Skip to content

Results

A sweep produces two kinds of artifact, and it helps to keep them straight:

  • Stored results — one committed results.<ext> per skill, written as the evals run. This is the source of truth: the raw per-model, per-case outcomes, kept deterministic so it commits and diffs cleanly. This page covers them.
  • ReportsEVALUATION.md and a machine-readable rollup, rendered from the stored results by evolve report. Nothing is measured there; the report is a view over what the results files already hold. See Reviewing reports.
Artifact Path Scope Written by
Stored results evals/<skill>/results.<ext> one skill the eval engine
Markdown report EVALUATION.md whole repo evolve report
Machine rollup EVALUATION.<format> whole repo evolve report

Both are generated — regenerate them, don't edit them by hand.

Where they live

Each skill's outcomes land in a single results.<ext> beside its authored definitions:

evals/<skill>/
├── triggers.<ext>     # Tier 1 definitions
├── evals.<ext>        # Tier 2 definitions
└── results.<ext>      # committed outcomes — both tiers

Supported formats are json, jsonc, yaml and yml; the active one follows --results-format. A sweep rewrites only the model entries it actually ran and leaves the rest untouched, so diffs stay scoped to what changed. Writes are atomic and deterministic — sorted keys, fixed field order, rounded floats, trailing newline — so re-running an unchanged suite produces a byte-identical file. The mechanics of the write are in How evaluations run.

What's inside

A results file is nested model-major: a top-level header, then a models map keyed by provider/model-id, and under each model up to two entries — triggers and evals — either of which is absent when that tier hasn't run for the model. The key is provider-qualified (anthropic/claude-opus-4-8) because a harness like Cursor can drive another vendor's model, and the prefix keeps those ids from colliding.

Field Meaning
schema Results schema version. evolve report --migrate upgrades older files in place.
plugin Plugin the skill belongs to.
skill Skill name (matches the eval directory).
models Map of provider/model-id → per-model triggers / evals entries.

Each tier entry carries a shared header and then its per-case results:

Field Meaning
provider, model, display Provider id, model id, and human-readable name.
harness Agent CLI that executed the run (claude, codex, …).
tool_version evolve version that wrote the entry.
ran_at RFC3339 UTC timestamp of the run.
content_hash Fingerprint of the skill content the entry was graded against — triggers hash the SKILL.md frontmatter, evals the whole skill directory. Drives --modified/--new staleness.
timeout_seconds Per-case timeout in force for the run.
pricing The input/output USD-per-MTok rates snapshotted at run time, or an explicit null when the model is unpriced.
results The per-case array — one entry per trigger query or eval case.
summary Aggregates over results: passed/failed/total, pass_rate, avg_run_seconds, and any usage rollup.

Per-case detail differs by tier. A trigger result records the query, whether it should_trigger, the hits over runs, the passed verdict (hit-rate ≥ 0.5), and avg_run_seconds. An eval result records the case id/name, a tri-state passed (null when skipped or errored), a runtime_error string when the agent run failed outright, the graded expectations (each with its text, passed, and evidence), and execution_metrics/timing for the run.

Usage and pricing are grouped, not nulled

Token figures live in two optional sub-objects: estimate (input tokens from the provider's counting API over SKILL.md + the query/prompt, priced at the input rate) and measured (the harness-reported consumption — fresh input, cache reads/writes, output, and total cost). A provider that can't count or report usage simply omits the sub-object, so an absent figure stays distinguishable from a measured zero.

Alongside the current run, an entry keeps compact prior snapshots so the report can show movement without a re-run:

  • previous (both tiers) — the run this one replaced. The one-back comparison, i.e. your iteration signal.
  • baseline (evals only) — the same cases run with the skill absent. The gap is the skill's measured lift.

Snapshots store only the summary and per-case scalars, never the full expectation/timing detail, so the file stays readable. The deltas themselves are derived at report time, not stored.

The schema is the contract

The field-by-field contract — including every optional sub-object — lives in the JSON Schemas under schemas/ (results.schema.json and the shared common.schema.json). Point your editor at them with a "$schema" key for validation and completion.

Once the results are written, evolve report renders them into the repository's reports — see Reviewing reports.