Results¶
A sweep produces two kinds of artifact, and it helps to keep them straight:
- Stored results — one committed
results.<ext>per skill, written as the evals run. This is the source of truth: the raw per-model, per-case outcomes, kept deterministic so it commits and diffs cleanly. This page covers them. - Reports —
EVALUATION.mdand a machine-readable rollup, rendered from the stored results byevolve report. Nothing is measured there; the report is a view over what the results files already hold. See Reviewing reports.
| Artifact | Path | Scope | Written by |
|---|---|---|---|
| Stored results | evals/<skill>/results.<ext> |
one skill | the eval engine |
| Markdown report | EVALUATION.md |
whole repo | evolve report |
| Machine rollup | EVALUATION.<format> |
whole repo | evolve report |
Both are generated — regenerate them, don't edit them by hand.
Where they live¶
Each skill's outcomes land in a single results.<ext> beside its authored definitions:
evals/<skill>/
├── triggers.<ext> # Tier 1 definitions
├── evals.<ext> # Tier 2 definitions
└── results.<ext> # committed outcomes — both tiers
Supported formats are json, jsonc, yaml and yml; the active one follows --results-format. A sweep rewrites
only the model entries it actually ran and leaves the rest untouched, so diffs stay scoped to what changed. Writes are
atomic and deterministic — sorted keys, fixed field order, rounded floats, trailing newline — so re-running an unchanged
suite produces a byte-identical file. The mechanics of the write are in
How evaluations run.
What's inside¶
A results file is nested model-major: a top-level header, then a models map keyed by provider/model-id, and
under each model up to two entries — triggers and evals — either of which is absent when that tier hasn't run for
the model. The key is provider-qualified (anthropic/claude-opus-4-8) because a harness like Cursor can drive another
vendor's model, and the prefix keeps those ids from colliding.
| Field | Meaning |
|---|---|
schema |
Results schema version. evolve report --migrate upgrades older files in place. |
plugin |
Plugin the skill belongs to. |
skill |
Skill name (matches the eval directory). |
models |
Map of provider/model-id → per-model triggers / evals entries. |
Each tier entry carries a shared header and then its per-case results:
| Field | Meaning |
|---|---|
provider, model, display |
Provider id, model id, and human-readable name. |
harness |
Agent CLI that executed the run (claude, codex, …). |
tool_version |
evolve version that wrote the entry. |
ran_at |
RFC3339 UTC timestamp of the run. |
content_hash |
Fingerprint of the skill content the entry was graded against — triggers hash the SKILL.md frontmatter, evals the whole skill directory. Drives --modified/--new staleness. |
timeout_seconds |
Per-case timeout in force for the run. |
pricing |
The input/output USD-per-MTok rates snapshotted at run time, or an explicit null when the model is unpriced. |
results |
The per-case array — one entry per trigger query or eval case. |
summary |
Aggregates over results: passed/failed/total, pass_rate, avg_run_seconds, and any usage rollup. |
Per-case detail differs by tier. A trigger result records the query, whether it should_trigger, the hits over
runs, the passed verdict (hit-rate ≥ 0.5), and avg_run_seconds. An eval result records the case id/name,
a tri-state passed (null when skipped or errored), a runtime_error string when the agent run failed outright, the
graded expectations (each with its text, passed, and evidence), and execution_metrics/timing for the run.
Usage and pricing are grouped, not nulled
Token figures live in two optional sub-objects: estimate (input tokens from the provider's counting API over
SKILL.md + the query/prompt, priced at the input rate) and measured (the harness-reported consumption — fresh
input, cache reads/writes, output, and total cost). A provider that can't count or report usage simply omits the
sub-object, so an absent figure stays distinguishable from a measured zero.
Alongside the current run, an entry keeps compact prior snapshots so the report can show movement without a re-run:
previous(both tiers) — the run this one replaced. The one-back comparison, i.e. your iteration signal.baseline(evals only) — the same cases run with the skill absent. The gap is the skill's measured lift.
Snapshots store only the summary and per-case scalars, never the full expectation/timing detail, so the file stays readable. The deltas themselves are derived at report time, not stored.
The schema is the contract
The field-by-field contract — including every optional sub-object — lives in the JSON Schemas under
schemas/ (results.schema.json and the shared
common.schema.json). Point your editor at them with a "$schema" key for validation and completion.
Once the results are written, evolve report renders them into the repository's reports — see
Reviewing reports.