Skip to content

Reviewing reports

Reports are the repository-level view over a sweep. evolve report reads every committed results.<ext>, rolls the entries up across plugins, skills, and models, and renders two artifacts: a human-readable EVALUATION.md and a machine-readable rollup. Nothing is measured here — the report is a projection of what the stored results already hold, so it re-renders identically until the results change.

Generating reports

evolve report is the last step of evolve run all, but it touches no agents, so you can re-run it any time the stored results change — after a partial sweep, a --migrate, or a models restriction.

evolve report                 # regenerate EVALUATION.md + the machine rollup
evolve report --migrate       # upgrade stored results to the latest schema first
evolve report --check         # also gate on configured thresholds (non-zero exit on breach)
Flag Description
--check Fail (exit 1) when a pass rate breaches a configured threshold.
--migrate Upgrade stored results files to the latest schema before rendering.
--min-triggers-pass-rate Minimum trigger pass rate 0..1 for --check (overrides the config threshold).
--min-evals-pass-rate Minimum eval pass rate 0..1 for --check (overrides the config threshold).
--stale-results keep\|drop What to do with stored results for models outside the active models set (default: prompt on a terminal, else keep).

The output format follows --results-format; switching formats removes the stale rollup from the previous choice. What gets written depends on the repository layout:

Layout EVALUATION.md (repo root) EVALUATION.<format> (repo root) Per-plugin EVALUATION.md
single rollup and per-skill detail tables yes
multi / marketplace rollup only yes one per plugin, with the detail tables

The Markdown report

EVALUATION.md opens with a generated-by marker, an # Skill evaluations heading, and a methodology paragraph explaining where the figures come from and how to read empty cells. If a models restriction is active, an ## Excluded models table follows, listing the catalog models left out (so a partial matrix never reads as the whole picture).

Then one ## <plugin> section per plugin, each with up to two rollup tables.

Two kinds of empty cell

The report distinguishes them deliberately, and the difference is decidable from the stored data:

  • — not measured yet. A rerun could fill it (e.g. a tier that hasn't run for that model).
  • n/a — the provider can never produce it: no counting API, no usage reporting, or no published pricing. It's structurally absent, not zero.

Rollup tables

The per-plugin Triggers rollup, one row per model:

Column What it shows
Provider Vendor display name.
Model Display name with the provider-local model id in backticks.
Passed Passing queries over total (1/2); (N errored) is appended when runs failed outright.
Pass rate Share of queries that passed, as a percentage.
Δ rate Change in pass rate versus the previous run; when there's nothing to compare against.
Avg run Mean wall-clock per run, weighted across queries.
Input tokens Estimated input tokens (SKILL.md + query) from the counting API.
Est. input cost That estimate priced at the model's input rate.

The per-plugin Evals rollup adds the measured-usage columns:

Column What it shows
Provider Vendor display name.
Model Display name with the provider-local model id in backticks.
Passed Passing evals over total; (N errored) for runs that failed outright (excluded from the ratio).
Δ rate Pass-rate change versus the previous run; (vs base) when only a baseline exists to compare to.
Lift vs base Pass-rate gain over the no-skill baseline run — the skill's measured contribution; with no baseline.
Avg run Mean executor duration (the agent run only, excluding grading).
Input tokens Estimated input tokens from the counting API.
Est. input cost The estimate priced at the input rate.
Measured in/out Harness-reported input/output tokens, as in/out.
Cache rd/wr Cache read / creation tokens, as read/write.
Measured cost Harness-reported total cost for the run.

A rendered slice of a single-layout report — the solo plugin section, with its Triggers and Evals rollups:

Triggers

Provider Model Passed Pass rate Δ rate Avg run Input tokens Est. input cost
Anthropic Claude Fable 5 (claude-fable-5) ½ 50% -50% 7.1s 2,770 $0.0277
Cursor Cursor Composer 2.5 (composer-2.5) 2/2 100% 12.7s n/a n/a
Google Gemini 3.5 Flash (gemini-3.5-flash) 2,580 $0.0039

Evals

Provider Model Passed Δ rate Lift vs base Avg run Input tokens Est. input cost Measured in/out Cache rd/wr Measured cost
Anthropic Claude Fable 5 (claude-fable-5) 0/1 -100% +0% 84.2s 1,827 $0.0183 8,200/3,142 220,000/5,480 $0.7824

Detail tables

The detail tables turn the view case-major — one heading per trigger query or per eval, with the models as rows, so comparing models on a single case is the default. In a single-layout repo they follow the rollups in the root EVALUATION.md; in multi/marketplace repos they move to each plugin's own EVALUATION.md.

Trigger detail, under a #### <query> (expected: yes|no) heading:

Column What it shows
Provider, Model Vendor and model, as in the rollup.
Result Per-query verdict: PASS, FAIL, or (ungraded).
Rate Hits over runs (3/3).
Δ rate Change versus the previous run; when not comparable.
Avg run Mean run duration.
Input tokens Estimated input tokens.
Est. cost The estimate priced at the input rate.

Eval detail, under a #### <id> — <name> heading, mirrors the eval rollup's columns but swaps the passed-count for a per-case Result: PASS, FAIL, ERROR (a runtime error — the agent run failed), or . Below the table, each failed run is itemised — runtime errors as - <model> runtime error: <message>, and each failed expectation as - <model> failed <assertion>: <evidence> — so a failure points straight at the judge's reasoning.

The machine-readable rollup

EVALUATION.<format> is the same data as a structured object, for CI gates, dashboards, and diffing. Its top level:

Field Meaning
schema Rollup schema version.
tool_version evolve version that generated the rollup.
latest_run The most recent ran_at across every entry.
plugins Map of plugin name → { triggers, evals, skills }.

Under each plugin, triggers and evals are maps of provider/model-id → a model rollup, and skills nests the same shape per skill for drill-down. Each model rollup carries the aggregates the Markdown tables render — passed, failed, errored, total, pass_rate, avg_run_seconds, the estimate and measured usage objects — plus the machine-only baseline summary and the previous_delta / baseline_delta objects (rate, avg_run_seconds, input_tokens, output_tokens, cost_usd) that the Markdown renders as Δ rate and Lift vs base.

Gating on thresholds

evolve report --check turns the rollup into a CI gate: it compares the trigger and eval pass rates against the thresholds in your .evolve config (or the --min-*-pass-rate overrides) and exits non-zero on a breach, printing each as FAIL: ….

evolve report --check --min-triggers-pass-rate 0.95 --min-evals-pass-rate 0.90

With no threshold configured and no flag passed, --check is an error rather than a silent pass. Exit codes follow the reference: 0 clean, 1 on a breach, 2 on a usage or runtime error.