Skip to content

Reference

Commands

Top-level:

Command Description
evolve doctor Check provider CLIs, credentials and counting APIs.
evolve models Show the effective provider/model matrix and pricing metadata.
evolve report Regenerate evaluation rollups from stored results.
evolve run Run static checks, trigger checks, behavioral evals, or the full pipeline.
evolve version Print build metadata.

Run tiers:

evolve run checks     Tier 0 — static validation (no agents run)
evolve run triggers   Tier 1 — does the expected skill activate?
evolve run evals      Tier 2 — behavioral cases in throwaway workspaces
evolve run all        check → triggers → evals → report

The generated command reference lives in docs/cli/evolve.md.

Global flags

Flag Description
--root PATH Repository root to operate on.
--layout auto\|single\|multi\|marketplace Repository layout.
--results-format json\|jsonc\|yaml Results and rollup format.
--json Emit machine-readable JSONL progress.
-v, --verbose Debug logging.

Run flags

Flag Description
--plugin a,b (alias --plugins) Restrict the run to one or more plugins.
--skill x,y (alias --skills) Restrict the run to one or more skills.
--model anthropic,openai (alias --models) Pick providers / model ids, or all.
--eval case-id Restrict run evals to one behavioral case.
--runs N Repeat each trigger prompt N times.
--jobs N Concurrency for behavioral evals.
--max-turns N Per-case turn cap.
--timeout SECONDS Per-case timeout.
--new Run only work with missing or stale stored results.
--modified Rerun only cases whose authored content changed since their results.
--keep-workspaces Leave temporary workspaces behind for debugging.
--count-only Compute token usage without running agents.
--stale-results keep\|drop What to do with results outside the models set.
--strict Turn check / eval failures into a non-zero exit.
--no-tui Force plain line output (also EVOLVE_NO_TUI=1).

Exit codes

Code Meaning
0 The run completed. By default, failed checks / evals only warn.
1 With --strict (or report --check): check / eval / threshold failures.
2 Usage, configuration, or runtime errors.

Providers

Provider Harness CLI Credential Triggers Evals Token counting
Anthropic claude ANTHROPIC_API_KEY (or OAuth token vars) yes yes yes
OpenAI codex OPENAI_API_KEY yes yes yes
Google gemini GEMINI_API_KEY / GOOGLE_API_KEY yes no yes
Cursor agent (cursor-agent) CURSOR_API_KEY yes yes no
GitHub Copilot copilot COPILOT_GITHUB_TOKEN (or GH_TOKEN / GITHUB_TOKEN) yes yes no
Antigravity agy OAuth login via agy yes yes no

Each provider needs its harness CLI on PATH and the credentials that CLI requires. Cursor, Copilot and Antigravity expose no token-counting API, so their figures render as n/a — structurally absent, not zero. Run evolve doctor to check the local environment.

Reports

evolve report rebuilds repository-level rollups from stored per-skill results, writing EVALUATION.md plus a machine-readable rollup in the configured format (per-plugin detail pages in marketplace / multi repos). Gate on thresholds:

evolve report --check --min-triggers-pass-rate 0.95 --min-evals-pass-rate 0.90

Reviewing reports walks through the report layout, every table and its columns, and the threshold gate.