Evaluate coding-agent plugins, across every model.
Evolve validates plugin structure, checks whether skills trigger on the right prompts, and runs behavioral eval suites in throwaway workspaces — across Anthropic, OpenAI, Google, Cursor, Copilot and Antigravity. It writes committed Markdown + JSON rollups for review and CI, all from one Go binary.
$ brew install --cask bitwise-media-group/tap/evolve
$ evolve run all
The Evolve live dashboard — [1] execution tree · [2] rollup · [3] runs · [4] details, streaming as agents finish.
One CLI for evaluating coding-agent plugins across every model.
Static checks, trigger-accuracy evals and behavioral evals in one pass.
evolve run all runs check → triggers → evals → report.
Anthropic, OpenAI, Google, Cursor, Copilot and Antigravity — each driven through its own harness
CLI. Pick a single model or a whole provider with --model.
Tier 1 runs each trigger prompt several times and records whether the expected skill activated — scored per model.
Tier 2 runs behavioral cases in throwaway workspaces, graded by deterministic assertions — files, regexes, commands — and an optional LLM judge.
Committed Markdown + JSON rollups, deterministic to the byte. --strict and
report --check gate CI on pass-rate thresholds.
Install once, run against any plugin repo. Auto-detects marketplace, multi and single layouts — terminal-native, no services.
Configure once in a
.evolve.yaml
at your repo root.
# .evolve.yaml — optional, at the repo root layout: marketplace models: - anthropic - cursor/composer-2.5 harnesses: - claude - codex results_format: json max_turns: 12 stale_results: keep report: thresholds: triggers_min_pass_rate: 0.8 evals_min_pass_rate: 0.9
Homebrew cask, or build from source with go install.
$ brew install --cask bitwise-media-group/tap/evolve
Confirm harnesses, credentials and counting APIs.
$ evolve doctor
Launch the run and watch the TUI stream live.
$ evolve run all