MULTI-TIER PLUGIN EVAL CLI

EVOLVE

Evaluate coding-agent plugins, across every model.

Evolve validates plugin structure, checks whether skills trigger on the right prompts, and runs behavioral eval suites in throwaway workspaces — across Anthropic, OpenAI, Google, Cursor, Copilot and Antigravity. It writes committed Markdown + JSON rollups for review and CI, all from one Go binary.

zsh — install

$ brew install --cask bitwise-media-group/tap/evolve

$ evolve run all

Get started → Star on GitHub

The Evolve TUI dashboard showing execution tree, rollups, runs and details

The Evolve live dashboard — [1] execution tree · [2] rollup · [3] runs · [4] details, streaming as agents finish.

anthropic/claude-haiku-4-5·anthropic/claude-sonnet-4-6·anthropic/claude-opus-4-8·anthropic/claude-fable-5·openai/gpt-5.2·openai/gpt-5.3-codex·openai/gpt-5.3-codex-spark·openai/gpt-5.4-mini·openai/gpt-5.4·openai/gpt-5.5·google/gemini-3-flash-preview·google/gemini-3.1-flash-lite·google/gemini-3.1-pro-preview·google/gemini-3.1-pro·google/gemini-3.5-flash·cursor/composer-2.5·anthropic/claude-haiku-4-5·anthropic/claude-sonnet-4-6·anthropic/claude-opus-4-8·anthropic/claude-fable-5·openai/gpt-5.2·openai/gpt-5.3-codex·openai/gpt-5.3-codex-spark·openai/gpt-5.4-mini·openai/gpt-5.4·openai/gpt-5.5·google/gemini-3-flash-preview·google/gemini-3.1-flash-lite·google/gemini-3.1-pro-preview·google/gemini-3.1-pro·google/gemini-3.5-flash·cursor/composer-2.5·

2,058EVALS IN SUITE

1,213 ✓PASSED

367 ✕FAILED

$87.78SPENT THIS RUN

// WHAT EVOLVE DOES

One CLI for evaluating coding-agent plugins across every model.

Multi-tier by design

Static checks, trigger-accuracy evals and behavioral evals in one pass. evolve run all runs check → triggers → evals → report.

Every model

Anthropic, OpenAI, Google, Cursor, Copilot and Antigravity — each driven through its own harness CLI. Pick a single model or a whole provider with --model.

Trigger accuracy

Tier 1 runs each trigger prompt several times and records whether the expected skill activated — scored per model.

Behavioral evals

Tier 2 runs behavioral cases in throwaway workspaces, graded by deterministic assertions — files, regexes, commands — and an optional LLM judge.

Deterministic reports

Committed Markdown + JSON rollups, deterministic to the byte. --strict and report --check gate CI on pass-rate thresholds.

One Go binary

Install once, run against any plugin repo. Auto-detects marketplace, multi and single layouts — terminal-native, no services.

HARNESSES

claude codex gemini copilot agy

PROVIDERS

anthropic openai google cursor

// CONFIGURE ONCE

Configure once in a .evolve.yaml at your repo root.

.evolve.yaml

# .evolve.yaml — optional, at the repo root
layout: marketplace
models:
  - anthropic
  - cursor/composer-2.5
harnesses:
  - claude
  - codex
results_format: json
max_turns: 12
stale_results: keep
report:
  thresholds:
    triggers_min_pass_rate: 0.8
    evals_min_pass_rate: 0.9

// THREE COMMANDS TO YOUR FIRST RUN

Install

Homebrew cask, or build from source with go install.

$ brew install --cask bitwise-media-group/tap/evolve

Check

Confirm harnesses, credentials and counting APIs.

$ evolve doctor

Run

Launch the run and watch the TUI stream live.

$ evolve run all