Skip to content

Assertions

A Tier 2 eval succeeds when its assertions pass. Each assertion is one graded condition checked against the agent's final response and the throwaway workspace it left behind. There are two families:

  • Deterministic checksfile_exists, file_absent, regex, not_regex, command, tool_call. These inspect files, output, exit codes, or observed tool calls. They are cheap, fast, and reproducible.
  • The LLM judgellm. A pinned model reads the assertion and the agent's work and returns a verdict. Reserve it for holistic or subjective checks the deterministic ones can't express.

Every assertion is one entry in an eval's assertions array:

{
    "id": "parametrize",
    "prompt": "Write parametrized tests for the parse function …",
    "files": ["files/src/myapp/kv.py"],
    "assertions": [
        { "type": "file_exists", "path": "tests/test_kv.py" },
        {
            "type": "regex",
            "path": "tests/test_kv.py",
            "pattern": "pytest\\.mark\\.parametrize"
        }
    ]
}

An eval must declare at least one expectations entry or one assertions entry, or loading fails validation.

How assertions are graded

  • Each assertion is pass / fail / skipped. Most return pass or fail. A few return skipped when the check can't run — a command whose requires binary is missing, or a tool_call against a harness that can't report tool calls. Skipped assertions count neither for nor against the case.
  • Order is deterministic. expectations (see the llm section) expand to llm assertions graded first, in authored order, followed by the authored assertions in order.
  • Patterns are Go RE2 regexes — no backreferences or lookaround. In a JSON or JSONC file every backslash doubles (\\., \\s); YAML single-quoted strings keep one.

The closed set of types is file_exists, file_absent, regex, not_regex, command, tool_call, llm. Each is covered below with a realistic example drawn from the bitwise-media-group skills.

file_exists

Passes when path (workspace-relative) exists after the run. Use it to assert the agent created the file it was asked to.

{ "type": "file_exists", "path": "tests/test_kv.py" }
Field Required Meaning
path yes Path relative to the eval workspace root to stat for

file_absent

The mirror of file_exists: passes when path does not exist. Use it to assert the agent didn't create something it shouldn't have.

{ "type": "file_absent", "path": "internal/version/doc.go" }

Here the skill should recognise that a single-file package already carries its package comment, so adding a separate doc.go would be redundant — the assertion fails if the agent adds one anyway.

Field Required Meaning
path yes Path relative to the workspace that must not exist

regex

Passes when pattern matches. With a path, the pattern is matched against that workspace file's contents; without a path, it's matched against the agent's final response text. Patterns are compiled in multiline mode, so ^ and $ anchor to line boundaries.

{
    "type": "regex",
    "path": "tests/test_kv.py",
    "pattern": "pytest\\.mark\\.parametrize"
}

A path-less variant checks the agent's own report — for example, that it surfaced the command it ran:

{ "type": "regex", "pattern": "terraform validate" }
Field Required Meaning
pattern yes RE2 pattern, compiled multiline
path no Workspace file to match against; when omitted, matches the agent's output

A regex with a path whose file is missing fails (there is nothing to match against).

not_regex

The negation of regex: passes when pattern does not match. Use it to assert an anti-pattern is absent.

{
    "type": "not_regex",
    "path": "tests/test_kv.py",
    "pattern": "unittest|TestCase"
}

This guards a pytest suite against stdlib unittest leaking in.

A missing file fails not_regex

When path is set, the file is read before matching. If the file doesn't exist the assertion fails rather than vacuously passing — so not_regex with a path effectively asserts "the file exists and does not contain this". To assert a file is gone, use file_absent.

Field Required Meaning
pattern yes RE2 pattern, compiled multiline
path no Workspace file to match against; when omitted, matches the agent's output

command

Runs run through /bin/sh -c in the workspace and passes when the exit code matches expect_exit (default 0). This is the strongest deterministic check — it runs the real toolchain over the agent's output.

{
    "type": "command",
    "run": "terraform fmt -recursive -check",
    "requires": "terraform"
}

Use cwd to run inside a subdirectory of the workspace, and chain a setup step before the check:

{
    "type": "command",
    "run": "terraform init -backend=false -input=false >/dev/null && terraform validate",
    "cwd": "modules/widget",
    "requires": "terraform"
}

Set expect_exit when success means a non-zero code — for instance, asserting a forbidden token is gone (grep exits 1 when it finds no match):

{ "type": "command", "run": "grep -rq 'TODO' src/", "expect_exit": 1 }
Field Required Meaning
run yes Shell command run via /bin/sh -c
cwd no Workspace-relative directory to run in (default: the workspace root)
requires no A binary that must be on PATH; if absent the assertion is skipped, not failed
expect_exit no Exit code that counts as success (default 0)

requires keeps suites portable: an eval that needs terraform or python3 skips cleanly on a machine without it instead of reporting a false failure. The command shares the eval's timeout_seconds.

tool_call

Passes when the agent actually invoked a matching tool during the run — inspecting behaviour, not just the final artifact. tool is a regex matched against each observed tool name; the optional pattern is matched against that call's JSON-serialized arguments. An MCP tool surfaces as mcp__<server>__<tool>.

{ "type": "tool_call", "tool": "^Bash$", "pattern": "terraform validate" }

This confirms the agent ran terraform validate rather than only claiming to. To assert it reached for a specific MCP tool:

{ "type": "tool_call", "tool": "mcp__github__create_pull_request" }
Field Required Meaning
tool yes Regex matched against an observed tool name (e.g. ^Bash$, mcp__github__.*)
pattern no Regex matched against the call's JSON-serialized arguments

Tri-state: if the harness under test can't report tool calls the assertion is skipped; if it reports calls but none match, the assertion fails. Unlike regex, tool and pattern are not compiled in multiline mode.

llm

Graded by an LLM judge instead of a fixed rule. The judge is pinned to claude-sonnet-4-6 regardless of the model under test, so verdicts stay comparable across providers. It reads the assertion text, the agent's final response, the eval's expected_output (as context, never a separate check), and may Read/Glob/Grep the workspace before returning a pass/fail verdict with a short evidence quote.

{
    "type": "llm",
    "text": "The new tests cover the error paths, not just the happy path."
}
Field Required Meaning
text yes The statement the judge must verify

Reserve llm for genuinely subjective or holistic judgements; prefer a deterministic check whenever one can express the same condition, since each llm assertion costs a model call.

Bare-string shorthand

A plain string anywhere in assertions is shorthand for { "type": "llm", "text": <string> }:

"assertions": [
  "The commit message follows Conventional Commits.",
  { "type": "file_exists", "path": "CHANGELOG.md" }
]

expectations

expectations is a top-level array of statements, each expanded into an llm assertion and graded before the authored assertions, in authored order. Reports tag these with source: "expectation" so they're distinguishable from inline assertions.

{
    "id": "focused-tests",
    "prompt": "Add tests for the new validator and explain your choices.",
    "expectations": [
        "The summary explains why each test case was chosen.",
        "No production code under src/ was modified."
    ],
    "assertions": [{ "type": "file_exists", "path": "tests/test_validator.py" }]
}

expected_output is related but different: it's the author's prose description of success. The judge sees it as context, but it is never graded as its own assertion — so migrating a suite that only has expected_output keeps its pass rate.

A combined example

A single eval typically mixes families: deterministic checks pin the concrete artifacts, and one judge assertion covers the part only prose can describe.

{
    "id": "parametrize",
    "prompt": "Write parametrized tests for parse in src/myapp/kv.py, including the error paths, in tests/test_kv.py.",
    "allowed_tools": "Read Write Edit Glob Grep Skill Bash(uv *) Bash(pytest *) Bash(python3 *)",
    "files": ["files/src/myapp/kv.py"],
    "expectations": ["The agent explains which error paths it parametrized and why."],
    "assertions": [
        { "type": "file_exists", "path": "tests/test_kv.py" },
        {
            "type": "regex",
            "path": "tests/test_kv.py",
            "pattern": "pytest\\.mark\\.parametrize"
        },
        {
            "type": "regex",
            "path": "tests/test_kv.py",
            "pattern": "pytest\\.raises"
        },
        {
            "type": "not_regex",
            "path": "tests/test_kv.py",
            "pattern": "unittest|TestCase"
        },
        {
            "type": "command",
            "run": "python3 -m py_compile tests/test_kv.py",
            "requires": "python3"
        }
    ]
}

For the case around these assertions — the prompt, input files/fixtures, and the per-case run controls — see Behavioral evals. Every field above is validated by the evals JSON Schema; point your editor at it via a "$schema" key for completion and inline errors.