Skip to main content

Documentation Index

Fetch the complete documentation index at: https://internal.september.wtf/llms.txt

Use this file to discover all available pages before exploring further.

If you can’t measure your agent, you can’t improve it. Worse, you won’t know when it regresses — and agents regress every time you change the model, the prompt, the tools, or the catalog. This page covers eval strategy for agents built on the Engine.

What you’re trying to measure

Agent evals split into three layers, each measuring something different:
  1. Output quality. Did the agent produce a good answer for one input? (Per-task accuracy, format compliance, factuality.)
  2. Process quality. Did the agent get there efficiently? (Tool call count, token count, latency, cost.)
  3. System behavior. Does the deployed agent stay within bounds? (Refusals firing correctly, HITL triggering at the right times, permission prompts not regressing.)
A good eval set covers all three. Most teams start at #1 and grow into #2 and #3 over time.

Building a golden dataset

A golden dataset is a set of (input, expected outcome) pairs. Quality beats quantity — 50 carefully chosen cases are more useful than 5,000 auto-generated ones.

Where the cases come from

  • Real production traces. Pull task_ids where the agent did something interesting (or wrong). Anonymize, encode the expected result.
  • Hand-crafted edge cases. Cases that exercise tricky paths — ambiguous inputs, near-violations of the refusal policy, multi-step reasoning.
  • Regression cases. Every bug you find becomes a permanent eval case. Future regressions in that exact pattern get caught.

What an eval case looks like

{
  "id": "eval-2026-04-001",
  "category": "refactoring",
  "input": {
    "message": "Find the function that handles user login and add rate limiting.",
    "task_id": "eval-{run_id}-001",
    "context": {
      "files": ["auth.py", "middleware.py"]
    }
  },
  "expected": {
    "must_call_tools": ["read_file", "write_file"],
    "must_not_call_tools": ["bash"],
    "output_must_contain": ["rate_limit"],
    "output_must_not_contain": ["DELETE"],
    "max_tokens": 8000,
    "max_turns": 6
  }
}
The expected block is what you score against. Different cases score on different dimensions.

Scoring patterns

Three families, with different trade-offs.

Deterministic checks

Regex match, keyword presence, JSON schema validation, tool-call inspection. Fast, cheap, exact. Good for:
  • “Did the agent call the web_search tool at least once?”
  • “Does the JSON output match this schema?”
  • “Does the response contain the rate-limit configuration?”
Limit: only catches what you can specify in code.

LLM-as-judge

A second model (often the same provider, sometimes a smaller one) reads the input, the agent’s output, and a rubric, and assigns a score.
You are evaluating an agent's response.

Input: {input}
Response: {response}

Rubric:
- Does the response answer the question? (yes/no)
- Is the response factually accurate based on the provided context? (yes/no)
- Is the response in the right tone? (yes/no/somewhat)

Return JSON: { "answers": "...", "accurate": "...", "tone": "...", "notes": "..." }
Useful when:
  • You have qualitative criteria (tone, helpfulness, clarity).
  • Determinism would require an unreasonable amount of regex.
  • You want a per-category score, not just pass/fail.
Limits:
  • LLM-as-judge is biased toward verbose, structured responses. Calibrate.
  • Model swaps shift judge calibration too. Pin your judge separately.
  • Don’t use the same model as both agent and judge.

Human evaluation

Slow, expensive, definitive. Run this on a small sample (~20 cases per release) to ground-truth your automated evals. If LLM-as-judge says something is great and humans say it’s bad, your judge needs work.

Running an eval

The Engine doesn’t ship its own eval framework. Most teams write a thin runner:
def run_eval(case, engine_url, engine_key):
    task_id = f"eval-{case['id']}-{uuid()}"

    events = list(stream_execute(
        engine_url, engine_key,
        message=case["input"]["message"],
        task_id=task_id,
    ))

    text = "".join(e["text"] for e in events if e.get("type") == "text_delta")
    tool_calls = [e["tool"] for e in events if e.get("type") == "tool_call"]
    usage = sum_usage(events)

    return {
        "case_id": case["id"],
        "passed": score(case["expected"], text, tool_calls, usage),
        "text": text,
        "tool_calls": tool_calls,
        "usage": usage,
    }
Run cases in parallel up to your concurrency budget. Most cases run in seconds; a full suite of 100 cases finishes in a few minutes.

Test isolation

Use a fresh task_id per case (suffix with the run ID). Otherwise memory from one case bleeds into another and your scores get non-deterministic. For really clean isolation, run against a fresh Engine instance with an empty brain. Ship a make eval-clean that wipes the brain volume before running.

Tracking results

A good eval system tracks:
  • Per-case pass/fail. With diffs against the previous run.
  • Per-category aggregate. “Refactoring: 47/50, +2 from last run.”
  • Cost and latency distribution. P50, P99, max per case.
  • Tool-call distribution. Which tools the agent uses, how often.
  • Regression flags. Cases that passed in the last run and fail in this one. Treat these as priority-zero.

Frequency

TriggerWhat to run
Every PR that touches a prompt or toolQuick eval (~20 representative cases)
Every PR that bumps the EngineFull eval, including process metrics
Every model upgradeFull eval + manual sample
WeeklyLong-running eval (multi-step cases the quick suite skips)
Before any production rolloutFull eval + manual sample

Patterns that fail

  • Eval-driven prompt overfit. If you write the eval first and tune the prompt to pass it, you’ll pass the eval and fail in the wild. Write evals that capture intent, not literal output.
  • Single-judge bias. One LLM judge is consistent but possibly wrong in a consistent direction. Run two judges and look for disagreement.
  • Stale golden data. A case that was a good test six months ago may no longer be representative. Audit the dataset quarterly.
  • Pass-rate as the only metric. A 95% pass rate is meaningless if the 5% are critical. Look at what fails, not just how much.

See also