If you can’t measure your agent, you can’t improve it. Worse, you won’t know when it regresses — and agents regress every time you change the model, the prompt, the tools, or the catalog. This page covers eval strategy for agents built on the Engine.Documentation Index
Fetch the complete documentation index at: https://internal.september.wtf/llms.txt
Use this file to discover all available pages before exploring further.
What you’re trying to measure
Agent evals split into three layers, each measuring something different:- Output quality. Did the agent produce a good answer for one input? (Per-task accuracy, format compliance, factuality.)
- Process quality. Did the agent get there efficiently? (Tool call count, token count, latency, cost.)
- System behavior. Does the deployed agent stay within bounds? (Refusals firing correctly, HITL triggering at the right times, permission prompts not regressing.)
Building a golden dataset
A golden dataset is a set of (input, expected outcome) pairs. Quality beats quantity — 50 carefully chosen cases are more useful than 5,000 auto-generated ones.Where the cases come from
- Real production traces. Pull
task_ids where the agent did something interesting (or wrong). Anonymize, encode the expected result. - Hand-crafted edge cases. Cases that exercise tricky paths — ambiguous inputs, near-violations of the refusal policy, multi-step reasoning.
- Regression cases. Every bug you find becomes a permanent eval case. Future regressions in that exact pattern get caught.
What an eval case looks like
expected block is what you score against. Different cases score
on different dimensions.
Scoring patterns
Three families, with different trade-offs.Deterministic checks
Regex match, keyword presence, JSON schema validation, tool-call inspection. Fast, cheap, exact. Good for:- “Did the agent call the
web_searchtool at least once?” - “Does the JSON output match this schema?”
- “Does the response contain the rate-limit configuration?”
LLM-as-judge
A second model (often the same provider, sometimes a smaller one) reads the input, the agent’s output, and a rubric, and assigns a score.- You have qualitative criteria (tone, helpfulness, clarity).
- Determinism would require an unreasonable amount of regex.
- You want a per-category score, not just pass/fail.
- LLM-as-judge is biased toward verbose, structured responses. Calibrate.
- Model swaps shift judge calibration too. Pin your judge separately.
- Don’t use the same model as both agent and judge.
Human evaluation
Slow, expensive, definitive. Run this on a small sample (~20 cases per release) to ground-truth your automated evals. If LLM-as-judge says something is great and humans say it’s bad, your judge needs work.Running an eval
The Engine doesn’t ship its own eval framework. Most teams write a thin runner:Test isolation
Use a freshtask_id per case (suffix with the run ID). Otherwise
memory from one case bleeds into another and your scores get
non-deterministic.
For really clean isolation, run against a fresh Engine instance with an
empty brain. Ship a make eval-clean that wipes the brain volume
before running.
Tracking results
A good eval system tracks:- Per-case pass/fail. With diffs against the previous run.
- Per-category aggregate. “Refactoring: 47/50, +2 from last run.”
- Cost and latency distribution. P50, P99, max per case.
- Tool-call distribution. Which tools the agent uses, how often.
- Regression flags. Cases that passed in the last run and fail in this one. Treat these as priority-zero.
Frequency
| Trigger | What to run |
|---|---|
| Every PR that touches a prompt or tool | Quick eval (~20 representative cases) |
| Every PR that bumps the Engine | Full eval, including process metrics |
| Every model upgrade | Full eval + manual sample |
| Weekly | Long-running eval (multi-step cases the quick suite skips) |
| Before any production rollout | Full eval + manual sample |
Patterns that fail
- Eval-driven prompt overfit. If you write the eval first and tune the prompt to pass it, you’ll pass the eval and fail in the wild. Write evals that capture intent, not literal output.
- Single-judge bias. One LLM judge is consistent but possibly wrong in a consistent direction. Run two judges and look for disagreement.
- Stale golden data. A case that was a good test six months ago may no longer be representative. Audit the dataset quarterly.
- Pass-rate as the only metric. A 95% pass rate is meaningless if the 5% are critical. Look at what fails, not just how much.
See also
- Eval harness — implementation details for setting up your eval runner.
- Golden datasets — sourcing and curation.
- Regression testing — catching regressions across Engine versions.

