A golden dataset is the foundation of every eval. It’s the set of (input, expected outcome) pairs that defines what “working” means for your agent. Quality matters more than quantity — fifty hand-curated cases beat 5000 auto-generated ones.Documentation Index
Fetch the complete documentation index at: https://internal.september.wtf/llms.txt
Use this file to discover all available pages before exploring further.
What makes a case “golden”
A case is golden when:- The input represents real usage. It’s something a real user might ask, in roughly the way they’d ask.
- The expected outcome is unambiguous. A reviewer can tell if the agent succeeded or failed.
- It exercises behavior worth testing. Routine cases catch regressions; edge cases catch bugs.
Sourcing cases
Real production traces
The best source. Pulltask_ids where the agent did something
interesting (or wrong), anonymize, encode the expected result.
What makes a trace promotable to an eval case:
- The user kept the conversation going (good signal: they got value).
- The agent took an unusual path (worth pinning).
- The agent failed (must be pinned to prevent regression).
- A user submitted feedback via
/feedback(explicit signal).
Edge cases
Hand-craft cases that exercise tricky paths:- Ambiguous inputs. “Delete the old thing.” Old which? The agent should ask, not guess.
- Refusal triggers. Cases that should hit the refusal policy. If the policy fails open, the eval catches it.
- Tool-selection ambiguity. Cases where two tools could plausibly apply. The eval pins which one is correct.
- Multi-step reasoning. “Find X, then find Y based on X, then combine.” Tests the loop’s depth.
- Rare paths. HITL prompts, compaction triggers, MCP failures. These are hard to hit organically.
Regression cases
Every bug is an eval case. When you fix a bug, encode the bug as an eval case and merge them together. The eval pins that the bug stays fixed. This is the cheapest insurance you have. Future regressions in that exact pattern get caught at eval time, not in production.Schema for a case
Use a stable schema so the runner can score:Categories
Group cases by category. Useful for:- Spotting category-level regressions (“research suite dropped from 95% to 80% — focus there”).
- Running subsets (“only run refactoring cases on PRs that touch refactor-related code”).
- Reporting (“we’re at 95% across all categories”).
| Category | What it covers |
|---|---|
basic-qa | Single-turn factual or reasoning questions. |
tool-selection | Cases where the right tool choice matters. |
multi-step | Cases that require multiple turns. |
refusals | Cases that should hit the refusal policy. |
permissions | Cases that should trigger HITL. |
error-recovery | Cases where a tool fails and the agent should adapt. |
voice-and-format | Cases that test output style. |
memory | Cases that exercise long-term memory. |
regressions | Bugs encoded as cases. |
Curation hygiene
A dataset goes stale. Audit quarterly:- Are these cases still representative? If your product has evolved, some cases may no longer reflect real usage. Retire them.
- Are the expected outcomes still right? If you’ve changed your refusal policy, “the agent must refuse X” cases may now be backwards.
- Are there gaps? New features need new cases.
- Are there duplicates? Two cases that exercise the same path are one case too many.
Anti-patterns
Eval-driven prompt overfit
If you write cases first and tune the agent until it passes them, you end up with an agent that passes the eval and fails in the wild. The eval should capture intent, not literal output. Pin the behavior, not the words.Output-equality scoring on free-form text
Comparing the agent’s text output to an exact expected string is brittle and almost never useful. Use must-contain/must-not-contain, LLM-as-judge, or schema validation instead.One-judge tyranny
A single LLM-as-judge has a consistent bias. If your eval depends entirely on one judge’s score, you’re optimizing toward that judge’s quirks. Use two judges and look for disagreement; periodically have humans audit a sample.Case explosion
If your dataset has 1000+ cases, you’ve stopped curating and started collecting. Trim ruthlessly. A focused 100-case suite beats a sprawling 1000-case one for both signal and run cost.Versioning
Treat the dataset like a public API:- Pin in tests. Every eval run runs against a specific dataset version.
- Diff cases in PRs. When you add or change a case, the diff is reviewable.
- Don’t mutate accepted cases. If the case is wrong, mark it deprecated and add a corrected one.
See also
- Eval harness — running cases.
- Regression — using the dataset for regression testing.
- Evaluation — the broader picture.

