Skip to main content

Documentation Index

Fetch the complete documentation index at: https://internal.september.wtf/llms.txt

Use this file to discover all available pages before exploring further.

A golden dataset is the foundation of every eval. It’s the set of (input, expected outcome) pairs that defines what “working” means for your agent. Quality matters more than quantity — fifty hand-curated cases beat 5000 auto-generated ones.

What makes a case “golden”

A case is golden when:
  • The input represents real usage. It’s something a real user might ask, in roughly the way they’d ask.
  • The expected outcome is unambiguous. A reviewer can tell if the agent succeeded or failed.
  • It exercises behavior worth testing. Routine cases catch regressions; edge cases catch bugs.
A bad case fails on any of those — synthetic inputs no user would write, vague success criteria, or behavior that’s not worth measuring.

Sourcing cases

Real production traces

The best source. Pull task_ids where the agent did something interesting (or wrong), anonymize, encode the expected result. What makes a trace promotable to an eval case:
  • The user kept the conversation going (good signal: they got value).
  • The agent took an unusual path (worth pinning).
  • The agent failed (must be pinned to prevent regression).
  • A user submitted feedback via /feedback (explicit signal).

Edge cases

Hand-craft cases that exercise tricky paths:
  • Ambiguous inputs. “Delete the old thing.” Old which? The agent should ask, not guess.
  • Refusal triggers. Cases that should hit the refusal policy. If the policy fails open, the eval catches it.
  • Tool-selection ambiguity. Cases where two tools could plausibly apply. The eval pins which one is correct.
  • Multi-step reasoning. “Find X, then find Y based on X, then combine.” Tests the loop’s depth.
  • Rare paths. HITL prompts, compaction triggers, MCP failures. These are hard to hit organically.

Regression cases

Every bug is an eval case. When you fix a bug, encode the bug as an eval case and merge them together. The eval pins that the bug stays fixed. This is the cheapest insurance you have. Future regressions in that exact pattern get caught at eval time, not in production.

Schema for a case

Use a stable schema so the runner can score:
{
  "id": "eval-2026-04-014",
  "category": "multi-step-reasoning",
  "description": "Agent should search the web, then summarize, then compare.",
  "tags": ["web_search", "synthesis"],
  "input": {
    "message": "Compare the response times of GPT-5.4-mini and Claude Haiku 4.5.",
    "task_id": null
  },
  "expected": {
    "must_call_tools": ["web_search"],
    "output_must_contain": ["GPT", "Claude", "ms"],
    "output_must_not_contain": ["I don't know"],
    "max_turns": 6,
    "max_tokens": 4000,
    "judge_rubric": "Does the response cite specific numbers and explain the comparison?"
  },
  "added_on": "2026-04-15",
  "added_by": "shrivathsanm",
  "source": "production trace task-2026-04-12-...",
  "notes": "First pass missed Claude entirely. Fixed in v2.3.1."
}
The schema can grow over time; keep it backwards-compatible.

Categories

Group cases by category. Useful for:
  • Spotting category-level regressions (“research suite dropped from 95% to 80% — focus there”).
  • Running subsets (“only run refactoring cases on PRs that touch refactor-related code”).
  • Reporting (“we’re at 95% across all categories”).
Suggested categories for a typical agentic deployment:
CategoryWhat it covers
basic-qaSingle-turn factual or reasoning questions.
tool-selectionCases where the right tool choice matters.
multi-stepCases that require multiple turns.
refusalsCases that should hit the refusal policy.
permissionsCases that should trigger HITL.
error-recoveryCases where a tool fails and the agent should adapt.
voice-and-formatCases that test output style.
memoryCases that exercise long-term memory.
regressionsBugs encoded as cases.

Curation hygiene

A dataset goes stale. Audit quarterly:
  • Are these cases still representative? If your product has evolved, some cases may no longer reflect real usage. Retire them.
  • Are the expected outcomes still right? If you’ve changed your refusal policy, “the agent must refuse X” cases may now be backwards.
  • Are there gaps? New features need new cases.
  • Are there duplicates? Two cases that exercise the same path are one case too many.
Each case should justify its existence. If you can’t say what bug or behavior a case is pinning, retire it.

Anti-patterns

Eval-driven prompt overfit

If you write cases first and tune the agent until it passes them, you end up with an agent that passes the eval and fails in the wild. The eval should capture intent, not literal output. Pin the behavior, not the words.

Output-equality scoring on free-form text

Comparing the agent’s text output to an exact expected string is brittle and almost never useful. Use must-contain/must-not-contain, LLM-as-judge, or schema validation instead.

One-judge tyranny

A single LLM-as-judge has a consistent bias. If your eval depends entirely on one judge’s score, you’re optimizing toward that judge’s quirks. Use two judges and look for disagreement; periodically have humans audit a sample.

Case explosion

If your dataset has 1000+ cases, you’ve stopped curating and started collecting. Trim ruthlessly. A focused 100-case suite beats a sprawling 1000-case one for both signal and run cost.

Versioning

Treat the dataset like a public API:
  • Pin in tests. Every eval run runs against a specific dataset version.
  • Diff cases in PRs. When you add or change a case, the diff is reviewable.
  • Don’t mutate accepted cases. If the case is wrong, mark it deprecated and add a corrected one.

See also