Golden datasets

A golden dataset is the foundation of every eval. It’s the set of (input, expected outcome) pairs that defines what “working” means for your agent. Quality matters more than quantity — fifty hand-curated cases beat 5000 auto-generated ones.

What makes a case “golden”

A case is golden when:

The input represents real usage. It’s something a real user might ask, in roughly the way they’d ask.
The expected outcome is unambiguous. A reviewer can tell if the agent succeeded or failed.
It exercises behavior worth testing. Routine cases catch regressions; edge cases catch bugs.

A bad case fails on any of those — synthetic inputs no user would write, vague success criteria, or behavior that’s not worth measuring.

Sourcing cases

Real production traces

The best source. Pull task_ids where the agent did something interesting (or wrong), anonymize, encode the expected result. What makes a trace promotable to an eval case:

The user kept the conversation going (good signal: they got value).
The agent took an unusual path (worth pinning).
The agent failed (must be pinned to prevent regression).
A user submitted feedback via /feedback (explicit signal).

Edge cases

Hand-craft cases that exercise tricky paths:

Ambiguous inputs. “Delete the old thing.” Old which? The agent should ask, not guess.
Refusal triggers. Cases that should hit the refusal policy. If the policy fails open, the eval catches it.
Tool-selection ambiguity. Cases where two tools could plausibly apply. The eval pins which one is correct.
Multi-step reasoning. “Find X, then find Y based on X, then combine.” Tests the loop’s depth.
Rare paths. HITL prompts, compaction triggers, MCP failures. These are hard to hit organically.

Regression cases

Every bug is an eval case. When you fix a bug, encode the bug as an eval case and merge them together. The eval pins that the bug stays fixed. This is the cheapest insurance you have. Future regressions in that exact pattern get caught at eval time, not in production.

Schema for a case

Use a stable schema so the runner can score:

{
  "id": "eval-2026-04-014",
  "category": "multi-step-reasoning",
  "description": "Agent should search the web, then summarize, then compare.",
  "tags": ["web_search", "synthesis"],
  "input": {
    "message": "Compare the response times of GPT-5.4-mini and Claude Haiku 4.5.",
    "task_id": null
  },
  "expected": {
    "must_call_tools": ["web_search"],
    "output_must_contain": ["GPT", "Claude", "ms"],
    "output_must_not_contain": ["I don't know"],
    "max_turns": 6,
    "max_tokens": 4000,
    "judge_rubric": "Does the response cite specific numbers and explain the comparison?"
  },
  "added_on": "2026-04-15",
  "added_by": "shrivathsanm",
  "source": "production trace task-2026-04-12-...",
  "notes": "First pass missed Claude entirely. Fixed in v2.3.1."
}

The schema can grow over time; keep it backwards-compatible.

Category	What it covers
`basic-qa`	Single-turn factual or reasoning questions.
`tool-selection`	Cases where the right tool choice matters.
`multi-step`	Cases that require multiple turns.
`refusals`	Cases that should hit the refusal policy.
`permissions`	Cases that should trigger HITL.
`error-recovery`	Cases where a tool fails and the agent should adapt.
`voice-and-format`	Cases that test output style.
`memory`	Cases that exercise long-term memory.
`regressions`	Bugs encoded as cases.

Curation hygiene

A dataset goes stale. Audit quarterly:

Are these cases still representative? If your product has evolved, some cases may no longer reflect real usage. Retire them.
Are the expected outcomes still right? If you’ve changed your refusal policy, “the agent must refuse X” cases may now be backwards.
Are there gaps? New features need new cases.
Are there duplicates? Two cases that exercise the same path are one case too many.

Each case should justify its existence. If you can’t say what bug or behavior a case is pinning, retire it.

Anti-patterns

Eval-driven prompt overfit

If you write cases first and tune the agent until it passes them, you end up with an agent that passes the eval and fails in the wild. The eval should capture intent, not literal output. Pin the behavior, not the words.

Output-equality scoring on free-form text

Comparing the agent’s text output to an exact expected string is brittle and almost never useful. Use must-contain/must-not-contain, LLM-as-judge, or schema validation instead.

One-judge tyranny

A single LLM-as-judge has a consistent bias. If your eval depends entirely on one judge’s score, you’re optimizing toward that judge’s quirks. Use two judges and look for disagreement; periodically have humans audit a sample.

Case explosion

If your dataset has 1000+ cases, you’ve stopped curating and started collecting. Trim ruthlessly. A focused 100-case suite beats a sprawling 1000-case one for both signal and run cost.

Versioning

Treat the dataset like a public API:

Pin in tests. Every eval run runs against a specific dataset version.
Diff cases in PRs. When you add or change a case, the diff is reviewable.
Don’t mutate accepted cases. If the case is wrong, mark it deprecated and add a corrected one.

Get started

Capabilities

Build with the Engine

Agents and tools

Test and evaluate

API reference

Guides

Resources

Golden datasets

What makes a case “golden”

Sourcing cases

Real production traces

Edge cases

Regression cases

Schema for a case

Categories

Curation hygiene

Anti-patterns

Eval-driven prompt overfit

Output-equality scoring on free-form text

One-judge tyranny

Case explosion

Versioning

See also

Get started

Capabilities

Build with the Engine

Agents and tools

Test and evaluate

API reference

Guides

Resources

Documentation Index

​What makes a case “golden”

​Sourcing cases

​Real production traces

​Edge cases

​Regression cases

​Schema for a case

​Categories

​Curation hygiene

​Anti-patterns

​Eval-driven prompt overfit

​Output-equality scoring on free-form text

​One-judge tyranny

​Case explosion

​Versioning

​See also

What makes a case “golden”

Sourcing cases

Real production traces

Edge cases

Regression cases

Schema for a case

Categories

Curation hygiene

Anti-patterns

Eval-driven prompt overfit

Output-equality scoring on free-form text

One-judge tyranny

Case explosion

Versioning

See also