Agents regress. A model upgrade shifts behavior. A prompt change unblocks one path and breaks another. A new tool steals attention from the right one. Regression testing is how you find these before users do.Documentation Index
Fetch the complete documentation index at: https://internal.september.wtf/llms.txt
Use this file to discover all available pages before exploring further.
What’s a regression
A regression is any case that:- Passed in a known-good baseline.
- Fails in the current run.
When to run
Run a regression suite on:- Every PR that touches a prompt or tool definition. Prompt changes are the most common cause.
- Every PR that touches the agent loop or context machinery. Engine changes can shift behavior subtly.
- Every Engine version bump. Even patch versions can shift behavior; major versions definitely do.
- Every model change. New model identifier, new provider, new default — all need a baseline check.
- On a schedule. Weekly, even when nothing has changed. Catches upstream provider drift.
How to run
The runner is the same one you use for evals (see Eval harness). Regression is a comparison of two runs:Choosing a baseline
Three options, in increasing rigor:Last green main
The most recent commit on main where the eval was clean. Easy to compute; doesn’t require infrastructure.Last release
The previous tagged release. Stable, but stale — doesn’t catch mid-release regressions until release time.Pinned baseline
A specific commit you’ve designated as the bar. Move it forward deliberately. Useful when you want the regression suite to be aware of already-known regressions and only flag new ones.What counts as “the agent” for regression purposes
Regression-test against everything that can change behavior:- The agent’s system prompt.
- Tool definitions in the catalog.
- The model and its parameters.
- The Engine version.
- Provider behavior (out of your control, but worth pinning so you notice).
Triaging regressions
When the suite reports a regression:- Reproduce locally. Run the single case against your current build to confirm.
- Bisect. If the regression is clearly tied to a recent change (one PR moved 10 cases from green to red), focus there. If not, bisect across the suspect window.
- Categorize.
- Real regression — the agent’s behavior on this case is genuinely worse. Fix the agent.
- Stale case — the case’s expected outcome is no longer correct. Update the case.
- Flake — the case passes some runs and fails others. Either fix the source of nondeterminism or pin the case as flaky and stop blocking on it (sparingly).
- Fix or update. Either the agent or the case has to change.
- Add a new case if the regression revealed a gap.
Statistical significance
Agent evals have noise. The same case can pass on one run and fail on another, especially with high-temperature models. Two patterns to manage this:Run cases multiple times
For high-stakes cases, run each one 3 times and require all three to pass. The noise floor drops at the cost of 3× run time.Track flakiness
Mark cases as flaky after they oscillate. Don’t block on flakes; treat them as “needs review” instead of “needs fix.” But fix the flakiness eventually. A growing list of flakes is a quality signal that something is off.Cross-version regression
When you bump the Engine across a major version:- Run the full suite against the old version. Pin the results.
- Bump the Engine.
- Run the full suite again.
- Diff.
- Some cases will regress because of legitimate behavior changes. Update them.
- Some cases will fail because the new Engine has new defaults. Configure or update.
- Some cases will fail because of bugs. File them.
See also
- Eval harness — the runner.
- Golden datasets — what to run against.
- Migration guides — version-specific behavior changes.

