Skip to main content

Documentation Index

Fetch the complete documentation index at: https://internal.september.wtf/llms.txt

Use this file to discover all available pages before exploring further.

Agents regress. A model upgrade shifts behavior. A prompt change unblocks one path and breaks another. A new tool steals attention from the right one. Regression testing is how you find these before users do.

What’s a regression

A regression is any case that:
  • Passed in a known-good baseline.
  • Fails in the current run.
Regressions are the highest-priority eval signal. Net pass rate could be flat — if the failures are different cases than last time, you’ve both fixed and broken things.

When to run

Run a regression suite on:
  • Every PR that touches a prompt or tool definition. Prompt changes are the most common cause.
  • Every PR that touches the agent loop or context machinery. Engine changes can shift behavior subtly.
  • Every Engine version bump. Even patch versions can shift behavior; major versions definitely do.
  • Every model change. New model identifier, new provider, new default — all need a baseline check.
  • On a schedule. Weekly, even when nothing has changed. Catches upstream provider drift.

How to run

The runner is the same one you use for evals (see Eval harness). Regression is a comparison of two runs:
def diff_runs(baseline: list[Result], current: list[Result]) -> Diff:
    by_id_baseline = {r.case_id: r for r in baseline}
    by_id_current = {r.case_id: r for r in current}

    regressions = []
    fixes = []
    new_failures = []
    new_passes = []

    for case_id in set(by_id_baseline) | set(by_id_current):
        b = by_id_baseline.get(case_id)
        c = by_id_current.get(case_id)
        if b is None:
            (new_failures if not c.passed else new_passes).append(c)
            continue
        if c is None:
            continue  # case removed
        if b.passed and not c.passed:
            regressions.append((b, c))
        elif not b.passed and c.passed:
            fixes.append((b, c))

    return Diff(regressions=regressions, fixes=fixes, ...)
Output:
Regressions (2):
  research/eval-2026-04-001
    baseline: passed
    current:  failed (missing tool call: web_search)
  multi-step/eval-2026-04-014
    baseline: passed
    current:  failed (output_tokens 12000 > 8000)

Fixes (1):
  refactoring/eval-2026-04-007
    baseline: failed
    current:  passed

New cases (3): ...
Removed cases (0): ...

Pass rate: 94% → 92% (-2 pp)
Cost:      $4.21 → $4.84 (+15%)
P99:       38s → 41s
A regression report leads with the regressions. Pass-rate is a summary metric; the cases that flipped are the news.

Choosing a baseline

Three options, in increasing rigor:

Last green main

The most recent commit on main where the eval was clean. Easy to compute; doesn’t require infrastructure.

Last release

The previous tagged release. Stable, but stale — doesn’t catch mid-release regressions until release time.

Pinned baseline

A specific commit you’ve designated as the bar. Move it forward deliberately. Useful when you want the regression suite to be aware of already-known regressions and only flag new ones.

What counts as “the agent” for regression purposes

Regression-test against everything that can change behavior:
  • The agent’s system prompt.
  • Tool definitions in the catalog.
  • The model and its parameters.
  • The Engine version.
  • Provider behavior (out of your control, but worth pinning so you notice).
If your dataset doesn’t exercise a dimension, you can’t regression-test it. Coverage is in the dataset, not in the runner.

Triaging regressions

When the suite reports a regression:
  1. Reproduce locally. Run the single case against your current build to confirm.
  2. Bisect. If the regression is clearly tied to a recent change (one PR moved 10 cases from green to red), focus there. If not, bisect across the suspect window.
  3. Categorize.
    • Real regression — the agent’s behavior on this case is genuinely worse. Fix the agent.
    • Stale case — the case’s expected outcome is no longer correct. Update the case.
    • Flake — the case passes some runs and fails others. Either fix the source of nondeterminism or pin the case as flaky and stop blocking on it (sparingly).
  4. Fix or update. Either the agent or the case has to change.
  5. Add a new case if the regression revealed a gap.

Statistical significance

Agent evals have noise. The same case can pass on one run and fail on another, especially with high-temperature models. Two patterns to manage this:

Run cases multiple times

For high-stakes cases, run each one 3 times and require all three to pass. The noise floor drops at the cost of 3× run time.

Track flakiness

Mark cases as flaky after they oscillate. Don’t block on flakes; treat them as “needs review” instead of “needs fix.” But fix the flakiness eventually. A growing list of flakes is a quality signal that something is off.

Cross-version regression

When you bump the Engine across a major version:
  1. Run the full suite against the old version. Pin the results.
  2. Bump the Engine.
  3. Run the full suite again.
  4. Diff.
Expected outcomes:
  • Some cases will regress because of legitimate behavior changes. Update them.
  • Some cases will fail because the new Engine has new defaults. Configure or update.
  • Some cases will fail because of bugs. File them.
Don’t try to land a major-version bump while pretending nothing changed. The diff between versions is the value the eval delivers.

See also