When to roll back
Roll back when:/healthreturns non-ok and you can’t fix it within minutes.- Error rate or latency is sharply worse than baseline and not improving.
- A regression eval that passed pre-upgrade is now failing.
- Customer reports confirm the new version is broken.
- One outlier metric that hasn’t actually impacted users.
- Behavior changes that were expected per the changelog.
- Issues that are clearly local to one bad request rather than the service as a whole.
What rollback looks like
Rollback has two layers:- Image rollback — switch the Engine container back to the previous image tag.
- Data rollback — restore the brain database to a pre-upgrade snapshot, if migrations made the new version’s schema incompatible with the old.
Image rollback
If the new version’s migrations didn’t change the brain schema in a way the old version doesn’t understand:Brain rollback
When a forward-only migration shipped in the new version (most migrations), the old Engine version can’t read the upgraded brain. You have to restore the brain to the pre-upgrade state.Restoring from snapshot
What gets lost
Specifically:- Any episodes, knowledge, or social-graph entries written during the bad window.
- Any conversation history from the bad window.
- Any feedback submitted during the bad window.
What’s preserved
- The pre-upgrade brain state, exactly.
- All long-term memory from before the upgrade.
- All migrations applied before the upgrade.
Volume snapshot rollback
If your infrastructure provides volume snapshots (EBS, GCP persistent disks, etc.):Drain before stopping
The Engine has graceful-shutdown logic. Stopping with a drain timeout lets in-flight turns complete:Communicating
While rolling back:- Post in the incident channel: “Rolling back to v2.3.1 due to $symptom.”
- Update the status page if customer impact was real.
- After rollback completes: confirm
/healthok, errors recovered, post the all-clear.
Post-rollback
Once the rollback is in and stable:- Capture the diagnostic. Logs, traces, metrics from the bad window. You’ll need them to fix forward.
- Don’t redeploy in panic. Whatever broke needs root cause first.
- File a postmortem ticket. See Postmortems.
- Add a regression eval. The exact failure mode should be a permanent test case. See Regression.
When rollback isn’t possible
Some failure modes can’t be cleanly rolled back:- The brain is corrupted and snapshots are unusable. You’re now in a data-recovery scenario; involve whoever owns the brain.
- Migrations dropped data that the old version expects. Same scenario.
- The new version sent destructive MCP calls (deleted external resources, sent emails). The Engine’s state is recoverable; the external state isn’t.
Practice rollback
Run a rollback drill in staging quarterly. The exercise:- Bring staging up on the current production version.
- Take a snapshot.
- Upgrade staging to a deliberately-broken version (e.g. an internal build with a known bug).
- Roll back using the procedure above.
- Time how long it took. Identify what slowed you down.
See also
- Upgrade — how to do the upgrade so rollback isn’t needed.
- Postmortems — what comes after.
- Database — backup and restore details.

