Rollback

Plan rollback before you need it. The middle of an incident is not the time to figure out which command reverts which database, or whether your snapshots are recent enough.

When to roll back

Roll back when:

/health returns non-ok and you can’t fix it within minutes.
Error rate or latency is sharply worse than baseline and not improving.
A regression eval that passed pre-upgrade is now failing.
Customer reports confirm the new version is broken.

Don’t roll back for:

One outlier metric that hasn’t actually impacted users.
Behavior changes that were expected per the changelog.
Issues that are clearly local to one bad request rather than the service as a whole.

If you’re unsure, roll back. The cost of an unnecessary rollback is low; the cost of leaving a bad version live is high.

What rollback looks like

Rollback has two layers:

Image rollback — switch the Engine container back to the previous image tag.
Data rollback — restore the brain database to a pre-upgrade snapshot, if migrations made the new version’s schema incompatible with the old.

Most rollbacks need only #1. Some need #2.

Image rollback

If the new version’s migrations didn’t change the brain schema in a way the old version doesn’t understand:

# 1. update docker-compose.yml back to the previous image tag
# from: image: engine:2.4.0
# to:   image: engine:2.3.1

# 2. pull the previous image (it should still be in the registry)
docker compose pull engine

# 3. drain and restart
docker compose stop --timeout 30 engine
docker compose up -d engine

# 4. verify
curl -fsS "$ENGINE_URL/health" | jq

If the previous version’s image was deleted from the registry, you may need to rebuild from the previous git tag. Plan for this — keep at least the last 5 versions of each image.

Brain rollback

When a forward-only migration shipped in the new version (most migrations), the old Engine version can’t read the upgraded brain. You have to restore the brain to the pre-upgrade state.

Restoring from snapshot

# 1. stop the engine
docker compose stop engine

# 2. restore from your pre-upgrade snapshot
# (assuming you made one before the upgrade — see Upgrade)
docker compose run --rm --entrypoint /bin/sh engine -c \
  'sqlite3 /data/brain.sqlite ".restore /data/brain-pre-upgrade.db"'

# 3. start the old version
docker compose up -d engine

# 4. verify
curl -fsS "$ENGINE_URL/health" | jq

You will lose any data written between the snapshot and the rollback — typically minutes to an hour, depending on how long the bad version was live.

What gets lost

Specifically:

Any episodes, knowledge, or social-graph entries written during the bad window.
Any conversation history from the bad window.
Any feedback submitted during the bad window.

Depending on your application, you may need to inform affected users or replay important events.

What’s preserved

The pre-upgrade brain state, exactly.
All long-term memory from before the upgrade.
All migrations applied before the upgrade.

Volume snapshot rollback

If your infrastructure provides volume snapshots (EBS, GCP persistent disks, etc.):

# 1. stop the engine
docker compose stop engine

# 2. detach the engine_data volume

# 3. attach a snapshot from before the upgrade

# 4. start the old engine version
docker compose up -d engine

Volume snapshots are usually faster to restore than file-level restores for large brains. Test the restore path before you need it.

Drain before stopping

The Engine has graceful-shutdown logic. Stopping with a drain timeout lets in-flight turns complete:

docker compose stop --timeout 60 engine

60 seconds is usually enough; longer turns get cut off but the brain state is consistent because the agent loop persists at boundaries (after every model call, every tool result).

Communicating

While rolling back:

Post in the incident channel: “Rolling back to v2.3.1 due to $symptom.”
Update the status page if customer impact was real.
After rollback completes: confirm /health ok, errors recovered, post the all-clear.

Post-rollback

Once the rollback is in and stable:

Capture the diagnostic. Logs, traces, metrics from the bad window. You’ll need them to fix forward.
Don’t redeploy in panic. Whatever broke needs root cause first.
File a postmortem ticket. See Postmortems.
Add a regression eval. The exact failure mode should be a permanent test case. See Regression.

When rollback isn’t possible

Some failure modes can’t be cleanly rolled back:

The brain is corrupted and snapshots are unusable. You’re now in a data-recovery scenario; involve whoever owns the brain.
Migrations dropped data that the old version expects. Same scenario.
The new version sent destructive MCP calls (deleted external resources, sent emails). The Engine’s state is recoverable; the external state isn’t.

For these, the rollback is partial — restore what you can, accept the loss for what you can’t, and design future migrations to be reversible.

Practice rollback

Run a rollback drill in staging quarterly. The exercise:

Bring staging up on the current production version.
Take a snapshot.
Upgrade staging to a deliberately-broken version (e.g. an internal build with a known bug).
Roll back using the procedure above.
Time how long it took. Identify what slowed you down.

A rollback you’ve never practiced is a rollback that won’t go to plan.

Local development

Deploy

Configuration

Infrastructure

On-call

Incidents

SLOs

Security

When to roll back

What rollback looks like

Image rollback

Brain rollback

Restoring from snapshot

What gets lost

What’s preserved

Volume snapshot rollback

Drain before stopping

Communicating

Post-rollback

When rollback isn’t possible

Practice rollback

See also

Local development

Deploy

Configuration

Infrastructure

On-call

Incidents

SLOs

Security

Documentation Index

​When to roll back

​What rollback looks like

​Image rollback

​Brain rollback

​Restoring from snapshot

​What gets lost

​What’s preserved

​Volume snapshot rollback

​Drain before stopping

​Communicating

​Post-rollback

​When rollback isn’t possible

​Practice rollback

​See also

When to roll back

What rollback looks like

Image rollback

Brain rollback

Restoring from snapshot

What gets lost

What’s preserved

Volume snapshot rollback

Drain before stopping

Communicating

Post-rollback

When rollback isn’t possible

Practice rollback

See also