The levels
Sev1 — Customer-impacting outage
Used for:- The Engine is down for one or more customers.
- Data integrity is at risk (corruption, loss).
- Active security incident (suspected breach, credential leak, unauthorized access).
- Compliance breach (data exposure beyond authorized scope).
- Page primary, secondary, and the service owner immediately.
- Open the incident channel.
- Status page updated within 15 minutes.
- Customer comms drafted within 30 minutes for any customer with impact > 5 minutes.
- Mitigate first; root-cause later.
/healthreturns non-ok across the fleet.- A migration corrupted user brains.
- An API key leaked and someone is using it.
Sev2 — Degraded service
Used for:- Significant feature broken but not a full outage.
- Elevated error rate (> 5%) sustained for > 5 minutes.
- P99 latency > 2× baseline sustained.
- One customer’s Engine is hard-down (others fine).
- Major MCP connector down (Slack, Stripe, etc., affecting tier-2 agent capability).
- Page primary.
- Open the incident channel.
- Status page updated if customer-visible.
- Mitigate within the hour if possible.
- The Learning Centre is failing every batch.
- One specific tool is broken (
web_searchreturns 500s). - A subset of users seeing slow responses.
Sev3 — Caught early, no customer impact
Used for:- Internal alert without user symptom yet.
- Single customer issue that’s being worked around.
- Capacity warnings (disk > 80%, memory pressure).
- Non-critical regressions caught by evals.
- File a ticket.
- Address next business day.
- No paging; no incident channel.
- Brain database approaching 8 GB.
- Cache hit ratio dropped 20% after a deploy (not impacting users yet).
- Engine logs show occasional warnings about a deprecation.
Borderline cases
When you’re unsure between two levels, escalate. Better a false-Sev2 than a missed-Sev1. A few rules of thumb:- Customer impact known to be > 5% → Sev1.
- Customer impact known to be 1-5% → Sev2.
- Internal-only with no current user symptom → Sev3.
- Could become Sev1 within an hour if not addressed → Sev2.
Severity reduction
A live Sev1 can be downgraded to Sev2 if:- Mitigation is complete.
- Customer impact has stopped.
- Only follow-up work remains (postmortem, monitoring of the fix).
Escalation matrix
| Time on Sev1 | Who’s involved |
|---|---|
| 0–15 min | Primary on-call |
| 15–30 min | + Secondary |
| 30–60 min | + Service owner / engineering lead |
| 1+ hour | + Leadership |
Severity in postmortems
Every postmortem records the severity used. If it’s wrong in hindsight, note it:Severity: Sev2. In retrospect should have been Sev1 — actual customer impact was 8%, not the 3% initially estimated.Calibration improves over time. Pay attention.
Customer comms by severity
| Severity | Status page | Direct customer comm |
|---|---|---|
| Sev1 | Yes, within 15 min | Within 30 min for impacted customers |
| Sev2 | If customer-visible | Optional |
| Sev3 | No | No |
See also
- On-call rotation — who’s responsible.
- Handoff — what gets passed at rotation.
- Postmortems — after-action reports.

