When something goes wrong, the first decision is severity. Severity drives who gets paged, how fast they respond, and what the next 30 minutes look like. This page defines the levels.Documentation Index
Fetch the complete documentation index at: https://septemberai.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
The levels
Sev1 — Customer-impacting outage
Used for:- The Engine is down for one or more customers.
- Data integrity is at risk (corruption, loss).
- Active security incident (suspected breach, credential leak, unauthorized access).
- Compliance breach (data exposure beyond authorized scope).
- Page primary, secondary, and the service owner immediately.
- Open the incident channel.
- Status page updated within 15 minutes.
- Customer comms drafted within 30 minutes for any customer with impact > 5 minutes.
- Mitigate first; root-cause later.
/healthreturns non-ok across the fleet.- A migration corrupted user brains.
- An API key leaked and someone is using it.
Sev2 — Degraded service
Used for:- Significant feature broken but not a full outage.
- Elevated error rate (> 5%) sustained for > 5 minutes.
- P99 latency > 2× baseline sustained.
- One customer’s Engine is hard-down (others fine).
- Major MCP connector down (Slack, Stripe, etc., affecting tier-2 agent capability).
- Page primary.
- Open the incident channel.
- Status page updated if customer-visible.
- Mitigate within the hour if possible.
- The Learning Centre is failing every batch.
- One specific tool is broken (
web_searchreturns 500s). - A subset of users seeing slow responses.
Sev3 — Caught early, no customer impact
Used for:- Internal alert without user symptom yet.
- Single customer issue that’s being worked around.
- Capacity warnings (disk > 80%, memory pressure).
- Non-critical regressions caught by evals.
- File a ticket.
- Address next business day.
- No paging; no incident channel.
- Brain database approaching 8 GB.
- Cache hit ratio dropped 20% after a deploy (not impacting users yet).
- Engine logs show occasional warnings about a deprecation.
Borderline cases
When you’re unsure between two levels, escalate. Better a false-Sev2 than a missed-Sev1. A few rules of thumb:- Customer impact known to be > 5% → Sev1.
- Customer impact known to be 1-5% → Sev2.
- Internal-only with no current user symptom → Sev3.
- Could become Sev1 within an hour if not addressed → Sev2.
Severity reduction
A live Sev1 can be downgraded to Sev2 if:- Mitigation is complete.
- Customer impact has stopped.
- Only follow-up work remains (postmortem, monitoring of the fix).
Escalation matrix
| Time on Sev1 | Who’s involved |
|---|---|
| 0–15 min | Primary on-call |
| 15–30 min | + Secondary |
| 30–60 min | + Service owner / engineering lead |
| 1+ hour | + Leadership |
Severity in postmortems
Every postmortem records the severity used. If it’s wrong in hindsight, note it:Severity: Sev2. In retrospect should have been Sev1 — actual customer impact was 8%, not the 3% initially estimated.Calibration improves over time. Pay attention.
Customer comms by severity
| Severity | Status page | Direct customer comm |
|---|---|---|
| Sev1 | Yes, within 15 min | Within 30 min for impacted customers |
| Sev2 | If customer-visible | Optional |
| Sev3 | No | No |
See also
- On-call rotation — who’s responsible.
- Handoff — what gets passed at rotation.
- Postmortems — after-action reports.

