Severity definitions

When something goes wrong, the first decision is severity. Severity drives who gets paged, how fast they respond, and what the next 30 minutes look like. This page defines the levels.

The levels

Sev1 — Customer-impacting outage

Used for:

The Engine is down for one or more customers.
Data integrity is at risk (corruption, loss).
Active security incident (suspected breach, credential leak, unauthorized access).
Compliance breach (data exposure beyond authorized scope).

Response:

Page primary, secondary, and the service owner immediately.
Open the incident channel.
Status page updated within 15 minutes.
Customer comms drafted within 30 minutes for any customer with impact > 5 minutes.
Mitigate first; root-cause later.

Examples:

/health returns non-ok across the fleet.
A migration corrupted user brains.
An API key leaked and someone is using it.

Sev2 — Degraded service

Used for:

Significant feature broken but not a full outage.
Elevated error rate (> 5%) sustained for > 5 minutes.
P99 latency > 2× baseline sustained.
One customer’s Engine is hard-down (others fine).
Major MCP connector down (Slack, Stripe, etc., affecting tier-2 agent capability).

Response:

Page primary.
Open the incident channel.
Status page updated if customer-visible.
Mitigate within the hour if possible.

Examples:

The Learning Centre is failing every batch.
One specific tool is broken (web_search returns 500s).
A subset of users seeing slow responses.

Sev3 — Caught early, no customer impact

Used for:

Internal alert without user symptom yet.
Single customer issue that’s being worked around.
Capacity warnings (disk > 80%, memory pressure).
Non-critical regressions caught by evals.

Response:

File a ticket.
Address next business day.
No paging; no incident channel.

Examples:

Brain database approaching 8 GB.
Cache hit ratio dropped 20% after a deploy (not impacting users yet).
Engine logs show occasional warnings about a deprecation.

Borderline cases

When you’re unsure between two levels, escalate. Better a false-Sev2 than a missed-Sev1. A few rules of thumb:

Customer impact known to be > 5% → Sev1.
Customer impact known to be 1-5% → Sev2.
Internal-only with no current user symptom → Sev3.
Could become Sev1 within an hour if not addressed → Sev2.

Severity reduction

A live Sev1 can be downgraded to Sev2 if:

Mitigation is complete.
Customer impact has stopped.
Only follow-up work remains (postmortem, monitoring of the fix).

Downgrading isn’t a celebration; it changes what response level the issue gets. The postmortem still happens.

Escalation matrix

Time on Sev1	Who’s involved
0–15 min	Primary on-call
15–30 min	+ Secondary
30–60 min	+ Service owner / engineering lead
1+ hour	+ Leadership

For Sev2, the matrix is shifted — primary takes the lead, secondary joins after 30 min if not resolving, leadership only if it ages into 1+ hour.

Severity in postmortems

Every postmortem records the severity used. If it’s wrong in hindsight, note it:

Severity: Sev2. In retrospect should have been Sev1 — actual customer impact was 8%, not the 3% initially estimated.

Calibration improves over time. Pay attention.

Customer comms by severity

Severity	Status page	Direct customer comm
Sev1	Yes, within 15 min	Within 30 min for impacted customers
Sev2	If customer-visible	Optional
Sev3	No	No

Status page templates and customer comm templates live with the handoff checklist.

Local development

Deploy

Configuration

Infrastructure

On-call

Incidents

SLOs

Security

Severity definitions

The levels

Sev1 — Customer-impacting outage

Sev2 — Degraded service

Sev3 — Caught early, no customer impact

Borderline cases

Severity reduction

Escalation matrix

Severity in postmortems

Customer comms by severity

See also

Local development

Deploy

Configuration

Infrastructure

On-call

Incidents

SLOs

Security

Documentation Index

​The levels

​Sev1 — Customer-impacting outage

​Sev2 — Degraded service

​Sev3 — Caught early, no customer impact

​Borderline cases

​Severity reduction

​Escalation matrix

​Severity in postmortems

​Customer comms by severity

​See also

The levels

Sev1 — Customer-impacting outage

Sev2 — Degraded service

Sev3 — Caught early, no customer impact

Borderline cases

Severity reduction

Escalation matrix

Severity in postmortems

Customer comms by severity

See also