Skip to main content

Documentation Index

Fetch the complete documentation index at: https://septemberai.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

When something goes wrong, the first decision is severity. Severity drives who gets paged, how fast they respond, and what the next 30 minutes look like. This page defines the levels.

The levels

Sev1 — Customer-impacting outage

Used for:
  • The Engine is down for one or more customers.
  • Data integrity is at risk (corruption, loss).
  • Active security incident (suspected breach, credential leak, unauthorized access).
  • Compliance breach (data exposure beyond authorized scope).
Response:
  • Page primary, secondary, and the service owner immediately.
  • Open the incident channel.
  • Status page updated within 15 minutes.
  • Customer comms drafted within 30 minutes for any customer with impact > 5 minutes.
  • Mitigate first; root-cause later.
Examples:
  • /health returns non-ok across the fleet.
  • A migration corrupted user brains.
  • An API key leaked and someone is using it.

Sev2 — Degraded service

Used for:
  • Significant feature broken but not a full outage.
  • Elevated error rate (> 5%) sustained for > 5 minutes.
  • P99 latency > 2× baseline sustained.
  • One customer’s Engine is hard-down (others fine).
  • Major MCP connector down (Slack, Stripe, etc., affecting tier-2 agent capability).
Response:
  • Page primary.
  • Open the incident channel.
  • Status page updated if customer-visible.
  • Mitigate within the hour if possible.
Examples:
  • The Learning Centre is failing every batch.
  • One specific tool is broken (web_search returns 500s).
  • A subset of users seeing slow responses.

Sev3 — Caught early, no customer impact

Used for:
  • Internal alert without user symptom yet.
  • Single customer issue that’s being worked around.
  • Capacity warnings (disk > 80%, memory pressure).
  • Non-critical regressions caught by evals.
Response:
  • File a ticket.
  • Address next business day.
  • No paging; no incident channel.
Examples:
  • Brain database approaching 8 GB.
  • Cache hit ratio dropped 20% after a deploy (not impacting users yet).
  • Engine logs show occasional warnings about a deprecation.

Borderline cases

When you’re unsure between two levels, escalate. Better a false-Sev2 than a missed-Sev1. A few rules of thumb:
  • Customer impact known to be > 5% → Sev1.
  • Customer impact known to be 1-5% → Sev2.
  • Internal-only with no current user symptom → Sev3.
  • Could become Sev1 within an hour if not addressed → Sev2.

Severity reduction

A live Sev1 can be downgraded to Sev2 if:
  • Mitigation is complete.
  • Customer impact has stopped.
  • Only follow-up work remains (postmortem, monitoring of the fix).
Downgrading isn’t a celebration; it changes what response level the issue gets. The postmortem still happens.

Escalation matrix

Time on Sev1Who’s involved
0–15 minPrimary on-call
15–30 min+ Secondary
30–60 min+ Service owner / engineering lead
1+ hour+ Leadership
For Sev2, the matrix is shifted — primary takes the lead, secondary joins after 30 min if not resolving, leadership only if it ages into 1+ hour.

Severity in postmortems

Every postmortem records the severity used. If it’s wrong in hindsight, note it:
Severity: Sev2. In retrospect should have been Sev1 — actual customer impact was 8%, not the 3% initially estimated.
Calibration improves over time. Pay attention.

Customer comms by severity

SeverityStatus pageDirect customer comm
Sev1Yes, within 15 minWithin 30 min for impacted customers
Sev2If customer-visibleOptional
Sev3NoNo
Status page templates and customer comm templates live with the handoff checklist.

See also