Postmortems

A postmortem is a written record of what went wrong and what we learned. We write them blamelessly — not because the people involved didn’t make mistakes, but because the goal is to fix the system, not the people. A postmortem that names individuals as the failure point will not produce durable improvements. A postmortem that names the gaps in process, tooling, and design will.

When to write one

Severity	Postmortem?
Sev1 (customer-impacting outage)	Required
Sev2 (degraded service, partial outage)	Required
Sev3 (caught early, no customer impact)	Encouraged
Near-miss	Encouraged if interesting

When to skip

You don’t need a postmortem for:

A single dropped request that retried and succeeded.
A flaky test.
An alert that fired but was misconfigured (file an alerting fix instead).

If in doubt, write one. They take an hour and save weeks.

The blameless rule

In the postmortem document:

Refer to people by role, not by name. “The on-call engineer” not “Alice.”
Describe decisions, not motivations. “The engineer rolled forward rather than back” not “the engineer panicked and rolled forward.”
Describe system behavior, not human behavior. “The deploy script did not have a dry-run mode” not “the deployer ran the wrong command.”

This isn’t about being polite. It’s about producing accurate analysis. “Operator error” is a non-explanation that ends investigation. “The deployment system did not require confirmation before destroying the prod database” is the start of a fix.

Lifecycle

Incident closes. The on-call engineer files a stub postmortem within 24 hours.
Author named. One person owns drafting. Usually whoever was on-call.
Draft within 5 working days. Cover: summary, impact, timeline, root cause, trigger, detection, resolution, action items, what went well, what went badly, lessons. Use roles, not names.
Review. Schedule a 30-minute review meeting with everyone involved plus the service owner. The review’s job is to challenge the timeline and the action items, not to assign blame.
Action items dated and owned. Every action item gets an owner and a target date. Action items without owners and dates don’t ship.
Publish. Postmortems live in operations/postmortems/. Internal-only. We circulate them in #engineering.
Track follow-through. Review action item status weekly until all are closed.

Where they live

operations/postmortems/YYYY-MM-DD-short-slug.mdx.

What we don’t do

We don’t censor postmortems. The internal version is honest.
We don’t make postmortems performance-review inputs. (If they were, no one would write honest ones.)
We don’t write postmortems for the customer. If we owe a customer a public RCA, write it separately and link from here.

Local development

Deploy

Configuration

Infrastructure

On-call

Incidents

SLOs

Security

Postmortems

When to write one

When to skip

The blameless rule

Lifecycle

Where they live

What we don’t do

Local development

Deploy

Configuration

Infrastructure

On-call

Incidents

SLOs

Security

Documentation Index

​When to write one

​When to skip

​The blameless rule

​Lifecycle

​Where they live

​What we don’t do

When to write one

When to skip

The blameless rule

Lifecycle

Where they live

What we don’t do