Postmortem Template¶

Postmortems are how organizations learn from failure. A good postmortem identifies what went wrong, why, and what will prevent recurrence—without assigning blame to individuals. A bad postmortem becomes a paper exercise that gets filed and forgotten.

This template provides structure for effective, blameless incident analysis.

What problem this solves¶

Incidents reveal weaknesses in systems, processes, and organizational knowledge. Without structured analysis, the same failures recur. Teams learn nothing, trust erodes, and the organization becomes reactive instead of resilient.

Postmortems solve this by creating a forcing function for reflection: what happened, why, and what will we do differently? The written artifact ensures learning is captured and shared beyond the immediate responders.

When to use this¶

Use a postmortem for:

Any incident that significantly impacted users (SEV1 or SEV2).
Near-misses that could have been serious.
Incidents with interesting learning potential, even if impact was limited.
Any time the team feels "we need to understand what happened."

Don't use a postmortem for:

Minor issues with obvious fixes.
Team process retrospectives (use a retro instead).
Situations where the goal is to assign blame.

Roles and ownership¶

Role	Responsibility
Postmortem owner	Writes the document. Schedules the review meeting. Ensures actions are tracked. Usually the incident commander or a designated team member.
Incident responders	Contribute to the timeline and analysis. Review the draft for accuracy.
Leadership	Ensures postmortems happen. Models blameless culture. Allocates resources for follow-up actions.

How to run the postmortem¶

Step 1: Draft the postmortem (within 48 hours of incident)¶

The postmortem owner creates the initial document, focusing on:

A clear timeline of events.
Impact summary (who was affected, for how long).
Initial analysis of contributing factors.

Don't wait until you have perfect information. Start with what you know; fill in gaps later.

Step 2: Gather input (1–3 days)¶

Share the draft with everyone involved. Ask them to:

Correct any timeline errors.
Add context you may have missed.
Suggest contributing factors.

For complex incidents, schedule a brief sync to walk through the timeline together.

Step 3: Conduct root cause analysis¶

Ask "why" repeatedly until you reach systemic causes:

Why did the service fail? (A bad config was deployed.)
Why was a bad config deployed? (It wasn't caught in review.)
Why wasn't it caught in review? (The config syntax is complex and error-prone.)
Why is the syntax error-prone? (We haven't invested in validation tooling.)

The goal is to find causes you can actually address—not "human error" (which is not actionable) but the system conditions that made the error possible.

Step 4: Define action items¶

Every postmortem must produce concrete follow-up actions:

Each action has an owner and a deadline.
Actions should prevent recurrence, not just fix the immediate issue.
Prioritize: not everything can be done immediately.

Be realistic about capacity. A postmortem with 15 action items that never get done is worse than one with 3 actions that get completed.

Hold a brief postmortem review meeting (30 min) to:

Walk through the analysis.
Confirm the team agrees on root causes.
Commit to the action items.

Share the final postmortem widely. Learning should spread beyond the team that experienced the incident.

Step 6: Track follow-through¶

Add action items to your tracking system. Review progress in subsequent team meetings. Close the loop—incomplete actions erode trust in the postmortem process.

Signals that postmortems are working¶

Incidents that have been postmortem'd don't recur.
Action items get completed within agreed timelines.
The same contributing factors don't appear repeatedly.
Team members feel safe contributing honestly.
Postmortems are read by people beyond the immediate team.

Failure modes and mitigations¶

Failure mode	What it looks like	Mitigation
Blame culture	Postmortems identify "who screwed up"	Leadership must model blamelessness; focus language on systems, not people
Incomplete analysis	Postmortem stops at "bad deploy" without asking why	Use "5 Whys" or similar technique to reach systemic causes
Actions never done	Postmortem produces a list that gets ignored	Track actions in your normal work system; review in retros
Postmortems not written	Incidents happen but no analysis follows	Make postmortems mandatory for SEV1/SEV2; assign owners immediately
Too long / too detailed	Postmortem is 10 pages that no one reads	Keep it focused; summary and key learnings should fit on one page

The template¶

Postmortem document¶

# Postmortem: [Incident Title]

**Date of incident:** [Date]
**Duration:** [Start time] – [End time] ([X hours/minutes])
**Severity:** [SEV1 / SEV2 / SEV3]
**Postmortem owner:** [Name]
**Last updated:** [Date]

---

## Summary

[2–3 sentences: What happened? Who was affected? What was the impact?]

---

## Impact

| Metric          | Value                   |
| --------------- | ----------------------- |
| Users affected  | [Number or percentage]  |
| Duration        | [X hours/minutes]       |
| Revenue impact  | [If applicable]         |
| Support tickets | [Number]                |
| SLO impact      | [Error budget consumed] |

---

## Timeline

All times in [timezone].

| Time    | Event                      |
| ------- | -------------------------- |
| [HH:MM] | [Event description]        |
| [HH:MM] | [Event description]        |
| [HH:MM] | Alert fired: [alert name]  |
| [HH:MM] | [Name] began investigation |
| [HH:MM] | Root cause identified      |
| [HH:MM] | Mitigation applied         |
| [HH:MM] | Service restored           |

---

## Root cause analysis

### What happened

[Describe the technical failure. What broke and how?]

### Why it happened

[Apply "5 Whys" or similar analysis to reach systemic causes.]

1. **Why did [immediate cause] happen?**
   [Answer]

2. **Why did [answer 1] happen?**
   [Answer]

3. **Why did [answer 2] happen?**
   [Answer]

[Continue until you reach causes you can address.]

### Contributing factors

- [Factor 1: e.g., lack of monitoring for this failure mode]
- [Factor 2: e.g., documentation was outdated]
- [Factor 3: e.g., incident response process was unclear]

---

## What went well

- [Thing that worked during the incident or response]
- [Thing that worked]

---

## What could have gone better

- [Thing that made the incident worse or harder to resolve]
- [Thing that could have been better]

---

## Action items

| Action                         | Owner  | Priority | Due date | Status |
| ------------------------------ | ------ | -------- | -------- | ------ |
| [Action to prevent recurrence] | [Name] | P1       | [Date]   | ☐      |
| [Action to improve detection]  | [Name] | P2       | [Date]   | ☐      |
| [Action to improve response]   | [Name] | P2       | [Date]   | ☐      |

---

## Lessons learned

[Key takeaways that should be shared with the broader organization.]

---

## Appendix

[Links to relevant dashboards, logs, Slack threads, or other artifacts.]

Delivery: Incident Response — How to run the incident itself.
Platform: Reliability Practices — Broader reliability discipline.
Outage Communications Template — Communicating with stakeholders during incidents.
Runbook Template — Documenting procedures to prevent incidents.