Skip to content

Postmortem Template

Postmortems are how organizations learn from failure. A good postmortem identifies what went wrong, why, and what will prevent recurrence—without assigning blame to individuals. A bad postmortem becomes a paper exercise that gets filed and forgotten.

This template provides structure for effective, blameless incident analysis.


What problem this solves

Incidents reveal weaknesses in systems, processes, and organizational knowledge. Without structured analysis, the same failures recur. Teams learn nothing, trust erodes, and the organization becomes reactive instead of resilient.

Postmortems solve this by creating a forcing function for reflection: what happened, why, and what will we do differently? The written artifact ensures learning is captured and shared beyond the immediate responders.


When to use this

Use a postmortem for:

  • Any incident that significantly impacted users (SEV1 or SEV2).
  • Near-misses that could have been serious.
  • Incidents with interesting learning potential, even if impact was limited.
  • Any time the team feels "we need to understand what happened."

Don't use a postmortem for:

  • Minor issues with obvious fixes.
  • Team process retrospectives (use a retro instead).
  • Situations where the goal is to assign blame.

Roles and ownership

Role Responsibility
Postmortem owner Writes the document. Schedules the review meeting. Ensures actions are tracked. Usually the incident commander or a designated team member.
Incident responders Contribute to the timeline and analysis. Review the draft for accuracy.
Leadership Ensures postmortems happen. Models blameless culture. Allocates resources for follow-up actions.

How to run the postmortem

Step 1: Draft the postmortem (within 48 hours of incident)

The postmortem owner creates the initial document, focusing on:

  • A clear timeline of events.
  • Impact summary (who was affected, for how long).
  • Initial analysis of contributing factors.

Don't wait until you have perfect information. Start with what you know; fill in gaps later.

Step 2: Gather input (1–3 days)

Share the draft with everyone involved. Ask them to:

  • Correct any timeline errors.
  • Add context you may have missed.
  • Suggest contributing factors.

For complex incidents, schedule a brief sync to walk through the timeline together.

Step 3: Conduct root cause analysis

Ask "why" repeatedly until you reach systemic causes:

  • Why did the service fail? (A bad config was deployed.)
  • Why was a bad config deployed? (It wasn't caught in review.)
  • Why wasn't it caught in review? (The config syntax is complex and error-prone.)
  • Why is the syntax error-prone? (We haven't invested in validation tooling.)

The goal is to find causes you can actually address—not "human error" (which is not actionable) but the system conditions that made the error possible.

Step 4: Define action items

Every postmortem must produce concrete follow-up actions:

  • Each action has an owner and a deadline.
  • Actions should prevent recurrence, not just fix the immediate issue.
  • Prioritize: not everything can be done immediately.

Be realistic about capacity. A postmortem with 15 action items that never get done is worse than one with 3 actions that get completed.

Step 5: Review and share

Hold a brief postmortem review meeting (30 min) to:

  • Walk through the analysis.
  • Confirm the team agrees on root causes.
  • Commit to the action items.

Share the final postmortem widely. Learning should spread beyond the team that experienced the incident.

Step 6: Track follow-through

Add action items to your tracking system. Review progress in subsequent team meetings. Close the loop—incomplete actions erode trust in the postmortem process.


Signals that postmortems are working

  • Incidents that have been postmortem'd don't recur.
  • Action items get completed within agreed timelines.
  • The same contributing factors don't appear repeatedly.
  • Team members feel safe contributing honestly.
  • Postmortems are read by people beyond the immediate team.

Failure modes and mitigations

Failure mode What it looks like Mitigation
Blame culture Postmortems identify "who screwed up" Leadership must model blamelessness; focus language on systems, not people
Incomplete analysis Postmortem stops at "bad deploy" without asking why Use "5 Whys" or similar technique to reach systemic causes
Actions never done Postmortem produces a list that gets ignored Track actions in your normal work system; review in retros
Postmortems not written Incidents happen but no analysis follows Make postmortems mandatory for SEV1/SEV2; assign owners immediately
Too long / too detailed Postmortem is 10 pages that no one reads Keep it focused; summary and key learnings should fit on one page

The template

Postmortem document

# Postmortem: [Incident Title]

**Date of incident:** [Date]
**Duration:** [Start time] – [End time] ([X hours/minutes])
**Severity:** [SEV1 / SEV2 / SEV3]
**Postmortem owner:** [Name]
**Last updated:** [Date]

---

## Summary

[2–3 sentences: What happened? Who was affected? What was the impact?]

---

## Impact

| Metric          | Value                   |
| --------------- | ----------------------- |
| Users affected  | [Number or percentage]  |
| Duration        | [X hours/minutes]       |
| Revenue impact  | [If applicable]         |
| Support tickets | [Number]                |
| SLO impact      | [Error budget consumed] |

---

## Timeline

All times in [timezone].

| Time    | Event                      |
| ------- | -------------------------- |
| [HH:MM] | [Event description]        |
| [HH:MM] | [Event description]        |
| [HH:MM] | Alert fired: [alert name]  |
| [HH:MM] | [Name] began investigation |
| [HH:MM] | Root cause identified      |
| [HH:MM] | Mitigation applied         |
| [HH:MM] | Service restored           |

---

## Root cause analysis

### What happened

[Describe the technical failure. What broke and how?]

### Why it happened

[Apply "5 Whys" or similar analysis to reach systemic causes.]

1. **Why did [immediate cause] happen?**
   [Answer]

2. **Why did [answer 1] happen?**
   [Answer]

3. **Why did [answer 2] happen?**
   [Answer]

[Continue until you reach causes you can address.]

### Contributing factors

- [Factor 1: e.g., lack of monitoring for this failure mode]
- [Factor 2: e.g., documentation was outdated]
- [Factor 3: e.g., incident response process was unclear]

---

## What went well

- [Thing that worked during the incident or response]
- [Thing that worked]

---

## What could have gone better

- [Thing that made the incident worse or harder to resolve]
- [Thing that could have been better]

---

## Action items

| Action                         | Owner  | Priority | Due date | Status |
| ------------------------------ | ------ | -------- | -------- | ------ |
| [Action to prevent recurrence] | [Name] | P1       | [Date]   | ☐      |
| [Action to improve detection]  | [Name] | P2       | [Date]   | ☐      |
| [Action to improve response]   | [Name] | P2       | [Date]   | ☐      |

---

## Lessons learned

[Key takeaways that should be shared with the broader organization.]

---

## Appendix

[Links to relevant dashboards, logs, Slack threads, or other artifacts.]