Skip to content

Incident Response

When production breaks, how you respond matters as much as the fix. A good incident response minimizes user impact, keeps stakeholders informed, and captures learning that prevents recurrence. A bad response turns a small problem into a big one through chaos, blame, and missed lessons.

This page covers the delivery perspective on incident response: the practices and processes that make incidents manageable. For the complete framework including severity levels, roles, and communication, see Crisis Management.

What Problem This Solves

Incidents are inevitable in any system with sufficient complexity. The question is not whether you'll have incidents, but how you'll handle them.

Without a clear incident response process:

Chaos multiplies. Everyone investigates the same thing. Nobody knows who's in charge. Communication is inconsistent. The incident takes longer to resolve than it should.

Learning doesn't happen. The incident is fixed, everyone goes back to normal, and the same thing happens again in three months.

People burn out. A few heroes always handle incidents. Everyone else watches. On-call becomes a punishment, not a responsibility.

Trust erodes. Customers experience problems and hear nothing. Leadership finds out from Twitter. Support is blindsided.

Good incident response prevents all of these by making incident handling structured, distributed, and learning-oriented.


When to Use This Process

Invoke incident response when:

  • Users are experiencing degraded service or errors
  • Automated alerts fire indicating system problems
  • Internal users report something is broken
  • You suspect a security issue

Don't invoke when:

  • The issue is minor and can be fixed in minutes without coordination
  • It's a development/staging environment issue with no user impact
  • It's a known issue with a known timeline

When in doubt, invoke. It's easier to stand down from an incident than to escalate a problem that grew while you waited.


Ownership

Role Responsibility
On-call engineer First responder; triages, escalates, often becomes Tech Lead for the incident
Incident Commander (IC) Owns the incident end-to-end; coordinates response, manages communication
Tech Lead Leads technical investigation and resolution
Comms Lead Owns stakeholder communication (internal and external)
Engineering Manager Ensures process is followed; supports IC; addresses team issues

For full role definitions, see Crisis Management.


The Incident Lifecycle

Phase 1: Detection

Incidents begin when someone notices a problem. The faster you detect, the less impact users experience.

Detection sources:

  • Automated monitoring: Alerts on error rates, latency, availability
  • Customer reports: Support tickets, social media mentions
  • Internal reports: Engineer notices something wrong

Key actions:

  1. Acknowledge the alert or report
  2. Verify the problem is real (not a false positive)
  3. Assess initial impact: What's affected? How many users?
  4. Create an incident record

Phase 2: Triage

Determine how serious this is and who needs to respond.

Severity assessment:

  • SEV-1: Complete outage or severe degradation affecting most users
  • SEV-2: Significant degradation affecting a subset of users
  • SEV-3: Partial degradation or non-critical feature failure
  • SEV-4: Minor issue with minimal user impact

See Crisis Management for full severity definitions.

Key actions:

  1. Assign severity
  2. Assign Incident Commander (for SEV-1/SEV-2)
  3. Create incident channel
  4. Notify relevant stakeholders

Phase 3: Investigation

Figure out what's wrong.

Key actions:

  1. Tech Lead coordinates investigation
  2. Responders investigate their assigned areas
  3. Findings are shared in the incident channel
  4. IC tracks progress and removes blockers

Investigation heuristics:

  • What changed recently? Deployments, config changes, traffic patterns
  • Where is the error? Error logs, traces, metrics
  • What's the blast radius? Which users, which regions, which features
  • What's the proximate cause? Even if you don't know root cause, what's immediately wrong?

Phase 4: Mitigation

Stop the bleeding. Mitigation is about restoring service, not fixing the root cause.

Mitigation options:

  • Rollback the recent deployment
  • Disable the problematic feature (feature flag)
  • Scale up resources
  • Route traffic away from the problem
  • Apply a hotfix

Key principle: Mitigate first, investigate later. User impact matters more than understanding why.

Phase 5: Resolution

Confirm the incident is resolved and close it out.

Resolution criteria:

  • Service is functioning normally
  • Metrics are back to baseline
  • No ongoing user impact

Key actions:

  1. IC declares incident resolved
  2. Send final stakeholder communication
  3. Update status page
  4. Schedule postmortem (for SEV-1/SEV-2)

Phase 6: Postmortem

Learn from the incident to prevent recurrence.

Postmortem principles:

  • Blameless. Focus on systems, not people. "What allowed this to happen?" not "Who did this?"
  • Honest. Capture what actually happened, including uncomfortable truths
  • Actionable. Generate specific action items with owners and deadlines
  • Shared. Make postmortems visible so the organization learns

See Postmortem Template for structure.


On-Call Best Practices

On-call is how you ensure someone is always ready to respond. It's also how you burn people out if you do it wrong.

Rotation Design

Fairness. Distribute on-call evenly. Track who's been paged and rebalance if needed.

Coverage. Define the on-call window. 24/7 coverage requires either follow-the-sun rotation or compensation for off-hours.

Escalation. On-call engineers should never feel stuck. Define clear escalation paths for when they need help.

Handoff. At shift changes, hand off context about ongoing issues. Don't make the next person start from zero.

On-Call Health

Page frequency. Track pages per on-call shift. High frequency leads to burnout and alert fatigue.

Actionable alerts. Pages should require human action. If a page can be ignored or auto-resolves, it shouldn't be a page.

Runbooks. On-call engineers shouldn't need to remember everything. Provide runbooks for common issues.

Blameless reviews. When on-call is rough, review why. Address systemic issues, not individual performance.


Runbooks

Runbooks provide step-by-step guidance for common operational tasks and incidents. They encode knowledge so on-call engineers don't have to figure things out under pressure.

What Makes a Good Runbook

Actionable. Tells you what to do, not just what's happening.

Current. Reflects the actual system, not how it worked six months ago.

Accessible. Easy to find during an incident. Linked from alerts when possible.

Tested. Someone has actually followed the steps and confirmed they work.

Runbook Structure

## Runbook: [Title]

### Overview

[What is this runbook for? When should you use it?]

### Symptoms

[How do you know you're in this situation? Alert names, error messages, metrics.]

### Impact

[What's the user impact? What's the business impact?]

### Prerequisites

[What access/tools do you need? Who should you notify?]

### Steps

1. [Step 1 with exact commands if applicable]
2. [Step 2]
3. [Step 3]

### Verification

[How do you confirm the issue is resolved?]

### Escalation

[When and how to escalate if this doesn't work]

### Related

[Links to related runbooks, dashboards, documentation]

See Runbook Template for the full template.


What Good Looks Like

You'll know incident response is working when:

Signal What it looks like
Fast detection Incidents are caught by monitoring, not customer complaints
Calm coordination IC takes charge quickly; the response is organized, not chaotic
Clear communication Stakeholders know what's happening and when to expect updates
Quick mitigation User impact is stopped within your target window
Thorough postmortems Every significant incident gets a postmortem with actionable follow-ups
Declining recurrence The same incidents don't keep happening
Healthy on-call On-call is sustainable; people don't dread it

Metrics to Track

  • MTTR (Mean Time to Recovery): From detection to resolution
  • MTTD (Mean Time to Detection): From incident start to detection
  • Incident frequency: Total incidents per period, by severity
  • Postmortem completion rate: Percentage of incidents that get postmortems
  • Action item completion rate: Percentage of postmortem actions that get done
  • On-call health: Pages per shift, off-hours pages, escalation rate

Failure Modes and Mitigations

The Solo Hero

Symptom: One person always handles incidents. They know all the systems. When they're on vacation, incidents go worse.

Root cause: Knowledge concentration, lack of documented process, hero culture.

Mitigation: Rotate IC role. Require handoff documentation. Pair junior engineers with senior during incidents. Invest in runbooks.

The Blame Game

Symptom: Postmortems become interrogations. People hide mistakes. Future incidents are harder to investigate because people are defensive.

Root cause: Leadership punishes failures instead of learning from them.

Mitigation: Enforce blameless postmortems from the top. Ask "what allowed this" not "who did this." Celebrate incident reporters, not punish them.

The Alert Storm

Symptom: Pages are constant. On-call is miserable. Most pages are noise or duplicate.

Root cause: Alerting thresholds too sensitive, no alert consolidation, alerts for things that don't need human response.

Mitigation: Audit alerts. Remove or tune noisy alerts. Consolidate related alerts. Every alert should require human action.

The Postmortem Pile

Symptom: Postmortems are written, but action items never get done. The same incidents recur.

Root cause: Action items aren't tracked or prioritized. Postmortem is a ritual, not a learning mechanism.

Mitigation: Track postmortem action items in your normal backlog. Review completion rates. Escalate when patterns repeat.


Copy-Paste Artifact: Incident Response Quickstart

Post this in your on-call documentation and incident channels.

## Incident Response Quickstart

### When you detect a problem:

1. **Verify** - Is this real? Check metrics, logs, multiple sources
2. **Assess** - What's the impact? How many users? How severe?
3. **Communicate** - Create incident channel: #inc-YYYY-MM-DD-description
4. **Assign** - "I'm taking IC" or "I need an IC"

### Severity guide:

- **SEV-1:** Most/all users affected, critical function broken
- **SEV-2:** Significant users affected, workaround inadequate
- **SEV-3:** Some users affected, workaround exists
- **SEV-4:** Minimal impact, low priority fix

### IC responsibilities:

- Own the incident end-to-end
- Assign roles (Tech Lead, Comms Lead)
- Drive to mitigation (restore service first)
- Ensure communication (stakeholder updates)
- Declare resolution
- Schedule postmortem

### When in doubt:

- Escalate up, not down
- Mitigate first, investigate after
- Over-communicate, don't go silent
- Ask for help early

Copy-Paste Artifact: On-Call Handoff Template

## On-Call Handoff

**From:** [Name]
**To:** [Name]
**Date:** [Date]
**Shift:** [Time range]

### Current Status

[ ] All clear - no active issues
[ ] Active issue - see below

### Active Issues

| Issue         | Status                                | Next steps    | Notes     |
| ------------- | ------------------------------------- | ------------- | --------- |
| [Description] | [Investigating/Mitigating/Monitoring] | [What's next] | [Context] |

### Recent Incidents (last 7 days)

| Date   | Summary   | Resolved? | Postmortem? |
| ------ | --------- | --------- | ----------- |
| [Date] | [Summary] | [Y/N]     | [Link]      |

### Things to Watch

- [Anything that might cause problems]
- [Recent deployments to monitor]
- [Scheduled maintenance]

### Notes

[Any other context for the incoming on-call]

---

Handoff confirmed: [ ] (Incoming on-call acknowledges receipt)

Copy-Paste Artifact: Postmortem Action Item Tracker

## Postmortem Action Item Tracker

**Quarter:** [Q_ YYYY]

### Action Items

| Incident   | Action               | Owner  | Due    | Status              | Notes |
| ---------- | -------------------- | ------ | ------ | ------------------- | ----- |
| [Inc link] | [Action description] | [Name] | [Date] | [Open/Done/Blocked] |       |

### Summary

- Total action items: \_\_\_
- Completed: \_\_\_
- Open: \_\_\_
- Blocked: \_\_\_
- Overdue: \_\_\_

### Recurring Patterns

[Are any action items similar across incidents? Are we seeing the same root causes?]

### Process Improvements

[What would help us complete action items more reliably?]

Further Reading

  • Incident Management for Operations by Rob Schnepp, Ron Vidal, and Chris Hawley
  • Site Reliability Engineering by Google – Chapters on incident response and postmortems
  • The Field Guide to Understanding Human Error by Sidney Dekker