Incident Response¶

When production breaks, how you respond matters as much as the fix. A good incident response minimizes user impact, keeps stakeholders informed, and captures learning that prevents recurrence. A bad response turns a small problem into a big one through chaos, blame, and missed lessons.

This page covers the delivery perspective on incident response: the practices and processes that make incidents manageable. For the complete framework including severity levels, roles, and communication, see Crisis Management.

What Problem This Solves¶

Incidents are inevitable in any system with sufficient complexity. The question is not whether you'll have incidents, but how you'll handle them.

Without a clear incident response process:

Chaos multiplies. Everyone investigates the same thing. Nobody knows who's in charge. Communication is inconsistent. The incident takes longer to resolve than it should.

Learning doesn't happen. The incident is fixed, everyone goes back to normal, and the same thing happens again in three months.

People burn out. A few heroes always handle incidents. Everyone else watches. On-call becomes a punishment, not a responsibility.

Trust erodes. Customers experience problems and hear nothing. Leadership finds out from Twitter. Support is blindsided.

Good incident response prevents all of these by making incident handling structured, distributed, and learning-oriented.

When to Use This Process¶

Invoke incident response when:

Users are experiencing degraded service or errors
Automated alerts fire indicating system problems
Internal users report something is broken
You suspect a security issue

Don't invoke when:

The issue is minor and can be fixed in minutes without coordination
It's a development/staging environment issue with no user impact
It's a known issue with a known timeline

When in doubt, invoke. It's easier to stand down from an incident than to escalate a problem that grew while you waited.

Ownership¶

Role	Responsibility
On-call engineer	First responder; triages, escalates, often becomes Tech Lead for the incident
Incident Commander (IC)	Owns the incident end-to-end; coordinates response, manages communication
Tech Lead	Leads technical investigation and resolution
Comms Lead	Owns stakeholder communication (internal and external)
Engineering Manager	Ensures process is followed; supports IC; addresses team issues

For full role definitions, see Crisis Management.

The Incident Lifecycle¶

Phase 1: Detection¶

Incidents begin when someone notices a problem. The faster you detect, the less impact users experience.

Detection sources:

Automated monitoring: Alerts on error rates, latency, availability
Customer reports: Support tickets, social media mentions
Internal reports: Engineer notices something wrong

Key actions:

Acknowledge the alert or report
Verify the problem is real (not a false positive)
Assess initial impact: What's affected? How many users?
Create an incident record

Phase 2: Triage¶

Determine how serious this is and who needs to respond.

Severity assessment:

SEV-1: Complete outage or severe degradation affecting most users
SEV-2: Significant degradation affecting a subset of users
SEV-3: Partial degradation or non-critical feature failure
SEV-4: Minor issue with minimal user impact

See Crisis Management for full severity definitions.

Key actions:

Assign severity
Assign Incident Commander (for SEV-1/SEV-2)
Create incident channel
Notify relevant stakeholders

Phase 3: Investigation¶

Figure out what's wrong.

Key actions:

Tech Lead coordinates investigation
Responders investigate their assigned areas
Findings are shared in the incident channel
IC tracks progress and removes blockers

Investigation heuristics:

What changed recently? Deployments, config changes, traffic patterns
Where is the error? Error logs, traces, metrics
What's the blast radius? Which users, which regions, which features
What's the proximate cause? Even if you don't know root cause, what's immediately wrong?

Phase 4: Mitigation¶

Stop the bleeding. Mitigation is about restoring service, not fixing the root cause.

Mitigation options:

Rollback the recent deployment
Disable the problematic feature (feature flag)
Scale up resources
Route traffic away from the problem
Apply a hotfix

Key principle: Mitigate first, investigate later. User impact matters more than understanding why.

Phase 5: Resolution¶

Confirm the incident is resolved and close it out.

Resolution criteria:

Service is functioning normally
Metrics are back to baseline
No ongoing user impact

Key actions:

IC declares incident resolved
Send final stakeholder communication
Update status page
Schedule postmortem (for SEV-1/SEV-2)

Phase 6: Postmortem¶

Learn from the incident to prevent recurrence.

Postmortem principles:

Blameless. Focus on systems, not people. "What allowed this to happen?" not "Who did this?"
Honest. Capture what actually happened, including uncomfortable truths
Actionable. Generate specific action items with owners and deadlines
Shared. Make postmortems visible so the organization learns

See Postmortem Template for structure.

On-Call Best Practices¶

On-call is how you ensure someone is always ready to respond. It's also how you burn people out if you do it wrong.

Rotation Design¶

Fairness. Distribute on-call evenly. Track who's been paged and rebalance if needed.

Coverage. Define the on-call window. 24/7 coverage requires either follow-the-sun rotation or compensation for off-hours.

Escalation. On-call engineers should never feel stuck. Define clear escalation paths for when they need help.

Handoff. At shift changes, hand off context about ongoing issues. Don't make the next person start from zero.

On-Call Health¶

Page frequency. Track pages per on-call shift. High frequency leads to burnout and alert fatigue.

Actionable alerts. Pages should require human action. If a page can be ignored or auto-resolves, it shouldn't be a page.

Runbooks. On-call engineers shouldn't need to remember everything. Provide runbooks for common issues.

Blameless reviews. When on-call is rough, review why. Address systemic issues, not individual performance.

Runbooks¶

Runbooks provide step-by-step guidance for common operational tasks and incidents. They encode knowledge so on-call engineers don't have to figure things out under pressure.

What Makes a Good Runbook¶

Actionable. Tells you what to do, not just what's happening.

Current. Reflects the actual system, not how it worked six months ago.

Accessible. Easy to find during an incident. Linked from alerts when possible.

Tested. Someone has actually followed the steps and confirmed they work.

Runbook Structure¶

## Runbook: [Title]

### Overview

[What is this runbook for? When should you use it?]

### Symptoms

[How do you know you're in this situation? Alert names, error messages, metrics.]

### Impact

[What's the user impact? What's the business impact?]

### Prerequisites

[What access/tools do you need? Who should you notify?]

### Steps

1. [Step 1 with exact commands if applicable]
2. [Step 2]
3. [Step 3]

### Verification

[How do you confirm the issue is resolved?]

### Escalation

[When and how to escalate if this doesn't work]

### Related

[Links to related runbooks, dashboards, documentation]

See Runbook Template for the full template.

What Good Looks Like¶

You'll know incident response is working when:

Signal	What it looks like
Fast detection	Incidents are caught by monitoring, not customer complaints
Calm coordination	IC takes charge quickly; the response is organized, not chaotic
Clear communication	Stakeholders know what's happening and when to expect updates
Quick mitigation	User impact is stopped within your target window
Thorough postmortems	Every significant incident gets a postmortem with actionable follow-ups
Declining recurrence	The same incidents don't keep happening
Healthy on-call	On-call is sustainable; people don't dread it

Metrics to Track¶

MTTR (Mean Time to Recovery): From detection to resolution
MTTD (Mean Time to Detection): From incident start to detection
Incident frequency: Total incidents per period, by severity
Postmortem completion rate: Percentage of incidents that get postmortems
Action item completion rate: Percentage of postmortem actions that get done
On-call health: Pages per shift, off-hours pages, escalation rate

Failure Modes and Mitigations¶

The Solo Hero¶

Symptom: One person always handles incidents. They know all the systems. When they're on vacation, incidents go worse.

Root cause: Knowledge concentration, lack of documented process, hero culture.

Mitigation: Rotate IC role. Require handoff documentation. Pair junior engineers with senior during incidents. Invest in runbooks.

The Blame Game¶

Symptom: Postmortems become interrogations. People hide mistakes. Future incidents are harder to investigate because people are defensive.

Root cause: Leadership punishes failures instead of learning from them.

Mitigation: Enforce blameless postmortems from the top. Ask "what allowed this" not "who did this." Celebrate incident reporters, not punish them.

The Alert Storm¶

Symptom: Pages are constant. On-call is miserable. Most pages are noise or duplicate.

Root cause: Alerting thresholds too sensitive, no alert consolidation, alerts for things that don't need human response.

Mitigation: Audit alerts. Remove or tune noisy alerts. Consolidate related alerts. Every alert should require human action.

The Postmortem Pile¶

Symptom: Postmortems are written, but action items never get done. The same incidents recur.

Root cause: Action items aren't tracked or prioritized. Postmortem is a ritual, not a learning mechanism.

Mitigation: Track postmortem action items in your normal backlog. Review completion rates. Escalate when patterns repeat.

Copy-Paste Artifact: Incident Response Quickstart¶

Post this in your on-call documentation and incident channels.

## Incident Response Quickstart

### When you detect a problem:

1. **Verify** - Is this real? Check metrics, logs, multiple sources
2. **Assess** - What's the impact? How many users? How severe?
3. **Communicate** - Create incident channel: #inc-YYYY-MM-DD-description
4. **Assign** - "I'm taking IC" or "I need an IC"

### Severity guide:

- **SEV-1:** Most/all users affected, critical function broken
- **SEV-2:** Significant users affected, workaround inadequate
- **SEV-3:** Some users affected, workaround exists
- **SEV-4:** Minimal impact, low priority fix

### IC responsibilities:

- Own the incident end-to-end
- Assign roles (Tech Lead, Comms Lead)
- Drive to mitigation (restore service first)
- Ensure communication (stakeholder updates)
- Declare resolution
- Schedule postmortem

### When in doubt:

- Escalate up, not down
- Mitigate first, investigate after
- Over-communicate, don't go silent
- Ask for help early

Copy-Paste Artifact: On-Call Handoff Template¶

## On-Call Handoff

**From:** [Name]
**To:** [Name]
**Date:** [Date]
**Shift:** [Time range]

### Current Status

[ ] All clear - no active issues
[ ] Active issue - see below

### Active Issues

| Issue         | Status                                | Next steps    | Notes     |
| ------------- | ------------------------------------- | ------------- | --------- |
| [Description] | [Investigating/Mitigating/Monitoring] | [What's next] | [Context] |

### Recent Incidents (last 7 days)

| Date   | Summary   | Resolved? | Postmortem? |
| ------ | --------- | --------- | ----------- |
| [Date] | [Summary] | [Y/N]     | [Link]      |

### Things to Watch

- [Anything that might cause problems]
- [Recent deployments to monitor]
- [Scheduled maintenance]

### Notes

[Any other context for the incoming on-call]

---

Handoff confirmed: [ ] (Incoming on-call acknowledges receipt)

Copy-Paste Artifact: Postmortem Action Item Tracker¶

## Postmortem Action Item Tracker

**Quarter:** [Q_ YYYY]

### Action Items

| Incident   | Action               | Owner  | Due    | Status              | Notes |
| ---------- | -------------------- | ------ | ------ | ------------------- | ----- |
| [Inc link] | [Action description] | [Name] | [Date] | [Open/Done/Blocked] |       |

### Summary

- Total action items: \_\_\_
- Completed: \_\_\_
- Open: \_\_\_
- Blocked: \_\_\_
- Overdue: \_\_\_

### Recurring Patterns

[Are any action items similar across incidents? Are we seeing the same root causes?]

### Process Improvements

[What would help us complete action items more reliably?]

Incident Response¶

What Problem This Solves¶

When to Use This Process¶

Ownership¶

The Incident Lifecycle¶

Phase 1: Detection¶

Phase 2: Triage¶

Phase 3: Investigation¶

Phase 4: Mitigation¶

Phase 5: Resolution¶

Phase 6: Postmortem¶

On-Call Best Practices¶

Rotation Design¶

On-Call Health¶

Runbooks¶

What Makes a Good Runbook¶

Runbook Structure¶

What Good Looks Like¶

Metrics to Track¶

Failure Modes and Mitigations¶

The Solo Hero¶

The Blame Game¶

The Alert Storm¶

The Postmortem Pile¶

Copy-Paste Artifact: Incident Response Quickstart¶

Copy-Paste Artifact: On-Call Handoff Template¶

Copy-Paste Artifact: Postmortem Action Item Tracker¶

Further Reading¶

Related¶