Crisis Management¶
An incident is any unplanned event that disrupts or degrades service for users. Crisis management is how you prepare for, respond to, and recover from these events. This page provides a complete framework: severity definitions, role assignments, escalation paths, and the incident lifecycle from detection to postmortem.
The goal is not to prevent all incidents—that's impossible. The goal is to minimize impact when they occur, recover quickly, and learn enough to reduce future occurrences.
What Problem This Solves¶
Without a structured approach, incident response becomes chaotic and exhausting. Multiple people investigate the same thing. Communication is inconsistent. Decisions stall because nobody knows who's in charge. The team burns out, and incidents feel worse than they need to be.
A good incident management framework provides clarity in chaos. When an incident starts, everyone knows: Who's in charge? What's my role? When do we escalate? How do we communicate? These questions have answers before the incident, not during it.
When to Use This Framework¶
Use this framework when:
- Your product has users who depend on its availability
- You have engineers on-call or responsible for production systems
- You need to coordinate across multiple people or teams during incidents
- You want consistent, repeatable incident response
- Stakeholders (internal or external) expect communication during outages
Don't use this framework when:
- You're building a prototype with no production users
- Incidents are so rare and small that formal process adds more overhead than value
- You have a single developer who can handle everything informally
Even small teams benefit from lightweight incident structure. The templates here can be simplified for smaller contexts.
Severity Levels¶
Severity levels create shared vocabulary. They determine response intensity, communication cadence, and escalation paths. Define them before incidents happen so you're not debating severity during a crisis.
Severity Definitions¶
| Severity | Definition | Examples | Response |
|---|---|---|---|
| SEV-1 (Critical) | Complete outage or severe degradation affecting all or most users. Revenue or data loss occurring. Security breach in progress. | Full production down, payment processing broken, data leak discovered | All hands. Incident Commander required. Executive notification. External status page updated. |
| SEV-2 (High) | Significant degradation affecting a subset of users or critical functionality. Workaround may exist but is inadequate. | Major feature broken, significant latency, authentication issues for some users | Dedicated response team. Incident Commander recommended. Stakeholder notification. |
| SEV-3 (Medium) | Partial degradation or non-critical feature failure. Workaround exists and is acceptable short-term. | Minor feature broken, slow performance for small subset, non-critical integration failing | On-call engineer responds. Team lead notified. Normal prioritization for fix. |
| SEV-4 (Low) | Minor issue with minimal user impact. Cosmetic problems or edge cases. | Typo in UI, minor display bug, edge case failure | Logged and prioritized in normal backlog. No immediate response required. |
When in doubt, escalate
If you're unsure about severity, round up. It's better to over-respond and scale down than to under-respond and let an incident grow. You can always downgrade severity as you learn more.
Severity Escalation¶
Severity can change during an incident:
- Escalate when impact is broader than initially understood, when resolution is taking longer than expected, or when new information reveals greater risk.
- De-escalate when impact is contained, when a workaround is deployed, or when the affected population is smaller than feared.
Document severity changes in the incident timeline with reasoning.
Roles in Incident Response¶
Clear roles prevent duplication and gaps. Not every incident needs every role, but everyone should know which roles exist and who's filling them.
Incident Commander (IC)¶
Responsibility: Owns the incident end-to-end. Makes decisions, coordinates responders, manages communication cadence, and declares resolution.
What they do:
- Assess severity and assemble the response team
- Create and maintain the incident channel/thread
- Assign tasks and roles
- Make decisions when the team is stuck
- Ensure regular status updates
- Declare incident resolved and initiate postmortem
What they don't do:
- Debug code (unless they're the only responder)
- Write customer communications (that's the Comms Lead)
- Get pulled into technical rabbit holes
Who should be IC: Someone with enough technical context to understand updates, enough authority to make decisions, and enough calm to coordinate under pressure. This is usually an engineering manager, senior engineer, or tech lead. It should rotate to build the skill across the team.
Technical Lead¶
Responsibility: Leads the technical investigation and resolution.
What they do:
- Investigate root cause
- Coordinate debugging across responders
- Implement fixes or workarounds
- Communicate technical status to the IC
What they don't do:
- Manage the overall incident (that's the IC)
- Write stakeholder communications
Who should be Tech Lead: The person with the deepest knowledge of the affected system, or the on-call engineer if no one else is available.
Communications Lead¶
Responsibility: Owns all stakeholder communication—internal and external.
What they do:
- Draft and send internal updates (Slack, email)
- Update external status page
- Coordinate with customer support on messaging
- Handle executive inquiries
What they don't do:
- Debug or fix the technical issue
- Make decisions about incident response
Who should be Comms Lead: Someone who can write clearly under pressure. Often a product manager, engineering manager, or designated communications person. For smaller incidents, the IC may handle comms.
Responders¶
Responsibility: Investigate and resolve the incident under Tech Lead coordination.
What they do:
- Investigate assigned areas
- Report findings to the Tech Lead
- Implement fixes as directed
- Document actions in the incident timeline
Scribe (Optional)¶
Responsibility: Maintains the incident timeline and captures key decisions.
What they do:
- Record timestamps, actions, and findings
- Capture decisions and reasoning
- Prepare the timeline for postmortem
For smaller incidents, the IC or another responder can handle scribing. For major incidents, a dedicated scribe significantly improves postmortem quality.
Role Assignment During an Incident¶
When an incident starts, the IC explicitly assigns roles:
IC: "I'm taking Incident Commander. @alex, you're Tech Lead—start investigating
the database. @sam, you're Comms Lead—draft an internal update for
#engineering-all. @taylor, investigate the API layer and report back."
Explicit assignment prevents assumptions and gaps.
The Incident Lifecycle¶
Phase 1: Detection¶
An incident begins when someone detects a problem. Detection sources include:
- Automated alerting (monitoring, error tracking)
- Customer reports (support tickets, social media)
- Internal reports (engineer notices something wrong)
Key actions:
- Acknowledge the alert or report
- Perform initial assessment: What's the impact? What's affected?
- Create an incident record (ticket, Slack channel, or both)
- Assign initial severity based on available information
Phase 2: Response Mobilization¶
Once an incident is confirmed, mobilize the response team.
Key actions:
- Declare the incident formally (post in incident channel, update status page if warranted)
- Assign Incident Commander
- IC assigns other roles based on severity and scope
- Establish communication cadence (e.g., updates every 30 minutes for SEV-1)
For remote teams: Use a dedicated Slack channel per incident, named with a convention like #inc-2026-01-31-database-outage. This keeps noise out of regular channels and creates a searchable record.
Phase 3: Investigation and Diagnosis¶
The team works to understand the problem.
Key actions:
- Tech Lead coordinates investigation across systems
- Responders investigate their assigned areas and report findings
- IC tracks progress and removes blockers
- Comms Lead sends regular updates to stakeholders
- Scribe maintains the timeline
Remote-first practices:
- Use threaded replies in the incident channel to keep the main channel scannable
- Use voice/video for real-time debugging when typing is too slow
- Share screens rather than describing what you see
Phase 4: Mitigation¶
The immediate goal is to stop the bleeding—restore service, even if the root cause isn't fully understood.
Key actions:
- Implement a workaround or fix
- Verify the fix works (check metrics, confirm with affected users)
- Communicate the mitigation to stakeholders
- Decide: Is this good enough, or does more work need to happen now?
Mitigation vs. resolution: Mitigation stops the immediate impact. Resolution fully fixes the underlying problem. Sometimes mitigation is good enough for now, and resolution can happen in follow-up work. Make this decision explicitly.
Phase 5: Resolution and Handoff¶
The incident is resolved when service is restored to acceptable levels.
Key actions:
- IC declares the incident resolved
- Send final communication to stakeholders
- Update status page to "resolved"
- Schedule postmortem (within 24-72 hours for SEV-1/SEV-2)
- Hand off any follow-up work to the backlog
Criteria for resolution:
- Service is functioning normally
- Metrics are back to baseline
- No ongoing customer impact
Phase 6: Postmortem¶
Every significant incident deserves a postmortem. The goal is learning, not blame.
Key actions:
- Gather timeline, data, and evidence
- Hold blameless postmortem meeting
- Identify root causes and contributing factors
- Generate action items to prevent recurrence
- Share the postmortem with the broader team
See Postmortem Template for structure and guidance.
Remote-First Incident Response¶
Remote teams face unique challenges: no war room, time zone gaps, and reliance on text-based communication. Here's how to adapt:
Dedicated incident channels. Don't run incidents in #general or #engineering. Create a channel per incident for focus and searchability.
Video for complex debugging. When you're deep in investigation, switch from Slack to a video call. Screen sharing is faster than describing.
Explicit handoffs. When shifting responsibility (end of day, role change), make it explicit: "I'm handing IC to @alex—here's current status: [summary]."
Timezone-aware cadence. For long-running incidents, plan handoffs around time zones. Document status thoroughly so people waking up can catch up quickly.
Async postmortems. Not everyone can attend a synchronous postmortem meeting. Prepare a written document in advance, allow async comments, and record the meeting.
What Good Looks Like¶
You'll know your crisis management is working when you observe these signals:
| Signal | What it looks like |
|---|---|
| Fast detection | Incidents are detected by monitoring, not customer complaints |
| Clear ownership | IC is assigned within minutes of incident declaration |
| Calm coordination | The incident channel is focused, not chaotic |
| Regular communication | Stakeholders receive updates at predictable intervals |
| Quick mitigation | Impact is contained within your target SLO (e.g., SEV-1 mitigated in <1 hour) |
| Thorough postmortems | Every SEV-1/SEV-2 gets a postmortem with actionable follow-ups |
| Declining recurrence | Root causes are addressed; the same incidents don't keep happening |
| Sustainable on-call | Rotation is fair; engineers don't burn out |
Failure Modes and Mitigations¶
The Leaderless Incident¶
Symptom: Nobody takes charge. Multiple people investigate the same thing. Decisions stall. The incident drags on.
Root cause: No clear IC assignment, unclear escalation paths, or cultural reluctance to "take over."
Mitigation: Make IC assignment automatic. First responder to SEV-1/SEV-2 becomes IC until someone else explicitly takes over. Train people in IC skills so it's not scary.
The Communication Blackout¶
Symptom: Stakeholders have no idea what's happening. Executives start pinging engineers directly. Customer support is blindsided.
Root cause: No Comms Lead assigned, or the IC is too busy debugging to communicate.
Mitigation: Assign Comms Lead explicitly for SEV-1/SEV-2. Use templates so communication doesn't require creativity during stress.
The Blame Storm¶
Symptom: Postmortems become interrogations. People get defensive. Future incidents are hidden or downplayed.
Root cause: Leadership treats postmortems as accountability exercises rather than learning opportunities.
Mitigation: Enforce blameless postmortems. Leadership sets the tone by asking "what allowed this to happen?" not "who did this?" Celebrate incident reporters, not punish them.
The Zombie Incident¶
Symptom: Incidents are never formally closed. People drift away without clear resolution. The same issue resurfaces.
Root cause: No criteria for resolution, no IC discipline around closure.
Mitigation: IC must explicitly declare resolution with criteria: "Service is restored, metrics are nominal, closing the incident." Reopening is fine if things regress.
The Postmortem Graveyard¶
Symptom: Postmortems generate action items that never get done. The same root causes repeat.
Root cause: Action items aren't owned, prioritized, or tracked.
Mitigation: Every action item has an owner and a due date. Track completion in your normal sprint/backlog process. Review completion rates periodically.
Escalation Paths¶
Define escalation paths before incidents happen. Here's a template:
| Situation | Escalate to | How |
|---|---|---|
| Need more responders | Team's on-call rotation | Page via PagerDuty/Opsgenie |
| Incident affects multiple teams | Engineering leadership | Slack + phone call |
| Customer-facing impact likely to make news | VP Engineering + Comms/PR | Phone call |
| Security incident suspected | Security team + CISO | Dedicated security channel + phone |
| Need executive decision | CTO/CEO | Phone call via Eng leadership |
Don't hesitate to escalate
Escalation is not failure. It's getting the right resources to solve a problem. Leaders would rather be woken up for a real incident than learn about it from Twitter the next morning.
On-Call Best Practices¶
On-call is part of incident response. A healthy on-call rotation makes crisis management sustainable.
Rotation fairness. Share the load equally. Track who's been paged and rebalance if necessary.
Reasonable expectations. On-call means you can respond within X minutes, not that you can't leave your house. Define the SLA explicitly.
Compensation. If on-call is burdensome, compensate for it (time off, money, or both). Unpaid on-call breeds resentment.
Runbooks. On-call engineers shouldn't need to remember everything. Provide runbooks for common issues. See Runbook Template.
Blameless culture. On-call engineers will make mistakes under pressure. Treat mistakes as system problems to fix, not personal failures.
Escalation clarity. On-call engineers should never feel stuck. Make sure they know when and how to escalate.
Copy-Paste Artifact: Incident Response Checklist¶
## Incident Response Checklist
### Detection
- [ ] Acknowledge alert or report
- [ ] Assess initial impact (users affected, systems involved)
- [ ] Assign initial severity (SEV-1/2/3/4)
### Mobilization
- [ ] Create incident channel: #inc-YYYY-MM-DD-description
- [ ] Assign Incident Commander
- [ ] IC assigns: Tech Lead, Comms Lead, Responders
- [ ] Post initial status in incident channel
### Investigation
- [ ] Tech Lead coordinates debugging
- [ ] Responders investigate assigned areas
- [ ] Scribe/IC maintains timeline
- [ ] Comms Lead sends first stakeholder update
### Mitigation
- [ ] Implement workaround or fix
- [ ] Verify fix (check metrics, confirm with users)
- [ ] Update stakeholders on mitigation
- [ ] Decide: resolved now, or follow-up needed?
### Resolution
- [ ] IC declares incident resolved
- [ ] Send final stakeholder communication
- [ ] Update status page to "resolved"
- [ ] Schedule postmortem
### Postmortem
- [ ] Gather timeline and evidence
- [ ] Write postmortem document
- [ ] Hold blameless postmortem meeting
- [ ] Assign action items with owners and due dates
- [ ] Share postmortem broadly
Copy-Paste Artifact: Incident Channel Opening Message¶
🚨 **INCIDENT DECLARED** 🚨
**Severity:** SEV-[X]
**Summary:** [Brief description of the issue]
**Impact:** [Who/what is affected]
**Detection:** [How was this discovered]
**Roles:**
- Incident Commander: @[name]
- Tech Lead: @[name]
- Comms Lead: @[name]
- Responders: @[names]
**Current Status:** Investigating
**Next Update:** [time] or when we have significant news
---
Please keep this channel focused on incident response. Use threads for extended discussion.
Copy-Paste Artifact: Severity Assessment Questions¶
## Severity Assessment Questions
Ask these questions to determine severity:
1. **How many users are affected?**
- All/most users → likely SEV-1
- Significant subset → likely SEV-2
- Small subset or specific scenario → likely SEV-3/4
2. **What functionality is broken?**
- Core functionality (login, payments, primary feature) → escalate
- Secondary feature or edge case → de-escalate
3. **Is there data loss or security risk?**
- Yes → SEV-1, involve security team
4. **Is there a workaround?**
- No reasonable workaround → escalate
- Workaround exists and is acceptable → de-escalate
5. **Is the problem getting worse?**
- Expanding impact → escalate
- Stable or contained → maintain or de-escalate
6. **Are we past our SLO?**
- Yes → escalate
- No but approaching → prepare to escalate
When in doubt, round up. You can always de-escalate.
Further Reading¶
- Incident Management for Operations by Rob Schnepp, Ron Vidal, and Chris Hawley – Practical incident management adapted from fire service practices
- Site Reliability Engineering by Google – Chapters on incident response and postmortems
- The Field Guide to Understanding Human Error by Sidney Dekker – Foundational text on blameless investigation
Related¶
- Outage Communication Playbook – How to communicate during incidents
- Incident Response – The delivery-focused view of incident handling
- Postmortem Template – Template for blameless reviews
- Runbook Template – Template for operational runbooks
- Reliability Practices – Building resilient systems