Crisis Management¶

An incident is any unplanned event that disrupts or degrades service for users. Crisis management is how you prepare for, respond to, and recover from these events. This page provides a complete framework: severity definitions, role assignments, escalation paths, and the incident lifecycle from detection to postmortem.

The goal is not to prevent all incidents—that's impossible. The goal is to minimize impact when they occur, recover quickly, and learn enough to reduce future occurrences.

What Problem This Solves¶

Without a structured approach, incident response becomes chaotic and exhausting. Multiple people investigate the same thing. Communication is inconsistent. Decisions stall because nobody knows who's in charge. The team burns out, and incidents feel worse than they need to be.

A good incident management framework provides clarity in chaos. When an incident starts, everyone knows: Who's in charge? What's my role? When do we escalate? How do we communicate? These questions have answers before the incident, not during it.

When to Use This Framework¶

Use this framework when:

Your product has users who depend on its availability
You have engineers on-call or responsible for production systems
You need to coordinate across multiple people or teams during incidents
You want consistent, repeatable incident response
Stakeholders (internal or external) expect communication during outages

Don't use this framework when:

You're building a prototype with no production users
Incidents are so rare and small that formal process adds more overhead than value
You have a single developer who can handle everything informally

Even small teams benefit from lightweight incident structure. The templates here can be simplified for smaller contexts.

Severity Levels¶

Severity levels create shared vocabulary. They determine response intensity, communication cadence, and escalation paths. Define them before incidents happen so you're not debating severity during a crisis.

Severity Definitions¶

Severity	Definition	Examples	Response
SEV-1 (Critical)	Complete outage or severe degradation affecting all or most users. Revenue or data loss occurring. Security breach in progress.	Full production down, payment processing broken, data leak discovered	All hands. Incident Commander required. Executive notification. External status page updated.
SEV-2 (High)	Significant degradation affecting a subset of users or critical functionality. Workaround may exist but is inadequate.	Major feature broken, significant latency, authentication issues for some users	Dedicated response team. Incident Commander recommended. Stakeholder notification.
SEV-3 (Medium)	Partial degradation or non-critical feature failure. Workaround exists and is acceptable short-term.	Minor feature broken, slow performance for small subset, non-critical integration failing	On-call engineer responds. Team lead notified. Normal prioritization for fix.
SEV-4 (Low)	Minor issue with minimal user impact. Cosmetic problems or edge cases.	Typo in UI, minor display bug, edge case failure	Logged and prioritized in normal backlog. No immediate response required.

When in doubt, escalate

If you're unsure about severity, round up. It's better to over-respond and scale down than to under-respond and let an incident grow. You can always downgrade severity as you learn more.

Severity Escalation¶

Severity can change during an incident:

Escalate when impact is broader than initially understood, when resolution is taking longer than expected, or when new information reveals greater risk.
De-escalate when impact is contained, when a workaround is deployed, or when the affected population is smaller than feared.

Document severity changes in the incident timeline with reasoning.

Roles in Incident Response¶

Clear roles prevent duplication and gaps. Not every incident needs every role, but everyone should know which roles exist and who's filling them.

Incident Commander (IC)¶

Responsibility: Owns the incident end-to-end. Makes decisions, coordinates responders, manages communication cadence, and declares resolution.

What they do:

Assess severity and assemble the response team
Create and maintain the incident channel/thread
Assign tasks and roles
Make decisions when the team is stuck
Ensure regular status updates
Declare incident resolved and initiate postmortem

What they don't do:

Debug code (unless they're the only responder)
Write customer communications (that's the Comms Lead)
Get pulled into technical rabbit holes

Who should be IC: Someone with enough technical context to understand updates, enough authority to make decisions, and enough calm to coordinate under pressure. This is usually an engineering manager, senior engineer, or tech lead. It should rotate to build the skill across the team.

Technical Lead¶

Responsibility: Leads the technical investigation and resolution.

What they do:

Investigate root cause
Coordinate debugging across responders
Implement fixes or workarounds
Communicate technical status to the IC

What they don't do:

Manage the overall incident (that's the IC)
Write stakeholder communications

Who should be Tech Lead: The person with the deepest knowledge of the affected system, or the on-call engineer if no one else is available.

Communications Lead¶

Responsibility: Owns all stakeholder communication—internal and external.

What they do:

Draft and send internal updates (Slack, email)
Update external status page
Coordinate with customer support on messaging
Handle executive inquiries

What they don't do:

Debug or fix the technical issue
Make decisions about incident response

Who should be Comms Lead: Someone who can write clearly under pressure. Often a product manager, engineering manager, or designated communications person. For smaller incidents, the IC may handle comms.

Responders¶

Responsibility: Investigate and resolve the incident under Tech Lead coordination.

What they do:

Investigate assigned areas
Report findings to the Tech Lead
Implement fixes as directed
Document actions in the incident timeline

Scribe (Optional)¶

Responsibility: Maintains the incident timeline and captures key decisions.

What they do:

Record timestamps, actions, and findings
Capture decisions and reasoning
Prepare the timeline for postmortem

For smaller incidents, the IC or another responder can handle scribing. For major incidents, a dedicated scribe significantly improves postmortem quality.

Role Assignment During an Incident¶

When an incident starts, the IC explicitly assigns roles:

IC: "I'm taking Incident Commander. @alex, you're Tech Lead—start investigating
the database. @sam, you're Comms Lead—draft an internal update for
#engineering-all. @taylor, investigate the API layer and report back."

Explicit assignment prevents assumptions and gaps.

The Incident Lifecycle¶

Phase 1: Detection¶

An incident begins when someone detects a problem. Detection sources include:

Automated alerting (monitoring, error tracking)
Customer reports (support tickets, social media)
Internal reports (engineer notices something wrong)

Key actions:

Acknowledge the alert or report
Perform initial assessment: What's the impact? What's affected?
Create an incident record (ticket, Slack channel, or both)
Assign initial severity based on available information

Phase 2: Response Mobilization¶

Once an incident is confirmed, mobilize the response team.

Key actions:

Declare the incident formally (post in incident channel, update status page if warranted)
Assign Incident Commander
IC assigns other roles based on severity and scope
Establish communication cadence (e.g., updates every 30 minutes for SEV-1)

For remote teams: Use a dedicated Slack channel per incident, named with a convention like #inc-2026-01-31-database-outage. This keeps noise out of regular channels and creates a searchable record.

Phase 3: Investigation and Diagnosis¶

The team works to understand the problem.

Key actions:

Tech Lead coordinates investigation across systems
Responders investigate their assigned areas and report findings
IC tracks progress and removes blockers
Comms Lead sends regular updates to stakeholders
Scribe maintains the timeline

Remote-first practices:

Use threaded replies in the incident channel to keep the main channel scannable
Use voice/video for real-time debugging when typing is too slow
Share screens rather than describing what you see

Phase 4: Mitigation¶

The immediate goal is to stop the bleeding—restore service, even if the root cause isn't fully understood.

Key actions:

Implement a workaround or fix
Verify the fix works (check metrics, confirm with affected users)
Communicate the mitigation to stakeholders
Decide: Is this good enough, or does more work need to happen now?

Mitigation vs. resolution: Mitigation stops the immediate impact. Resolution fully fixes the underlying problem. Sometimes mitigation is good enough for now, and resolution can happen in follow-up work. Make this decision explicitly.

Phase 5: Resolution and Handoff¶

The incident is resolved when service is restored to acceptable levels.

Key actions:

IC declares the incident resolved
Send final communication to stakeholders
Update status page to "resolved"
Schedule postmortem (within 24-72 hours for SEV-1/SEV-2)
Hand off any follow-up work to the backlog

Criteria for resolution:

Service is functioning normally
Metrics are back to baseline
No ongoing customer impact

Phase 6: Postmortem¶

Every significant incident deserves a postmortem. The goal is learning, not blame.

Key actions:

Gather timeline, data, and evidence
Hold blameless postmortem meeting
Identify root causes and contributing factors
Generate action items to prevent recurrence
Share the postmortem with the broader team

See Postmortem Template for structure and guidance.

Remote-First Incident Response¶

Remote teams face unique challenges: no war room, time zone gaps, and reliance on text-based communication. Here's how to adapt:

Dedicated incident channels. Don't run incidents in #general or #engineering. Create a channel per incident for focus and searchability.

Video for complex debugging. When you're deep in investigation, switch from Slack to a video call. Screen sharing is faster than describing.

Explicit handoffs. When shifting responsibility (end of day, role change), make it explicit: "I'm handing IC to @alex—here's current status: [summary]."

Timezone-aware cadence. For long-running incidents, plan handoffs around time zones. Document status thoroughly so people waking up can catch up quickly.

Async postmortems. Not everyone can attend a synchronous postmortem meeting. Prepare a written document in advance, allow async comments, and record the meeting.

What Good Looks Like¶

You'll know your crisis management is working when you observe these signals:

Signal	What it looks like
Fast detection	Incidents are detected by monitoring, not customer complaints
Clear ownership	IC is assigned within minutes of incident declaration
Calm coordination	The incident channel is focused, not chaotic
Regular communication	Stakeholders receive updates at predictable intervals
Quick mitigation	Impact is contained within your target SLO (e.g., SEV-1 mitigated in <1 hour)
Thorough postmortems	Every SEV-1/SEV-2 gets a postmortem with actionable follow-ups
Declining recurrence	Root causes are addressed; the same incidents don't keep happening
Sustainable on-call	Rotation is fair; engineers don't burn out

Failure Modes and Mitigations¶

The Leaderless Incident¶

Symptom: Nobody takes charge. Multiple people investigate the same thing. Decisions stall. The incident drags on.

Root cause: No clear IC assignment, unclear escalation paths, or cultural reluctance to "take over."

Mitigation: Make IC assignment automatic. First responder to SEV-1/SEV-2 becomes IC until someone else explicitly takes over. Train people in IC skills so it's not scary.

The Communication Blackout¶

Symptom: Stakeholders have no idea what's happening. Executives start pinging engineers directly. Customer support is blindsided.

Root cause: No Comms Lead assigned, or the IC is too busy debugging to communicate.

Mitigation: Assign Comms Lead explicitly for SEV-1/SEV-2. Use templates so communication doesn't require creativity during stress.

The Blame Storm¶

Symptom: Postmortems become interrogations. People get defensive. Future incidents are hidden or downplayed.

Root cause: Leadership treats postmortems as accountability exercises rather than learning opportunities.

Mitigation: Enforce blameless postmortems. Leadership sets the tone by asking "what allowed this to happen?" not "who did this?" Celebrate incident reporters, not punish them.

The Zombie Incident¶

Symptom: Incidents are never formally closed. People drift away without clear resolution. The same issue resurfaces.

Root cause: No criteria for resolution, no IC discipline around closure.

Mitigation: IC must explicitly declare resolution with criteria: "Service is restored, metrics are nominal, closing the incident." Reopening is fine if things regress.

The Postmortem Graveyard¶

Symptom: Postmortems generate action items that never get done. The same root causes repeat.

Root cause: Action items aren't owned, prioritized, or tracked.

Mitigation: Every action item has an owner and a due date. Track completion in your normal sprint/backlog process. Review completion rates periodically.

Escalation Paths¶

Define escalation paths before incidents happen. Here's a template:

Situation	Escalate to	How
Need more responders	Team's on-call rotation	Page via PagerDuty/Opsgenie
Incident affects multiple teams	Engineering leadership	Slack + phone call
Customer-facing impact likely to make news	VP Engineering + Comms/PR	Phone call
Security incident suspected	Security team + CISO	Dedicated security channel + phone
Need executive decision	CTO/CEO	Phone call via Eng leadership

Don't hesitate to escalate

Escalation is not failure. It's getting the right resources to solve a problem. Leaders would rather be woken up for a real incident than learn about it from Twitter the next morning.

On-Call Best Practices¶

On-call is part of incident response. A healthy on-call rotation makes crisis management sustainable.

Rotation fairness. Share the load equally. Track who's been paged and rebalance if necessary.

Reasonable expectations. On-call means you can respond within X minutes, not that you can't leave your house. Define the SLA explicitly.

Compensation. If on-call is burdensome, compensate for it (time off, money, or both). Unpaid on-call breeds resentment.

Runbooks. On-call engineers shouldn't need to remember everything. Provide runbooks for common issues. See Runbook Template.

Blameless culture. On-call engineers will make mistakes under pressure. Treat mistakes as system problems to fix, not personal failures.

Escalation clarity. On-call engineers should never feel stuck. Make sure they know when and how to escalate.

Copy-Paste Artifact: Incident Response Checklist¶

## Incident Response Checklist

### Detection

- [ ] Acknowledge alert or report
- [ ] Assess initial impact (users affected, systems involved)
- [ ] Assign initial severity (SEV-1/2/3/4)

### Mobilization

- [ ] Create incident channel: #inc-YYYY-MM-DD-description
- [ ] Assign Incident Commander
- [ ] IC assigns: Tech Lead, Comms Lead, Responders
- [ ] Post initial status in incident channel

### Investigation

- [ ] Tech Lead coordinates debugging
- [ ] Responders investigate assigned areas
- [ ] Scribe/IC maintains timeline
- [ ] Comms Lead sends first stakeholder update

### Mitigation

- [ ] Implement workaround or fix
- [ ] Verify fix (check metrics, confirm with users)
- [ ] Update stakeholders on mitigation
- [ ] Decide: resolved now, or follow-up needed?

### Resolution

- [ ] IC declares incident resolved
- [ ] Send final stakeholder communication
- [ ] Update status page to "resolved"
- [ ] Schedule postmortem

### Postmortem

- [ ] Gather timeline and evidence
- [ ] Write postmortem document
- [ ] Hold blameless postmortem meeting
- [ ] Assign action items with owners and due dates
- [ ] Share postmortem broadly

Copy-Paste Artifact: Incident Channel Opening Message¶

🚨 **INCIDENT DECLARED** 🚨

**Severity:** SEV-[X]
**Summary:** [Brief description of the issue]
**Impact:** [Who/what is affected]
**Detection:** [How was this discovered]

**Roles:**

- Incident Commander: @[name]
- Tech Lead: @[name]
- Comms Lead: @[name]
- Responders: @[names]

**Current Status:** Investigating

**Next Update:** [time] or when we have significant news

---

Please keep this channel focused on incident response. Use threads for extended discussion.

Copy-Paste Artifact: Severity Assessment Questions¶

## Severity Assessment Questions

Ask these questions to determine severity:

1. **How many users are affected?**
   - All/most users → likely SEV-1
   - Significant subset → likely SEV-2
   - Small subset or specific scenario → likely SEV-3/4

2. **What functionality is broken?**
   - Core functionality (login, payments, primary feature) → escalate
   - Secondary feature or edge case → de-escalate

3. **Is there data loss or security risk?**
   - Yes → SEV-1, involve security team

4. **Is there a workaround?**
   - No reasonable workaround → escalate
   - Workaround exists and is acceptable → de-escalate

5. **Is the problem getting worse?**
   - Expanding impact → escalate
   - Stable or contained → maintain or de-escalate

6. **Are we past our SLO?**
   - Yes → escalate
   - No but approaching → prepare to escalate

When in doubt, round up. You can always de-escalate.

Crisis Management¶

What Problem This Solves¶

When to Use This Framework¶

Severity Levels¶

Severity Definitions¶

Severity Escalation¶

Roles in Incident Response¶

Incident Commander (IC)¶

Technical Lead¶

Communications Lead¶

Responders¶

Scribe (Optional)¶

Role Assignment During an Incident¶

The Incident Lifecycle¶

Phase 1: Detection¶

Phase 2: Response Mobilization¶

Phase 3: Investigation and Diagnosis¶

Phase 4: Mitigation¶

Phase 5: Resolution and Handoff¶

Phase 6: Postmortem¶

Remote-First Incident Response¶

What Good Looks Like¶

Failure Modes and Mitigations¶

The Leaderless Incident¶

The Communication Blackout¶

The Blame Storm¶

The Zombie Incident¶

The Postmortem Graveyard¶

Escalation Paths¶

On-Call Best Practices¶

Copy-Paste Artifact: Incident Response Checklist¶

Copy-Paste Artifact: Incident Channel Opening Message¶

Copy-Paste Artifact: Severity Assessment Questions¶

Further Reading¶

Related¶