Skip to content

Crisis Management

An incident is any unplanned event that disrupts or degrades service for users. Crisis management is how you prepare for, respond to, and recover from these events. This page provides a complete framework: severity definitions, role assignments, escalation paths, and the incident lifecycle from detection to postmortem.

The goal is not to prevent all incidents—that's impossible. The goal is to minimize impact when they occur, recover quickly, and learn enough to reduce future occurrences.

What Problem This Solves

Without a structured approach, incident response becomes chaotic and exhausting. Multiple people investigate the same thing. Communication is inconsistent. Decisions stall because nobody knows who's in charge. The team burns out, and incidents feel worse than they need to be.

A good incident management framework provides clarity in chaos. When an incident starts, everyone knows: Who's in charge? What's my role? When do we escalate? How do we communicate? These questions have answers before the incident, not during it.


When to Use This Framework

Use this framework when:

  • Your product has users who depend on its availability
  • You have engineers on-call or responsible for production systems
  • You need to coordinate across multiple people or teams during incidents
  • You want consistent, repeatable incident response
  • Stakeholders (internal or external) expect communication during outages

Don't use this framework when:

  • You're building a prototype with no production users
  • Incidents are so rare and small that formal process adds more overhead than value
  • You have a single developer who can handle everything informally

Even small teams benefit from lightweight incident structure. The templates here can be simplified for smaller contexts.


Severity Levels

Severity levels create shared vocabulary. They determine response intensity, communication cadence, and escalation paths. Define them before incidents happen so you're not debating severity during a crisis.

Severity Definitions

Severity Definition Examples Response
SEV-1 (Critical) Complete outage or severe degradation affecting all or most users. Revenue or data loss occurring. Security breach in progress. Full production down, payment processing broken, data leak discovered All hands. Incident Commander required. Executive notification. External status page updated.
SEV-2 (High) Significant degradation affecting a subset of users or critical functionality. Workaround may exist but is inadequate. Major feature broken, significant latency, authentication issues for some users Dedicated response team. Incident Commander recommended. Stakeholder notification.
SEV-3 (Medium) Partial degradation or non-critical feature failure. Workaround exists and is acceptable short-term. Minor feature broken, slow performance for small subset, non-critical integration failing On-call engineer responds. Team lead notified. Normal prioritization for fix.
SEV-4 (Low) Minor issue with minimal user impact. Cosmetic problems or edge cases. Typo in UI, minor display bug, edge case failure Logged and prioritized in normal backlog. No immediate response required.

When in doubt, escalate

If you're unsure about severity, round up. It's better to over-respond and scale down than to under-respond and let an incident grow. You can always downgrade severity as you learn more.

Severity Escalation

Severity can change during an incident:

  • Escalate when impact is broader than initially understood, when resolution is taking longer than expected, or when new information reveals greater risk.
  • De-escalate when impact is contained, when a workaround is deployed, or when the affected population is smaller than feared.

Document severity changes in the incident timeline with reasoning.


Roles in Incident Response

Clear roles prevent duplication and gaps. Not every incident needs every role, but everyone should know which roles exist and who's filling them.

Incident Commander (IC)

Responsibility: Owns the incident end-to-end. Makes decisions, coordinates responders, manages communication cadence, and declares resolution.

What they do:

  • Assess severity and assemble the response team
  • Create and maintain the incident channel/thread
  • Assign tasks and roles
  • Make decisions when the team is stuck
  • Ensure regular status updates
  • Declare incident resolved and initiate postmortem

What they don't do:

  • Debug code (unless they're the only responder)
  • Write customer communications (that's the Comms Lead)
  • Get pulled into technical rabbit holes

Who should be IC: Someone with enough technical context to understand updates, enough authority to make decisions, and enough calm to coordinate under pressure. This is usually an engineering manager, senior engineer, or tech lead. It should rotate to build the skill across the team.

Technical Lead

Responsibility: Leads the technical investigation and resolution.

What they do:

  • Investigate root cause
  • Coordinate debugging across responders
  • Implement fixes or workarounds
  • Communicate technical status to the IC

What they don't do:

  • Manage the overall incident (that's the IC)
  • Write stakeholder communications

Who should be Tech Lead: The person with the deepest knowledge of the affected system, or the on-call engineer if no one else is available.

Communications Lead

Responsibility: Owns all stakeholder communication—internal and external.

What they do:

  • Draft and send internal updates (Slack, email)
  • Update external status page
  • Coordinate with customer support on messaging
  • Handle executive inquiries

What they don't do:

  • Debug or fix the technical issue
  • Make decisions about incident response

Who should be Comms Lead: Someone who can write clearly under pressure. Often a product manager, engineering manager, or designated communications person. For smaller incidents, the IC may handle comms.

Responders

Responsibility: Investigate and resolve the incident under Tech Lead coordination.

What they do:

  • Investigate assigned areas
  • Report findings to the Tech Lead
  • Implement fixes as directed
  • Document actions in the incident timeline

Scribe (Optional)

Responsibility: Maintains the incident timeline and captures key decisions.

What they do:

  • Record timestamps, actions, and findings
  • Capture decisions and reasoning
  • Prepare the timeline for postmortem

For smaller incidents, the IC or another responder can handle scribing. For major incidents, a dedicated scribe significantly improves postmortem quality.

Role Assignment During an Incident

When an incident starts, the IC explicitly assigns roles:

IC: "I'm taking Incident Commander. @alex, you're Tech Lead—start investigating
the database. @sam, you're Comms Lead—draft an internal update for
#engineering-all. @taylor, investigate the API layer and report back."

Explicit assignment prevents assumptions and gaps.


The Incident Lifecycle

Phase 1: Detection

An incident begins when someone detects a problem. Detection sources include:

  • Automated alerting (monitoring, error tracking)
  • Customer reports (support tickets, social media)
  • Internal reports (engineer notices something wrong)

Key actions:

  1. Acknowledge the alert or report
  2. Perform initial assessment: What's the impact? What's affected?
  3. Create an incident record (ticket, Slack channel, or both)
  4. Assign initial severity based on available information

Phase 2: Response Mobilization

Once an incident is confirmed, mobilize the response team.

Key actions:

  1. Declare the incident formally (post in incident channel, update status page if warranted)
  2. Assign Incident Commander
  3. IC assigns other roles based on severity and scope
  4. Establish communication cadence (e.g., updates every 30 minutes for SEV-1)

For remote teams: Use a dedicated Slack channel per incident, named with a convention like #inc-2026-01-31-database-outage. This keeps noise out of regular channels and creates a searchable record.

Phase 3: Investigation and Diagnosis

The team works to understand the problem.

Key actions:

  1. Tech Lead coordinates investigation across systems
  2. Responders investigate their assigned areas and report findings
  3. IC tracks progress and removes blockers
  4. Comms Lead sends regular updates to stakeholders
  5. Scribe maintains the timeline

Remote-first practices:

  • Use threaded replies in the incident channel to keep the main channel scannable
  • Use voice/video for real-time debugging when typing is too slow
  • Share screens rather than describing what you see

Phase 4: Mitigation

The immediate goal is to stop the bleeding—restore service, even if the root cause isn't fully understood.

Key actions:

  1. Implement a workaround or fix
  2. Verify the fix works (check metrics, confirm with affected users)
  3. Communicate the mitigation to stakeholders
  4. Decide: Is this good enough, or does more work need to happen now?

Mitigation vs. resolution: Mitigation stops the immediate impact. Resolution fully fixes the underlying problem. Sometimes mitigation is good enough for now, and resolution can happen in follow-up work. Make this decision explicitly.

Phase 5: Resolution and Handoff

The incident is resolved when service is restored to acceptable levels.

Key actions:

  1. IC declares the incident resolved
  2. Send final communication to stakeholders
  3. Update status page to "resolved"
  4. Schedule postmortem (within 24-72 hours for SEV-1/SEV-2)
  5. Hand off any follow-up work to the backlog

Criteria for resolution:

  • Service is functioning normally
  • Metrics are back to baseline
  • No ongoing customer impact

Phase 6: Postmortem

Every significant incident deserves a postmortem. The goal is learning, not blame.

Key actions:

  1. Gather timeline, data, and evidence
  2. Hold blameless postmortem meeting
  3. Identify root causes and contributing factors
  4. Generate action items to prevent recurrence
  5. Share the postmortem with the broader team

See Postmortem Template for structure and guidance.


Remote-First Incident Response

Remote teams face unique challenges: no war room, time zone gaps, and reliance on text-based communication. Here's how to adapt:

Dedicated incident channels. Don't run incidents in #general or #engineering. Create a channel per incident for focus and searchability.

Video for complex debugging. When you're deep in investigation, switch from Slack to a video call. Screen sharing is faster than describing.

Explicit handoffs. When shifting responsibility (end of day, role change), make it explicit: "I'm handing IC to @alex—here's current status: [summary]."

Timezone-aware cadence. For long-running incidents, plan handoffs around time zones. Document status thoroughly so people waking up can catch up quickly.

Async postmortems. Not everyone can attend a synchronous postmortem meeting. Prepare a written document in advance, allow async comments, and record the meeting.


What Good Looks Like

You'll know your crisis management is working when you observe these signals:

Signal What it looks like
Fast detection Incidents are detected by monitoring, not customer complaints
Clear ownership IC is assigned within minutes of incident declaration
Calm coordination The incident channel is focused, not chaotic
Regular communication Stakeholders receive updates at predictable intervals
Quick mitigation Impact is contained within your target SLO (e.g., SEV-1 mitigated in <1 hour)
Thorough postmortems Every SEV-1/SEV-2 gets a postmortem with actionable follow-ups
Declining recurrence Root causes are addressed; the same incidents don't keep happening
Sustainable on-call Rotation is fair; engineers don't burn out

Failure Modes and Mitigations

The Leaderless Incident

Symptom: Nobody takes charge. Multiple people investigate the same thing. Decisions stall. The incident drags on.

Root cause: No clear IC assignment, unclear escalation paths, or cultural reluctance to "take over."

Mitigation: Make IC assignment automatic. First responder to SEV-1/SEV-2 becomes IC until someone else explicitly takes over. Train people in IC skills so it's not scary.

The Communication Blackout

Symptom: Stakeholders have no idea what's happening. Executives start pinging engineers directly. Customer support is blindsided.

Root cause: No Comms Lead assigned, or the IC is too busy debugging to communicate.

Mitigation: Assign Comms Lead explicitly for SEV-1/SEV-2. Use templates so communication doesn't require creativity during stress.

The Blame Storm

Symptom: Postmortems become interrogations. People get defensive. Future incidents are hidden or downplayed.

Root cause: Leadership treats postmortems as accountability exercises rather than learning opportunities.

Mitigation: Enforce blameless postmortems. Leadership sets the tone by asking "what allowed this to happen?" not "who did this?" Celebrate incident reporters, not punish them.

The Zombie Incident

Symptom: Incidents are never formally closed. People drift away without clear resolution. The same issue resurfaces.

Root cause: No criteria for resolution, no IC discipline around closure.

Mitigation: IC must explicitly declare resolution with criteria: "Service is restored, metrics are nominal, closing the incident." Reopening is fine if things regress.

The Postmortem Graveyard

Symptom: Postmortems generate action items that never get done. The same root causes repeat.

Root cause: Action items aren't owned, prioritized, or tracked.

Mitigation: Every action item has an owner and a due date. Track completion in your normal sprint/backlog process. Review completion rates periodically.


Escalation Paths

Define escalation paths before incidents happen. Here's a template:

Situation Escalate to How
Need more responders Team's on-call rotation Page via PagerDuty/Opsgenie
Incident affects multiple teams Engineering leadership Slack + phone call
Customer-facing impact likely to make news VP Engineering + Comms/PR Phone call
Security incident suspected Security team + CISO Dedicated security channel + phone
Need executive decision CTO/CEO Phone call via Eng leadership

Don't hesitate to escalate

Escalation is not failure. It's getting the right resources to solve a problem. Leaders would rather be woken up for a real incident than learn about it from Twitter the next morning.


On-Call Best Practices

On-call is part of incident response. A healthy on-call rotation makes crisis management sustainable.

Rotation fairness. Share the load equally. Track who's been paged and rebalance if necessary.

Reasonable expectations. On-call means you can respond within X minutes, not that you can't leave your house. Define the SLA explicitly.

Compensation. If on-call is burdensome, compensate for it (time off, money, or both). Unpaid on-call breeds resentment.

Runbooks. On-call engineers shouldn't need to remember everything. Provide runbooks for common issues. See Runbook Template.

Blameless culture. On-call engineers will make mistakes under pressure. Treat mistakes as system problems to fix, not personal failures.

Escalation clarity. On-call engineers should never feel stuck. Make sure they know when and how to escalate.


Copy-Paste Artifact: Incident Response Checklist

## Incident Response Checklist

### Detection

- [ ] Acknowledge alert or report
- [ ] Assess initial impact (users affected, systems involved)
- [ ] Assign initial severity (SEV-1/2/3/4)

### Mobilization

- [ ] Create incident channel: #inc-YYYY-MM-DD-description
- [ ] Assign Incident Commander
- [ ] IC assigns: Tech Lead, Comms Lead, Responders
- [ ] Post initial status in incident channel

### Investigation

- [ ] Tech Lead coordinates debugging
- [ ] Responders investigate assigned areas
- [ ] Scribe/IC maintains timeline
- [ ] Comms Lead sends first stakeholder update

### Mitigation

- [ ] Implement workaround or fix
- [ ] Verify fix (check metrics, confirm with users)
- [ ] Update stakeholders on mitigation
- [ ] Decide: resolved now, or follow-up needed?

### Resolution

- [ ] IC declares incident resolved
- [ ] Send final stakeholder communication
- [ ] Update status page to "resolved"
- [ ] Schedule postmortem

### Postmortem

- [ ] Gather timeline and evidence
- [ ] Write postmortem document
- [ ] Hold blameless postmortem meeting
- [ ] Assign action items with owners and due dates
- [ ] Share postmortem broadly

Copy-Paste Artifact: Incident Channel Opening Message

🚨 **INCIDENT DECLARED** 🚨

**Severity:** SEV-[X]
**Summary:** [Brief description of the issue]
**Impact:** [Who/what is affected]
**Detection:** [How was this discovered]

**Roles:**

- Incident Commander: @[name]
- Tech Lead: @[name]
- Comms Lead: @[name]
- Responders: @[names]

**Current Status:** Investigating

**Next Update:** [time] or when we have significant news

---

Please keep this channel focused on incident response. Use threads for extended discussion.

Copy-Paste Artifact: Severity Assessment Questions

## Severity Assessment Questions

Ask these questions to determine severity:

1. **How many users are affected?**
   - All/most users → likely SEV-1
   - Significant subset → likely SEV-2
   - Small subset or specific scenario → likely SEV-3/4

2. **What functionality is broken?**
   - Core functionality (login, payments, primary feature) → escalate
   - Secondary feature or edge case → de-escalate

3. **Is there data loss or security risk?**
   - Yes → SEV-1, involve security team

4. **Is there a workaround?**
   - No reasonable workaround → escalate
   - Workaround exists and is acceptable → de-escalate

5. **Is the problem getting worse?**
   - Expanding impact → escalate
   - Stable or contained → maintain or de-escalate

6. **Are we past our SLO?**
   - Yes → escalate
   - No but approaching → prepare to escalate

When in doubt, round up. You can always de-escalate.

Further Reading

  • Incident Management for Operations by Rob Schnepp, Ron Vidal, and Chris Hawley – Practical incident management adapted from fire service practices
  • Site Reliability Engineering by Google – Chapters on incident response and postmortems
  • The Field Guide to Understanding Human Error by Sidney Dekker – Foundational text on blameless investigation