Leadership in Crisis¶
A production incident doesn't announce itself politely. One moment the system is fine. The next, alerts are firing, Slack is exploding, and someone is asking "what's happening?" in a tone that makes it clear they already know something is very wrong.
This case study is about what happens next—not just the technical response, but the leadership response. How you show up in a crisis shapes trust for months afterward. Get it right, and the team emerges stronger, with clearer processes and deeper confidence. Get it wrong, and you erode the psychological safety you've spent quarters building.
I've been through enough incidents to know that the technical fix is often the easy part. The hard part is leading people through uncertainty while maintaining clarity, protecting the team from chaos while keeping stakeholders informed, and recovering trust after the immediate fire is out.
The problem this solves¶
In a crisis, normal processes break. The usual cadences—standups, planning, async updates—become irrelevant. Decisions need to happen faster than usual, with less information than you'd like, under more scrutiny than normal.
Without clear leadership, several things go wrong:
Communication fragments. Multiple people answer the same questions inconsistently. Stakeholders hear different versions of reality. Rumors fill the gaps.
Decision paralysis sets in. No one is sure who can make the call. People wait for approval that doesn't come, or make decisions in silos that conflict.
The team burns out. Without someone protecting boundaries, engineers work unsustainable hours, skip meals, and accumulate stress that compounds long after the incident ends.
Trust erodes. If customers, stakeholders, or leadership feel uninformed or misled—even unintentionally—recovering that trust takes far longer than fixing the technical issue.
The role of the leader in a crisis is to create structure where chaos wants to live: clear communication, clear decisions, clear boundaries.
When to use this approach¶
This playbook applies when:
- A production incident is impacting users at scale (severity 1 or 2 in most frameworks).
- The situation requires coordination across multiple people or teams.
- External communication is needed (customers, leadership, partners).
- The timeline extends beyond a quick fix—hours, not minutes.
- The team is visibly stressed or uncertain about what to do.
This is not just about outages. Security incidents, data integrity issues, critical bugs affecting key customers, or service degradations that don't trigger alerts but damage trust—all qualify.
When this approach is not enough¶
This playbook assumes you have:
- Basic incident response infrastructure (alerting, on-call rotation, communication channels).
- Authority to make decisions or clear escalation paths to someone who does.
- A team that can execute technical remediation.
If you're missing these foundations, start with Crisis: Crisis Management to establish the operational baseline.
This playbook also doesn't cover:
- Legal or regulatory incidents (involve your legal team immediately).
- HR crises (different playbook entirely).
- Incidents where the root cause is personnel misconduct.
Roles and ownership¶
In a crisis, role clarity is everything. Ambiguity about who does what creates duplication, gaps, and conflict.
| Role | Responsibility |
|---|---|
| Incident Commander (IC) | Owns the incident end-to-end. Makes decisions, prioritizes actions, calls for help. Usually a senior engineer or EM. |
| Technical Lead | Drives diagnosis and remediation. Coordinates engineers working on the fix. Reports status to IC. |
| Communications Lead | Owns all external and internal communication. Drafts updates, manages channels, ensures consistency. |
| Scribe | Documents everything: timeline, decisions, who did what. Captures information for the postmortem. |
| Engineering Manager | Protects team health. Manages escalations, shields the team from noise, handles stakeholder relationships. |
In smaller teams, one person may wear multiple hats—but name the hats explicitly. "I'm acting as IC and Comms Lead until someone else can take over" is better than assumed responsibility.
The approach: leading through a crisis¶
Phase 1: Stabilize the response (first 30 minutes)¶
The first phase is about creating order, not solving the problem.
Establish the incident channel. Create a dedicated space (Slack channel, video bridge, or both). Name it clearly: #incident-2026-01-31-payment-failures. Everyone working on the incident communicates here—not in DMs, not in other channels.
Name the roles. Within the first 10 minutes, someone needs to say: "I'm acting as Incident Commander. Sarah is Technical Lead. Alex is on comms." If no one steps up, you step up.
Assess severity and scope. What's broken? Who's affected? How many users, how much data, how much revenue? Get approximate answers fast—precision comes later.
First external communication within 30 minutes. Even if you don't have answers, communicate: "We're aware of an issue affecting X. We're investigating. Next update in 30 minutes." Silence is interpreted as ignorance or indifference.
Protect focus. The people diagnosing the problem need focus. Create a "war room" dynamic—only essential people, no drive-by questions. The IC shields the technical team from noise.
The 30-minute rule
If you haven't communicated externally within 30 minutes of a user-impacting incident, you're already behind. Stakeholders will fill the silence with assumptions—usually worse than reality.
Phase 2: Coordinate and communicate (ongoing)¶
Once the initial chaos settles, the work becomes rhythmic: investigate, decide, communicate, repeat.
Establish update cadence. Decide how often you'll communicate—typically every 30–60 minutes during active incidents. Stick to the cadence even if there's no new information. "No update" is an update.
Centralize decisions. All significant decisions go through the IC. This prevents conflicting actions and creates accountability. If someone needs to make a call, they bring it to the IC, who decides or explicitly delegates.
Document as you go. The scribe captures the timeline in real-time: timestamps, actions taken, decisions made, who was involved. This is invaluable for the postmortem and for anyone joining mid-incident.
Communicate in layers.
- Technical team: Real-time in the incident channel. Raw, detailed, fast.
- Internal stakeholders (leadership, support, sales): Summarized updates with impact and ETA. Less technical, more business context.
- External (customers, status page): Clear, honest, human. What's affected, what we're doing, when we'll update next.
Rotate if it's long. Incidents lasting more than 2–3 hours require rotation. No one makes good decisions after hours of high-stress focus. Plan handoffs before people are depleted.
Phase 3: Resolve and verify¶
The fix is in. Now what?
Verify resolution. Don't just deploy and assume success. Define what "fixed" looks like—metrics returning to baseline, error rates dropping, customer confirmation. Monitor actively for regression.
Communicate resolution. Same channels, same cadence. "The issue affecting X has been resolved. Root cause was Y. We'll share a detailed postmortem within 48 hours."
Capture immediate learnings. Before people scatter, spend 15 minutes capturing: What was the timeline? What decisions did we make? What do we need to do next? This feeds the postmortem.
Thank people. Publicly acknowledge the people who helped. Name them specifically. Crisis response is stressful, and recognition matters.
Phase 4: Recover and learn¶
The incident is over. The work isn't.
Schedule the postmortem within 48–72 hours. Memory fades fast. Get the key people in a room (virtual or physical) while the experience is fresh. Use the Postmortem Template.
Focus on systems, not individuals. The question is "what allowed this to happen?" not "who did this?" Blameless doesn't mean accountability-free—it means we fix systems, not people.
Assign and track remediation items. Every action item from the postmortem gets an owner and a deadline. Track them like any other work. Don't let them rot in a document no one reads.
Check on team health. After a major incident, explicitly check in on the people involved. Adrenaline masks exhaustion. Some people need time off. Some need to talk. Some just need acknowledgment that it was hard.
Rebuild stakeholder trust. Share the postmortem externally if appropriate. Be transparent about what went wrong and what you're doing to prevent recurrence. Trust is rebuilt through honesty and follow-through, not spin.
What good looks like¶
Calm, clear communication. Updates go out on time, in a consistent voice, with honest acknowledgment of what's known and unknown.
Decisions happen quickly. The IC makes calls without endless debate. Disagree-and-commit operates smoothly because roles are clear.
The team stays functional. People rotate, take breaks, eat food. No one burns out for a single incident.
Stakeholders feel informed. Leadership, customers, and partners aren't surprised. They may be unhappy about the incident, but they're not unhappy about how it was handled.
The postmortem produces real change. Action items get done. The same incident doesn't happen twice. The team has higher confidence, not lower.
| Signal | What it indicates |
|---|---|
| First external update within 30 minutes | Communication structure is working |
| Consistent update cadence maintained | Comms Lead is functioning effectively |
| No conflicting information in different channels | Centralized communication is holding |
| Team members taking breaks during long incidents | Team health is being protected |
| Postmortem scheduled within 72 hours | Learning culture is intact |
| Remediation items completed within committed timeframe | Follow-through is real |
Failure modes and mitigations¶
| Failure mode | What it looks like | Mitigation |
|---|---|---|
| No clear IC | Multiple people making conflicting decisions. Everyone waiting for someone else. | First 10 minutes: explicitly name the IC. If no one does, step up. |
| Communication blackout | Stakeholders pinging constantly for updates. Rumors spreading. Frustration building. | Commit to update cadence and stick to it. Even "no new info" is an update. |
| Hero mode | One person working 12+ hours straight. Others afraid to interrupt or suggest rotation. | IC actively monitors time and mandates rotation. No one is irreplaceable. |
| Fix-first, communicate-later | Technical team heads-down while everyone else panics. Updates only after resolution. | Comms Lead works in parallel. Communication is not a distraction—it's part of the response. |
| Blame surfaces in real-time | Slack messages like "who deployed this?" or "why wasn't this tested?" | IC shuts it down immediately. "We'll cover that in the postmortem. Right now, focus on resolution." |
| Postmortem never happens | Incident ends, everyone moves on, nothing changes. | Schedule postmortem before the incident closes. Treat it as mandatory, not optional. |
| Remediation items rot | Action items assigned but never completed. Same incident repeats. | Track remediation items in your normal work system. Review completion in weekly syncs. |
Copy-pastable artifacts¶
Incident kickoff message (for #incident channel)¶
🚨 **Incident Declared: [Brief description]**
**Severity:** [SEV1/SEV2/SEV3]
**Impact:** [What's affected, who's affected, approximate scope]
**Status:** Investigating
**Roles:**
- Incident Commander: @[name]
- Technical Lead: @[name]
- Communications Lead: @[name]
- Scribe: @[name]
**Update cadence:** Every [30/60] minutes until resolved.
**Next update:** [time]
Please keep this channel clear for incident-related communication only. For questions, ping @[IC name] directly.
External status update template¶
**[Service/Product] - Investigating issues with [feature/function]**
**Time:** [UTC timestamp]
We are currently investigating reports of [brief description of impact]. Some users may experience [specific symptoms].
Our team is actively working on diagnosis and resolution. We will provide an update by [specific time].
We apologize for any inconvenience and appreciate your patience.
Internal stakeholder update template¶
**Incident Update: [title]**
**Time:** [timestamp]
**Severity:** [SEV level]
**Status:** [Investigating / Identified / Monitoring / Resolved]
**What's happening:**
[2-3 sentences on current state]
**Impact:**
- Users affected: [number or percentage]
- Features affected: [list]
- Revenue/business impact: [if known]
**Current actions:**
- [Action 1 - owner]
- [Action 2 - owner]
**ETA for resolution:** [estimate or "unknown - will update at X time"]
**Next update:** [specific time]
Questions: Contact @[comms lead or IC]
Post-incident team check-in agenda (15 minutes, day after)¶
## Post-incident check-in
**Purpose:** Quick sync on how everyone is doing after the incident. Not a postmortem—that's separate.
**Agenda:**
1. How is everyone feeling? (Round-robin, 1-2 sentences each)
2. Does anyone need time off or reduced load this week?
3. Any immediate process concerns before the formal postmortem?
4. Confirm postmortem date/time
**Norms:**
- This is a safe space. Frustration, exhaustion, and relief are all valid.
- No problem-solving right now—just acknowledging.
Postmortem scheduling message¶
**Postmortem scheduled for incident: [title]**
**When:** [date/time, within 72 hours of resolution]
**Where:** [link]
**Duration:** 60 minutes
**Attendees:** [IC, Tech Lead, Comms Lead, key responders, affected team leads]
**Preparation:**
- Review the incident timeline (linked in thread)
- Come with observations about what worked and what didn't
- Think about: what should we do differently next time?
**Norms:**
- Blameless: we focus on systems, not individuals
- Honest: we want the real story, not the polished version
- Action-oriented: we leave with concrete next steps
Please confirm attendance.
A reflection: what I've learned about crisis leadership¶
The first few crises I led, I made a common mistake: I tried to be everywhere at once. Answering questions, reviewing fixes, drafting communications, managing stakeholders. By the time the incident ended, I was exhausted and had done none of those things well.
The lesson was simple but hard to internalize: your job is not to do everything—it's to make sure everything gets done by the right people. That means delegating, even when it feels faster to do it yourself. It means trusting people you've hired and trained. It means protecting your own bandwidth so you can make the decisions that only you can make.
The other lesson was about honesty. Early in my career, I saw leaders minimize incidents to stakeholders, downplay impact, or delay communication hoping the problem would resolve before anyone noticed. It never works. Stakeholders always find out. And when they do, the trust cost is far higher than the incident itself.
Now I over-communicate. I assume stakeholders want to know, even if the news is bad. I share what we know, what we don't know, and what we're doing about it. I've never regretted being too transparent. I've often regretted being too cautious.
Crises reveal culture. If your team has psychological safety, it shows: people speak up, admit mistakes, ask for help. If they don't, that shows too: silence, blame, finger-pointing. The best time to build that culture is before the crisis. But if you're already in it—lead by example. Admit what you don't know. Thank people for flagging risks. Focus on fixing systems, not punishing people.
The incident ends. The trust you build—or break—lasts much longer.
Further reading¶
- The Field Guide to Understanding Human Error by Sidney Dekker — How to think about failure in complex systems.
- Incident Management for Operations by Rob Schnepp et al. — Practical frameworks for incident response.
- Turn the Ship Around by L. David Marquet — Leadership through intent-based delegation, relevant under pressure.
- The Checklist Manifesto by Atul Gawande — Why checklists work in high-stakes environments.
Related chapters¶
- Crisis: Crisis Management — The operational baseline for incident response.
- Crisis: Outage Communication Playbook — Detailed scripts and templates for incident communication.
- Resources: Postmortem Template — Structure for learning from incidents.
- Resources: Runbook Template — How to document operational procedures.
- Principles: Core Principles — The values that guide leadership under pressure.
- Team Ops: Conflict Resolution — Useful when post-incident tensions need addressing.