Outage Communication Playbook¶

When systems fail, communication is as important as the fix. Silence breeds anxiety. Stakeholders who don't hear from you will assume the worst, escalate unnecessarily, or lose trust in your team. But communication done poorly—vague, inconsistent, or overly optimistic—is almost as bad as silence.

This page provides a complete playbook for incident communication: who needs to know what, when to tell them, and exactly what to say. It includes templates you can copy and adapt for your context.

What Problem This Solves¶

During an incident, engineers are focused on investigation and resolution. Communication is often an afterthought—something that happens when someone asks "should we tell anyone about this?" By then, stakeholders may already be confused, worried, or angry.

Poor communication during outages creates these problems:

Customer trust erodes. Users who learn about an outage from their own errors—rather than your status page—feel blindsided. Trust is hard to rebuild.

Support is overwhelmed. Without proactive communication, every affected user submits a ticket asking if there's a problem. Support drowns while engineering debugs.

Leadership panics. Executives who hear about outages from customers or social media instead of their own teams will intervene in unhelpful ways.

Engineers get interrupted. Without a Comms Lead, responders get pulled into answering questions instead of fixing the problem.

A structured communication playbook solves all of these by making incident communication predictable, proactive, and independent from the technical response.

When to Use This Playbook¶

Use this playbook when:

You have a SEV-1 or SEV-2 incident with user impact
Customers may notice degraded performance or errors
The issue will take more than a few minutes to resolve
External parties (customers, partners) depend on your service
Internal stakeholders need to coordinate responses (support, sales, executives)

Don't use this playbook when:

The issue is internal-only with no user impact
You can resolve it faster than it takes to communicate about it
It's a SEV-3/SEV-4 that doesn't warrant external communication

Err toward communicating more rather than less. Users appreciate transparency even for small issues.

Roles in Incident Communication¶

Communications Lead (Comms Lead)¶

Responsibility: Owns all stakeholder communication during the incident.

What they do:

Draft and send internal updates
Update external status page
Coordinate with customer support on messaging
Field questions from executives so engineers can focus
Prepare post-incident communications (resolution notice, postmortem summary)

What they don't do:

Debug or fix the technical issue
Make decisions about incident response strategy

Who should be Comms Lead: For smaller incidents, the Incident Commander can handle comms. For SEV-1 incidents, designate a separate Comms Lead so the IC can focus on coordination. Good candidates: engineering managers, product managers, or designated communications staff.

Incident Commander's Role in Communication¶

The IC provides information to the Comms Lead:

Current status and impact
Estimated time to resolution (if known)
What's being done
When the next technical update will be available

The Comms Lead translates this into stakeholder-appropriate messaging.

Communication Audiences¶

Different audiences need different information at different times.

Audience	What they need	Channel	Cadence
Engineering team	Technical details, tasks, coordination	Incident Slack channel	Real-time
Customer support	What to tell customers, known workarounds	Support channel or direct message	As new info is available
Internal stakeholders (leadership, sales, etc.)	Impact summary, expected resolution, business implications	#incidents-internal or email	Every 30-60 min for SEV-1
Customers	What's affected, workaround if any, expected resolution	Status page, in-app notice, email for major outages	Every 30 min for SEV-1, less for lower severity

Communicate proactively

Don't wait for people to ask. Push updates on a schedule. If there's nothing new to report, say so: "We're still investigating. Next update in 30 minutes."

Communication Cadence by Severity¶

Severity	Internal updates	External updates (status page)	Email to customers
SEV-1	Every 15-30 minutes	Every 30 minutes	If outage >1 hour
SEV-2	Every 30-60 minutes	Every 60 minutes	Only if requested/significant
SEV-3	At resolution	At resolution (if posted)	No
SEV-4	At resolution	No	No

Adjust based on your context. A B2B SaaS with enterprise customers may need more frequent, more formal communication than a consumer app.

Communication Phases¶

Phase 1: Initial Acknowledgment¶

Goal: Let stakeholders know you're aware of the issue and working on it.

Timing: Within 10-15 minutes of incident declaration for SEV-1/SEV-2.

What to include:

Acknowledgment that there's an issue
Brief description of impact (what users experience)
Statement that you're investigating
When to expect the next update

What NOT to include:

Root cause (you don't know yet)
Estimated resolution time (you don't know yet)
Blame or speculation

Phase 2: Investigation Updates¶

Goal: Keep stakeholders informed while you investigate.

Timing: At the cadence defined for the severity level.

What to include:

Current status (still investigating, identified cause, implementing fix)
Any changes in impact
Workaround if available
Expected next update time

Key principle: If you have nothing new, say so. "We're still investigating. The team is focused on [specific area]. Next update in 30 minutes." Silence is worse than repetition.

Phase 3: Mitigation Announcement¶

Goal: Let stakeholders know the immediate impact is contained.

Timing: As soon as mitigation is confirmed.

What to include:

Confirmation that the issue is mitigated
What users should expect now
Whether there's any follow-up action needed from users
That you'll share more details in the postmortem

Phase 4: Resolution Announcement¶

Goal: Close the loop with stakeholders.

Timing: When the incident is fully resolved.

What to include:

Confirmation that the incident is resolved
Brief summary of what happened (high level)
Apology for the impact
Commitment to share postmortem findings

Phase 5: Postmortem Summary (Optional but Recommended)¶

Goal: Rebuild trust through transparency.

Timing: 24-72 hours after resolution for major incidents.

What to include:

What happened (plain language)
How you responded
What you're doing to prevent recurrence
Apology and commitment to improvement

For enterprise customers, consider sharing a summary of the postmortem. For public products, consider a blog post for major incidents.

What Good Looks Like¶

You'll know your incident communication is working when:

Signal	What it looks like
Proactive awareness	Stakeholders learn about incidents from you, not from errors or social media
Support is prepared	Customer support knows what's happening before tickets arrive
Predictable cadence	Updates arrive when promised
Appropriate detail	Technical enough to be useful, not so technical that it confuses
Trust preserved	Customers frustrated by the outage, but not by your communication
Engineers undisturbed	Responders aren't fielding stakeholder questions during the incident

Failure Modes and Mitigations¶

The Communication Vacuum¶

Symptom: Long silences during an incident. Stakeholders start asking "does anyone know what's happening?"

Root cause: No Comms Lead assigned, or Comms Lead doesn't have information from the technical team.

Mitigation: Assign Comms Lead explicitly. IC provides status updates to Comms Lead every 15 minutes. Comms Lead pushes updates on schedule, even if it's "no new information."

Overpromising on Resolution Time¶

Symptom: "We expect to resolve this in 30 minutes" turns into 3 hours. Stakeholders lose trust.

Root cause: Pressure to give optimistic estimates, or not enough buffer in estimates.

Mitigation: Don't give specific times unless you're confident. Say "we're working on it" or give wide ranges: "We expect this to take between 1-3 hours to resolve." Under-promise, over-deliver.

The Blame Leak¶

Symptom: Communication includes blame or finger-pointing: "A vendor failure caused..." or "Human error led to..."

Root cause: Instinct to explain, or pressure from leadership to assign blame.

Mitigation: Keep incident communication blameless. Focus on impact and actions, not causes. Save root cause discussion for the postmortem. Publicly blaming vendors or individuals during an incident damages relationships.

Inconsistent Channels¶

Symptom: Status page says one thing, support says another, internal Slack says a third thing.

Root cause: Multiple people communicating without coordination.

Mitigation: Comms Lead owns all external messaging. All updates flow through them. Internal updates can be faster/more detailed, but external communication is controlled.

The Over-Technical Update¶

Symptom: Status page says "We're experiencing elevated error rates in our Kafka consumer groups causing message lag in the event processing pipeline." Customers have no idea what this means.

Root cause: Engineers writing customer-facing communication.

Mitigation: Comms Lead translates technical updates to customer language. Focus on user impact: "Some users may experience delays in [feature]." Save technical details for internal updates.

Remote-First Adaptations¶

Async-friendly templates. Keep message templates ready so the Comms Lead doesn't need to craft prose under pressure.

Clear escalation paths. Know how to reach the Comms Lead outside working hours. They should be pageable for SEV-1.

Time zone handoffs. For long-running incidents, hand off Comms Lead role explicitly: "I'm handing comms to @sam—here's current status and the next scheduled update time."

Recorded postmortem summaries. If sharing postmortems with customers, consider a Loom video for enterprise accounts. It's more personal than a document.

Copy-Paste Artifact: Status Page Templates¶

Initial Acknowledgment¶

**Investigating: [Feature/Service] Issues**

We are aware of an issue affecting [brief description of what users experience]. Our team is investigating and we will provide updates as we learn more.

Next update in 30 minutes or when we have significant news.

Investigation Update¶

**Update: [Feature/Service] Issues**

We are continuing to investigate [brief description]. [Optional: We have identified the cause and are working on a fix / We are still working to identify the root cause].

[If available: Some users may be able to work around this by [workaround].]

Next update in 30 minutes or when we have significant news.

Mitigation Announcement¶

**Update: [Feature/Service] Issues - Mitigated**

We have implemented a fix and service is recovering. [Feature/Service] should be returning to normal.

If you continue to experience issues, please [refresh the page / contact support / etc.].

We will provide a full update once the incident is resolved.

Resolution Announcement¶

**Resolved: [Feature/Service] Issues**

The issue affecting [feature/service] has been resolved. All systems are operating normally.

We apologize for the disruption. We will be conducting a thorough review of this incident and will share our findings.

Thank you for your patience.

Copy-Paste Artifact: Internal Incident Update Template¶

## Incident Update: [Title]

**Time:** [timestamp]
**Severity:** SEV-[X]
**Status:** [Investigating / Identified / Mitigating / Resolved]

### Current Impact

[Brief description of user impact]

### What We Know

[Current understanding of the situation]

### What We're Doing

[Current actions being taken]

### Timeline

- [HH:MM] - [Event]
- [HH:MM] - [Event]
- [HH:MM] - [Event]

### Next Update

[Time] or when we have significant news

---

IC: @[name] | Comms: @[name] | Tech Lead: @[name]

Copy-Paste Artifact: Customer Email for Major Outage¶

Use this for extended outages (1+ hours) affecting significant customer population.

Subject: [Company Name] Service Disruption - [Date]

Dear [Customer],

We wanted to reach out to let you know about a service disruption that may have affected your experience with [Product Name].

**What happened:**
[Brief, plain-language description of the issue - focus on user experience, not technical details]

**Impact:**
[Timeframe of impact, what features were affected]

**What we're doing:**
Our team responded immediately and [brief description of resolution]. Service has been restored as of [time].

**Preventing future issues:**
We're conducting a thorough review of this incident. [If you have specific actions planned, mention them briefly.]

We understand how important [Product] is to your [workflow/business/etc.], and we're sorry for any inconvenience this caused.

If you have any questions or are still experiencing issues, please contact [support email].

Thank you for your patience and continued trust in [Company Name].

[Signature]

Copy-Paste Artifact: Support Team Briefing Template¶

Share this with customer support during an incident so they know how to respond to customers.

## Support Briefing: [Incident Title]

**Status:** [Active / Resolved]
**Last Updated:** [timestamp]

### What's happening

[Plain-language description of the issue]

### What customers experience

[Specific symptoms: error messages, slow load times, feature unavailable, etc.]

### What to tell customers

"We're aware of an issue affecting [feature]. Our team is actively working on it. [Workaround if available]. We expect to have an update within [timeframe]."

### Known workaround

[If any, describe steps. If none: "No workaround at this time."]

### What NOT to say

- Don't speculate on cause
- Don't promise specific resolution times
- Don't mention internal details (server names, employee names, etc.)

### Point customers to

- Status page: [URL]
- [Any other relevant resources]

### Escalation

If customers report impact not covered above, let us know in #[incident-channel].

Copy-Paste Artifact: Executive Briefing Template¶

For keeping leadership informed during major incidents.

## Executive Briefing: [Incident Title]

**Severity:** SEV-[X]
**Status:** [Active / Resolved]
**Duration:** [start time] - [current/end time]
**Prepared by:** [name]

### Summary

[2-3 sentence summary suitable for sharing with board or customers]

### Impact

- Users affected: [number or percentage]
- Revenue impact: [if known, or "assessing"]
- Customer escalations: [number]

### Current Status

[What's happening right now]

### Actions Taken

[Key response actions]

### Next Steps

[What happens next]

### Communication

- Status page: [Updated / Will update at X]
- Customer email: [Sent / Will send if >X hours / Not needed]
- Social media: [Monitoring / Response needed]

### Questions for Leadership

[If you need decisions or air cover, list them here]

Outage Communication Playbook¶

What Problem This Solves¶

When to Use This Playbook¶

Roles in Incident Communication¶

Communications Lead (Comms Lead)¶

Incident Commander's Role in Communication¶

Communication Audiences¶

Communication Cadence by Severity¶

Communication Phases¶

Phase 1: Initial Acknowledgment¶

Phase 2: Investigation Updates¶

Phase 3: Mitigation Announcement¶

Phase 4: Resolution Announcement¶

Phase 5: Postmortem Summary (Optional but Recommended)¶

What Good Looks Like¶

Failure Modes and Mitigations¶

The Communication Vacuum¶

Overpromising on Resolution Time¶

The Blame Leak¶

Inconsistent Channels¶

The Over-Technical Update¶

Remote-First Adaptations¶

Copy-Paste Artifact: Status Page Templates¶

Initial Acknowledgment¶

Investigation Update¶

Mitigation Announcement¶

Resolution Announcement¶

Copy-Paste Artifact: Internal Incident Update Template¶

Copy-Paste Artifact: Customer Email for Major Outage¶

Copy-Paste Artifact: Support Team Briefing Template¶

Copy-Paste Artifact: Executive Briefing Template¶

Further Reading¶

Related¶