Skip to content

Outage Communication Playbook

When systems fail, communication is as important as the fix. Silence breeds anxiety. Stakeholders who don't hear from you will assume the worst, escalate unnecessarily, or lose trust in your team. But communication done poorly—vague, inconsistent, or overly optimistic—is almost as bad as silence.

This page provides a complete playbook for incident communication: who needs to know what, when to tell them, and exactly what to say. It includes templates you can copy and adapt for your context.

What Problem This Solves

During an incident, engineers are focused on investigation and resolution. Communication is often an afterthought—something that happens when someone asks "should we tell anyone about this?" By then, stakeholders may already be confused, worried, or angry.

Poor communication during outages creates these problems:

Customer trust erodes. Users who learn about an outage from their own errors—rather than your status page—feel blindsided. Trust is hard to rebuild.

Support is overwhelmed. Without proactive communication, every affected user submits a ticket asking if there's a problem. Support drowns while engineering debugs.

Leadership panics. Executives who hear about outages from customers or social media instead of their own teams will intervene in unhelpful ways.

Engineers get interrupted. Without a Comms Lead, responders get pulled into answering questions instead of fixing the problem.

A structured communication playbook solves all of these by making incident communication predictable, proactive, and independent from the technical response.


When to Use This Playbook

Use this playbook when:

  • You have a SEV-1 or SEV-2 incident with user impact
  • Customers may notice degraded performance or errors
  • The issue will take more than a few minutes to resolve
  • External parties (customers, partners) depend on your service
  • Internal stakeholders need to coordinate responses (support, sales, executives)

Don't use this playbook when:

  • The issue is internal-only with no user impact
  • You can resolve it faster than it takes to communicate about it
  • It's a SEV-3/SEV-4 that doesn't warrant external communication

Err toward communicating more rather than less. Users appreciate transparency even for small issues.


Roles in Incident Communication

Communications Lead (Comms Lead)

Responsibility: Owns all stakeholder communication during the incident.

What they do:

  • Draft and send internal updates
  • Update external status page
  • Coordinate with customer support on messaging
  • Field questions from executives so engineers can focus
  • Prepare post-incident communications (resolution notice, postmortem summary)

What they don't do:

  • Debug or fix the technical issue
  • Make decisions about incident response strategy

Who should be Comms Lead: For smaller incidents, the Incident Commander can handle comms. For SEV-1 incidents, designate a separate Comms Lead so the IC can focus on coordination. Good candidates: engineering managers, product managers, or designated communications staff.

Incident Commander's Role in Communication

The IC provides information to the Comms Lead:

  • Current status and impact
  • Estimated time to resolution (if known)
  • What's being done
  • When the next technical update will be available

The Comms Lead translates this into stakeholder-appropriate messaging.


Communication Audiences

Different audiences need different information at different times.

Audience What they need Channel Cadence
Engineering team Technical details, tasks, coordination Incident Slack channel Real-time
Customer support What to tell customers, known workarounds Support channel or direct message As new info is available
Internal stakeholders (leadership, sales, etc.) Impact summary, expected resolution, business implications #incidents-internal or email Every 30-60 min for SEV-1
Customers What's affected, workaround if any, expected resolution Status page, in-app notice, email for major outages Every 30 min for SEV-1, less for lower severity

Communicate proactively

Don't wait for people to ask. Push updates on a schedule. If there's nothing new to report, say so: "We're still investigating. Next update in 30 minutes."


Communication Cadence by Severity

Severity Internal updates External updates (status page) Email to customers
SEV-1 Every 15-30 minutes Every 30 minutes If outage >1 hour
SEV-2 Every 30-60 minutes Every 60 minutes Only if requested/significant
SEV-3 At resolution At resolution (if posted) No
SEV-4 At resolution No No

Adjust based on your context. A B2B SaaS with enterprise customers may need more frequent, more formal communication than a consumer app.


Communication Phases

Phase 1: Initial Acknowledgment

Goal: Let stakeholders know you're aware of the issue and working on it.

Timing: Within 10-15 minutes of incident declaration for SEV-1/SEV-2.

What to include:

  • Acknowledgment that there's an issue
  • Brief description of impact (what users experience)
  • Statement that you're investigating
  • When to expect the next update

What NOT to include:

  • Root cause (you don't know yet)
  • Estimated resolution time (you don't know yet)
  • Blame or speculation

Phase 2: Investigation Updates

Goal: Keep stakeholders informed while you investigate.

Timing: At the cadence defined for the severity level.

What to include:

  • Current status (still investigating, identified cause, implementing fix)
  • Any changes in impact
  • Workaround if available
  • Expected next update time

Key principle: If you have nothing new, say so. "We're still investigating. The team is focused on [specific area]. Next update in 30 minutes." Silence is worse than repetition.

Phase 3: Mitigation Announcement

Goal: Let stakeholders know the immediate impact is contained.

Timing: As soon as mitigation is confirmed.

What to include:

  • Confirmation that the issue is mitigated
  • What users should expect now
  • Whether there's any follow-up action needed from users
  • That you'll share more details in the postmortem

Phase 4: Resolution Announcement

Goal: Close the loop with stakeholders.

Timing: When the incident is fully resolved.

What to include:

  • Confirmation that the incident is resolved
  • Brief summary of what happened (high level)
  • Apology for the impact
  • Commitment to share postmortem findings

Goal: Rebuild trust through transparency.

Timing: 24-72 hours after resolution for major incidents.

What to include:

  • What happened (plain language)
  • How you responded
  • What you're doing to prevent recurrence
  • Apology and commitment to improvement

For enterprise customers, consider sharing a summary of the postmortem. For public products, consider a blog post for major incidents.


What Good Looks Like

You'll know your incident communication is working when:

Signal What it looks like
Proactive awareness Stakeholders learn about incidents from you, not from errors or social media
Support is prepared Customer support knows what's happening before tickets arrive
Predictable cadence Updates arrive when promised
Appropriate detail Technical enough to be useful, not so technical that it confuses
Trust preserved Customers frustrated by the outage, but not by your communication
Engineers undisturbed Responders aren't fielding stakeholder questions during the incident

Failure Modes and Mitigations

The Communication Vacuum

Symptom: Long silences during an incident. Stakeholders start asking "does anyone know what's happening?"

Root cause: No Comms Lead assigned, or Comms Lead doesn't have information from the technical team.

Mitigation: Assign Comms Lead explicitly. IC provides status updates to Comms Lead every 15 minutes. Comms Lead pushes updates on schedule, even if it's "no new information."

Overpromising on Resolution Time

Symptom: "We expect to resolve this in 30 minutes" turns into 3 hours. Stakeholders lose trust.

Root cause: Pressure to give optimistic estimates, or not enough buffer in estimates.

Mitigation: Don't give specific times unless you're confident. Say "we're working on it" or give wide ranges: "We expect this to take between 1-3 hours to resolve." Under-promise, over-deliver.

The Blame Leak

Symptom: Communication includes blame or finger-pointing: "A vendor failure caused..." or "Human error led to..."

Root cause: Instinct to explain, or pressure from leadership to assign blame.

Mitigation: Keep incident communication blameless. Focus on impact and actions, not causes. Save root cause discussion for the postmortem. Publicly blaming vendors or individuals during an incident damages relationships.

Inconsistent Channels

Symptom: Status page says one thing, support says another, internal Slack says a third thing.

Root cause: Multiple people communicating without coordination.

Mitigation: Comms Lead owns all external messaging. All updates flow through them. Internal updates can be faster/more detailed, but external communication is controlled.

The Over-Technical Update

Symptom: Status page says "We're experiencing elevated error rates in our Kafka consumer groups causing message lag in the event processing pipeline." Customers have no idea what this means.

Root cause: Engineers writing customer-facing communication.

Mitigation: Comms Lead translates technical updates to customer language. Focus on user impact: "Some users may experience delays in [feature]." Save technical details for internal updates.


Remote-First Adaptations

Async-friendly templates. Keep message templates ready so the Comms Lead doesn't need to craft prose under pressure.

Clear escalation paths. Know how to reach the Comms Lead outside working hours. They should be pageable for SEV-1.

Time zone handoffs. For long-running incidents, hand off Comms Lead role explicitly: "I'm handing comms to @sam—here's current status and the next scheduled update time."

Recorded postmortem summaries. If sharing postmortems with customers, consider a Loom video for enterprise accounts. It's more personal than a document.


Copy-Paste Artifact: Status Page Templates

Initial Acknowledgment

**Investigating: [Feature/Service] Issues**

We are aware of an issue affecting [brief description of what users experience]. Our team is investigating and we will provide updates as we learn more.

Next update in 30 minutes or when we have significant news.

Investigation Update

**Update: [Feature/Service] Issues**

We are continuing to investigate [brief description]. [Optional: We have identified the cause and are working on a fix / We are still working to identify the root cause].

[If available: Some users may be able to work around this by [workaround].]

Next update in 30 minutes or when we have significant news.

Mitigation Announcement

**Update: [Feature/Service] Issues - Mitigated**

We have implemented a fix and service is recovering. [Feature/Service] should be returning to normal.

If you continue to experience issues, please [refresh the page / contact support / etc.].

We will provide a full update once the incident is resolved.

Resolution Announcement

**Resolved: [Feature/Service] Issues**

The issue affecting [feature/service] has been resolved. All systems are operating normally.

We apologize for the disruption. We will be conducting a thorough review of this incident and will share our findings.

Thank you for your patience.

Copy-Paste Artifact: Internal Incident Update Template

## Incident Update: [Title]

**Time:** [timestamp]
**Severity:** SEV-[X]
**Status:** [Investigating / Identified / Mitigating / Resolved]

### Current Impact

[Brief description of user impact]

### What We Know

[Current understanding of the situation]

### What We're Doing

[Current actions being taken]

### Timeline

- [HH:MM] - [Event]
- [HH:MM] - [Event]
- [HH:MM] - [Event]

### Next Update

[Time] or when we have significant news

---

IC: @[name] | Comms: @[name] | Tech Lead: @[name]

Copy-Paste Artifact: Customer Email for Major Outage

Use this for extended outages (1+ hours) affecting significant customer population.

Subject: [Company Name] Service Disruption - [Date]

Dear [Customer],

We wanted to reach out to let you know about a service disruption that may have affected your experience with [Product Name].

**What happened:**
[Brief, plain-language description of the issue - focus on user experience, not technical details]

**Impact:**
[Timeframe of impact, what features were affected]

**What we're doing:**
Our team responded immediately and [brief description of resolution]. Service has been restored as of [time].

**Preventing future issues:**
We're conducting a thorough review of this incident. [If you have specific actions planned, mention them briefly.]

We understand how important [Product] is to your [workflow/business/etc.], and we're sorry for any inconvenience this caused.

If you have any questions or are still experiencing issues, please contact [support email].

Thank you for your patience and continued trust in [Company Name].

[Signature]

Copy-Paste Artifact: Support Team Briefing Template

Share this with customer support during an incident so they know how to respond to customers.

## Support Briefing: [Incident Title]

**Status:** [Active / Resolved]
**Last Updated:** [timestamp]

### What's happening

[Plain-language description of the issue]

### What customers experience

[Specific symptoms: error messages, slow load times, feature unavailable, etc.]

### What to tell customers

"We're aware of an issue affecting [feature]. Our team is actively working on it. [Workaround if available]. We expect to have an update within [timeframe]."

### Known workaround

[If any, describe steps. If none: "No workaround at this time."]

### What NOT to say

- Don't speculate on cause
- Don't promise specific resolution times
- Don't mention internal details (server names, employee names, etc.)

### Point customers to

- Status page: [URL]
- [Any other relevant resources]

### Escalation

If customers report impact not covered above, let us know in #[incident-channel].

Copy-Paste Artifact: Executive Briefing Template

For keeping leadership informed during major incidents.

## Executive Briefing: [Incident Title]

**Severity:** SEV-[X]
**Status:** [Active / Resolved]
**Duration:** [start time] - [current/end time]
**Prepared by:** [name]

### Summary

[2-3 sentence summary suitable for sharing with board or customers]

### Impact

- Users affected: [number or percentage]
- Revenue impact: [if known, or "assessing"]
- Customer escalations: [number]

### Current Status

[What's happening right now]

### Actions Taken

[Key response actions]

### Next Steps

[What happens next]

### Communication

- Status page: [Updated / Will update at X]
- Customer email: [Sent / Will send if >X hours / Not needed]
- Social media: [Monitoring / Response needed]

### Questions for Leadership

[If you need decisions or air cover, list them here]

Further Reading

  • Incident Management for Operations by Rob Schnepp, Ron Vidal, and Chris Hawley – Chapter on incident communication
  • Crucial Conversations by Patterson, Grenny, McMillan, and Switzler – Communicating under pressure
  • Atlassian's Incident Communication guides – Practical templates and examples