Runbook Template¶

A runbook is a set of documented procedures that anyone on the team can follow to complete a task or respond to an incident. Good runbooks reduce reliance on tribal knowledge, speed up incident response, and make on-call less stressful.

What problem this solves¶

When operational knowledge lives only in people's heads:

Incidents take longer to resolve because responders don't know what to do.
On-call becomes a burden because only certain people can handle issues.
Knowledge is lost when people leave or move to other teams.
Simple tasks become blocked waiting for the "expert."

Runbooks solve this by externalizing operational knowledge into documented procedures that anyone can follow.

When to use this¶

Create a runbook for:

Any task that multiple people need to perform.
Common incident scenarios and their remediation steps.
Deployment procedures.
Scaling operations (manual or semi-automated).
Maintenance tasks that happen regularly.

Link runbooks to alerts. When an alert fires, the responder should know immediately where to find the relevant runbook.

Roles and ownership¶

Role	Responsibility
Service owner	Ensures runbooks exist for their service. Reviews and updates them after incidents.
On-call engineer	Uses runbooks during incidents. Reports when runbooks are missing or outdated.
Tech lead / Manager	Ensures runbook coverage for the team's services. Prioritizes runbook creation as part of operational readiness.

Runbooks should have explicit owners—someone responsible for keeping them accurate and up-to-date.

How to create a runbook¶

Step 1: Identify the procedure¶

What task or scenario does this runbook cover? Be specific:

"Restart the payment service" (good)
"Handle payment issues" (too vague)

Step 2: Document prerequisites¶

What does someone need before they can follow this runbook?

Access permissions
Tools or CLIs
Context about the system

Step 3: Write step-by-step instructions¶

Write for someone who doesn't have your context. Each step should be:

Specific enough to follow without interpretation.
Verifiable—how do you know the step succeeded?

Include commands that can be copy-pasted, but also explain what they do.

Step 4: Add troubleshooting¶

What can go wrong? What are the common variations? How do you recover if a step fails?

Step 5: Define escalation¶

When should someone stop following the runbook and escalate? Who do they contact?

Step 6: Test the runbook¶

Have someone unfamiliar with the procedure follow the runbook. Note where they get stuck. Refine.

Step 7: Link to alerts¶

If this runbook is for incident response, link it from the alert. When the alert fires, the link should be in the alert message.

Signals that runbooks are working¶

On-call responders can resolve issues without waking up the service owner.
Incident resolution times decrease.
Runbooks are used and updated regularly.
New team members can perform operational tasks quickly.
Same incidents don't require re-learning each time.

Failure modes and mitigations¶

Failure mode	What it looks like	Mitigation
Runbooks don't exist	Responders improvise; outcomes vary	Require runbooks before a service is production-ready
Runbooks are outdated	Steps don't work; responders waste time	Review after every incident; tag with "last verified" date
Runbooks aren't linked to alerts	Responders can't find them when needed	Include runbook link in alert metadata
Too much detail	20-page document; no one reads it	Keep it focused; separate reference docs from procedures
Not tested	Steps have errors; responders discover during incidents	Have someone else follow the runbook periodically

The template¶

Runbook document¶

# Runbook: [Procedure Name]

**Service:** [Service name]
**Owner:** [Name or team]
**Last verified:** [Date]
**Alert link:** [Link to associated alert, if applicable]

---

## Purpose

[What does this runbook help you do? When would you use it?]

---

## Prerequisites

- [ ] Access to [system/tool]
- [ ] [CLI tool] installed
- [ ] Familiarity with [concept] (link to docs if needed)
- [ ] VPN connected (if applicable)

---

## Procedure

### Step 1: [Step name]

[What to do]

```bash
# Command to run
command --with-flags
```

**Expected output:**

```text
[What you should see]
```

**If this fails:**
[What to try]

---

### Step 2: [Step name]

[What to do]

**Verification:**
[How to confirm this step succeeded]

---

### Step 3: [Step name]

[...]

---

## Troubleshooting

### Issue: [Common problem]

**Symptoms:** [What you observe]

**Cause:** [Why this happens]

**Resolution:**
[Steps to fix]

---

### Issue: [Another common problem]

[...]

---

## Rollback

If something goes wrong and you need to undo:

1. [Rollback step 1]
2. [Rollback step 2]

---

## Escalation

Escalate if:

- The runbook steps don't resolve the issue.
- You're unsure about a step.
- The impact is larger than expected.

**Escalation contacts:**

| Role            | Contact       | When to page               |
| --------------- | ------------- | -------------------------- |
| Service owner   | [Name/handle] | If runbook doesn't resolve |
| On-call manager | [Name/handle] | If SEV1 or customer impact |

---

## Related resources

- [Link to service documentation]
- [Link to architecture diagram]
- [Link to related runbooks]

---

## Changelog

| Date   | Author | Changes         |
| ------ | ------ | --------------- |
| [Date] | [Name] | Initial version |
| [Date] | [Name] | [What changed]  |

Alert integration example¶

When creating alerts, include runbook links in the alert metadata:

# Example alert rule (Prometheus/Alertmanager style)
- alert: PaymentServiceHighErrorRate
  expr: rate(payment_errors_total[5m]) > 0.1
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: 'Payment service error rate is high'
    runbook: 'https://wiki.example.com/runbooks/payment-service-high-error-rate'
    dashboard: 'https://grafana.example.com/d/payment-service'

Platform: Reliability Practices — Operational discipline including runbooks.
Delivery: Incident Response — How runbooks fit into incident handling.
Postmortem Template — Document when runbooks need improvement.
Outage Communications Template — Communicating during incidents.