Skip to content

Runbook Template

A runbook is a set of documented procedures that anyone on the team can follow to complete a task or respond to an incident. Good runbooks reduce reliance on tribal knowledge, speed up incident response, and make on-call less stressful.


What problem this solves

When operational knowledge lives only in people's heads:

  • Incidents take longer to resolve because responders don't know what to do.
  • On-call becomes a burden because only certain people can handle issues.
  • Knowledge is lost when people leave or move to other teams.
  • Simple tasks become blocked waiting for the "expert."

Runbooks solve this by externalizing operational knowledge into documented procedures that anyone can follow.


When to use this

Create a runbook for:

  • Any task that multiple people need to perform.
  • Common incident scenarios and their remediation steps.
  • Deployment procedures.
  • Scaling operations (manual or semi-automated).
  • Maintenance tasks that happen regularly.

Link runbooks to alerts. When an alert fires, the responder should know immediately where to find the relevant runbook.


Roles and ownership

Role Responsibility
Service owner Ensures runbooks exist for their service. Reviews and updates them after incidents.
On-call engineer Uses runbooks during incidents. Reports when runbooks are missing or outdated.
Tech lead / Manager Ensures runbook coverage for the team's services. Prioritizes runbook creation as part of operational readiness.

Runbooks should have explicit owners—someone responsible for keeping them accurate and up-to-date.


How to create a runbook

Step 1: Identify the procedure

What task or scenario does this runbook cover? Be specific:

  • "Restart the payment service" (good)
  • "Handle payment issues" (too vague)

Step 2: Document prerequisites

What does someone need before they can follow this runbook?

  • Access permissions
  • Tools or CLIs
  • Context about the system

Step 3: Write step-by-step instructions

Write for someone who doesn't have your context. Each step should be:

  • Specific enough to follow without interpretation.
  • Verifiable—how do you know the step succeeded?

Include commands that can be copy-pasted, but also explain what they do.

Step 4: Add troubleshooting

What can go wrong? What are the common variations? How do you recover if a step fails?

Step 5: Define escalation

When should someone stop following the runbook and escalate? Who do they contact?

Step 6: Test the runbook

Have someone unfamiliar with the procedure follow the runbook. Note where they get stuck. Refine.

If this runbook is for incident response, link it from the alert. When the alert fires, the link should be in the alert message.


Signals that runbooks are working

  • On-call responders can resolve issues without waking up the service owner.
  • Incident resolution times decrease.
  • Runbooks are used and updated regularly.
  • New team members can perform operational tasks quickly.
  • Same incidents don't require re-learning each time.

Failure modes and mitigations

Failure mode What it looks like Mitigation
Runbooks don't exist Responders improvise; outcomes vary Require runbooks before a service is production-ready
Runbooks are outdated Steps don't work; responders waste time Review after every incident; tag with "last verified" date
Runbooks aren't linked to alerts Responders can't find them when needed Include runbook link in alert metadata
Too much detail 20-page document; no one reads it Keep it focused; separate reference docs from procedures
Not tested Steps have errors; responders discover during incidents Have someone else follow the runbook periodically

The template

Runbook document

# Runbook: [Procedure Name]

**Service:** [Service name]
**Owner:** [Name or team]
**Last verified:** [Date]
**Alert link:** [Link to associated alert, if applicable]

---

## Purpose

[What does this runbook help you do? When would you use it?]

---

## Prerequisites

- [ ] Access to [system/tool]
- [ ] [CLI tool] installed
- [ ] Familiarity with [concept] (link to docs if needed)
- [ ] VPN connected (if applicable)

---

## Procedure

### Step 1: [Step name]

[What to do]

```bash
# Command to run
command --with-flags
```

**Expected output:**

```text
[What you should see]
```

**If this fails:**
[What to try]

---

### Step 2: [Step name]

[What to do]

**Verification:**
[How to confirm this step succeeded]

---

### Step 3: [Step name]

[...]

---

## Troubleshooting

### Issue: [Common problem]

**Symptoms:** [What you observe]

**Cause:** [Why this happens]

**Resolution:**
[Steps to fix]

---

### Issue: [Another common problem]

[...]

---

## Rollback

If something goes wrong and you need to undo:

1. [Rollback step 1]
2. [Rollback step 2]

---

## Escalation

Escalate if:

- The runbook steps don't resolve the issue.
- You're unsure about a step.
- The impact is larger than expected.

**Escalation contacts:**

| Role            | Contact       | When to page               |
| --------------- | ------------- | -------------------------- |
| Service owner   | [Name/handle] | If runbook doesn't resolve |
| On-call manager | [Name/handle] | If SEV1 or customer impact |

---

## Related resources

- [Link to service documentation]
- [Link to architecture diagram]
- [Link to related runbooks]

---

## Changelog

| Date   | Author | Changes         |
| ------ | ------ | --------------- |
| [Date] | [Name] | Initial version |
| [Date] | [Name] | [What changed]  |

Alert integration example

When creating alerts, include runbook links in the alert metadata:

# Example alert rule (Prometheus/Alertmanager style)
- alert: PaymentServiceHighErrorRate
  expr: rate(payment_errors_total[5m]) > 0.1
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: 'Payment service error rate is high'
    runbook: 'https://wiki.example.com/runbooks/payment-service-high-error-rate'
    dashboard: 'https://grafana.example.com/d/payment-service'