Reliability Practices¶

Reliability is not a feature you ship once. It's a continuous discipline—a combination of technical practices, operational habits, and cultural norms that together determine whether your systems work when users need them.

This page covers the practices that make reliability sustainable: how to define what "reliable" means for your systems, how to observe whether you're meeting that bar, and how to respond and learn when things go wrong.

What problem this solves¶

Every team says they care about reliability. But without concrete practices, reliability becomes an aspiration that gets traded away under deadline pressure. The result is systems that work most of the time but fail unpredictably, eroding user trust and consuming roadmap capacity in firefighting.

Reliability practices solve this by:

Making reliability expectations explicit and measurable (SLOs).
Providing visibility into system behavior (observability).
Creating consistent, effective responses to failures (incident management).
Building a culture where reliability is everyone's responsibility, not an afterthought.

The cost of not having these practices is that reliability becomes reactive. You find out you have a problem when users tell you—or when they stop using your product.

When to invest in reliability¶

Invest now if:¶

Incident frequency is increasing or staying high.
On-call engineers are burned out or dreading their rotations.
Postmortems identify the same root causes repeatedly.
Users complain about reliability, or you see it in churn data.
You don't have SLOs, or existing SLOs aren't actionable.
Observability is insufficient to diagnose issues quickly.

Defer deeper investment if:¶

Reliability is already meeting user expectations.
Other themes (security, scaling) are more urgent.
You're in a very early stage where stability isn't the primary concern.

Even in early stages, basic reliability hygiene (monitoring, alerting, incident process) should exist. What you can defer is optimization and advanced practices.

Core reliability practices¶

1. Service Level Objectives (SLOs)¶

SLOs define what "reliable enough" means for your system. They are the contract between your team and your users—explicit targets for availability, latency, or other quality dimensions.

Why SLOs matter:

Without SLOs, reliability is a vague goal. "We want to be reliable" doesn't help you make trade-offs. SLOs give you a concrete bar: if you're meeting your SLO, you have room to take risks and ship features. If you're missing it, you need to prioritize reliability work.

How to define SLOs:

Identify user-impacting behaviors. What do users care about? Can they load the page? Can they complete a transaction? Is the response fast enough?
Choose metrics that reflect those behaviors. Availability (error rate), latency (p50, p95, p99), throughput. Avoid internal metrics that don't map to user experience.
Set targets based on user expectations and business needs. 99.9% availability sounds impressive, but it means 8+ hours of downtime per year. Is that acceptable? Is 99.5% enough?
Define error budgets. If your SLO is 99.9% availability, your error budget is 0.1%. When the budget is exhausted, reliability work takes priority over features.

Common mistakes:

SLOs that are too ambitious and routinely missed (erodes trust in the system).
SLOs that are too easy and don't reflect user expectations.
SLOs without error budgets (no mechanism for prioritization).
Too many SLOs to track meaningfully.

Start with 1–3 SLOs per service. Keep them simple and actionable.

2. Observability¶

Observability is the ability to understand what's happening inside your systems based on their external outputs. It's the foundation for detecting problems, diagnosing root causes, and verifying fixes.

The three pillars:

Logs: Detailed records of discrete events. Useful for debugging specific issues but expensive to store and search at scale.
Metrics: Aggregated numerical measurements over time. Useful for alerting and trend analysis. Cheaper to store than logs.
Traces: Records of requests as they flow through distributed systems. Essential for understanding latency and dependencies.

What good observability looks like:

You can answer "what's broken?" within minutes of an alert.
You can trace a user request through all the services it touches.
Dashboards show the health of each service at a glance.
On-call engineers can diagnose issues without tribal knowledge.

Common mistakes:

Logging everything (expensive, noisy, hard to search).
Alerting on every metric (alert fatigue).
Observability owned by a central team with no input from service owners.
Missing correlation between logs, metrics, and traces.

Practical guidance:

Use structured logging with consistent fields (request ID, user ID, service name).
Alert on symptoms (SLO violations), not causes (CPU high).
Ensure every service has a health dashboard and basic runbook.
Trace requests end-to-end, especially through async boundaries.

3. Alerting¶

Alerts are how systems tell humans that something needs attention. Good alerting catches real problems early. Bad alerting creates noise that gets ignored.

Principles for effective alerting:

Alert on user impact, not internal metrics. An alert should mean "users are affected" or "users will be affected soon." High CPU is not inherently a problem—failed requests are.
Every alert should be actionable. If there's nothing to do when an alert fires, it shouldn't be an alert. Consider making it a dashboard metric instead.
Reduce noise aggressively. Alert fatigue is real. If on-call ignores alerts because most are noise, you'll miss the real ones.
Include context in alerts. The alert should link to a dashboard, a runbook, and have enough information to start diagnosis.

Common mistakes:

Paging for non-urgent issues.
Alerts without runbooks.
Multiple alerts for the same underlying problem.
Thresholds that trigger during normal operations.

4. Incident management¶

When things go wrong, a clear incident process ensures problems are resolved quickly and consistently. Chaos during incidents extends outages and burns out responders.

Key elements:

Clear roles: Incident Commander (coordinates), Technical Lead (diagnoses), Communications Lead (updates stakeholders). Roles can be combined for smaller incidents.
Severity levels: Define what constitutes P1, P2, P3. This determines who gets paged, how quickly, and what the response expectations are.
Communication channels: Dedicated incident channel, status page updates, stakeholder notifications. Don't let communication happen in scattered threads.
Escalation paths: When to escalate, to whom, and how. Make it easy for responders to get help.
Post-incident process: Every significant incident gets a postmortem. No exceptions.

See Delivery: Incident Response for detailed incident playbooks.

5. Postmortems¶

Postmortems are how you learn from failures. A good postmortem identifies what went wrong, why, and what will prevent recurrence. A bad postmortem assigns blame and gets filed without action.

Principles:

Blameless: Focus on systems and processes, not individuals. The question is "what allowed this to happen?" not "who did this?"
Thorough: Understand the full timeline, contributing factors, and near-misses.
Action-oriented: Every postmortem should produce concrete follow-up items with owners and deadlines.
Shared: Postmortems should be readable by anyone in the organization. Transparency builds trust and spreads learning.

What a postmortem should include:

Summary and impact (what happened, who was affected, for how long)
Timeline (detailed sequence of events)
Root cause analysis (what failed and why)
Contributing factors (what made it worse or harder to resolve)
What went well (what worked in the response)
Action items (specific, owned, time-bound)

See Resources: Postmortem Template for a copy-pastable format.

6. On-call¶

On-call is how teams maintain accountability for their systems outside working hours. Good on-call is sustainable and effective. Bad on-call burns out engineers and doesn't actually improve reliability.

Principles:

You build it, you run it. Teams that are responsible for production are more careful about what they ship.
On-call should be sustainable. If on-call means regular sleep deprivation, something is broken—either the systems or the process.
Compensate appropriately. On-call is a burden. Acknowledge it with time off, pay, or other compensation.
Reduce toil. Track what on-call pages for. If the same issues recur, invest in fixing them rather than accepting the burden.

Healthy on-call signals:

Pages are rare (< 2 per week per person).
Most pages are actionable and require human judgment.
Runbooks exist for common issues.
Handoffs are smooth with clear context.

Unhealthy on-call signals:

Frequent pages for non-urgent issues.
Responders don't know what to do when paged.
On-call weeks are dreaded and avoided.
No follow-up on recurring issues.

Roles and ownership¶

Role	Responsibilities
Service-owning teams	Define and maintain SLOs for their services. Respond to incidents. Write and maintain runbooks. Participate in on-call.
Platform/SRE team	Provide observability tooling and standards. Support incident response. Consult on reliability improvements. Own shared infrastructure SLOs.
Engineering Leadership	Prioritize reliability investment. Enforce error budget policies. Ensure on-call is sustainable. Model blameless culture.
Product Leadership	Understand SLOs and their implications. Accept trade-offs when error budgets are exhausted. Advocate for reliability when needed.

Templates and artifacts¶

SLO definition template¶

# SLO: [Service Name]

**Owner:** [Team]
**Last reviewed:** [Date]

## User journey

What user-facing behavior does this SLO protect?

[e.g., "Users can successfully complete checkout within acceptable time"]

## SLI (Service Level Indicator)

| SLI           | Definition                                         | Measurement     |
| ------------- | -------------------------------------------------- | --------------- |
| Availability  | Percentage of requests that succeed (HTTP 2xx/3xx) | [Metric source] |
| Latency (p95) | 95th percentile response time                      | [Metric source] |

## SLO Targets

| SLI           | Target  | Measurement window |
| ------------- | ------- | ------------------ |
| Availability  | 99.9%   | Rolling 30 days    |
| Latency (p95) | < 500ms | Rolling 30 days    |

## Error budget

- **Availability budget:** 0.1% = ~43 minutes/month of downtime
- **Current burn rate:** [X]% of monthly budget consumed

## Error budget policy

- **Budget healthy (> 50% remaining):** Normal feature development
- **Budget caution (25–50% remaining):** Prioritize reliability work alongside features
- **Budget critical (< 25% remaining):** Freeze non-critical changes; focus on reliability
- **Budget exhausted:** Feature freeze until reliability improves

## Review schedule

- Weekly: Check burn rate in team standup
- Monthly: Review SLO performance in team retro
- Quarterly: Revisit targets and adjust if needed

On-call health checklist¶

# On-Call Health Review: [Team]

**Period:** [Date range]
**Reviewed by:** [Name]

## Metrics

- Total pages: [#]
- Pages per on-call person: [#]
- Pages outside business hours: [#]
- Mean time to acknowledge: [X min]
- Mean time to resolve: [X min]

## Page analysis

| Category                   | Count | % of total |
| -------------------------- | ----- | ---------- |
| Actionable, required human |       |            |
| Could be automated         |       |            |
| Noise / false positive     |       |            |
| Duplicate / cascading      |       |            |

## Top recurring issues

1. [Issue]: [# occurrences] — [Status: Fixed / In progress / Needs investment]
2. [Issue]: [# occurrences] — [Status]
3. [Issue]: [# occurrences] — [Status]

## Runbook coverage

- Services with runbooks: [X/Y]
- Runbooks used this period: [#]
- Runbooks needing updates: [List]

## Team sentiment

How did on-call feel this period? (Survey or discussion)

[Summary]

## Actions

- [ ] [Action to reduce page volume or improve response]
- [ ] [Action to improve runbooks or tooling]

Reliability review meeting agenda¶

# Reliability Review: [Team/Service]

**Date:** [Date]
**Attendees:** [Team members, stakeholders]

## Agenda

1. **SLO performance (10 min)**
   - Are we meeting our SLOs?
   - Error budget status
   - Any SLO violations to discuss?

2. **Incident review (15 min)**
   - Incidents since last review
   - Status of postmortem action items
   - Patterns or recurring themes

3. **On-call health (10 min)**
   - Page volume and quality
   - Team feedback on on-call burden
   - Toil reduction opportunities

4. **Observability gaps (10 min)**
   - Any blind spots identified?
   - Tooling or dashboard needs?

5. **Reliability investments (10 min)**
   - Current reliability initiatives: status
   - Prioritization for next period

## Decisions

[Record decisions]

## Actions

- [ ] [Action with owner and deadline]

Signals that reliability practices are working¶

Signal	What it indicates
SLOs are met consistently	Reliability targets are appropriate and achievable
Error budgets drive prioritization	Trade-off mechanism is working
Incidents are resolved quickly	Process and observability are effective
Postmortem actions get completed	Learning loop is closing
On-call is sustainable	Systems are stable; toil is managed
Reliability is part of planning	Cultural integration is happening

Failure modes and mitigations¶

Failure mode	What it looks like	Mitigation
SLOs without teeth	SLOs are missed but nothing changes	Enforce error budget policies; make consequences real
Alert fatigue	Too many alerts; responders ignore them	Audit alerts quarterly; remove noise aggressively
Blameful postmortems	People hide mistakes; learning doesn't happen	Model blameless behavior from leadership; focus on systems
Unsustainable on-call	Burnout, attrition, people avoiding rotations	Track toil; invest in fixes; compensate fairly
Observability as afterthought	Can't diagnose issues; incidents take longer	Make observability a requirement for shipping
Reliability as someone else's job	Product teams don't feel ownership	"You build it, you run it"; include reliability in team goals

Platform Themes — How reliability fits into broader platform investment.
Platform Scalability — Scalability and reliability work together.
Delivery: Incident Response — Detailed incident management playbook.
Resources: Postmortem Template — Copy-pastable postmortem format.
Resources: Runbook Template — Template for operational runbooks.
Metrics: Engineering Metrics — Broader metrics framework including reliability signals.

Reliability Practices¶

What problem this solves¶

When to invest in reliability¶

Invest now if:¶

Defer deeper investment if:¶

Core reliability practices¶

1. Service Level Objectives (SLOs)¶

2. Observability¶

3. Alerting¶

4. Incident management¶

5. Postmortems¶

6. On-call¶

Roles and ownership¶

Templates and artifacts¶

SLO definition template¶

On-call health checklist¶

Reliability review meeting agenda¶

Signals that reliability practices are working¶

Failure modes and mitigations¶

Related pages¶