The fastest way to make an on-call rotation worthless is to page people for things that don't matter. Once engineers learn that 90% of pages are noise, they stop reading them — and the one that mattered gets muted along with the rest. This article is about building observability that answers real questions and an alerting layer engineers trust enough to wake up for. It assumes you already run Prometheus or similar and know what a metric is.

Monitoring vs Observability

The distinction is not marketing. Monitoring answers questions you knew to ask in advance: is CPU high, is the queue backing up, is the error rate above 1%. Observability is the property of being able to ask new questions about your system's behaviour after the fact, without shipping new code — typically by retaining high-cardinality, correlated signals you can slice arbitrarily.

You need both. Monitoring catches known failure modes cheaply; observability lets you debug the novel incident at 3am that no dashboard anticipated.

The Three Pillars and Where Each Earns Its Keep

Pillar Best for Weakness Cost driver
Metrics Alerting, trends, SLOs — cheap to store, fast to query Pre-aggregated, low context Cardinality (label combinations)
Logs Forensic detail, exact error messages Expensive at volume, hard to aggregate Volume (GB ingested)
Traces Latency attribution across services Sampling loses tail detail Span volume / retention

The mistake is treating these as interchangeable. Alert on metrics — they are cheap and deterministic. Use traces to find which service in a request path is slow. Use logs to find why once you know where to look. An alert built on log scraping is usually a sign you should have emitted a metric.

USE for Resources, RED for Services

Two complementary methods stop you from guessing what to measure.

USE (Brendan Gregg) — for every resource (CPU, memory, disk, network, connection pools):

  • Utilisation — how busy it is
  • Saturation — queued work it can't service yet
  • Errors — error events

RED (Tom Wilkie) — for every request-driven service:

  • Rate — requests per second
  • Errors — failed requests per second
  • Duration — latency distribution

USE finds the saturated resource causing a problem; RED tells you the user-facing impact. A disk at 95% utilisation (USE) only matters if it is driving up request duration (RED).

SLIs, SLOs and Error Budgets as the Basis for Alerting

Stop alerting on raw thresholds and start alerting on whether you are about to breach a promise. An SLI is a measured ratio of good events to total events; an SLO is the target for that ratio; the error budget is 1 − SLO.

If your availability SLO is 99.9% over 30 days, your error budget is 0.1% — roughly 43 minutes of downtime. Alerts should fire based on how fast you are consuming that budget, not on instantaneous symptoms.

Symptom-Based Alerting Kills Fatigue

Alert on user pain, not on every contributing cause. A single node having high CPU is a cause; elevated checkout latency is a symptom. If users aren't affected, it can wait for business hours. This single principle eliminates the majority of noisy pages.

The most effective pattern is the multi-window, multi-burn-rate SLO alert: a fast-burn window catches sudden outages, a slow-burn window catches gradual degradation, and requiring both a long and short window to fire suppresses flapping.

# Prometheus rule: multi-window burn-rate alert on a request-error SLO (99.9%)
groups:
- name: slo-burn-rate
  rules:
  - alert: ErrorBudgetFastBurn
    # 14.4x burn over 1h AND 5m confirms a real, fast outage
    expr: |
      (
        job:slo_errors:ratio_rate1h{job="checkout"} > (14.4 * 0.001)
        and
        job:slo_errors:ratio_rate5m{job="checkout"} > (14.4 * 0.001)
      )
    for: 2m
    labels:
      severity: page
    annotations:
      summary: "Checkout burning error budget fast (14.4x)"
      runbook: "https://runbooks.example.com/checkout-error-budget"

  - alert: ErrorBudgetSlowBurn
    # 6x burn over 6h AND 30m catches slow degradation
    expr: |
      (
        job:slo_errors:ratio_rate6h{job="checkout"} > (6 * 0.001)
        and
        job:slo_errors:ratio_rate30m{job="checkout"} > (6 * 0.001)
      )
    for: 15m
    labels:
      severity: ticket
    annotations:
      summary: "Checkout slow error-budget burn (6x)"
      runbook: "https://runbooks.example.com/checkout-error-budget"

The job:slo_errors:ratio_rate1h series are recording rules — precompute the ratio so the alert query stays cheap:

- record: job:slo_errors:ratio_rate1h
  expr: |
    sum(rate(http_requests_total{job="checkout",code=~"5.."}[1h]))
      /
    sum(rate(http_requests_total{job="checkout"}[1h]))

The burn-rate multipliers map to "what fraction of the monthly budget would this rate consume". 14.4x over one hour exhausts about 2% of a 30-day budget per hour — fast enough to warrant a page. Tune the windows to your SLO, don't copy them blindly.

Severity, Routing and On-Call Hygiene

Two severities are enough for most teams: page (wake a human now) and ticket (handle in hours). Route them differently and never page on a ticket-grade signal.

# Alertmanager: route by severity, group to avoid storms, mute the rest
route:
  receiver: default-ticket
  group_by: ['alertname', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - matchers: [ severity="page" ]
      receiver: pagerduty-oncall
      group_wait: 10s
      repeat_interval: 1h
    - matchers: [ severity="ticket" ]
      receiver: slack-eng

receivers:
  - name: pagerduty-oncall
    pagerduty_configs:
      - service_key: ${PD_SERVICE_KEY}
  - name: slack-eng
    slack_configs:
      - api_url: ${SLACK_WEBHOOK}
        channel: '#alerts'
  - name: default-ticket
    slack_configs:
      - api_url: ${SLACK_WEBHOOK}
        channel: '#alerts-low'

group_by collapses a hundred related alerts into one notification — without it, a single bad deploy pages the on-call engineer fifty times. Track page volume per rotation as a first-class metric; a sustained rate above a handful of actionable pages per shift is a fatigue problem to fix, not tolerate.

Every Alert Links to a Runbook

An alert with no runbook is a riddle delivered at 3am. Make the runbook link a required annotation — enforce it in CI so rules without one fail the build.

# CI gate: reject alert rules missing a runbook annotation
promtool check rules rules/*.yml
yq -e '.groups[].rules[] | select(.alert) | .annotations.runbook' rules/*.yml \
  > /dev/null || { echo "alert missing runbook annotation"; exit 1; }

The runbook should state what the alert means, the likely causes, the first three diagnostic queries to run, and how to mitigate — not how to write Prometheus.

Cardinality and Cost

Metric cost is driven by cardinality: every unique combination of label values is a separate time series. Putting a user ID, request ID or raw URL in a label can explode one metric into millions of series and bankrupt your storage. Keep labels bounded (status code, route template, region) and push high-cardinality identifiers into traces and logs where they belong.

Before adding a label, ask: how many distinct values can it take, multiplied across all other labels? If the answer is unbounded, it does not belong on a metric.

OpenTelemetry and Tooling

Instrument with OpenTelemetry rather than vendor SDKs. It is vendor-neutral, so you emit once and route metrics, logs and traces to whatever backend you choose — and you can switch backends without re-instrumenting. A pragmatic open-source stack:

  • Prometheus — metrics and alert evaluation
  • Alertmanager — routing, grouping, silencing
  • Grafana — dashboards across all three pillars
  • Loki — logs, queried with the same label model as metrics
  • Tempo — traces, with exemplars linking a latency spike straight to a trace

The goal is a tight loop: an SLO burn-rate alert pages you, the dashboard exemplar jumps you to the slow trace, and the trace's span points you at the log line explaining why. Build that loop, gate every rule on a runbook, alert on symptoms, and your team will stop muting the pager.


Drowning in alerts your team has quietly stopped reading? Talk to i2zone and we'll rebuild your observability around SLOs and symptom-based alerting so the pager only fires when it should.