Stopping Production Incidents Before They Happen

Most organisations treat reliability as an operational afterthought: build the feature, ship it, then wire up alerts and hope the on-call rota absorbs the fallout. That model scales badly. Once you cross a few hundred deploys a month, the only sustainable strategy is to shift reliability left — to make the system resistant to failure before code reaches production. This article covers the concrete engineering practices that prevent incidents rather than merely respond to them.

Reliability is a property you design in, not bolt on

The cheapest incident is the one that never happens, and the second cheapest is the one a machine resolved at 03:00 without paging anyone. Proactive reliability work targets both. It rests on a simple premise: every dependency will eventually fail, every node will eventually die, and every release carries risk. Your job is to make those events boring.

That requires four pillars working together: pre-production gates that catch regressions, progressive delivery that limits blast radius, capacity engineering that absorbs load spikes, and resilience patterns that contain failures. Underpinning all of them is observability — without it, the rest is guesswork.

Pre-production gates: fail in CI, not in prod

Functional tests tell you the code is correct. They say nothing about how it behaves under sustained load, memory pressure, or a flaky downstream. Add these gates before promotion:

Load testing to validate throughput against SLO targets, run on every release candidate.
Soak testing — hold realistic load for hours to surface memory leaks, connection-pool exhaustion, and file-descriptor creep that a 5-minute test will never reveal.
Dependency failure injection in a staging environment: kill the database, add 2s of latency to a downstream API, drop 10% of packets, and assert the service degrades gracefully rather than cascading.

# k6 soak test gating a release in CI
k6 run --vus 200 --duration 2h \
  --threshold 'http_req_duration{expected_response:true}:p(99)<400' \
  --threshold 'http_req_failed:rate<0.01' \
  soak.js
# non-zero exit on threshold breach fails the pipeline stage

Gate the pipeline on these results. A soak test that nobody reads is theatre; a soak test that blocks promotion is engineering.

Progressive delivery: shrink the blast radius

A big-bang deploy means a bad release hits 100% of traffic instantly. Progressive delivery exposes new code to a small slice, watches the SLOs, and rolls back automatically before users notice. Argo Rollouts makes this declarative:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: checkout-api
spec:
  replicas: 12
  strategy:
    canary:
      maxSurge: "25%"
      maxUnavailable: 0
      steps:
        - setWeight: 5
        - pause: { duration: 5m }
        - analysis:
            templates:
              - templateName: error-rate-slo
        - setWeight: 25
        - pause: { duration: 10m }
        - setWeight: 50
        - pause: { duration: 10m }
      analysis:
        templates:
          - templateName: error-rate-slo
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: error-rate-slo
spec:
  metrics:
    - name: error-rate
      interval: 1m
      successCondition: result < 0.01
      failureLimit: 2
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{job="checkout-api",code=~"5.."}[2m]))
            / sum(rate(http_requests_total{job="checkout-api"}[2m]))

If the canary's 5xx rate breaches the SLO twice, the rollout aborts and reverts. Compare the common strategies:

Strategy	Blast radius	Rollback speed	Infra cost	Best for
Blue-green	Full (instant cutover)	Instant (flip back)	2x during cutover	Stateless apps needing instant rollback
Canary	Small, growing	Fast (abort + revert)	Marginal	Most stateless services
Feature flags	Per-user/segment	Instant (toggle)	None	Decoupling deploy from release

Feature flags are the sharpest tool here: they let you deploy code dark and release it independently, ramping a feature to 1% of users and killing it in milliseconds without a redeploy.

Capacity planning and sane autoscaling

Outages frequently trace back to running too hot. Autoscaling is necessary but not sufficient — it has a reaction lag, and a cold-start storm during a traffic spike can be worse than the spike itself. Keep deliberate headroom and combine pod-level and node-level scaling.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: checkout-api
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: checkout-api
  minReplicas: 6          # never scale below baseline + headroom
  maxReplicas: 60
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300   # damp flapping
    scaleUp:
      policies:
        - type: Percent
          value: 100
          periodSeconds: 30             # double fast when needed
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60        # 40% headroom for spikes

Pair the HPA with Karpenter for node provisioning so pods are not left Pending when the HPA scales out. Target around 60% utilisation rather than 90% — the 30-40% headroom is what absorbs a spike while new capacity provisions.

Autoscaling reacts; it does not predict. For known events (sales, launches, batch windows) pre-scale on a schedule. Reactive scaling alone will always lag the leading edge of a spike.

A PodDisruptionBudget protects availability during voluntary disruptions like node upgrades or Karpenter consolidation:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: checkout-api
spec:
  minAvailable: 80%
  selector:
    matchLabels:
      app: checkout-api

Resilience patterns and graceful degradation

Map your dependencies and classify each as critical or non-critical. A non-critical dependency (recommendations, analytics) must never take down a critical path (checkout). Enforce that with the standard resilience primitives:

Timeouts on every network call — an unbounded call is a latent outage.
Retries with exponential backoff and jitter — synchronised retries cause thundering herds; jitter spreads them.
Circuit breakers to stop hammering a failing dependency and fail fast instead.
Bulkheads — isolate connection pools and thread pools per dependency so one slow service cannot exhaust resources shared with healthy ones.

When a non-critical dependency trips its breaker, degrade gracefully: serve a cached or default response rather than an error. The user sees a slightly worse experience, not a 500.

Chaos engineering and game days

Once the patterns are in place, prove they work. Chaos engineering injects controlled failure into production-like environments to validate that the system behaves as designed. Start small — terminate one pod, add latency to one dependency — and form a hypothesis ("checkout SLO holds when the recommendations service is down") before each experiment. Game days take this further: a scheduled, organised exercise where the team rehearses a realistic failure end to end, including the human response and runbooks. The findings almost always expose a missing alert, a stale runbook, or a dependency you did not know was on the critical path.

Error budgets gate risky work

Error budgets convert reliability into a currency. If your SLO is 99.9% availability, you have roughly 43 minutes of allowable downtime a month. While there is budget remaining, ship aggressively. When the budget is exhausted, freeze risky releases and divert effort to reliability until it recovers. This removes the perennial dev-versus-ops argument by replacing opinion with a shared, data-driven policy.

Production readiness reviews

Before any new service takes production traffic, run it through a Production Readiness Review checklist: defined SLOs and alerts, runbooks for the top failure modes, dashboards, capacity plan, dependency map with degradation behaviour, rollback mechanism, and on-call ownership. The PRR is the gate that ensures observability and resilience exist before the service can hurt anyone — not after the first incident teaches you they were missing.

Reliability is engineered upstream, not patched downstream. If you want a partner to design progressive delivery, chaos programmes, and error-budget policy into your platform before the next outage, talk to i2zone.