Most organisations treat reliability as an operational afterthought: build the feature, ship it, then wire up alerts and hope the on-call rota absorbs the fallout. That model scales badly. Once you cross a few hundred deploys a month, the only sustainable strategy is to shift reliability left — to make the system resistant to failure before code reaches production. This article covers the concrete engineering practices that prevent incidents rather than merely respond to them.
Reliability is a property you design in, not bolt on
The cheapest incident is the one that never happens, and the second cheapest is the one a machine resolved at 03:00 without paging anyone. Proactive reliability work targets both. It rests on a simple premise: every dependency will eventually fail, every node will eventually die, and every release carries risk. Your job is to make those events boring.
That requires four pillars working together: pre-production gates that catch regressions, progressive delivery that limits blast radius, capacity engineering that absorbs load spikes, and resilience patterns that contain failures. Underpinning all of them is observability — without it, the rest is guesswork.
Pre-production gates: fail in CI, not in prod
Functional tests tell you the code is correct. They say nothing about how it behaves under sustained load, memory pressure, or a flaky downstream. Add these gates before promotion:
- Load testing to validate throughput against SLO targets, run on every release candidate.
- Soak testing — hold realistic load for hours to surface memory leaks, connection-pool exhaustion, and file-descriptor creep that a 5-minute test will never reveal.
- Dependency failure injection in a staging environment: kill the database, add 2s of latency to a downstream API, drop 10% of packets, and assert the service degrades gracefully rather than cascading.
# k6 soak test gating a release in CI
k6 run --vus 200 --duration 2h \
--threshold 'http_req_duration{expected_response:true}:p(99)<400' \
--threshold 'http_req_failed:rate<0.01' \
soak.js
# non-zero exit on threshold breach fails the pipeline stage
Gate the pipeline on these results. A soak test that nobody reads is theatre; a soak test that blocks promotion is engineering.
Progressive delivery: shrink the blast radius
A big-bang deploy means a bad release hits 100% of traffic instantly. Progressive delivery exposes new code to a small slice, watches the SLOs, and rolls back automatically before users notice. Argo Rollouts makes this declarative:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: checkout-api
spec:
replicas: 12
strategy:
canary:
maxSurge: "25%"
maxUnavailable: 0
steps:
- setWeight: 5
- pause: { duration: 5m }
- analysis:
templates:
- templateName: error-rate-slo
- setWeight: 25
- pause: { duration: 10m }
- setWeight: 50
- pause: { duration: 10m }
analysis:
templates:
- templateName: error-rate-slo
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: error-rate-slo
spec:
metrics:
- name: error-rate
interval: 1m
successCondition: result < 0.01
failureLimit: 2
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
sum(rate(http_requests_total{job="checkout-api",code=~"5.."}[2m]))
/ sum(rate(http_requests_total{job="checkout-api"}[2m]))
If the canary's 5xx rate breaches the SLO twice, the rollout aborts and reverts. Compare the common strategies:
| Strategy | Blast radius | Rollback speed | Infra cost | Best for |
|---|---|---|---|---|
| Blue-green | Full (instant cutover) | Instant (flip back) | 2x during cutover | Stateless apps needing instant rollback |
| Canary | Small, growing | Fast (abort + revert) | Marginal | Most stateless services |
| Feature flags | Per-user/segment | Instant (toggle) | None | Decoupling deploy from release |
Feature flags are the sharpest tool here: they let you deploy code dark and release it independently, ramping a feature to 1% of users and killing it in milliseconds without a redeploy.
Capacity planning and sane autoscaling
Outages frequently trace back to running too hot. Autoscaling is necessary but not sufficient — it has a reaction lag, and a cold-start storm during a traffic spike can be worse than the spike itself. Keep deliberate headroom and combine pod-level and node-level scaling.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: checkout-api
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: checkout-api
minReplicas: 6 # never scale below baseline + headroom
maxReplicas: 60
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # damp flapping
scaleUp:
policies:
- type: Percent
value: 100
periodSeconds: 30 # double fast when needed
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60 # 40% headroom for spikes
Pair the HPA with Karpenter for node provisioning so pods are not left Pending when the HPA scales out. Target around 60% utilisation rather than 90% — the 30-40% headroom is what absorbs a spike while new capacity provisions.
Autoscaling reacts; it does not predict. For known events (sales, launches, batch windows) pre-scale on a schedule. Reactive scaling alone will always lag the leading edge of a spike.
A PodDisruptionBudget protects availability during voluntary disruptions like node upgrades or Karpenter consolidation:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: checkout-api
spec:
minAvailable: 80%
selector:
matchLabels:
app: checkout-api
Resilience patterns and graceful degradation
Map your dependencies and classify each as critical or non-critical. A non-critical dependency (recommendations, analytics) must never take down a critical path (checkout). Enforce that with the standard resilience primitives:
- Timeouts on every network call — an unbounded call is a latent outage.
- Retries with exponential backoff and jitter — synchronised retries cause thundering herds; jitter spreads them.
- Circuit breakers to stop hammering a failing dependency and fail fast instead.
- Bulkheads — isolate connection pools and thread pools per dependency so one slow service cannot exhaust resources shared with healthy ones.
When a non-critical dependency trips its breaker, degrade gracefully: serve a cached or default response rather than an error. The user sees a slightly worse experience, not a 500.
Chaos engineering and game days
Once the patterns are in place, prove they work. Chaos engineering injects controlled failure into production-like environments to validate that the system behaves as designed. Start small — terminate one pod, add latency to one dependency — and form a hypothesis ("checkout SLO holds when the recommendations service is down") before each experiment. Game days take this further: a scheduled, organised exercise where the team rehearses a realistic failure end to end, including the human response and runbooks. The findings almost always expose a missing alert, a stale runbook, or a dependency you did not know was on the critical path.
Error budgets gate risky work
Error budgets convert reliability into a currency. If your SLO is 99.9% availability, you have roughly 43 minutes of allowable downtime a month. While there is budget remaining, ship aggressively. When the budget is exhausted, freeze risky releases and divert effort to reliability until it recovers. This removes the perennial dev-versus-ops argument by replacing opinion with a shared, data-driven policy.
Production readiness reviews
Before any new service takes production traffic, run it through a Production Readiness Review checklist: defined SLOs and alerts, runbooks for the top failure modes, dashboards, capacity plan, dependency map with degradation behaviour, rollback mechanism, and on-call ownership. The PRR is the gate that ensures observability and resilience exist before the service can hurt anyone — not after the first incident teaches you they were missing.
Reliability is engineered upstream, not patched downstream. If you want a partner to design progressive delivery, chaos programmes, and error-budget policy into your platform before the next outage, talk to i2zone.