Disaster Recovery on AWS That Actually Works

Most disaster recovery plans on AWS fail not because the architecture was wrong, but because nobody ever tested the failover under realistic conditions. A runbook that has never been executed is a hypothesis, not a plan. This article assumes you already know what a VPC, an RDS instance and Route 53 are — the focus here is on choosing a DR strategy that maps to your actual recovery objectives, building it so it can be rebuilt, and proving it works before the incident that matters.

Start With RTO and RPO, Not With Architecture

Every meaningful DR decision flows from two numbers, and they must be set per workload, not per company.

RTO (Recovery Time Objective): how long the business can tolerate the service being down before recovery completes.
RPO (Recovery Point Objective): how much data, measured in time, you can afford to lose.

An RPO of five minutes mandates continuous or near-continuous replication; an RPO of 24 hours means a nightly snapshot is sufficient. An RTO of one minute rules out anything that requires booting infrastructure on demand. These numbers are commercial decisions dressed up as technical ones — get the product owner and finance in the room before you design anything.

Do not set a single RTO/RPO for the whole estate. Your payments path might need an RTO of minutes; your internal reporting dashboard can tolerate hours. Over-engineering the latter to match the former is how DR budgets triple for no business value.

The Four Strategies

AWS frames DR as four canonical patterns. They sit on a spectrum trading recovery speed against standing cost and operational complexity.

Strategy	Typical RTO	Typical RPO	Relative cost	Complexity	When to use
Backup & Restore	Hours	Hours	Lowest	Low	Non-critical workloads, dev/test, cost-sensitive tiers
Pilot Light	10s of minutes	Minutes	Low–medium	Medium	Core data replicated live, compute scaled from zero
Warm Standby	Minutes	Seconds–minutes	Medium–high	High	Business-critical, can tolerate brief degradation
Multi-Site Active/Active	Near-zero	Near-zero	Highest	Very high	Revenue-critical, regulatory near-zero downtime

The honest reality is that most organisations need Pilot Light or Warm Standby for their critical tier and Backup & Restore for everything else. Active/active is genuinely expensive — not just in standing infrastructure but in the engineering discipline required to run two regions hot with conflict resolution, and it is justified far less often than vendors suggest.

Backups Done Right

Backups are the foundation under every strategy. The common failure is treating them as a checkbox rather than a tested, isolated, immutable asset.

Use AWS Backup to centralise policy across services, and crucially copy cross-region and cross-account. A backup in the same account as production offers no protection against a compromised root credential or a ransomware actor with IAM access.

# Cross-region, cross-account copy via an AWS Backup plan rule
aws backup create-backup-plan --backup-plan '{
  "BackupPlanName": "critical-tier",
  "Rules": [{
    "RuleName": "daily-with-dr-copy",
    "TargetBackupVaultName": "prod-vault",
    "ScheduleExpression": "cron(0 2 * * ? *)",
    "Lifecycle": { "DeleteAfterDays": 35 },
    "CopyActions": [{
      "DestinationBackupVaultArn":
        "arn:aws:backup:eu-west-1:222233334444:backup-vault:dr-vault",
      "Lifecycle": { "DeleteAfterDays": 90 }
    }]
  }]
}'

Then make the DR vault immutable with Vault Lock in compliance mode. Once locked, even the root user cannot shorten retention or delete recovery points until the retention period expires — this is your ransomware backstop.

aws backup put-backup-vault-lock-configuration \
  --backup-vault-name dr-vault \
  --min-retention-days 30 \
  --max-retention-days 365 \
  --changeable-for-days 3   # cooling-off window before lock is permanent

Compliance-mode Vault Lock is irreversible after the cooling-off window. Test it in a throwaway account first. The flip side: an attacker who deletes your primary backups cannot touch the locked DR copy, which is the entire point.

Data Replication Options

Backups give you a recovery point; replication gives you a low RPO. Choose per data store:

Service	Mechanism	Typical RPO	Notes
RDS	Cross-region read replica	Seconds–minutes	Promote on failover; async, watch replica lag
Aurora	Global Database	~1 second	Sub-minute promotion, purpose-built for DR
S3	Cross-Region Replication (CRR)	Seconds–minutes	Enable versioning; replicates new objects only
DynamoDB	Global Tables	Sub-second	Multi-region, last-writer-wins conflict handling

Aurora Global Database is the standout for relational DR: replication lag is typically around a second, and managed unplanned failover promotes a secondary region in well under a minute. For S3, remember CRR only acts on objects written after it is enabled — backfill existing objects with S3 Batch Replication.

Infrastructure Recoverability

If you cannot rebuild the region from code, you do not have a DR plan — you have a single region with extra steps. Everything that defines the environment must live in IaC (Terraform, CDK or CloudFormation) and be deployable to the DR region with a parameter change, not a manual rebuild.

# Region is an input, not a hardcoded value — the same module deploys DR
variable "region" { type = string }

provider "aws" {
  region = var.region
}

module "platform" {
  source       = "../modules/platform"
  environment  = "dr"
  vpc_cidr     = "10.20.0.0/16"   # distinct from primary to allow peering
}

The subtle trap is config that lives outside IaC: manually created Secrets Manager values, hand-tuned security group rules, ACM certificates, or service quotas that differ between regions. DR region service quotas in particular are set low by default and will silently cap your recovery — request increases ahead of time, not during the incident.

DNS Failover With Route 53

Route 53 health checks plus failover routing give you automated traffic redirection. Pair a primary and secondary record set, attach a health check to the primary, and Route 53 shifts traffic when the primary endpoint fails.

# Health check against the primary application endpoint
aws route53 create-health-check --caller-reference dr-$(date +%s) \
  --health-check-config '{
    "Type": "HTTPS",
    "FullyQualifiedDomainName": "api.example.com",
    "ResourcePath": "/healthz",
    "RequestInterval": 30,
    "FailureThreshold": 3
  }'

Make the health check hit a deep health endpoint that verifies database connectivity and critical dependencies, not just that the load balancer answers. A shallow check passing while the database is unreachable means Route 53 happily keeps routing to a broken region.

The Part Everyone Skips: Game Days

A DR plan decays the moment it is written. The only way to keep it valid is to execute it regularly under controlled chaos. Run scheduled game days where you actually fail over to the DR region, serve real (or shadow) traffic, and measure achieved RTO/RPO against target.

A useful cadence:

Tabletop quarterly — walk the runbook, confirm dependencies and owners.
Live failover twice a year — promote the DR region, route a slice of traffic, fail back.
Unannounced drill annually — page the on-call team cold and time the real response.

The metric that matters is achieved RTO/RPO, not the target. If your stated RTO is 15 minutes but the last game day took 90, your real RTO is 90 minutes. Document and close that gap, or revise the commitment.

Common Pitfalls

Untested runbooks — steps reference deleted resources, stale credentials or people who have left.
Config drift — the DR region falls behind primary because changes are applied by hand to prod only. Enforce parity through CI that deploys identical IaC to both.
Forgotten dependencies — a third-party API allowlists only the primary region's NAT IPs, or an external SaaS webhook points at a single region.
Data egress costs — cross-region replication and restore traffic incur transfer charges. Model this; for large datasets it is a material monthly line item, and restore-time egress during a real DR event can be substantial.
DNS TTLs too high — a 3600-second TTL means clients keep hitting the dead region for an hour regardless of how fast Route 53 fails over. Keep failover-critical records at 60 seconds.

The difference between a DR plan that works and one that exists on a wiki is repeated, honest testing. Pick the cheapest strategy that meets each workload's RTO/RPO, build it entirely in code, lock your backups against tampering, and prove the failover on a schedule before reality forces the test.

Unsure whether your AWS estate could actually survive a regional outage — or what it would cost you to find out the hard way? Talk to i2zone and we'll pressure-test your DR strategy, run a real game day, and right-size your recovery spend.