Near-Zero-Downtime Database Migration

The naive way to migrate a database is pg_dump on the old host, pg_restore on the new one, and a maintenance window long enough to cover both. For anything beyond a few gigabytes, that window stretches into hours — and for a transactional system, hours of downtime is a business outage, not a maintenance window. This article covers how to get the downtime down to seconds: replicate continuously, then cut over fast. We'll use PostgreSQL and AWS DMS as the worked example, but the playbook generalises.

Why dump/restore doesn't scale

A logical dump/restore is O(data size) for downtime because writes must stop the moment the dump begins, or the restored copy is inconsistent. A 500 GB database with foreign keys and indexes can take many hours to restore — index rebuilds dominate — and the source is frozen the entire time. Physical snapshot/restore is faster but ties you to identical engine versions and storage, and still requires a write freeze for the final delta.

The fix is to decouple the bulk copy from the cutover: copy the bulk while the source is live, stream changes on top, and only freeze writes for the final, tiny delta.

The strategies compared

Strategy	Downtime	Complexity	Risk	When
Backup / restore	Hours	Low	Low	Small DBs, generous window, same engine
Native logical replication / CDC	Seconds	Medium	Medium	Same major engine (PG→PG); full control
AWS DMS (full load + CDC)	Seconds	Medium	Medium	Heterogeneous or cross-account/region moves
Dual-write at the app	Near-zero	High	High	No replication path; app can write both stores

Dual-write looks attractive on paper but pushes consistency into application code: you now own ordering, partial-failure handling, and reconciliation between two systems. Reserve it for cases where no replication path exists. For most PostgreSQL-to-RDS/Aurora moves, DMS full-load + CDC or native logical replication is the right tool.

The canonical approach: replicate, then cut over

The shape is always the same:

Full load — bulk-copy existing rows while the source serves traffic.
CDC (change data capture) — stream every insert/update/delete that happened during and after the load.
Catch up — let replication lag drop to zero.
Cut over — brief write freeze, verify lag is zero, switch the endpoint.

AWS DMS: full load plus ongoing CDC

DMS reads the source's logical replication stream. For PostgreSQL sources, set the prerequisites first:

-- On the SOURCE Postgres (or RDS parameter group equivalents)
ALTER SYSTEM SET wal_level = logical;
ALTER SYSTEM SET max_replication_slots = 10;
ALTER SYSTEM SET max_wal_senders = 10;
-- requires restart; DMS will create its own replication slot

Create the replication task with full load and CDC in one task, plus validation:

aws dms create-replication-task \
  --replication-task-identifier pg-prod-to-aurora \
  --source-endpoint-arn arn:aws:dms:eu-west-2:111122223333:endpoint:SRC \
  --target-endpoint-arn arn:aws:dms:eu-west-2:111122223333:endpoint:TGT \
  --replication-instance-arn arn:aws:dms:eu-west-2:111122223333:rep:INST \
  --migration-type full-load-and-cdc \
  --table-mappings file://table-mappings.json \
  --replication-task-settings file://task-settings.json

# Watch CDC latency — this number must hit ~0 before cutover
aws dms describe-replication-tasks \
  --query 'ReplicationTasks[0].ReplicationTaskStats'

In task-settings.json, enable validation so DMS row-count- and checksum-compares source against target continuously:

{
  "ValidationSettings": { "EnableValidation": true, "ThreadCount": 8 },
  "Logging": { "EnableLogging": true }
}

DMS CDC reads the WAL via a replication slot. If the task stalls, the slot holds WAL on the source and disk fills up. Monitor CDCLatencySource and the source's replication-slot lag — a stuck task can take down your live database.

Expand/contract schema migrations

The dangerous part of a migration is rarely the data — it's schema changes that break compatibility between old and new code. The expand/contract (parallel change) pattern keeps both versions working throughout:

Expand — add the new column/table; make it nullable or backfill in batches. Old code ignores it.
Migrate — deploy code that writes both old and new, reads old.
Switch reads — deploy code that reads new.
Contract — once nothing references the old schema, drop it.

-- EXPAND: additive, non-breaking
ALTER TABLE orders ADD COLUMN customer_uuid uuid;
-- backfill in batches to avoid long locks
UPDATE orders SET customer_uuid = c.uuid
FROM customers c WHERE orders.customer_id = c.id
  AND orders.customer_uuid IS NULL AND orders.id BETWEEN 1 AND 100000;
-- ... later, after read-switch deploy ...
-- CONTRACT
ALTER TABLE orders DROP COLUMN customer_id;

This decouples schema rollout from the data cutover, so neither blocks the other and each step is independently reversible.

Connection management for fast failover

The cutover is only as fast as your ability to repoint connections. Hard-coding the database host in the app means a redeploy at cutover — slow and risky. Put a proxy in front:

RDS Proxy sits between app and database, pools connections, and lets you flip the target with no app change. It also survives failovers without the app seeing dropped pools.
PgBouncer is the self-managed equivalent; repoint by editing databases and issuing RELOAD.

# pgbouncer.ini — flip host then: psql -p 6432 pgbouncer -c "RELOAD;"
[databases]
app = host=new-aurora-writer.eu-west-2.rds.amazonaws.com port=5432 dbname=app

Point your application at the proxy from day one — well before migration. Retrofitting a proxy during the cutover window adds an untested variable to the riskiest moment.

The cutover runbook

A tight, rehearsed checklist is what turns a multi-hour outage into a 30-second blip:

Pre-flight — DMS validation green, CDC latency near zero, rollback plan confirmed, on-call assembled.
Freeze writes — set the app to read-only or stop write workers. Note the time.
Drain — let in-flight transactions on the source complete (seconds).
Verify lag = 0 — confirm CDCLatencySource and CDCLatencyTarget are zero; the target has every committed change.
Fix sequences — reset target sequences past the source's max (see pitfalls).
Switch endpoint — repoint RDS Proxy / PgBouncer to the new database. Re-enable writes.
Smoke test — run scripted read+write checks against critical paths.
Reconcile — row counts and checksums on the hot tables.
Monitor — watch error rates and latency for the dual-running window.

Rollback: if smoke tests fail before writes resume on the new database, repoint the proxy back — the source never lost authority. After writes have landed on the new database, rollback means replaying those writes back to the source, so the no-go decision must happen at step 7, not 9.

Data reconciliation and validation

Trust, but verify. Beyond DMS validation, run an independent count and a checksum on the tables that matter:

-- Run identically on source and target, compare
SELECT count(*), md5(string_agg(id::text, ',' ORDER BY id)) FROM orders;

Pitfalls that bite

Sequences are not replicated by CDC. The target's sequences will be stale; reset every one (SELECT setval('orders_id_seq', (SELECT max(id) FROM orders))) at cutover or you'll get duplicate-key errors on the first inserts.
Large tables slow the full load and can stall CDC. Split them with parallel-load segments in the table mappings.
LOBs — DMS truncates large objects unless you set LobMaxSize / full-LOB mode; full-LOB mode is slow, so size it deliberately.
Foreign keys — disable FK validation on the target during full load (DMS does this for you with ENABLE_VALIDATION off per phase) or the unordered load fails constraint checks. Re-enable before cutover.

Migrating a production database and can't afford the downtime? i2zone builds and rehearses the full DMS-plus-cutover playbook so your switchover lands in seconds, not hours — talk to us.