The AWS Cost Optimization Playbook

Most AWS cost programmes fail not because the levers are unknown, but because they are pulled in the wrong order, once, by the wrong team. Rightsizing before you have allocation tags is guesswork; buying Savings Plans before you have a stable baseline locks in waste for a year. This playbook sequences the work the way a senior engineer actually runs it: visibility first, then compute, storage, the data-transfer costs nobody budgets for, and finally the operating model that stops the savings from eroding the moment you look away.

Visibility before action

You cannot optimise what you cannot attribute. The Cost Explorer UI is fine for executives; engineers need the Cost and Usage Report (CUR) queried directly. Enable CUR with resource IDs and hourly granularity, land it in S3, and query it with Athena.

-- Top 20 cost line items last month, with unblended cost
SELECT
  line_item_product_code        AS service,
  line_item_resource_id         AS resource,
  resource_tags_user_team       AS team,
  SUM(line_item_unblended_cost)  AS cost
FROM cur.cur_table
WHERE line_item_usage_start_date >= date_add('month', -1, current_date)
  AND line_item_line_item_type = 'Usage'
GROUP BY 1, 2, 3
ORDER BY cost DESC
LIMIT 20;

The hard part is allocation. A tag is only useful if it is enforced and activated as a cost allocation tag in the billing console. Mandate a minimal taxonomy — team, env, service, cost-center — and enforce it with a Service Control Policy or AWS Config rule so untagged resources are flagged on creation.

Untagged spend is unowned spend. Aim for >95% of CUR cost mapped to a tag before you touch a single instance type — otherwise your savings land in someone else's budget and the optimisation work has no internal sponsor.

Decide early between showback (teams see their cost, no internal billing) and chargeback (teams are actually billed). Showback is the right starting point; chargeback only works once allocation is trusted, otherwise you spend every month arguing about the shared-cost split for NAT, data transfer and support.

Compute: commitments, rightsizing, Graviton

Compute is usually 50–70% of the bill and where the largest structural savings live. Three commitment models exist, and the right answer is almost always a blend.

Mechanism	Discount (typical)	Flexibility	Commitment	Best for
Compute Savings Plans	up to ~66%	Any region, family, OS, Fargate, Lambda	1 or 3 yr, $/hr	Baseline that shifts across services
EC2 Instance Savings Plans	up to ~72%	Same family in a region	1 or 3 yr, $/hr	Stable family, want max discount
Reserved Instances	up to ~72%	Standard: locked; Convertible: exchangeable	1 or 3 yr	RDS/ElastiCache/Redshift (no SP coverage)
Spot	up to ~90%	Interruptible (2-min notice)	None	Stateless, fault-tolerant, batch

The decision rule: cover your stable 24/7 baseline with Compute Savings Plans (the flexibility is worth a few points of discount), use RIs for the managed services Savings Plans do not cover, and push fault-tolerant and batch workloads to Spot. Never commit above your trough — buy to roughly the 1- or 3-month minimum usage and let on-demand absorb the peaks.

Rightsizing comes from Compute Optimizer, not intuition. Export its recommendations and act on the over-provisioned instances first.

aws compute-optimizer get-ec2-instance-recommendations \
  --query 'instanceRecommendations[?finding==`Overprovisioned`].
    {id:instanceArn, current:currentInstanceType,
     rec:recommendationOptions[0].instanceType,
     savings:recommendationOptions[0].savingsOpportunity.estimatedMonthlySavings.value}' \
  --output table

Graviton is the single highest-leverage migration available. ARM-based instances typically deliver 20–40% better price-performance for most general-purpose, web and containerised workloads. The migration risk is binary compatibility — rebuild your images for arm64 (or build multi-arch with docker buildx), confirm any native dependencies have ARM wheels, and roll out behind a canary. Managed services (RDS, ElastiCache, OpenSearch) make it nearly free: change the instance class and reboot.

Storage: the slow leak

Storage rarely triggers an alert but compounds relentlessly.

S3: enable Intelligent-Tiering as the default storage class for unpredictable access patterns — it moves objects across tiers automatically with no retrieval fees on the frequent/infrequent tiers. For known patterns, use explicit lifecycle rules to expire or archive.
EBS gp2 → gp3: gp3 is roughly 20% cheaper per GB and decouples IOPS/throughput from volume size. There is no reason to stay on gp2.
Snapshots: orphaned snapshots from deregistered AMIs are a classic silent cost.

# Migrate all gp2 volumes to gp3 in a region
for v in $(aws ec2 describe-volumes \
  --filters Name=volume-type,Values=gp2 \
  --query 'Volumes[].VolumeId' --output text); do
  aws ec2 modify-volume --volume-id "$v" --volume-type gp3
done

{
  "Rules": [{
    "ID": "expire-logs",
    "Filter": { "Prefix": "logs/" },
    "Status": "Enabled",
    "Transitions": [{ "Days": 30, "StorageClass": "INTELLIGENT_TIERING" }],
    "Expiration": { "Days": 365 }
  }]
}

The data-transfer costs nobody budgets for

This is where I find the most surprise spend, because it is invisible in the EC2 line item and split across networking codes in the CUR.

NAT Gateway charges per-GB processed on top of the hourly fee. A chatty private subnet pulling from S3 or ECR through NAT can cost more than the compute. Replace those paths with VPC Gateway Endpoints (free, for S3 and DynamoDB) and Interface Endpoints for ECR, Secrets Manager, etc.
Cross-AZ traffic is billed in both directions (~$0.01/GB each way). A service mesh or database replica chatting across AZs at high volume adds up fast — use topology-aware routing to keep traffic in-zone where availability allows.
Inter-region and egress to internet are the most expensive; front public egress with CloudFront, which has cheaper data-out rates and offloads origin traffic.

-- Where is data-transfer spend actually going?
SELECT line_item_usage_type, SUM(line_item_unblended_cost) AS cost
FROM cur.cur_table
WHERE line_item_usage_type LIKE '%DataTransfer%'
   OR line_item_usage_type LIKE '%NatGateway-Bytes%'
GROUP BY 1 ORDER BY cost DESC;

Managed and serverless trade-offs

Serverless (Lambda, Fargate, Aurora Serverless v2) eliminates idle cost and operational overhead, but per-unit compute is more expensive than a well-utilised reserved EC2 fleet. The crossover is utilisation: below roughly 40–50% steady utilisation, serverless usually wins on total cost of ownership once you price in the engineering time you are not spending on patching and scaling. Above that, and especially at 24/7 high load, committed EC2/EKS with Graviton is cheaper. Decide per workload, not per dogma.

Governance: stop the regression

Savings decay. Lock them in with automated guardrails.

# Anomaly monitor with a $200 alert threshold
aws ce create-anomaly-monitor --anomaly-monitor \
  '{"MonitorName":"all-services","MonitorType":"DIMENSIONAL","MonitorDimension":"SERVICE"}'

AWS Budgets: per-team budgets with alerts at 80%/100% of forecast, wired to the owning team's channel.
Cost Anomaly Detection: catches the 3am runaway Lambda loop before it becomes a finance conversation.
Off-hours scheduling: non-prod accounts rarely need to run nights and weekends. Tag-driven start/stop of EC2/RDS/ASGs typically cuts non-prod compute by 60–70% — that is ~128 idle hours of 168 reclaimed.

A FinOps operating model

Tooling without ownership regresses within a quarter. The model that holds:

Cost is a non-functional requirement owned by engineering, informed by finance — not a quarterly fire drill run by finance alone.

Establish a small FinOps function (often one platform engineer plus a finance partner) that owns the CUR pipeline, the dashboards and the commitment portfolio. Make unit economics visible — cost per request, per tenant, per deployment — so teams optimise against a metric that maps to value, not an abstract dollar figure. Run a monthly review of the largest movers, feed anomalies straight back to owning teams, and treat commitment purchasing as a continuous rolling decision rather than an annual event. The goal is a flywheel: visibility drives accountability, accountability drives optimisation, and automation prevents the regression.

Want a second pair of eyes on your AWS bill? i2zone runs senior-led FinOps and cost-optimisation engagements that typically pay for themselves in the first month. Get in touch.