AWS Cost Optimization: The Complete Guide
- The cost optimization framework
- Tier 1: Quick wins (do this week)
- Tier 2: Commitment discounts (do this month)
- Tier 3: Architectural optimization (strategic)
- Tools and resources
- Building a cost-conscious culture
- Common mistakes to avoid
- When to get help
- Next steps
Every company I’ve audited has the same story: AWS costs grew faster than expected.
It’s not because AWS is expensive. It’s because:
- Defaults favor availability over cost — Multi-AZ, On-Demand, S3 Standard
- Engineers don’t see the bill — no feedback loop between provisioning and spending
- Growth happens faster than cleanup — old resources accumulate
- Nobody owns cloud cost — it’s everyone’s job, so it’s nobody’s job
- Fear of breaking things — “let’s just leave it running”
The good news: these problems are fixable. This guide shows you how.
The cost optimization framework
Think of optimization in three tiers:
Always start at Tier 1. Work your way up.
Tier 1: Quick wins (do this week)
1.1 Find and terminate idle resources
Idle EC2 instances
#!/bin/bash
# find-idle-ec2.sh
# Find EC2 instances with <5% average CPU over 14 days
# Requires GNU date (Linux). On macOS, install coreutils and use gdate.
THRESHOLD=5
DAYS=14
aws ec2 describe-instances \
--filters "Name=instance-state-name,Values=running" \
--query 'Reservations[].Instances[].[InstanceId,InstanceType,Tags[?Key==`Name`]|[0].Value]' \
--output text | while IFS=$'\t' read -r instance_id instance_type name; do
avg_cpu=$(aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=$instance_id \
--start-time $(date -d "-${DAYS} days" -u +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 86400 \
--statistics Average \
--query 'Datapoints[].Average' \
--output text | awk '{sum+=$1; count++} END {if(count>0) print sum/count; else print 0}')
if (( $(echo "$avg_cpu < $THRESHOLD" | bc -l) )); then
echo "IDLE: $instance_id ($name) - Type: $instance_type - Avg CPU: ${avg_cpu}%"
fi
done
Unattached EBS volumes
#!/bin/bash
# find-unattached-ebs.sh
echo "Unattached EBS Volumes:"
echo "========================"
aws ec2 describe-volumes \
--filters "Name=status,Values=available" \
--query 'Volumes[].[VolumeId,Size,VolumeType,CreateTime]' \
--output table
# Calculate total cost
total_gb=$(aws ec2 describe-volumes \
--filters "Name=status,Values=available" \
--query 'sum(Volumes[].Size)' \
--output text)
echo ""
echo "Total unattached storage: ${total_gb} GB"
echo "Estimated monthly cost: \$$(echo "$total_gb * 0.10" | bc) (gp2/gp3)"
Old EBS snapshots
#!/usr/bin/env python3
# find-old-snapshots.py
import boto3
from datetime import datetime, timezone
ec2 = boto3.client('ec2')
MAX_AGE_DAYS = 90
# Get account ID
sts = boto3.client('sts')
account_id = sts.get_caller_identity()['Account']
# Get all snapshots owned by this account
snapshots = ec2.describe_snapshots(OwnerIds=[account_id])['Snapshots']
old_snapshots = []
total_size = 0
for snap in snapshots:
age = datetime.now(timezone.utc) - snap['StartTime']
if age.days > MAX_AGE_DAYS:
old_snapshots.append({
'SnapshotId': snap['SnapshotId'],
'Size': snap['VolumeSize'],
'Age': age.days,
'Description': snap.get('Description', 'N/A')[:50]
})
total_size += snap['VolumeSize']
print(f"\nSnapshots older than {MAX_AGE_DAYS} days:")
print("=" * 80)
for snap in sorted(old_snapshots, key=lambda x: x['Age'], reverse=True)[:20]:
print(f"{snap['SnapshotId']} | {snap['Size']:>5} GB | {snap['Age']:>4} days | {snap['Description']}")
print(f"\nTotal old snapshots: {len(old_snapshots)}")
print(f"Total size: {total_size} GB")
print(f"Estimated monthly cost: ${total_size * 0.05:.2f}")
1.2 Right-size over-provisioned instances
Check EC2 recommendations from Compute Optimizer
# Enable Compute Optimizer (one-time)
aws compute-optimizer update-enrollment-status --status Active
# Get EC2 recommendations
aws compute-optimizer get-ec2-instance-recommendations \
--query 'instanceRecommendations[?finding==`OVER_PROVISIONED`].[
instanceArn,
currentInstanceType,
recommendationOptions[0].instanceType,
recommendationOptions[0].projectedUtilizationMetrics
]' \
--output table
RDS right-sizing analysis
#!/usr/bin/env python3
# analyze-rds-utilization.py
import boto3
from datetime import datetime, timezone, timedelta
rds = boto3.client('rds')
cloudwatch = boto3.client('cloudwatch')
def get_rds_metrics(db_identifier, days=14):
"""Get CPU and memory metrics for RDS instance"""
end_time = datetime.now(timezone.utc)
start_time = end_time - timedelta(days=days)
metrics = {}
for metric_name in ['CPUUtilization', 'FreeableMemory', 'DatabaseConnections']:
response = cloudwatch.get_metric_statistics(
Namespace='AWS/RDS',
MetricName=metric_name,
Dimensions=[{'Name': 'DBInstanceIdentifier', 'Value': db_identifier}],
StartTime=start_time,
EndTime=end_time,
Period=3600,
Statistics=['Average', 'Maximum']
)
if response['Datapoints']:
avg = sum(d['Average'] for d in response['Datapoints']) / len(response['Datapoints'])
max_val = max(d['Maximum'] for d in response['Datapoints'])
metrics[metric_name] = {'average': avg, 'maximum': max_val}
return metrics
# Analyze all RDS instances
instances = rds.describe_db_instances()['DBInstances']
print("RDS Instance Utilization Analysis")
print("=" * 80)
for instance in instances:
db_id = instance['DBInstanceIdentifier']
instance_class = instance['DBInstanceClass']
metrics = get_rds_metrics(db_id)
if metrics.get('CPUUtilization'):
cpu_avg = metrics['CPUUtilization']['average']
cpu_max = metrics['CPUUtilization']['maximum']
status = "OK"
if cpu_avg < 20 and cpu_max < 50:
status = "OVER-PROVISIONED - Consider downsizing"
elif cpu_avg > 80:
status = "UNDER-PROVISIONED - Consider upsizing"
print(f"\n{db_id} ({instance_class})")
print(f" CPU Avg: {cpu_avg:.1f}% | CPU Max: {cpu_max:.1f}%")
print(f" Status: {status}")
1.3 Storage class optimization
S3 bucket analysis
#!/usr/bin/env python3
# analyze-s3-storage.py
import boto3
from datetime import datetime, timezone, timedelta
s3 = boto3.client('s3')
cloudwatch = boto3.client('cloudwatch')
def get_bucket_size(bucket_name):
"""Get bucket size from CloudWatch metrics"""
now = datetime.now(timezone.utc)
start = now.replace(hour=0, minute=0, second=0, microsecond=0) - timedelta(days=1)
response = cloudwatch.get_metric_statistics(
Namespace='AWS/S3',
MetricName='BucketSizeBytes',
Dimensions=[
{'Name': 'BucketName', 'Value': bucket_name},
{'Name': 'StorageType', 'Value': 'StandardStorage'}
],
StartTime=start,
EndTime=now,
Period=86400,
Statistics=['Average']
)
if response['Datapoints']:
return response['Datapoints'][0]['Average']
return 0
buckets = s3.list_buckets()['Buckets']
print("S3 Bucket Storage Analysis")
print("=" * 80)
total_standard = 0
recommendations = []
for bucket in buckets:
name = bucket['Name']
try:
# Check if lifecycle policy exists
try:
s3.get_bucket_lifecycle_configuration(Bucket=name)
has_lifecycle = True
except s3.exceptions.ClientError:
has_lifecycle = False
size_bytes = get_bucket_size(name)
size_gb = size_bytes / (1024 ** 3)
if size_gb > 1: # Only show buckets > 1 GB
total_standard += size_gb
if not has_lifecycle and size_gb > 10:
recommendations.append({
'bucket': name,
'size_gb': size_gb,
'potential_savings': size_gb * 0.023 * 0.7 # Assume 70% can move to IA/Glacier
})
print(f"{name}: {size_gb:.2f} GB | Lifecycle: {'Yes' if has_lifecycle else 'NO'}")
except Exception as e:
print(f"{name}: Error - {e}")
print(f"\n{'=' * 80}")
print(f"Total Standard Storage: {total_standard:.2f} GB")
print(f"Monthly cost (Standard): ${total_standard * 0.023:.2f}")
if recommendations:
print("\nBuckets needing lifecycle policies:")
for rec in sorted(recommendations, key=lambda x: x['size_gb'], reverse=True):
print(f" {rec['bucket']}: {rec['size_gb']:.2f} GB - Potential savings: ${rec['potential_savings']:.2f}/mo")
1.4 Set up cost alerts
# Create a budget with alerts
aws budgets create-budget \
--account-id $(aws sts get-caller-identity --query Account --output text) \
--budget '{
"BudgetName": "Monthly-AWS-Budget",
"BudgetLimit": {
"Amount": "10000",
"Unit": "USD"
},
"TimeUnit": "MONTHLY",
"BudgetType": "COST"
}' \
--notifications-with-subscribers '[
{
"Notification": {
"NotificationType": "ACTUAL",
"ComparisonOperator": "GREATER_THAN",
"Threshold": 80,
"ThresholdType": "PERCENTAGE"
},
"Subscribers": [
{
"SubscriptionType": "EMAIL",
"Address": "your-email@company.com"
}
]
},
{
"Notification": {
"NotificationType": "FORECASTED",
"ComparisonOperator": "GREATER_THAN",
"Threshold": 100,
"ThresholdType": "PERCENTAGE"
},
"Subscribers": [
{
"SubscriptionType": "EMAIL",
"Address": "your-email@company.com"
}
]
}
]'
Tier 2: Commitment discounts (do this month)
2.1 Reserved Instances vs Savings Plans
| Feature | Reserved Instances | Savings Plans |
|---|---|---|
| Discount | Up to 72% | Up to 66% |
| Flexibility | Instance-type specific | Any instance type |
| Region-locked | Yes | Compute SP: No |
| Best for | Predictable, stable workloads | Variable workloads |
| Recommendation | RDS, ElastiCache | EC2, Fargate, Lambda |
2.2 RI / SP decision framework
#!/usr/bin/env python3
# ri-sp-analyzer.py
import boto3
ce = boto3.client('ce')
def get_ri_recommendations(service):
"""Get RI purchase recommendations"""
response = ce.get_reservation_purchase_recommendation(
Service=service,
TermInYears='ONE_YEAR',
PaymentOption='NO_UPFRONT',
LookbackPeriodInDays='SIXTY_DAYS'
)
return response.get('Recommendations', [])
def get_sp_recommendations():
"""Get Savings Plans recommendations"""
response = ce.get_savings_plans_purchase_recommendation(
SavingsPlansType='COMPUTE_SP',
TermInYears='ONE_YEAR',
PaymentOption='NO_UPFRONT',
LookbackPeriodInDays='SIXTY_DAYS'
)
return response.get('SavingsPlansPurchaseRecommendation', {})
print("Reserved Instance Recommendations")
print("=" * 60)
for service in [
'Amazon Elastic Compute Cloud - Compute',
'Amazon Relational Database Service',
'Amazon ElastiCache',
]:
recs = get_ri_recommendations(service)
if recs:
for rec in recs:
details = rec.get('RecommendationDetails', [])
for detail in details[:3]: # Top 3 recommendations
print(f"\n{service}")
print(f" Instance: {detail.get('InstanceDetails', {})}")
print(f" Monthly savings: ${float(detail.get('EstimatedMonthlySavingsAmount', 0)):.2f}")
print(f" Upfront cost: ${float(detail.get('UpfrontCost', 0)):.2f}")
print("\n" + "=" * 60)
print("Savings Plans Recommendations")
print("=" * 60)
sp_rec = get_sp_recommendations()
if sp_rec:
details = sp_rec.get('SavingsPlansPurchaseRecommendationDetails', [{}])[0]
print(f"Recommended hourly commitment: ${details.get('HourlyCommitmentToPurchase', 'N/A')}")
print(f"Estimated monthly savings: ${float(details.get('EstimatedMonthlySavingsAmount', 0)):.2f}")
2.3 Implementing commitments safely
Start-small strategy:
- Week 1: Buy RIs/SPs for 50% of stable On-Demand usage
- Month 1: Monitor utilization, ensure >80% RI/SP usage
- Month 2: Increase to 70% coverage
- Month 3: Reach target 80-90% coverage
# Monitor RI utilization
aws ce get-reservation-utilization \
--time-period Start=$(date -d "-30 days" +%Y-%m-%d),End=$(date +%Y-%m-%d) \
--granularity MONTHLY \
--query 'UtilizationsByTime[].Total.[UtilizationPercentage]'
# Monitor Savings Plans utilization
aws ce get-savings-plans-utilization \
--time-period Start=$(date -d "-30 days" +%Y-%m-%d),End=$(date +%Y-%m-%d) \
--granularity MONTHLY
Tier 3: Architectural optimization (strategic)
3.1 Spot instances for stateless workloads
EKS with Karpenter v1 spot configuration
# karpenter-spot-nodepool.yaml — Karpenter v1 (GA)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: spot-workloads
spec:
template:
metadata:
labels:
workload-type: spot
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"]
- key: kubernetes.io/arch
operator: In
values: ["amd64", "arm64"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"]
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ["5"]
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: spot-template
taints:
- key: spot
value: "true"
effect: NoSchedule
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30s
limits:
cpu: 500
memory: 500Gi
---
# Deployment that uses Spot
apiVersion: apps/v1
kind: Deployment
metadata:
name: batch-processor
spec:
replicas: 10
template:
spec:
tolerations:
- key: spot
operator: Equal
value: "true"
effect: NoSchedule
nodeSelector:
karpenter.sh/capacity-type: spot
containers:
- name: processor
image: batch-processor:latest
resources:
requests:
cpu: "1"
memory: "2Gi"
limits:
cpu: "2"
memory: "4Gi"
3.2 Graviton migration
Terraform for Graviton EKS node group
# Graviton node group — 20% cheaper than x86
resource "aws_eks_node_group" "graviton" {
cluster_name = aws_eks_cluster.main.name
node_group_name = "graviton-nodes"
node_role_arn = aws_iam_role.eks_nodes.arn
subnet_ids = aws_subnet.private[*].id
# Graviton instance types
instance_types = ["m6g.large", "m6g.xlarge", "c6g.large", "c6g.xlarge"]
# ARM64 AMI
ami_type = "AL2_ARM_64"
scaling_config {
desired_size = 3
min_size = 1
max_size = 10
}
labels = {
"kubernetes.io/arch" = "arm64"
"node-type" = "graviton"
}
taint {
key = "arch"
value = "arm64"
effect = "NO_SCHEDULE"
}
tags = {
Name = "graviton-node"
Environment = var.environment
CostCenter = "platform"
}
}
Multi-arch Docker build
# Dockerfile for multi-architecture support
FROM --platform=$BUILDPLATFORM golang:1.21-alpine AS builder
ARG TARGETARCH
ARG TARGETOS
WORKDIR /app
COPY . .
RUN CGO_ENABLED=0 GOOS=${TARGETOS} GOARCH=${TARGETARCH} \
go build -o /app/server ./cmd/server
# Final image
FROM alpine:3.18
COPY --from=builder /app/server /server
ENTRYPOINT ["/server"]
# Build and push multi-arch image
docker buildx build \
--platform linux/amd64,linux/arm64 \
-t your-registry/app:latest \
--push .
3.3 VPC endpoint optimization
# vpc-endpoints.tf
locals {
gateway_endpoints = ["s3", "dynamodb"]
interface_endpoints = [
"ecr.api",
"ecr.dkr",
"logs",
"monitoring",
"secretsmanager",
"ssm",
"ssmmessages",
"ec2messages",
"sts"
]
}
# Gateway endpoints (free for S3/DynamoDB)
resource "aws_vpc_endpoint" "gateway" {
for_each = toset(local.gateway_endpoints)
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.${var.region}.${each.key}"
vpc_endpoint_type = "Gateway"
route_table_ids = aws_route_table.private[*].id
tags = {
Name = "${each.key}-endpoint"
}
}
# Interface endpoints (charged per hour + data)
resource "aws_vpc_endpoint" "interface" {
for_each = toset(local.interface_endpoints)
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.${var.region}.${each.key}"
vpc_endpoint_type = "Interface"
subnet_ids = aws_subnet.private[*].id
security_group_ids = [aws_security_group.vpc_endpoints.id]
private_dns_enabled = true
tags = {
Name = "${each.key}-endpoint"
}
}
resource "aws_security_group" "vpc_endpoints" {
name_prefix = "vpc-endpoints-"
vpc_id = aws_vpc.main.id
ingress {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = [aws_vpc.main.cidr_block]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
Tools and resources
Free tools
| Tool | Purpose | Link |
|---|---|---|
| AWS Cost Explorer | Primary cost analysis | Built into AWS |
| AWS Compute Optimizer | Right-sizing recommendations | Built into AWS |
| AWS Trusted Advisor | Best practice checks | Limited free tier |
| Infracost | Terraform cost estimation | infracost.io |
| Komiser | Multi-cloud cost dashboard | github.com/tailwarden/komiser |
Building a cost-conscious culture
Technical optimization is only half the battle. The other half is organizational.
1. Make costs visible
#!/usr/bin/env python3
# weekly-team-cost-report.py
import boto3
from datetime import datetime, timezone, timedelta
def generate_team_cost_report():
ce = boto3.client('ce')
end = datetime.now(timezone.utc).strftime('%Y-%m-%d')
start = (datetime.now(timezone.utc) - timedelta(days=7)).strftime('%Y-%m-%d')
response = ce.get_cost_and_usage(
TimePeriod={'Start': start, 'End': end},
Granularity='DAILY',
Metrics=['UnblendedCost'],
GroupBy=[{'Type': 'TAG', 'Key': 'Team'}]
)
# Format and send to Slack/email
for day in response['ResultsByTime']:
for group in day['Groups']:
team = group['Keys'][0].replace('Team$', '') or 'Untagged'
cost = float(group['Metrics']['UnblendedCost']['Amount'])
print(f"{day['TimePeriod']['Start']} | {team}: ${cost:.2f}")
2. Set team budgets
Each team should have visibility into their own costs and accountability for staying within budget.
3. Celebrate cost wins
Make cost optimization achievements as visible as feature launches.
Common mistakes to avoid
- Optimizing too early — Don’t buy 3-year RIs for a startup you can’t predict
- Over-committing — Start with 50% coverage, increase gradually
- Ignoring data transfer — Often 5-15% of the bill, completely invisible
- Forgetting about dev/staging — Often running 24/7 unnecessarily
- Not tagging resources — You can’t optimize what you can’t see
When to get help
Consider getting professional help if:
- AWS spend is >$10K/month and growing
- No one has reviewed the bill in 6+ months
- You lack in-house AWS expertise
- Previous optimization attempts haven’t stuck
- You need to show ROI quickly
Next steps
- This week: Run the quick-win scripts, set up cost alerts
- This month: Analyze RI/SP opportunities, implement storage lifecycle
- This quarter: Evaluate architectural changes, build cost culture
Write a comment