AWS Cost Optimization: The Complete Guide

Everything you need to know about reducing your AWS bill by 20-40% — from quick wins to architectural changes.
AWS Cost Optimization: The Complete Guide

Every company I’ve audited has the same story: AWS costs grew faster than expected.

It’s not because AWS is expensive. It’s because:

  • Defaults favor availability over cost — Multi-AZ, On-Demand, S3 Standard
  • Engineers don’t see the bill — no feedback loop between provisioning and spending
  • Growth happens faster than cleanup — old resources accumulate
  • Nobody owns cloud cost — it’s everyone’s job, so it’s nobody’s job
  • Fear of breaking things — “let’s just leave it running”

The good news: these problems are fixable. This guide shows you how.

The cost optimization framework

Think of optimization in three tiers:

AWS cost optimization 3-tier framework pyramid — Tier 3 Architecture at top (Months, 30-50% savings), Tier 2 Commitments in middle (Days, 20-40% savings), Tier 1 Quick Wins at bottom (Hours, 10-30% savings)

Always start at Tier 1. Work your way up.

Tier 1: Quick wins (do this week)

1.1 Find and terminate idle resources

Idle EC2 instances

#!/bin/bash
# find-idle-ec2.sh
# Find EC2 instances with <5% average CPU over 14 days
# Requires GNU date (Linux). On macOS, install coreutils and use gdate.

THRESHOLD=5
DAYS=14

aws ec2 describe-instances \
  --filters "Name=instance-state-name,Values=running" \
  --query 'Reservations[].Instances[].[InstanceId,InstanceType,Tags[?Key==`Name`]|[0].Value]' \
  --output text | while IFS=$'\t' read -r instance_id instance_type name; do

  avg_cpu=$(aws cloudwatch get-metric-statistics \
    --namespace AWS/EC2 \
    --metric-name CPUUtilization \
    --dimensions Name=InstanceId,Value=$instance_id \
    --start-time $(date -d "-${DAYS} days" -u +%Y-%m-%dT%H:%M:%SZ) \
    --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
    --period 86400 \
    --statistics Average \
    --query 'Datapoints[].Average' \
    --output text | awk '{sum+=$1; count++} END {if(count>0) print sum/count; else print 0}')

  if (( $(echo "$avg_cpu < $THRESHOLD" | bc -l) )); then
    echo "IDLE: $instance_id ($name) - Type: $instance_type - Avg CPU: ${avg_cpu}%"
  fi
done

Unattached EBS volumes

#!/bin/bash
# find-unattached-ebs.sh

echo "Unattached EBS Volumes:"
echo "========================"

aws ec2 describe-volumes \
  --filters "Name=status,Values=available" \
  --query 'Volumes[].[VolumeId,Size,VolumeType,CreateTime]' \
  --output table

# Calculate total cost
total_gb=$(aws ec2 describe-volumes \
  --filters "Name=status,Values=available" \
  --query 'sum(Volumes[].Size)' \
  --output text)

echo ""
echo "Total unattached storage: ${total_gb} GB"
echo "Estimated monthly cost: \$$(echo "$total_gb * 0.10" | bc) (gp2/gp3)"

Old EBS snapshots

#!/usr/bin/env python3
# find-old-snapshots.py

import boto3
from datetime import datetime, timezone

ec2 = boto3.client('ec2')
MAX_AGE_DAYS = 90

# Get account ID
sts = boto3.client('sts')
account_id = sts.get_caller_identity()['Account']

# Get all snapshots owned by this account
snapshots = ec2.describe_snapshots(OwnerIds=[account_id])['Snapshots']

old_snapshots = []
total_size = 0

for snap in snapshots:
    age = datetime.now(timezone.utc) - snap['StartTime']
    if age.days > MAX_AGE_DAYS:
        old_snapshots.append({
            'SnapshotId': snap['SnapshotId'],
            'Size': snap['VolumeSize'],
            'Age': age.days,
            'Description': snap.get('Description', 'N/A')[:50]
        })
        total_size += snap['VolumeSize']

print(f"\nSnapshots older than {MAX_AGE_DAYS} days:")
print("=" * 80)
for snap in sorted(old_snapshots, key=lambda x: x['Age'], reverse=True)[:20]:
    print(f"{snap['SnapshotId']} | {snap['Size']:>5} GB | {snap['Age']:>4} days | {snap['Description']}")

print(f"\nTotal old snapshots: {len(old_snapshots)}")
print(f"Total size: {total_size} GB")
print(f"Estimated monthly cost: ${total_size * 0.05:.2f}")

1.2 Right-size over-provisioned instances

Check EC2 recommendations from Compute Optimizer

# Enable Compute Optimizer (one-time)
aws compute-optimizer update-enrollment-status --status Active

# Get EC2 recommendations
aws compute-optimizer get-ec2-instance-recommendations \
  --query 'instanceRecommendations[?finding==`OVER_PROVISIONED`].[
    instanceArn,
    currentInstanceType,
    recommendationOptions[0].instanceType,
    recommendationOptions[0].projectedUtilizationMetrics
  ]' \
  --output table

RDS right-sizing analysis

#!/usr/bin/env python3
# analyze-rds-utilization.py

import boto3
from datetime import datetime, timezone, timedelta

rds = boto3.client('rds')
cloudwatch = boto3.client('cloudwatch')

def get_rds_metrics(db_identifier, days=14):
    """Get CPU and memory metrics for RDS instance"""
    end_time = datetime.now(timezone.utc)
    start_time = end_time - timedelta(days=days)

    metrics = {}

    for metric_name in ['CPUUtilization', 'FreeableMemory', 'DatabaseConnections']:
        response = cloudwatch.get_metric_statistics(
            Namespace='AWS/RDS',
            MetricName=metric_name,
            Dimensions=[{'Name': 'DBInstanceIdentifier', 'Value': db_identifier}],
            StartTime=start_time,
            EndTime=end_time,
            Period=3600,
            Statistics=['Average', 'Maximum']
        )

        if response['Datapoints']:
            avg = sum(d['Average'] for d in response['Datapoints']) / len(response['Datapoints'])
            max_val = max(d['Maximum'] for d in response['Datapoints'])
            metrics[metric_name] = {'average': avg, 'maximum': max_val}

    return metrics

# Analyze all RDS instances
instances = rds.describe_db_instances()['DBInstances']

print("RDS Instance Utilization Analysis")
print("=" * 80)

for instance in instances:
    db_id = instance['DBInstanceIdentifier']
    instance_class = instance['DBInstanceClass']

    metrics = get_rds_metrics(db_id)

    if metrics.get('CPUUtilization'):
        cpu_avg = metrics['CPUUtilization']['average']
        cpu_max = metrics['CPUUtilization']['maximum']

        status = "OK"
        if cpu_avg < 20 and cpu_max < 50:
            status = "OVER-PROVISIONED - Consider downsizing"
        elif cpu_avg > 80:
            status = "UNDER-PROVISIONED - Consider upsizing"

        print(f"\n{db_id} ({instance_class})")
        print(f"  CPU Avg: {cpu_avg:.1f}% | CPU Max: {cpu_max:.1f}%")
        print(f"  Status: {status}")

1.3 Storage class optimization

S3 bucket analysis

#!/usr/bin/env python3
# analyze-s3-storage.py

import boto3
from datetime import datetime, timezone, timedelta

s3 = boto3.client('s3')
cloudwatch = boto3.client('cloudwatch')

def get_bucket_size(bucket_name):
    """Get bucket size from CloudWatch metrics"""
    now = datetime.now(timezone.utc)
    start = now.replace(hour=0, minute=0, second=0, microsecond=0) - timedelta(days=1)

    response = cloudwatch.get_metric_statistics(
        Namespace='AWS/S3',
        MetricName='BucketSizeBytes',
        Dimensions=[
            {'Name': 'BucketName', 'Value': bucket_name},
            {'Name': 'StorageType', 'Value': 'StandardStorage'}
        ],
        StartTime=start,
        EndTime=now,
        Period=86400,
        Statistics=['Average']
    )
    if response['Datapoints']:
        return response['Datapoints'][0]['Average']
    return 0

buckets = s3.list_buckets()['Buckets']

print("S3 Bucket Storage Analysis")
print("=" * 80)

total_standard = 0
recommendations = []

for bucket in buckets:
    name = bucket['Name']

    try:
        # Check if lifecycle policy exists
        try:
            s3.get_bucket_lifecycle_configuration(Bucket=name)
            has_lifecycle = True
        except s3.exceptions.ClientError:
            has_lifecycle = False

        size_bytes = get_bucket_size(name)
        size_gb = size_bytes / (1024 ** 3)

        if size_gb > 1:  # Only show buckets > 1 GB
            total_standard += size_gb

            if not has_lifecycle and size_gb > 10:
                recommendations.append({
                    'bucket': name,
                    'size_gb': size_gb,
                    'potential_savings': size_gb * 0.023 * 0.7  # Assume 70% can move to IA/Glacier
                })

            print(f"{name}: {size_gb:.2f} GB | Lifecycle: {'Yes' if has_lifecycle else 'NO'}")

    except Exception as e:
        print(f"{name}: Error - {e}")

print(f"\n{'=' * 80}")
print(f"Total Standard Storage: {total_standard:.2f} GB")
print(f"Monthly cost (Standard): ${total_standard * 0.023:.2f}")

if recommendations:
    print("\nBuckets needing lifecycle policies:")
    for rec in sorted(recommendations, key=lambda x: x['size_gb'], reverse=True):
        print(f"  {rec['bucket']}: {rec['size_gb']:.2f} GB - Potential savings: ${rec['potential_savings']:.2f}/mo")

1.4 Set up cost alerts

# Create a budget with alerts
aws budgets create-budget \
  --account-id $(aws sts get-caller-identity --query Account --output text) \
  --budget '{
    "BudgetName": "Monthly-AWS-Budget",
    "BudgetLimit": {
      "Amount": "10000",
      "Unit": "USD"
    },
    "TimeUnit": "MONTHLY",
    "BudgetType": "COST"
  }' \
  --notifications-with-subscribers '[
    {
      "Notification": {
        "NotificationType": "ACTUAL",
        "ComparisonOperator": "GREATER_THAN",
        "Threshold": 80,
        "ThresholdType": "PERCENTAGE"
      },
      "Subscribers": [
        {
          "SubscriptionType": "EMAIL",
          "Address": "your-email@company.com"
        }
      ]
    },
    {
      "Notification": {
        "NotificationType": "FORECASTED",
        "ComparisonOperator": "GREATER_THAN",
        "Threshold": 100,
        "ThresholdType": "PERCENTAGE"
      },
      "Subscribers": [
        {
          "SubscriptionType": "EMAIL",
          "Address": "your-email@company.com"
        }
      ]
    }
  ]'

Tier 2: Commitment discounts (do this month)

2.1 Reserved Instances vs Savings Plans

Feature Reserved Instances Savings Plans
Discount Up to 72% Up to 66%
Flexibility Instance-type specific Any instance type
Region-locked Yes Compute SP: No
Best for Predictable, stable workloads Variable workloads
Recommendation RDS, ElastiCache EC2, Fargate, Lambda

2.2 RI / SP decision framework

#!/usr/bin/env python3
# ri-sp-analyzer.py

import boto3

ce = boto3.client('ce')

def get_ri_recommendations(service):
    """Get RI purchase recommendations"""
    response = ce.get_reservation_purchase_recommendation(
        Service=service,
        TermInYears='ONE_YEAR',
        PaymentOption='NO_UPFRONT',
        LookbackPeriodInDays='SIXTY_DAYS'
    )
    return response.get('Recommendations', [])

def get_sp_recommendations():
    """Get Savings Plans recommendations"""
    response = ce.get_savings_plans_purchase_recommendation(
        SavingsPlansType='COMPUTE_SP',
        TermInYears='ONE_YEAR',
        PaymentOption='NO_UPFRONT',
        LookbackPeriodInDays='SIXTY_DAYS'
    )
    return response.get('SavingsPlansPurchaseRecommendation', {})

print("Reserved Instance Recommendations")
print("=" * 60)

for service in [
    'Amazon Elastic Compute Cloud - Compute',
    'Amazon Relational Database Service',
    'Amazon ElastiCache',
]:
    recs = get_ri_recommendations(service)
    if recs:
        for rec in recs:
            details = rec.get('RecommendationDetails', [])
            for detail in details[:3]:  # Top 3 recommendations
                print(f"\n{service}")
                print(f"  Instance: {detail.get('InstanceDetails', {})}")
                print(f"  Monthly savings: ${float(detail.get('EstimatedMonthlySavingsAmount', 0)):.2f}")
                print(f"  Upfront cost: ${float(detail.get('UpfrontCost', 0)):.2f}")

print("\n" + "=" * 60)
print("Savings Plans Recommendations")
print("=" * 60)

sp_rec = get_sp_recommendations()
if sp_rec:
    details = sp_rec.get('SavingsPlansPurchaseRecommendationDetails', [{}])[0]
    print(f"Recommended hourly commitment: ${details.get('HourlyCommitmentToPurchase', 'N/A')}")
    print(f"Estimated monthly savings: ${float(details.get('EstimatedMonthlySavingsAmount', 0)):.2f}")

2.3 Implementing commitments safely

Start-small strategy:

  • Week 1: Buy RIs/SPs for 50% of stable On-Demand usage
  • Month 1: Monitor utilization, ensure >80% RI/SP usage
  • Month 2: Increase to 70% coverage
  • Month 3: Reach target 80-90% coverage
# Monitor RI utilization
aws ce get-reservation-utilization \
  --time-period Start=$(date -d "-30 days" +%Y-%m-%d),End=$(date +%Y-%m-%d) \
  --granularity MONTHLY \
  --query 'UtilizationsByTime[].Total.[UtilizationPercentage]'

# Monitor Savings Plans utilization
aws ce get-savings-plans-utilization \
  --time-period Start=$(date -d "-30 days" +%Y-%m-%d),End=$(date +%Y-%m-%d) \
  --granularity MONTHLY

Tier 3: Architectural optimization (strategic)

3.1 Spot instances for stateless workloads

EKS with Karpenter v1 spot configuration

# karpenter-spot-nodepool.yaml — Karpenter v1 (GA)
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: spot-workloads
spec:
  template:
    metadata:
      labels:
        workload-type: spot
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"]
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64", "arm64"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]
        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values: ["5"]
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: spot-template
      taints:
        - key: spot
          value: "true"
          effect: NoSchedule
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s
  limits:
    cpu: 500
    memory: 500Gi

---
# Deployment that uses Spot
apiVersion: apps/v1
kind: Deployment
metadata:
  name: batch-processor
spec:
  replicas: 10
  template:
    spec:
      tolerations:
        - key: spot
          operator: Equal
          value: "true"
          effect: NoSchedule
      nodeSelector:
        karpenter.sh/capacity-type: spot
      containers:
        - name: processor
          image: batch-processor:latest
          resources:
            requests:
              cpu: "1"
              memory: "2Gi"
            limits:
              cpu: "2"
              memory: "4Gi"

3.2 Graviton migration

Terraform for Graviton EKS node group

# Graviton node group — 20% cheaper than x86
resource "aws_eks_node_group" "graviton" {
  cluster_name    = aws_eks_cluster.main.name
  node_group_name = "graviton-nodes"
  node_role_arn   = aws_iam_role.eks_nodes.arn
  subnet_ids      = aws_subnet.private[*].id

  # Graviton instance types
  instance_types = ["m6g.large", "m6g.xlarge", "c6g.large", "c6g.xlarge"]

  # ARM64 AMI
  ami_type = "AL2_ARM_64"

  scaling_config {
    desired_size = 3
    min_size     = 1
    max_size     = 10
  }

  labels = {
    "kubernetes.io/arch" = "arm64"
    "node-type"          = "graviton"
  }

  taint {
    key    = "arch"
    value  = "arm64"
    effect = "NO_SCHEDULE"
  }

  tags = {
    Name        = "graviton-node"
    Environment = var.environment
    CostCenter  = "platform"
  }
}

Multi-arch Docker build

# Dockerfile for multi-architecture support
FROM --platform=$BUILDPLATFORM golang:1.21-alpine AS builder

ARG TARGETARCH
ARG TARGETOS

WORKDIR /app
COPY . .

RUN CGO_ENABLED=0 GOOS=${TARGETOS} GOARCH=${TARGETARCH} \
    go build -o /app/server ./cmd/server

# Final image
FROM alpine:3.18
COPY --from=builder /app/server /server
ENTRYPOINT ["/server"]
# Build and push multi-arch image
docker buildx build \
  --platform linux/amd64,linux/arm64 \
  -t your-registry/app:latest \
  --push .

3.3 VPC endpoint optimization

# vpc-endpoints.tf

locals {
  gateway_endpoints = ["s3", "dynamodb"]

  interface_endpoints = [
    "ecr.api",
    "ecr.dkr",
    "logs",
    "monitoring",
    "secretsmanager",
    "ssm",
    "ssmmessages",
    "ec2messages",
    "sts"
  ]
}

# Gateway endpoints (free for S3/DynamoDB)
resource "aws_vpc_endpoint" "gateway" {
  for_each = toset(local.gateway_endpoints)

  vpc_id            = aws_vpc.main.id
  service_name      = "com.amazonaws.${var.region}.${each.key}"
  vpc_endpoint_type = "Gateway"
  route_table_ids   = aws_route_table.private[*].id

  tags = {
    Name = "${each.key}-endpoint"
  }
}

# Interface endpoints (charged per hour + data)
resource "aws_vpc_endpoint" "interface" {
  for_each = toset(local.interface_endpoints)

  vpc_id              = aws_vpc.main.id
  service_name        = "com.amazonaws.${var.region}.${each.key}"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = aws_subnet.private[*].id
  security_group_ids  = [aws_security_group.vpc_endpoints.id]
  private_dns_enabled = true

  tags = {
    Name = "${each.key}-endpoint"
  }
}

resource "aws_security_group" "vpc_endpoints" {
  name_prefix = "vpc-endpoints-"
  vpc_id      = aws_vpc.main.id

  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = [aws_vpc.main.cidr_block]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

Tools and resources

Free tools

Tool Purpose Link
AWS Cost Explorer Primary cost analysis Built into AWS
AWS Compute Optimizer Right-sizing recommendations Built into AWS
AWS Trusted Advisor Best practice checks Limited free tier
Infracost Terraform cost estimation infracost.io
Komiser Multi-cloud cost dashboard github.com/tailwarden/komiser

Building a cost-conscious culture

Technical optimization is only half the battle. The other half is organizational.

1. Make costs visible

#!/usr/bin/env python3
# weekly-team-cost-report.py

import boto3
from datetime import datetime, timezone, timedelta

def generate_team_cost_report():
    ce = boto3.client('ce')

    end = datetime.now(timezone.utc).strftime('%Y-%m-%d')
    start = (datetime.now(timezone.utc) - timedelta(days=7)).strftime('%Y-%m-%d')

    response = ce.get_cost_and_usage(
        TimePeriod={'Start': start, 'End': end},
        Granularity='DAILY',
        Metrics=['UnblendedCost'],
        GroupBy=[{'Type': 'TAG', 'Key': 'Team'}]
    )

    # Format and send to Slack/email
    for day in response['ResultsByTime']:
        for group in day['Groups']:
            team = group['Keys'][0].replace('Team$', '') or 'Untagged'
            cost = float(group['Metrics']['UnblendedCost']['Amount'])
            print(f"{day['TimePeriod']['Start']} | {team}: ${cost:.2f}")

2. Set team budgets

Each team should have visibility into their own costs and accountability for staying within budget.

3. Celebrate cost wins

Make cost optimization achievements as visible as feature launches.

Common mistakes to avoid

5 AWS cost optimization mistakes — optimizing too early, over-committing, ignoring data transfer, forgetting dev/staging, not tagging resources

  • Optimizing too early — Don’t buy 3-year RIs for a startup you can’t predict
  • Over-committing — Start with 50% coverage, increase gradually
  • Ignoring data transfer — Often 5-15% of the bill, completely invisible
  • Forgetting about dev/staging — Often running 24/7 unnecessarily
  • Not tagging resources — You can’t optimize what you can’t see

When to get help

Consider getting professional help if:

  • AWS spend is >$10K/month and growing
  • No one has reviewed the bill in 6+ months
  • You lack in-house AWS expertise
  • Previous optimization attempts haven’t stuck
  • You need to show ROI quickly

Next steps

  • This week: Run the quick-win scripts, set up cost alerts
  • This month: Analyze RI/SP opportunities, implement storage lifecycle
  • This quarter: Evaluate architectural changes, build cost culture

Write a comment
No comments yet.