How AWS S3 is built

A behind-the-scenes look at how Amazon S3 is designed for durability and correctness at massive scale, drawing on over a decade of operating one of the world’s largest distributed systems with Mai-Lan Tomsen Bukovec at AWS.
How AWS S3 is built

How AWS S3 is built Amazon S3 operates at an immense scale, handling hundreds of millions of transactions per second and storing exabytes of data, a feat achieved through meticulous engineering and a design philosophy centered on building for failure. The system has evolved significantly, incorporating Rust for performance-critical code and leveraging formal methods to ensure correctness, particularly in areas like consistency and cross-region replication. Key to its reliability are strategies to mitigate correlated failures and a principle that scale should be an advantage, leading to continuous improvements in performance and user experience.

  • S3 handles hundreds of millions of transactions per second globally and stores over 500 trillion objects, amounting to hundreds of exabytes of data.
  • The system achieved strong consistency without compromising availability or increasing costs through innovations like a replicated journal and a new cache coherency protocol.
  • Performance-critical code paths in S3 have been largely rewritten in Rust to maximize performance and minimize latency.
  • S3’s 11 nines of durability are continuously measured by auditor microservices, with automated repair systems addressing detected issues.
  • Formal methods, including automated reasoning, are extensively used in production to verify code correctness, especially for the index subsystem and cross-region replication.
  • Correlated failures, where multiple components fail simultaneously due to shared fault domains, are a primary concern and are mitigated by replicating data across multiple availability zones and using quorum-based algorithms.
  • S3 comprises around 200 microservices, many dedicated to durability tasks like health checks and repairs, emphasizing simplified, focused services.
  • S3 Vectors, a new data structure for searching high-dimensional vector spaces, achieves sub-100ms query times by precomputing vector neighborhoods.
  • Crash consistency is a core design philosophy, with engineers reasoning about system states under failure conditions.
  • The ‘Scale Is to Your Advantage’ principle ensures that increased scale improves system attributes like reliability.
  • Read the full article

https://newsletter.pragmaticengineer.com/p/how-aws-s3-is-built

Write a comment
No comments yet.