System Design — Distributed Systems
How Systems Survive Chaos
“Everything fails, all the time.” — Werner Vogels, Amazon CTO. CAP theorem, consistent hashing, rate limiting, circuit breakers, and leader election — the patterns that keep systems alive when everything goes wrong.
The difference between 99.9% and 99.99% costs millions. This page explains the patterns that make it possible.
Werner Vogels, Amazon's CTO, has a famous quote: “Everything fails, all the time.” He isn't being dramatic. At Amazon's scale — millions of servers, billions of requests per day — hardware failures aren't exceptional events. They're constants. Hard drives fail, networks partition, entire datacenters go dark. The question isn't if things will break, but how the system behaves when they do.
This page covers the hard problems in distributed systems: the mathematical impossibilities (CAP theorem), the elegant algorithms (consistent hashing), the protective patterns (circuit breakers, rate limiters), and the coordination protocols (leader election). These aren't theoretical — Netflix, Amazon, Google, and Discord use every single one in production.
If you haven't seen The Data Layer yet, that covers databases, caching, queues, and replication. This page builds on those concepts — you'll see why replication requires consensus, why caching needs consistent hashing, and why every service call needs a circuit breaker.
CAP Theorem — The Impossible Triangle
During a network partition, a distributed system must choose: serve consistent data (but some nodes refuse requests) or stay available (but some nodes serve stale data). You can't have both.
balance: $400
balance: $400
balance: $400
Quiz: CP or AP? (5 scenarios)
You're building a stock trading platform
The tradeoff you face every day isn't C vs A
Network partitions are rare — maybe a few times per year even at massive scale. The tradeoff you actually face on every single request is latency vs consistency. Daniel Abadi formalized this as PACELC: if there's a Partition, choose Availability or Consistency. Else (no partition), choose Latency or Consistency.
DynamoDB lets you choose per query. Set ConsistentRead: true for critical reads (checking inventory before purchase) and use the default eventually consistent read for everything else (browsing product pages). The consistent read costs 2x the capacity units and adds a few milliseconds — a tax you pay only when correctness matters.
Google Spanner claims to offer “effectively CA” by using atomic clocks and GPS receivers in every datacenter to keep global time synchronized to within a few microseconds. This lets them do globally consistent reads without the typical latency penalty. The catch: it costs $0.90/node-hour and requires Google's custom hardware. For the rest of us, it's back to choosing between latency and consistency.
Consistent Hashing — The Elegant Solution
Regular hashing redistributes 75% of keys when you remove one server. Consistent hashing? Only the dead server's keys move. Everything else stays put.
Consistent Hash Ring
The paper that powers every CDN
In 1997, David Karger and his co-authors at MIT published “Consistent Hashing and Random Trees.” They were trying to solve web caching at Akamai — how do you distribute billions of cached objects across thousands of servers without reshuffling everything when a server dies? Their solution was the hash ring.
Today, consistent hashing is everywhere. Amazon DynamoDB uses it to distribute data across partitions — when you add capacity, only 1/N of your data migrates instead of reshuffling everything. Cassandra's entire architecture is called “the ring” for this reason. Discord uses it to shard messages by guild ID. Every CDN uses it for cache routing.
Virtual nodes solved the last major problem. With just 4 physical servers on a ring, distribution can be wildly uneven — one server might hold 40% of keys while another holds 10%. By giving each physical server 100-150 virtual positions on the ring, distribution variance drops below 3%. It's an elegant trick: the ring doesn't know or care that 150 points are actually the same machine.
Rate Limiting — Protecting the Gates
Without rate limiting, one abusive client sending 50,000 requests/sec ruins it for everyone. Four algorithms, each with different burst handling.
Tokens drip into a bucket at a steady rate. Each request takes one token. Bucket empty? Request rejected.
Refill rate: 10 tokens/sec. Max capacity: 50 tokens. Allows bursts up to 50, then throttles to steady rate.
Handles bursts gracefully. Simple to implement. Most popular algorithm.
Burst size is fixed by bucket capacity. Memory efficient but less precise.
AWS API Gateway, Stripe, and GitHub all use token bucket. GitHub: 5,000 tokens/hour per user.
Try It: Send Requests
71 million requests per second
In February 2023, Cloudflare mitigated the largest HTTP DDoS attack ever recorded: 71 million requests per second, launched from a botnet of over 30,000 compromised devices. The attack lasted just a few minutes, but without rate limiting at the edge, it would have overwhelmed any origin server on earth.
Rate limiting isn't just for DDoS protection. It's fundamental to API design. GitHub gives you 5,000 API requests per hour. Twitter allows 300 tweet lookups per 15 minutes. Stripe rate-limits by API key and endpoint. These limits exist to ensure fair access and prevent one customer from degrading the service for everyone else.
The distributed rate limiting problem is subtle. With 10 API servers, each maintaining its own counter, a user could send 100 requests to each server — 1,000 total, but each server thinks it's within the 100-request limit. The fix: centralized counting with Redis. All servers atomically increment the same Redis key using INCR with a TTL. Redis handles this at microsecond latency, adding negligible overhead to each request.
Fault Tolerance — Building Unbreakable Systems
Services will fail. Networks will partition. Disks will corrupt. These four patterns ensure one failure doesn't cascade into a total outage.
After N failures, stop calling the failing service entirely. Return a fallback instead.
CLOSED (normal) → failures accumulate → OPEN (all calls blocked, fallback returned) → timer expires → HALF-OPEN (one test request) → success: CLOSE / failure: stay OPEN.
Service A keeps calling broken Service B, tying up threads. A's threads fill up → A stops responding → A's callers fail → cascade failure.
Netflix built Hystrix for this. If the recommendation service dies, you still watch — just no 'Because You Watched' section.
Circuit Breaker Demo
Cascade Failure Demo
5 services in a chain. Service E slows down. Watch the failure cascade — then toggle fault tolerance ON and see circuit breakers contain the damage.
Netflix breaks their own systems on purpose
In 2011, Netflix built Chaos Monkey — a tool that randomly kills production instances during business hours. The reasoning: if your system can't survive random failures in controlled conditions, it will definitely fail during real outages at 2 AM on Christmas Eve. They expanded this to the “Simian Army” — Chaos Gorilla kills entire availability zones, Latency Monkey injects random delays, Conformity Monkey shuts down instances that don't follow best practices.
The investment paid off spectacularly. On Christmas Eve 2012, AWS Elastic Load Balancing had a major outage. Services across the internet went down. Netflix stayed up. Their circuit breakers detected the ELB failures, stopped routing through affected paths, and continued streaming from healthy instances. Millions of people watched holiday movies while the rest of the internet was in crisis.
Amazon's 2017 S3 outage taught the rest of the industry the same lesson. A single mistyped command took down S3 in us-east-1 for 4 hours. Slack, Trello, Quora, Business Insider, and thousands of other services went down — not because their own code failed, but because they had no circuit breakers or fallbacks for S3 dependencies. The post-mortem led to a wave of adoption of resilience patterns across the industry.
Distributed Transactions — When One Database Isn't Enough
User pays, inventory decreases, order creates — across three services. If step 2 fails, step 1 already committed. You can't just “rollback” across independent databases. This is the hardest coordination problem in microservices.
The Problem — When One Database Isn't Enough
User buys a Widget: create order + charge card + reserve inventory + ship. Four services, four databases. What happens when step 3 fails but steps 1 and 2 already committed?
Create order (status: PENDING)
Charge $49.99 to credit card
Reserve 1x Widget (stock: 50 → 49)
Create shipment label
Two-Phase Commit (2PC) — All or Nothing
A coordinator asks every participant “Can you commit?” If all say yes → COMMIT. If any says no → ABORT. Simple concept, but the coordinator holds locks during the entire vote.
Saga Pattern — Compensating Transactions
Instead of locking everything, each service commits independently. If a step fails, run compensating transactions backwards to undo previous steps. No distributed locks. No coordinator blocking.
Create order (status: PENDING)
Charge $49.99 to credit card
Reserve 1x Widget (stock: 50 → 49)
Create shipment label
Choreography vs Orchestration
A central coordinator manages the saga. It tells each service what to do and handles failures. Like a conductor leading an orchestra.
Saga Orchestrator → tells Order Service to create order → waits for response → tells Payment Service to charge → waits → tells Inventory to reserve → ...
Pros
- + Easy to understand — all logic in one place
- + Easy to debug — orchestrator knows the full state
- + Handles complex workflows with branching and retries
Cons
- - Orchestrator is a single point of complexity
- - Tighter coupling — orchestrator knows about all services
- - Risk of the orchestrator becoming a 'god service'
Uber built Cadence (now Temporal) for this — 100+ use cases internally. Netflix, Maersk, DigitalOcean, and ANZ Bank all use Temporal for saga orchestration.
2PC vs Saga — When to Use Each
| Aspect | 2PC | Saga |
|---|---|---|
| Consistency | Strong (ACID) — all or nothing | Eventual — brief inconsistency windows |
| Availability | Low — coordinator failure blocks everyone | High — each step is independent |
| Latency | High — holds locks until all vote | Low — no distributed locks |
| Scalability | Poor — locks + coordinator bottleneck | Excellent — each service scales independently |
| Complexity | Simple concept, hard to implement correctly | Requires compensating transactions for every step |
| Use case | 2-3 databases, short transactions | Microservices, long-running workflows |
How Uber coordinates a ride across 15 states
To a user, an Uber ride is five steps: request, match, pickup, trip, complete. Architecturally, it's a 15+ state distributed state machine coordinating the rider, driver, and backend platform across six event streams. A single trip can span 30 minutes, involve dozens of services, and must survive crashes, network failures, and app restarts.
Uber built Cadence to solve this — an open-source workflow engine that grew to 100+ internal use cases. Before Cadence, 80% of requests to Uber's API gateway were polling calls: services endlessly asking “is it done yet?” Cadence replaced this with durable, stateful workflows where each step automatically retries and compensates on failure.
The creators of Cadence later founded Temporal, which has become the industry standard for saga orchestration. ANZ Bank reduced home loan processing development from over a year to weeks. Maersk cut logistics feature delivery from 60-80 days to 5-10 days. Netflix, DigitalOcean, and Stripe all use Temporal. The lesson: distributed transactions need a coordinator — but that coordinator must itself be distributed, fault-tolerant, and horizontally scalable.
Leader Election — Who's in Charge?
In a cluster, one node must coordinate writes. When the leader dies, the remaining nodes must agree on a new leader — fast, and without creating two leaders (split-brain).
The Split-Brain Problem
A network partition splits the cluster in two. Each side thinks the other is dead and elects its own leader. Two leaders accept conflicting writes. Quorum voting (majority required) prevents this.
The algorithm that keeps the internet coordinated
Leslie Lamport published the Paxos consensus algorithm in 1989. It was mathematically elegant and provably correct — and notoriously difficult to implement. So many teams got it wrong that in 2014, Diego Ongaro and John Ousterhout designed Raft as a more understandable alternative. Raft achieves the same guarantees as Paxos but with a clearer mental model: leader election, log replication, and safety.
etcd (used by Kubernetes for all cluster state) runs Raft. Consul (used by HashiCorp for service discovery) runs Raft. CockroachDB (used by DoorDash for global transactions) runs Raft. Every time Kubernetes schedules a pod, it's writing to etcd through Raft consensus — ensuring that even if one etcd node crashes mid-write, the cluster state stays consistent.
Kafka historically relied on ZooKeeper for leader election and metadata management. In 2022, Kafka introduced KRaft (Kafka Raft) to eliminate the ZooKeeper dependency entirely. Each Kafka controller now participates in a Raft quorum directly. This simplified operations enormously — no more managing a separate ZooKeeper cluster alongside your Kafka cluster.
What is the CAP theorem?+
What's the difference between consistent hashing and regular hashing?+
How does a circuit breaker prevent cascade failures?+
Why use exponential backoff instead of immediate retries?+
What is split-brain in distributed systems?+
What's the difference between Raft and Paxos?+
How does Netflix's Chaos Monkey work?+
What is the bulkhead pattern?+
How do rate limiters work in distributed systems?+
What are the five nines of availability?+
What's Next?
You've learned the patterns that keep distributed systems alive. Next, explore how systems protect themselves — TLS handshakes, OAuth flows, JWT tokens, and the attacks that monitoring helps you catch.
Sources
- Brewer, Eric — CAP Theorem (PODC 2000, proved by Gilbert & Lynch 2002)
- Karger et al. — Consistent Hashing and Random Trees (MIT, 1997)
- Ongaro & Ousterhout — In Search of an Understandable Consensus Algorithm (Raft, 2014)
- Netflix Tech Blog — Chaos Engineering and Hystrix
- Amazon Builders Library — Timeouts, retries, and backoff with jitter
- Cloudflare Blog — DDoS attack mitigation statistics
- Google Spanner Paper — Globally-consistent distributed database
- Abadi, Daniel — PACELC: Consistency and Availability in Distributed Systems
- Apache Kafka Documentation — KRaft consensus
- ByteByteGo — System Design resources