System Design — Distributed Systems

How Systems Survive Chaos

“Everything fails, all the time.” — Werner Vogels, Amazon CTO. CAP theorem, consistent hashing, rate limiting, circuit breakers, and leader election — the patterns that keep systems alive when everything goes wrong.

SLA

Downtime/Year

Downtime/Month

Example

99%

3.65 days

7.3 hours

Small startups, personal projects

99.9%

8.7 hours

43 minutes

Most SaaS (Slack, Notion)

99.99%

52 minutes

4.3 minutes

AWS S3, Netflix, Stripe

99.999%

5.2 minutes

26 seconds

Air traffic control, pacemakers

The difference between 99.9% and 99.99% costs millions. This page explains the patterns that make it possible.

CAP Theorem Consistent Hashing Rate Limiting Fault Tolerance Leader Election

Werner Vogels, Amazon's CTO, has a famous quote: “Everything fails, all the time.” He isn't being dramatic. At Amazon's scale — millions of servers, billions of requests per day — hardware failures aren't exceptional events. They're constants. Hard drives fail, networks partition, entire datacenters go dark. The question isn't if things will break, but how the system behaves when they do.

This page covers the hard problems in distributed systems: the mathematical impossibilities (CAP theorem), the elegant algorithms (consistent hashing), the protective patterns (circuit breakers, rate limiters), and the coordination protocols (leader election). These aren't theoretical — Netflix, Amazon, Google, and Discord use every single one in production.

If you haven't seen The Data Layer yet, that covers databases, caching, queues, and replication. This page builds on those concepts — you'll see why replication requires consensus, why caching needs consistent hashing, and why every service call needs a circuit breaker.

CAP Theorem — The Impossible Triangle

During a network partition, a distributed system must choose: serve consistent data (but some nodes refuse requests) or stay available (but some nodes serve stale data). You can't have both.

balance: $400

Quiz: CP or AP? (5 scenarios)

You're building a stock trading platform

The tradeoff you face every day isn't C vs A

Network partitions are rare — maybe a few times per year even at massive scale. The tradeoff you actually face on every single request is latency vs consistency. Daniel Abadi formalized this as PACELC: if there's a Partition, choose Availability or Consistency. Else (no partition), choose Latency or Consistency.

DynamoDB lets you choose per query. Set ConsistentRead: true for critical reads (checking inventory before purchase) and use the default eventually consistent read for everything else (browsing product pages). The consistent read costs 2x the capacity units and adds a few milliseconds — a tax you pay only when correctness matters.

Google Spanner claims to offer “effectively CA” by using atomic clocks and GPS receivers in every datacenter to keep global time synchronized to within a few microseconds. This lets them do globally consistent reads without the typical latency penalty. The catch: it costs $0.90/node-hour and requires Google's custom hardware. For the rest of us, it's back to choosing between latency and consistency.

Consistent Hashing — The Elegant Solution

Regular hashing redistributes 75% of keys when you remove one server. Consistent hashing? Only the dead server's keys move. Everything else stays put.

Consistent Hash Ring

The paper that powers every CDN

In 1997, David Karger and his co-authors at MIT published “Consistent Hashing and Random Trees.” They were trying to solve web caching at Akamai — how do you distribute billions of cached objects across thousands of servers without reshuffling everything when a server dies? Their solution was the hash ring.

Today, consistent hashing is everywhere. Amazon DynamoDB uses it to distribute data across partitions — when you add capacity, only 1/N of your data migrates instead of reshuffling everything. Cassandra's entire architecture is called “the ring” for this reason. Discord uses it to shard messages by guild ID. Every CDN uses it for cache routing.

Virtual nodes solved the last major problem. With just 4 physical servers on a ring, distribution can be wildly uneven — one server might hold 40% of keys while another holds 10%. By giving each physical server 100-150 virtual positions on the ring, distribution variance drops below 3%. It's an elegant trick: the ring doesn't know or care that 150 points are actually the same machine.

Rate Limiting — Protecting the Gates

Without rate limiting, one abusive client sending 50,000 requests/sec ruins it for everyone. Four algorithms, each with different burst handling.

Tokens drip into a bucket at a steady rate. Each request takes one token. Bucket empty? Request rejected.

Refill rate: 10 tokens/sec. Max capacity: 50 tokens. Allows bursts up to 50, then throttles to steady rate.

Pro:

Handles bursts gracefully. Simple to implement. Most popular algorithm.

Con:

Burst size is fixed by bucket capacity. Memory efficient but less precise.

AWS API Gateway, Stripe, and GitHub all use token bucket. GitHub: 5,000 tokens/hour per user.

Try It: Send Requests

Tokens:

50/50

0 accepted0 rejected

71 million requests per second

In February 2023, Cloudflare mitigated the largest HTTP DDoS attack ever recorded: 71 million requests per second, launched from a botnet of over 30,000 compromised devices. The attack lasted just a few minutes, but without rate limiting at the edge, it would have overwhelmed any origin server on earth.

Rate limiting isn't just for DDoS protection. It's fundamental to API design. GitHub gives you 5,000 API requests per hour. Twitter allows 300 tweet lookups per 15 minutes. Stripe rate-limits by API key and endpoint. These limits exist to ensure fair access and prevent one customer from degrading the service for everyone else.

The distributed rate limiting problem is subtle. With 10 API servers, each maintaining its own counter, a user could send 100 requests to each server — 1,000 total, but each server thinks it's within the 100-request limit. The fix: centralized counting with Redis. All servers atomically increment the same Redis key using INCR with a TTL. Redis handles this at microsecond latency, adding negligible overhead to each request.

Fault Tolerance — Building Unbreakable Systems

Services will fail. Networks will partition. Disks will corrupt. These four patterns ensure one failure doesn't cascade into a total outage.

After N failures, stop calling the failing service entirely. Return a fallback instead.

CLOSED (normal) → failures accumulate → OPEN (all calls blocked, fallback returned) → timer expires → HALF-OPEN (one test request) → success: CLOSE / failure: stay OPEN.

Without this pattern:

Service A keeps calling broken Service B, tying up threads. A's threads fill up → A stops responding → A's callers fail → cascade failure.

Netflix built Hystrix for this. If the recommendation service dies, you still watch — just no 'Because You Watched' section.

Circuit Breaker Demo

State:CLOSED

Failures:0/5

CLOSED→ 5 failures →OPEN→ 30s cooldown →HALF-OPEN→ test request →CLOSED

Cascade Failure Demo

5 services in a chain. Service E slows down. Watch the failure cascade — then toggle fault tolerance ON and see circuit breakers contain the damage.

Fault tolerance ON

API Gateway

→

User Service

→

Order Service

→

Payment Service

→

Email Service

Netflix breaks their own systems on purpose

In 2011, Netflix built Chaos Monkey — a tool that randomly kills production instances during business hours. The reasoning: if your system can't survive random failures in controlled conditions, it will definitely fail during real outages at 2 AM on Christmas Eve. They expanded this to the “Simian Army” — Chaos Gorilla kills entire availability zones, Latency Monkey injects random delays, Conformity Monkey shuts down instances that don't follow best practices.

The investment paid off spectacularly. On Christmas Eve 2012, AWS Elastic Load Balancing had a major outage. Services across the internet went down. Netflix stayed up. Their circuit breakers detected the ELB failures, stopped routing through affected paths, and continued streaming from healthy instances. Millions of people watched holiday movies while the rest of the internet was in crisis.

Amazon's 2017 S3 outage taught the rest of the industry the same lesson. A single mistyped command took down S3 in us-east-1 for 4 hours. Slack, Trello, Quora, Business Insider, and thousands of other services went down — not because their own code failed, but because they had no circuit breakers or fallbacks for S3 dependencies. The post-mortem led to a wave of adoption of resilience patterns across the industry.

Distributed Transactions — When One Database Isn't Enough

User pays, inventory decreases, order creates — across three services. If step 2 fails, step 1 already committed. You can't just “rollback” across independent databases. This is the hardest coordination problem in microservices.

The Problem — When One Database Isn't Enough

User buys a Widget: create order + charge card + reserve inventory + ship. Four services, four databases. What happens when step 3 fails but steps 1 and 2 already committed?

Order Service

Create order (status: PENDING)

Payment Service

Charge $49.99 to credit card

Inventory Service

Reserve 1x Widget (stock: 50 → 49)

Shipping Service

Create shipment label

Two-Phase Commit (2PC) — All or Nothing

A coordinator asks every participant “Can you commit?” If all say yes → COMMIT. If any says no → ABORT. Simple concept, but the coordinator holds locks during the entire vote.

PREPARECoordinator: "Can you commit?"

PREPAREParticipant 1: "Yes, I can commit"

PREPAREParticipant 2: "Yes, I can commit"

COMMITCoordinator: "COMMIT!"

Saga Pattern — Compensating Transactions

Instead of locking everything, each service commits independently. If a step fails, run compensating transactions backwards to undo previous steps. No distributed locks. No coordinator blocking.

Fail at step:

1Order Service

Create order (status: PENDING)

2Payment Service

Charge $49.99 to credit card

3Inventory Service

Reserve 1x Widget (stock: 50 → 49)

4Shipping Service

Create shipment label

Choreography vs Orchestration

A central coordinator manages the saga. It tells each service what to do and handles failures. Like a conductor leading an orchestra.

Saga Orchestrator → tells Order Service to create order → waits for response → tells Payment Service to charge → waits → tells Inventory to reserve → ...

Pros

+ Easy to understand — all logic in one place
+ Easy to debug — orchestrator knows the full state
+ Handles complex workflows with branching and retries

Cons

- Orchestrator is a single point of complexity
- Tighter coupling — orchestrator knows about all services
- Risk of the orchestrator becoming a 'god service'

Uber built Cadence (now Temporal) for this — 100+ use cases internally. Netflix, Maersk, DigitalOcean, and ANZ Bank all use Temporal for saga orchestration.

2PC vs Saga — When to Use Each

Aspect	2PC	Saga
Consistency	Strong (ACID) — all or nothing	Eventual — brief inconsistency windows
Availability	Low — coordinator failure blocks everyone	High — each step is independent
Latency	High — holds locks until all vote	Low — no distributed locks
Scalability	Poor — locks + coordinator bottleneck	Excellent — each service scales independently
Complexity	Simple concept, hard to implement correctly	Requires compensating transactions for every step
Use case	2-3 databases, short transactions	Microservices, long-running workflows

How Uber coordinates a ride across 15 states

To a user, an Uber ride is five steps: request, match, pickup, trip, complete. Architecturally, it's a 15+ state distributed state machine coordinating the rider, driver, and backend platform across six event streams. A single trip can span 30 minutes, involve dozens of services, and must survive crashes, network failures, and app restarts.

Uber built Cadence to solve this — an open-source workflow engine that grew to 100+ internal use cases. Before Cadence, 80% of requests to Uber's API gateway were polling calls: services endlessly asking “is it done yet?” Cadence replaced this with durable, stateful workflows where each step automatically retries and compensates on failure.

The creators of Cadence later founded Temporal, which has become the industry standard for saga orchestration. ANZ Bank reduced home loan processing development from over a year to weeks. Maersk cut logistics feature delivery from 60-80 days to 5-10 days. Netflix, DigitalOcean, and Stripe all use Temporal. The lesson: distributed transactions need a coordinator — but that coordinator must itself be distributed, fault-tolerant, and horizontally scalable.

Leader Election — Who's in Charge?

In a cluster, one node must coordinate writes. When the leader dies, the remaining nodes must agree on a new leader — fast, and without creating two leaders (split-brain).

👑

Node 1

Leader

Node 2

Follower

Node 3

Follower

Node 4

Follower

Node 5

Follower

Heartbeat active (every 500ms)

The Split-Brain Problem

A network partition splits the cluster in two. Each side thinks the other is dead and elects its own leader. Two leaders accept conflicting writes. Quorum voting (majority required) prevents this.

The algorithm that keeps the internet coordinated

Leslie Lamport published the Paxos consensus algorithm in 1989. It was mathematically elegant and provably correct — and notoriously difficult to implement. So many teams got it wrong that in 2014, Diego Ongaro and John Ousterhout designed Raft as a more understandable alternative. Raft achieves the same guarantees as Paxos but with a clearer mental model: leader election, log replication, and safety.

etcd (used by Kubernetes for all cluster state) runs Raft. Consul (used by HashiCorp for service discovery) runs Raft. CockroachDB (used by DoorDash for global transactions) runs Raft. Every time Kubernetes schedules a pod, it's writing to etcd through Raft consensus — ensuring that even if one etcd node crashes mid-write, the cluster state stays consistent.

Kafka historically relied on ZooKeeper for leader election and metadata management. In 2022, Kafka introduced KRaft (Kafka Raft) to eliminate the ZooKeeper dependency entirely. Each Kafka controller now participates in a Raft quorum directly. This simplified operations enormously — no more managing a separate ZooKeeper cluster alongside your Kafka cluster.

What is the CAP theorem?+

The CAP theorem states that during a network partition, a distributed system must choose between Consistency (every read returns the latest write) and Availability (every request gets a response). You can't have both simultaneously. In practice, you choose per-feature: bank transfers need consistency, social feeds need availability.

What's the difference between consistent hashing and regular hashing?+

Regular hashing (hash(key) % N) redistributes nearly ALL keys when you add or remove a server. Consistent hashing arranges servers on a ring — when a server is removed, only its keys move to the next server clockwise. With 4 servers, regular hashing moves 75% of keys; consistent hashing moves only 25%.

How does a circuit breaker prevent cascade failures?+

A circuit breaker monitors failure rates. After N consecutive failures, it 'opens' — all subsequent calls return a fallback immediately without hitting the failing service. After a cooldown period, it sends one test request. If it succeeds, the circuit closes and normal traffic resumes. Netflix's Hystrix pioneered this pattern.

Why use exponential backoff instead of immediate retries?+

Immediate retries hammer an already struggling service. If 1,000 clients all retry simultaneously, the service gets 1,000x its normal load. Exponential backoff (1s, 2s, 4s, 8s...) gives the service time to recover. Adding jitter (random delay) ensures clients don't all retry at the same moment.

What is split-brain in distributed systems?+

Split-brain occurs when a network partition creates two groups of nodes that can't communicate, and each group elects its own leader. Two leaders accept conflicting writes, causing data divergence. The fix is quorum voting: you need a majority (e.g., 3 out of 5 nodes) to elect a leader, so only one partition can succeed.

What's the difference between Raft and Paxos?+

Both are consensus algorithms for leader election and log replication. Paxos (1989) is mathematically elegant but notoriously hard to implement. Raft (2014) was designed specifically to be understandable. Most modern systems (etcd, Consul, CockroachDB) use Raft. The outcomes are equivalent — Raft is just easier to get right.

How does Netflix's Chaos Monkey work?+

Chaos Monkey randomly terminates production instances during business hours. The philosophy: if you can't survive random failures, you'll definitely fail during real outages. Netflix expanded this to the 'Simian Army' — Chaos Gorilla kills entire availability zones, Latency Monkey injects delays. The result: Netflix survived the 2012 Christmas AWS outage while other services went down.

What is the bulkhead pattern?+

Named after watertight compartments in ships. Each service call gets its own isolated resource pool (thread pool, connection pool). If one pool fills up because a service is slow, other pools are unaffected. Without bulkheads, one slow dependency consumes all available threads and everything stalls.

How do rate limiters work in distributed systems?+

With multiple servers, each has its own counter. A user could send 100 requests to each of 10 servers — 1,000 total, but each server sees only 100. The fix: centralized counting using Redis. All servers check/increment the same Redis key. Twitter and GitHub use this approach for API rate limiting.

What are the five nines of availability?+

99.999% uptime ('five nines') means only 5.2 minutes of downtime per year. Each additional nine is exponentially harder: 99.9% allows 8.7 hours, 99.99% allows 52 minutes. Getting from four to five nines requires synchronous replication, automatic failover, multi-region infrastructure, and costs tens of millions.

What's Next?

You've learned the patterns that keep distributed systems alive. Next, explore how systems protect themselves — TLS handshakes, OAuth flows, JWT tokens, and the attacks that monitoring helps you catch.

← Page 3: The Data Layer Page 1: System Design 101 Page 5: Security & Authentication →

Sources

Brewer, Eric — CAP Theorem (PODC 2000, proved by Gilbert & Lynch 2002)
Karger et al. — Consistent Hashing and Random Trees (MIT, 1997)
Ongaro & Ousterhout — In Search of an Understandable Consensus Algorithm (Raft, 2014)
Netflix Tech Blog — Chaos Engineering and Hystrix
Amazon Builders Library — Timeouts, retries, and backoff with jitter
Cloudflare Blog — DDoS attack mitigation statistics
Google Spanner Paper — Globally-consistent distributed database
Abadi, Daniel — PACELC: Consistency and Availability in Distributed Systems
Apache Kafka Documentation — KRaft consensus
ByteByteGo — System Design resources