TheHowPage

System Design — The Data Layer

How Your Data Stays Alive

One tweet lives in 7 places simultaneously. Databases, caches, search indexes, queues, and analytics pipelines — all keeping copies of your data for different purposes. Keeping them in sync is the hardest problem in system design.

"Hello world"

When you post a tweet, it doesn't just sit in one database. Within milliseconds, that 280-character message lives in 7 different places — a primary database, read replicas, Redis cache, CDN, Elasticsearch index, Kafka queue, and analytics pipeline. Each copy exists for a different reason, optimized for a different access pattern. Keeping them all in sync is the hardest problem in system design.

This page explores the data layer: how companies like Netflix, Twitter, and Amazon choose between SQL and NoSQL, when to shard versus replicate, why caching is 100x faster than querying, and how message queues let services fail gracefully. Every number comes from real production systems — Stack Overflow serving 1.3 billion page views on 9 servers, Redis handling 1.8 million operations per second, LinkedIn processing 7 trillion Kafka messages per day.

If you haven't seen The Network Layer yet, that covers the first half of the journey — DNS, TCP, HTTP, load balancers, CDNs. This page picks up where the request arrives at the server and needs to read or write data.

SQL vs NoSQL — The Great Database Debate

Same dataset, two different databases. Run 5 experiments and see which wins at each task. Spoiler: neither wins every time.

SELECT * FROM users WHERE id = 42
PostgreSQLTIE0.2ms

Primary key lookup. B-tree index. Instant.

MongoDBTIE0.2ms

Document lookup by _id. Instant.

Simple lookups are fast everywhere. The differences appear when things get complex.

2

SQL wins

vs
2

NoSQL wins

It's not SQL vs NoSQL — it's “which tradeoff matters more for YOUR use case?”

Quick Decision Helper

Do you need complex joins between related data?

Does your data schema change often?

Do you need ACID transactions?

Do you need 100K+ writes per second?

The real answer is “both”

Instagram uses PostgreSQL for user accounts and relationships (where data integrity is critical) and Cassandra for the activity feed (where write throughput matters more than consistency). Uber uses MySQL for trip data and Redis for real-time geolocation. Netflix uses CockroachDB for billing and Cassandra for viewing history. The question isn't “SQL or NoSQL?” — it's “which database for which access pattern?”

The ACID vs BASE tradeoff is at the heart of this. ACID (Atomicity, Consistency, Isolation, Durability) guarantees that every transaction either fully completes or fully rolls back. Your bank transfer will never leave you with money debited but not credited. BASE (Basically Available, Soft state, Eventually consistent) trades strict consistency for availability and partition tolerance. Your Instagram like count might show 1,003 on one server and 1,005 on another for a few milliseconds — and that's fine.

MongoDB started as a pure document store with no joins and no transactions. By 2024, it added multi-document ACID transactions, time series collections, and a SQL-like aggregation pipeline. PostgreSQL started as a relational database and added JSONB columns, full-text search, and time-series extensions. The two worlds are converging. Pick based on your primary access pattern, then extend as needed.

Schema Design — Modeling Your Data

Picking SQL vs NoSQL is step one. Step two is designing the schema inside your database — how tables relate, which columns to index, and how to avoid the partition key mistake that takes down your service at 2 AM.

Normalization vs Denormalization — The Fundamental Tradeoff

Data lives in one place. No duplication. Updates are safe. Reads require JOINs.

users
idname
1Jane
2Alex
orders
iduser_idproduct_idqty
10115012
10215021
10325013
products
idnameprice
501Widget$9.99
502Gadget$24.99
Query
SELECT u.name, p.name, o.qty
FROM orders o
JOIN users u ON o.user_id = u.id
JOIN products p ON o.product_id = p.id
Read SpeedSlower — 3 table JOIN required
Write SpeedFast — update price once, everywhere sees it
ConsistencyAlways consistent — single source of truth

Banking systems, e-commerce inventory, anything where data integrity is non-negotiable. Stack Overflow uses normalized SQL Server — 2 instances serve 1.3B page views/month.

Indexing — Why Your Query Is Slow

Index on one column. The bread and butter — covers 80% of cases.

SELECT * FROM users WHERE email = 'jane@example.com'

B-tree lookup: 3 comparisons → direct row. ~0.1ms

How it works: B-tree stores sorted pointers. Binary search narrows 10M rows to 1 row in ~23 comparisons (log₂ 10M ≈ 23).

B-tree handles 90% of indexing needs. PostgreSQL creates one automatically on every primary key.

The Hot Partition Problem — When One Shard Melts

Justin Bieber posts → 100M fans write likes to the same partition

Partition key: post_id

All likes for one post hash to the same partition. DynamoDB limits: 3,000 RCU and 1,000 WCU per partition. Celebrity posts blow past this instantly.

P1
95%
P2
12%
P3
8%
P4
15%
P5
5%

The schema change that saved Instagram

In the early 2010s, Instagram faced a critical scalability challenge nicknamed the “Justin Bieber Problem.” Celebrity posts garnered millions of likes within moments, and every like triggered a COUNT(*) query across a normalized likes table. The database couldn't keep up — queries that should take milliseconds were taking seconds, slowing the platform for everyone.

The fix was elegant: denormalize. They added a LikeCount column directly to the Posts table and incremented it atomically on each like. Reads became 100x faster — one row lookup instead of a cross-table aggregation. The tradeoff? Writes became slightly more complex (update two tables instead of one), and there's a brief window where the count might be off by one. For a social media platform, that's an acceptable trade.

This is the schema design lesson: normalize for correctness, denormalize for performance. Most real systems use both — normalized tables for the source of truth, denormalized views or caches for read-heavy access patterns. Uber uses composite indexes on (trip_id, timestamp) to serve their most common query. Stack Overflow's 2 SQL Server instances handle 1.3 billion page views because they index aggressively.

Database Scaling — When One Machine Isn't Enough

Traffic keeps growing. Your single database is sweating. Four strategies to scale, in the order you should try them.

Get a bigger machine. More CPU, more RAM, more disk. Same single server.

Benefit:

No code changes. No distributed systems complexity. Just throw money at hardware.

Limitation:

There's a ceiling. The biggest AWS instance (u-24tb1.metal) has 24TB RAM, costs $218/hour. Even that has limits.

Stack Overflow runs on 2 SQL Servers. They squeezed everything from vertical scaling before touching horizontal. 1.3B page views/month.

Stack Overflow doesn't shard — and neither should you (yet)

Stack Overflow serves 1.3 billion page views per month from 9 web servers and 2 SQL Server instances. No sharding. No NoSQL. Just aggressive indexing, smart caching, and a well-designed schema. They estimate that a single modern database server can handle 10,000+ queries per second with proper indexes.

Sharding introduces real operational pain. Cross-shard queries become expensive joins across the network. Schema changes must be coordinated across every shard. Rebalancing data when you add a shard can take days. Discord learned this the hard way — their Cassandra cluster had billions of messages, but certain channels (like popular gaming communities) created hot partitions. They eventually migrated to ScyllaDB, dropping p99 latency from 200ms to 5ms.

The scaling ladder is deliberate: start with better hardware (vertical scaling is underrated — a single AWS r7g.16xlarge has 512GB RAM and 64 cores). Then add read replicas to offload reads. Add caching to reduce DB load by 99%. Only when you've exhausted these options should you consider sharding. Most applications will never reach that point.

Caching — The Speed Multiplier

The fastest database query is the one you never make. Caching stores hot data in memory — 100x faster than disk. Four strategies, each with tradeoffs.

App checks cache
Miss → App reads DB
App writes to cache
Next request → Hit!

Most common pattern. App manages cache explicitly. Data loaded into cache on first access.

Pro:

Only caches what's actually requested. Resilient — if cache dies, app falls back to DB.

Con:

First request for every item is slow (cold start). Stale data if DB changes without cache invalidation.

Instagram uses cache-aside. First load of a popular post is slow. Every subsequent load is instant from cache.

Try It: Read a Key (Cache holds 4 items)

Cache Contents (0/4)

Empty — click a key above to populate

Eviction Policies — Cache Full, What Gets Kicked?

Least Recently Used

Evict what hasn't been touched in the longest time. Recent access = stay.

Cleaning your desk — remove what you haven't used lately.

The most popular choice. Redis uses LRU by default. Memcached too.

The Thundering Herd Problem

A popular cache key expires. Thousands of requests arrive simultaneously. All see "cache miss" and all slam the database at once. Pinterest saw this crash their DB when viral pins expired.

The two hardest problems in caching

Phil Karlton famously said there are only two hard things in computer science: cache invalidation and naming things. He wasn't joking about the first one. When the underlying data changes, every cached copy must be updated or deleted. Miss one, and users see stale data. Facebook maintains 75 billion cache items across their Memcached fleet — invalidating the right entries when a user updates their profile picture is a distributed coordination problem.

Twitter builds your entire home timeline in Redis before you open the app. When someone you follow tweets, that tweet is written to the timelines of all their followers in Redis — a process called “fan-out on write.” For Lady Gaga with 84 million followers, one tweet triggers 84 million cache writes. This is why Twitter's Redis deployment is one of the largest in the world.

The speed gap between cache and database is staggering. Redis responds in microseconds (0.001ms). PostgreSQL responds in milliseconds (1-10ms). That's a 1,000-10,000x difference. Netflix's EVCache handles 30 million reads per second. Without it, they'd need 100x more database capacity — hundreds of thousands of database instances instead of a few thousand.

GitHub rebuilt search from scratch — in Rust

For years, GitHub used Elasticsearch for code search. It worked, but searching across 200 million repositories with complex regex patterns was pushing Elasticsearch beyond its design. In 2023, they shipped Blackbird — a custom search engine written in Rust that indexes 115TB of code and returns results in under 100ms.

The architecture is clever: instead of indexing every file in full, Blackbird builds a content-addressable index where identical files across repositories share a single index entry. Since open-source projects are frequently forked, this deduplication dramatically reduces index size. The result? You can regex search across all of GitHub's public code faster than you can grep your own laptop.

Amazon took a different path. Their A9 product search engine evolved into OpenSearch — an open-source fork of Elasticsearch that now powers search for 350 million+ products. The key innovation: semantic search using vector embeddings alongside traditional inverted indexes. Type “comfy chair for reading” and it understands intent, not just keywords. In 2026, Elasticsearch itself added native vector search for AI-powered retrieval-augmented generation (RAG) pipelines — the search world is converging on hybrid approaches.

Message Queues — Decoupling Everything

Your user clicks "Sign Up." Do you make them wait while you send an email, update analytics, and notify Slack? Or do you save to DB, say "done," and handle the rest in the background?

Save to DB
50ms
Send welcome email
200ms
Update analytics
100ms
Notify admin Slack
300ms
Generate recommendations
500ms

Queue vs Pub/Sub

Each message consumed by exactly ONE consumer. Once processed, it's gone.

Work distribution. Process this payment. Encode this video. Each job done once.

AWS SQS at Amazon. Each message processed by exactly one worker. Used for order processing, image resizing.

Real Example: YouTube Upload Pipeline

When a creator uploads a video, one Kafka event triggers 5 parallel services. The creator doesn't wait for encoding to finish — the queue handles it.

1

Upload service receives video

Upload Service
2

Push event to Kafka

Kafka
3

Encode to 240p, 480p, 720p, 1080p, 4K

Encoding Service
4

Generate 3 thumbnail options

Thumbnail Service
5

Scan for copyrighted content (Content ID)

Copyright Service
6

Notify subscribers

Notification Service

How Kafka became the backbone of the modern internet

Kafka was built at LinkedIn in 2011 to solve a specific problem: they had hundreds of services that all needed to share data, and direct service-to-service communication was creating an unmaintainable web of dependencies. Jay Kreps designed Kafka as an append-only log — messages are written sequentially to disk, which is 100x faster than random writes.

Today, LinkedIn's Kafka deployment processes 7 trillion messages per day. Netflix pushes 700 billion events per day. When you press “play” on Netflix, one event triggers 8+ downstream services: recommendations update, viewing history records, bandwidth monitoring starts, A/B test metrics collect, content delivery optimizes, and billing tracks your usage — all from one Kafka event.

The key insight is that queues turn synchronous operations into asynchronous ones. Shopify's checkout during Black Friday handles 100x normal traffic by queueing order processing. The checkout is fast (payment captured, queue message sent). Inventory updates, email confirmations, and warehouse notifications all process from the queue at their own pace. If the email service goes down, no customer sees an error — the emails just send when it recovers.

Replication — Copies Keep You Alive

One database machine will eventually die. Replication copies your data to multiple machines. The question: do you wait for the copy to confirm, or trust it will get there eventually?

Write to Primary → Primary waits for ALL replicas to confirm → then tells client 'done'. Every replica has every write.

Speed:

Slow — 3 network round trips before confirming to client.

Safety:

Zero data loss. If primary crashes, replicas have everything.

Tradeoff:

Follower dies? Primary blocks. Writes halt until follower recovers or is removed.

Banks use synchronous replication. Wrong balance = lawsuit. Zero data loss is non-negotiable.

Primary
Reads + Writes
Follower 1
Reads only
Follower 2
Reads only

The Stale Read Problem

With async replication, a client might read from a replica before it gets the latest write. For 10-50ms, different replicas return different values.

Automatic Failover

When the primary dies, a follower gets promoted. With sync replication, zero data loss. With async, you might lose the last few seconds of writes.

The CAP theorem in practice

Eric Brewer's CAP theorem states that during a network partition, a distributed system must choose between consistency (every read returns the latest write) and availability (every request gets a response). You can't have both. Banks choose consistency — a transfer must be correct even if the system is briefly unavailable. Amazon chooses availability — you can always add to your cart, even if it means occasional inventory inconsistencies.

In practice, network partitions are rare but devastating. During Amazon's 2017 S3 outage, a single typo in a command took down services across the internet for 4 hours — Slack, Trello, IFTTT, and thousands of other services that depended on S3. This is why multi-region replication exists: Airbnb replicates MySQL across availability zones so that if an entire datacenter goes down, a replica in another zone takes over in seconds.

The numbers tell the story: 99.9% uptime means 8.7 hours of downtime per year. 99.99% means 52 minutes. 99.999% (“five nines”) means 5.2 minutes. Getting from four nines to five nines is exponentially harder and more expensive — it requires synchronous replication, automatic failover, and infrastructure redundancy across multiple geographic regions.

Object Storage — How the Cloud Stores Everything

The internet runs on object storage. S3 alone holds over 400 trillion objects. Netflix stores 19 petabytes of video. Dropbox saved $75M by building their own. Understanding block vs file vs object storage — and how erasure coding achieves 11 nines of durability — is essential for any system design that touches storage at scale.

Block vs File vs Object — Pick the Right One

Object StorageCloud

Flat namespace — no directories, just keys and values. Each object includes data + metadata + a globally unique key. Built for the web: HTTP API, infinite scale, pay-per-GB.

Access Pattern

PUT/GET by key over HTTP. No partial updates — replace entire object.

Latency

10–100ms first byte (S3), cached via CDN: <10ms

Max Size

5 TB per object (S3), unlimited total storage

Examples

AWS S3, Azure Blob, Google Cloud Storage, MinIO

Best For

Images, videos, backups, data lakes, static websites, ML datasets

Not For

Databases, frequently updated files, low-latency random reads

Block StorageFile StorageObject Storage
StructureFixed-size blocksHierarchical dirsFlat key-value
ProtocoliSCSI / NVMeNFS / SMBHTTP REST API
Shared AccessSingle serverMultiple serversGlobal (internet)
ScalabilityTB scalePB scaleUnlimited
Cost/GB$$$$$$

S3 — The Object Store That Runs the Internet

Launched in 2006 with a few thousand objects. Today it's the backbone of the internet — Netflix, Airbnb, NASA, and the majority of the web's static assets sit in S3.

400+ trillion

Objects Stored

S3 stores over 400 trillion objects as of 2024 — up from 100 trillion in 2021

100M+ req/sec

Peak Requests

S3 handles over 100 million requests per second at peak

99.999999999%

Durability

11 nines — store 10 million objects and expect to lose one every 10,000 years

99.99%

Availability

52 minutes of downtime per year for S3 Standard

How S3 Achieves 11 Nines — Erasure Coding vs Replication

Split data into fragments, add parity fragments using Reed-Solomon math. Reconstruct from any subset. 4x less storage than replication for better durability.

How It Works

A 6+3 scheme splits data into 6 chunks and creates 3 parity chunks. Store all 9 across different drives. Any 6 of 9 can reconstruct the original. Lose up to 3 and nothing is lost.

Storage Overhead

~50% (e.g., RS 6+3)

Durability

99.999999999% (11 nines)

Used By

AWS S3, Azure Blob Storage, Google Cloud Storage, MinIO, Ceph

Replication
~99.9999% (6 nines)
Erasure Coding
99.999999999% (11 nines)

Erasure coding: better durability at less than half the storage cost

Content-Addressable Storage — Deduplicate Everything

Instead of naming files, hash their content. The hash is the address. Same content = same hash = stored once. This is how Git, Docker, and IPFS achieve massive deduplication.

Every commit, tree, and blob is stored by its SHA-1 hash. Two identical files = one object. This is why git repos are small despite tracking every version.

How

git hash-object file.txt → SHA-1 hash → stored in .git/objects/

Every storage decision has consequences

Block for databases. File for shared access. Object for everything on the web. Erasure coding for durability without bankruptcy. These choices compound — pick the wrong storage type and you're either overpaying 4x (replication where erasure coding works) or fighting latency you can't fix (object storage for a database workload).

You've now covered the full data layer: how to choose a database, design schemas, scale reads and writes, cache intelligently, search at speed, queue work asynchronously, replicate for safety, and store at scale. Next up — how distributed systems survive when things go wrong.

When should I use SQL vs NoSQL?+
Use SQL (PostgreSQL, MySQL) when you need complex joins, ACID transactions, or strict data integrity — billing, user accounts, inventory. Use NoSQL (MongoDB, Cassandra, DynamoDB) when you need flexible schemas, massive write throughput, or horizontal scaling — IoT data, user activity logs, content catalogs. Most companies use both.
What's the difference between Redis and Memcached?+
Redis supports rich data structures (lists, sets, sorted sets, hashes), persistence, replication, and pub/sub. Memcached is simpler — just key-value with no persistence. Redis is the default choice for most use cases. Memcached is preferred when you need maximum simplicity and multi-threaded performance for pure caching.
When should I shard my database?+
Only when vertical scaling and read replicas aren't enough. Most apps never need sharding. Stack Overflow serves 1.3 billion page views/month without it. Shard when: your data doesn't fit on one machine, or write throughput exceeds what a single primary can handle. Start with indexing, caching, and read replicas first.
What happens when a cache expires and thousands of requests hit the DB?+
This is called a 'thundering herd' or 'cache stampede.' The fix is a stampede lock: the first request that finds the cache empty acquires a lock, rebuilds the cache, and all other requests wait briefly for the cache to repopulate. Pinterest implemented this after viral pins repeatedly crashed their database.
Why use a message queue instead of direct API calls?+
Decoupling, resilience, and speed. If your email service is down, queued emails will be sent when it recovers — no user sees an error. Queues absorb traffic spikes (Black Friday), enable parallel processing, and keep response times fast by deferring non-critical work. The tradeoff: eventual processing instead of immediate.
What's eventual consistency and when is it OK?+
Eventual consistency means all replicas will converge to the same value, but there's a brief window where different replicas might return different values. It's OK for: social media likes, view counts, 'continue watching' positions, search results, recommendations. It's NOT OK for: bank balances, inventory counts, anything involving money or safety.
How does Kafka handle 7 trillion messages per day?+
Kafka writes messages sequentially to disk (append-only log), which is 100x faster than random writes. It partitions topics across brokers for parallelism, batches messages for network efficiency, and uses zero-copy transfers from disk to network. Consumers track their own position in the log, so Kafka doesn't need to track message delivery state.
What's the difference between replication and sharding?+
Replication copies ALL data to multiple machines — every replica has everything. It improves read throughput and availability. Sharding splits data across machines — each shard has a SUBSET. It improves write throughput and total storage. Most large systems use both: sharded for writes, with replicas of each shard for reads.
Why does one tweet live in 7 places?+
Each copy serves a different purpose optimized for a specific access pattern: primary DB for durability, replica for read throughput, cache for speed, CDN for media delivery, search index for full-text search, message queue for async fan-out, analytics pipeline for metrics. It's the same data, stored in 7 different formats optimized for 7 different use cases.
How do I choose between Kafka, RabbitMQ, and SQS?+
Kafka: high-throughput event streaming, replay capability, ordered processing. Best for event sourcing and real-time analytics. RabbitMQ: traditional message broker with flexible routing, good for task queues and RPC patterns. SQS: fully managed, zero ops, auto-scales. Best for AWS-centric apps that want simplicity over features.

What's Next?

You've explored how data is stored, cached, queued, replicated, and stored at scale. Next, dive into how distributed systems survive chaos — circuit breakers, rate limiters, consensus algorithms, and the patterns that keep services running when everything goes wrong.

Sources

  • Stack Overflow Architecture — Nick Craver (nickcraver.com)
  • Discord Engineering Blog — Cassandra to ScyllaDB migration
  • Netflix Tech Blog — EVCache architecture
  • LinkedIn Engineering — Kafka at scale
  • Twitter Engineering — Timeline caching and fan-out
  • AWS Architecture Blog — Database scaling patterns
  • Redis Documentation — Performance benchmarks
  • PostgreSQL Documentation — Streaming replication
  • MongoDB Documentation — Replica sets and sharding
  • ByteByteGo — System Design resources

Every explainer is free. No ads, no paywall, no login.

If this helped you, consider supporting the project.

Buy us a coffee