TheHowPage

System Design — Page 6

Monitoring & Observability

Metrics, logs, traces, golden signals, SLOs, alerting, and incident response — how engineering teams detect, diagnose, and resolve production issues in minutes, not hours.

Prometheus adoption (K8s)

OTel production demand

MTTD reduction w/ tracing

Netflix logs / day

Building a system is the first half. Keeping it running is the second — and it's harder. In 2025, AWS went down for 15 hours due to a DNS race condition. Google Cloud was out for 7 hours because of a null pointer. Cloudflare took down ChatGPT, Claude, and X for 2 hours with a database permission change. Every one of these was detected by monitoring, diagnosed through logs and traces, and resolved through incident response.

This page covers the seven monitoring concepts that come up in every system design interview: the three pillars of observability, Google's golden signals, SLOs and error budgets, distributed tracing, alerting, log aggregation, and real incident response. Every diagram uses real data from production systems.

If you haven't seen Distributed Systems yet, start there — fault tolerance and circuit breakers are the patterns that monitoring detects and alerting acts on. And Security & Authentication covers the attacks that monitoring helps you catch.

The Three Pillars — Metrics, Logs, Traces

Every observability system is built on three types of telemetry data. Metrics tell you WHAT is wrong. Logs tell you WHY. Traces tell you WHERE. None is sufficient alone — you need all three to debug a distributed system.

MetricsWhat is happening?

Numerical measurements collected at regular intervals. CPU usage, request count, error rate, latency percentiles. Cheap to store, fast to query, ideal for dashboards and alerts. But they tell you WHAT is wrong, not WHY.

Data Shape
Time series: { timestamp, metric_name, value, labels }
Example: { t: 14:03:22, metric: "http_requests_total", value: 14523, method: "GET", status: "500" }
API error rate spikes to 15%
Dashboard shows http_error_rate jumped from 0.1% to 15% at 14:03.
You know something broke. You don't know what.
Metrics answer: WHAT is wrong (error rate is high).
Next step: check logs to find WHY.
PrometheusDatadogCloudWatchInfluxDB

Same Incident, Three Lenses

A payment database goes down at 14:03. Click each pillar to see what it reveals.

Metrics@ 14:03
http_error_rate: 0.1% → 15%
http_latency_p99: 45ms → 2,300ms
payment_success_rate: 99.8% → 0%
postgres_connections_active: 50 → 0

Metrics detected the problem in seconds. Alerts fired. But WHY did all payments fail?

Three types of data, one source of truth

The three pillars are useless in isolation. Metrics without logs is “something is wrong but I don't know why.” Logs without traces is “I found the error but I don't know which service caused it.” The magic happens when you correlate: a metric alert leads to a log search, which reveals a trace_id, which shows the full request path. OpenTelemetry unifies all three with a single trace_id — that's why 89% of production users demand it.

But even with perfect telemetry, you need to know what to watch. Google distilled decades of SRE experience into four numbers.

The Four Golden Signals

Google's SRE book defines four signals that every service should monitor. If you only have four dashboards, make them these. They've been the industry standard since 2014 because they catch 90% of production issues.

Latency

How long it takes to serve a request. Track both successful AND failed requests — a fast 500 error is still a problem, and a slow error wastes user time.

What to Measure

p50 (median), p95, p99 latency. The p99 matters most — it's what your worst 1% of users experience. If p99 = 3s but p50 = 50ms, 1% of users are having a terrible time.

Danger Zone

p99 > 1s for API endpoints. p99 > 3s for page loads.

Real Example

Stripe targets p99 < 200ms for payment APIs. At 3s+ latency, users assume the page is broken and retry — causing duplicate charges.

Signals are meaningless without targets

“Latency is 200ms” — is that good or bad? It depends on your target. A 200ms p99 is excellent for a payment API and terrible for a CDN edge response. You need a number that defines “good enough” — a Service Level Objective. And you need a way to measure it — a Service Level Indicator. Get these wrong and you either over-invest in reliability nobody needs, or under-invest until users leave.

SLOs, SLIs, SLAs — Defining “Reliable Enough”

99.9% uptime sounds great until you realize it means 43 minutes of downtime per month. 99.99% means 4.3 minutes. The difference between three nines and four nines is a 10x engineering investment — and every system must decide how many nines it needs.

The Nines — How Much Downtime Can You Afford?

8.77 hours

per year

43.8 minutes

per month

10.1 minutes

per week

Typical Use Case

Most SaaS products, APIs

Error budget: 0.100%43.8 minutes / month

The thin sliver on the right is your entire error budget. Use it wisely.

SLI → SLO → SLA — The Chain

SLIService Level Indicator

A quantitative measurement of one aspect of service quality. The raw number. What you actually measure.

Percentage of HTTP requests that complete in < 200ms.
Measured: 99.3% of requests were under 200ms last month.

Audience: Engineers

SLI
SLO
SLA

SLIs feed SLOs. SLOs feed SLAs. Measure → Target → Contract.

When your SLO breaches, you need to find the bottleneck — fast

Your error budget is burning. The golden signals show latency spiked. Logs reveal a timeout error. But which of your 15 microservices is the culprit? This is where distributed tracing saves you. A single trace shows the full request path — every service, every database call, every external API — with millisecond timing. The bottleneck lights up like a red flag.

Distributed Tracing — Following a Request Across Services

When a checkout request touches 5 services, which one is slow? Distributed tracing follows the request hop by hop, timing each span. OpenTelemetry is the standard — the second most active CNCF project after Kubernetes, with 89% of production users demanding vendor compliance.

Trace: POST /api/checkout412ms total
0ms105ms210ms315ms420ms
API Gateway
412ms
Auth Service
12ms
Payment Service
345ms
Stripe API
280ms
Inventory Service
35ms
PostgreSQL
18ms
Bottleneck: Stripe API (280ms / 412ms = 68%)

The Stripe API call dominates total latency. Optimization options: cache Stripe responses for idempotent requests, use Stripe's async payment intents, or add a circuit breaker for Stripe timeouts.

Traces find the problem. Alerts tell you it exists.

You can have the best dashboards in the world, but if nobody is looking at them at 3 AM, they're useless. Alerting bridges the gap between data and action. The challenge: too few alerts and you miss problems. Too many and your team ignores them all. PagerDuty reports that teams with mature alerting acknowledge incidents in under 5 minutes; teams without processes average 30+.

Alerting — When to Page, When to Ignore

The worst monitoring setup isn't one with no alerts — it's one with too many. Alert fatigue kills incident response. PagerDuty reports that mature teams have MTTA under 5 minutes; teams without processes average 30+ minutes. Every alert must be actionable, or it's noise.

Should This Alert Page Someone?

Alerting Anti-Patterns

The Problem

Too many alerts. Team ignores them. Mean time to acknowledge (MTTA) increases from 2 min to 30 min. Critical alerts get lost in noise.

The Fix

Audit every alert. If nobody acted on it in 30 days, delete it. Target < 5 actionable alerts per on-call shift.

Alerts fire. Now where do you look?

The alert says “payment error rate > 5%.” Your first move: check the logs. But when Netflix generates 1.3 petabytes of logs per day, you can't just SSH into a server and grep. You need a log aggregation system that can ingest, index, and search billions of events in seconds. The choice between ELK, Loki, Datadog, and Splunk is one of the most consequential infrastructure decisions a team makes.

Log Aggregation — Making Sense of a Billion Events

Netflix generates 1.3 petabytes of logs per day. You can't SSH into a server and grep anymore. Log aggregation tools collect, index, and search logs from thousands of services. The choice between ELK, Loki, Datadog, and Splunk comes down to cost, scale, and how much you want to manage.

ELK Stackopen-source

Elasticsearch + Logstash + Kibana. The original log aggregation stack. Elasticsearch indexes logs for full-text search; Logstash ingests and transforms; Kibana visualizes.

Best For

Self-hosted, full control, complex search queries

Scale

Netflix, LinkedIn, and eBay run massive ELK deployments. Netflix indexes 1.3 PB/day.

Cost

Free (open source) but expensive to operate. Elasticsearch is memory-hungry — budget $0.50-2.00/GB/day for infrastructure.

Log Levels — When to Use Each

DEBUG

Detailed diagnostic info. Development only. Never in production (too noisy).

INFO

Normal operations. Request handled, user logged in, job completed.

WARN

Something unexpected but not broken. Retry succeeded, approaching limit, deprecated API used.

ERROR

Something failed. Request returned 500, database query failed, external API timeout.

FATAL

System is crashing. Out of memory, can't connect to database on startup, unrecoverable state.

Structured vs Unstructured Logs

Unstructured (Bad)
[2026-03-07 14:03:22] ERROR - Payment failed for user 12345, amount $49.99, reason: connection timeout to stripe

Good luck parsing this with regex at 1M logs/sec.
Structured (Good)
{
  "timestamp": "2026-03-07T14:03:22Z",
  "level": "ERROR",
  "service": "payment-api",
  "event": "payment_failed",
  "user_id": "12345",
  "amount": 49.99,
  "reason": "connection_timeout",
  "target": "stripe",
  "trace_id": "abc123"
}

Structured logs are JSON. Every field is queryable. “Show me all payments > $100 that failed due to Stripe timeouts in the last hour” — one query, instant results.

Theory is great. Here's what it looks like when everything goes wrong.

On November 18, 2025, a single database permission change at Cloudflare caused the internet to break. ChatGPT, Claude, X, Shopify, and thousands of other services went down for approximately 2 hours. Let's walk through the incident step by step — from automated detection to blameless postmortem — to see how everything we've covered works in practice.

Real Incident — The Cloudflare Outage That Broke the Internet

On November 18, 2025, a single database permission change took down ChatGPT, Claude, X, Shopify, Indeed, and thousands more for ~2 hours. This is the anatomy of a real incident — from detection to postmortem — showing exactly how monitoring, alerting, and incident response work in practice.

November 18, 2025

Date

~2 hours

Duration

DB permission change

Root Cause

Hundreds of millions

Impact

Incident Timeline

11:20 UTCAutomated alerts fire

Cloudflare's monitoring detects significant failures delivering core network traffic. Error pages start appearing for end users. Internal dashboards show error rates spiking globally.

1 of 7

Continue the Series

Monitoring is the feedback loop that keeps everything else working. Next: learn how to present all of this in a 45-minute interview.

Frequently Asked Questions

What are the three pillars of observability?+
Metrics (what is happening — numerical time series), Logs (why it happened — timestamped event records), and Traces (where the bottleneck is — request flow across services). Each answers a different question. Combined, they give you full visibility into distributed systems.
What are the four golden signals?+
Latency (how long requests take), Traffic (how many requests you're getting), Errors (what percentage of requests fail), and Saturation (how full your resources are). Defined by Google's SRE book. If you only monitor four things, monitor these.
What is the difference between SLI, SLO, and SLA?+
SLI (Service Level Indicator) is what you measure: '99.3% of requests completed in < 200ms.' SLO (Service Level Objective) is your target: '99.9% of requests must complete in < 200ms.' SLA (Service Level Agreement) is the contract: 'If we drop below 99.9%, customers get credits.' SLIs feed SLOs, SLOs feed SLAs.
What is an error budget and why does it matter?+
If your SLO is 99.9% uptime, your error budget is 0.1% — about 43 minutes of downtime per month. When the budget is healthy, ship features fast. When it's depleted, freeze deployments and fix reliability. Error budgets turn the reliability-vs-velocity tradeoff into a measurable number.
What is distributed tracing?+
Distributed tracing follows a single request as it flows through multiple microservices. Each service adds a 'span' with timing data. The result is a waterfall view showing exactly which service is slow or failing. Tools: Jaeger, Zipkin, Datadog APM, Grafana Tempo. Standard: OpenTelemetry.
What is OpenTelemetry and why should I use it?+
OpenTelemetry (OTel) is a vendor-neutral standard for collecting metrics, logs, and traces. It's the second most active CNCF project after Kubernetes. Using OTel means you can switch between Datadog, Grafana, and Jaeger without re-instrumenting your code. 89% of production users demand OTel compliance from their vendors.
How do I avoid alert fatigue?+
Three rules: (1) Every alert must be actionable — if nobody acts on it, delete it. (2) Every alert must have a runbook explaining what to do. (3) Implement alert grouping so one root cause doesn't generate 50 pages. Target < 5 actionable alerts per on-call shift. Audit alerts monthly.
ELK Stack vs Grafana Loki — which should I use?+
ELK (Elasticsearch + Logstash + Kibana) gives you full-text search across all log content — powerful but expensive ($0.50-2/GB/day). Loki indexes only metadata (labels), making it 10-100x cheaper. Use ELK if you need complex search queries across log content. Use Loki if you're on Kubernetes and cost-sensitive.
How should I handle monitoring in a system design interview?+
End every design with: 'Here's how I'd monitor it.' Cover the four golden signals (latency, traffic, errors, saturation), define one SLO (e.g., 'p99 latency < 200ms, 99.9% of the time'), add distributed tracing for the request path, and mention alerting with PagerDuty. This shows you think about production, not just architecture.
What happens during a real incident response?+
Detection (automated alerts fire) → Triage (severity assigned, incident commander declared) → Investigation (check dashboards, logs, traces for root cause) → Mitigation (revert the change, fail over, scale up) → Recovery (confirm services are healthy) → Postmortem (blameless review: what happened, why, how to prevent it). The Cloudflare Nov 2025 outage followed this exact pattern — 2 minutes from alert to acknowledged.

Sources & References

  • Google — Site Reliability Engineering: How Google Runs Production Systems (sre.google/sre-book)
  • Cloudflare — November 18, 2025 Outage Post-Mortem (blog.cloudflare.com)
  • Grafana Labs — Observability Survey Report 2025 (grafana.com/observability-survey/2025)
  • CNCF — OpenTelemetry Adoption and Jaeger v2.0 (cncf.io/blog)
  • PagerDuty — From Alert to Resolution: Incident Response Automation (pagerduty.com)
  • Gartner — 70% Distributed Tracing Adoption Forecast for Microservices Organizations
  • AWS, Google Cloud, Azure — 2025 Outage Reports and Postmortems
  • Netflix Tech Blog — Log Pipeline Architecture (netflixtechblog.com)

Every explainer is free. No ads, no paywall, no login.

If this helped you, consider supporting the project.

Buy us a coffee