TheHowPage

Attention Mechanism Explained for Beginners

The attention mechanism is the core of modern AI. Learn how it works with the math, multi-head attention, causal masking, and why it costs O(n²).

Attention Mechanism Explained for Beginners

If you've heard the phrase "Attention Is All You Need," you've encountered the most important paper in modern AI. Published by Google researchers in 2017, it replaced the entire recurrent architecture with a single, elegant idea. But what does attention actually do?

The problem attention solves

Before attention, language models processed text sequentially — one word at a time, left to right. Recurrent neural networks (RNNs) and LSTMs passed a hidden state forward through each step, like a game of telephone. By the time the model reached the end of a long sentence, early information had degraded.

In the sentence "The cat that I saw yesterday at the park sat on the mat," the model needed to remember "cat" across nine intervening words before reaching "sat." RNNs struggled with this. Information faded. Gradients vanished.

Attention solves this by letting every word look at every other word, all at once. No sequential bottleneck. No information decay. Every token has direct access to every other token in the sequence.

How it works: Queries, Keys, and Values

For each token in a sentence, the model creates three vectors:

  1. Query (Q): "What am I looking for?"
  2. Key (K): "What do I contain?"
  3. Value (V): "What information should I pass along?"

These vectors are computed by multiplying the token's embedding by three learned weight matrices (W_Q, W_K, W_V). Each is typically 64 or 128 dimensions.

The attention score between two tokens is the dot product of the Query of one and the Key of the other. High dot product = high relevance. The formula:

Attention(Q, K, V) = softmax(Q · K^T / √d_k) · V

That √d_k term (the square root of the key dimension) is crucial. Without it, dot products grow large as dimensionality increases, pushing softmax into regions where gradients nearly vanish. The scaling keeps values in a numerically stable range — a small detail that makes training actually work.

A concrete example

Consider: "The bank by the river was eroding."

When the model processes "bank," its Query vector asks: "What kind of bank am I?" The Keys of "river" and "eroding" produce high dot products with this Query. So the attention weights concentrate on those words, and the Value vectors from "river" and "eroding" dominate the output.

The result? The model understands this is a river bank, not a financial bank — even though both meanings map to the same token. Context resolves ambiguity through attention.

Multi-head attention

A single attention computation can only capture one type of relationship. But language has many simultaneous relationships: syntax, semantics, coreference, temporal order, sentiment.

Multi-head attention runs several attention computations in parallel, each with its own Q, K, V weight matrices. Each "head" learns to focus on different patterns:

  • One head might track subject-verb agreement
  • Another might follow pronoun references
  • A third might capture positional proximity
  • A fourth might learn semantic similarity

The outputs of all heads are concatenated and projected through a final linear layer to produce the combined representation.

The numbers are staggering. GPT-3 uses 96 attention heads per layer across 96 layers — that's 9,216 parallel attention computations for every single token. GPT-4 is estimated at ~128 heads across ~120 layers, totaling over 15,000 attention operations per token.

Research from Anthropic, Google, and others has revealed that specific heads reliably learn specific roles. Some heads consistently track syntax trees. Others learn induction patterns — "if A followed B before, and A appears again, predict B." This is not programmed; it emerges from training.

Causal masking: Why LLMs can't peek ahead

In a decoder-only model (GPT, Claude, Llama), there's an important constraint: when generating text, the model can only attend to tokens that came before the current position. You can't use future words to predict the present one.

This is enforced through a causal mask — a triangular matrix that sets all "future" attention scores to negative infinity before softmax. The result: zero attention weight on tokens that haven't been generated yet.

This is why these models are called autoregressive: each token is generated based solely on the tokens before it.

Encoder models like BERT don't use causal masking — they allow bidirectional attention, looking both forward and backward. This makes them powerful for understanding tasks but unable to generate text token by token.

The cost: O(n²) and why it matters

Here's the catch. Every token attends to every other token. For a sequence of length n, that's n × n attention scores per head per layer. The computational cost is O(n²).

For a short prompt of 100 tokens, that's 10,000 scores — trivial. But for a 128,000-token context window, it's over 16 billion scores per head per layer. This is why long-context models are expensive to run and why researchers are racing to find sub-quadratic alternatives like linear attention, FlashAttention (which optimizes the constant factor), and sparse attention patterns.

Every time you hear about a model supporting "1 million tokens of context," know that the engineering behind it is fighting this quadratic wall.

Why attention changed everything

Before 2017, the best language models were recurrent, slow to train (no parallelism), and limited in the relationships they could capture. Attention made three things possible at once:

  1. Parallelism — all attention scores computed simultaneously (GPUs love this)
  2. Long-range connections — direct paths between any two tokens
  3. Interpretability — you can visualize exactly which words attend to which

This combination unlocked the scaling revolution. Transformers could be made bigger and trained faster than anything before them. Every major language model since — GPT, Claude, Gemini, Llama — is built on this foundation.

See it in action

Try our interactive attention visualizer — click on any word and watch the attention arcs light up, showing exactly how each token weighs every other token in real time. Toggle between single-head and multi-head views to see how different heads specialize.