TheHowPage

How AI Text Generation Actually Works

LLMs generate text one token at a time. Learn the autoregressive loop, KV caching, beam search vs sampling, stop conditions, and why the same prompt gives different outputs.

How AI Text Generation Actually Works

You type a prompt. The model replies with a paragraph. It feels instantaneous and holistic, as if the model composed the entire response at once. But that's an illusion. Under the hood, every language model generates text one token at a time, in a loop, each token chosen based on everything that came before it.

Understanding this process explains a lot about why LLMs behave the way they do — why they sometimes start strong and go off the rails, why they can't "go back and fix" earlier mistakes, and why the same prompt produces different outputs each time.

The autoregressive loop

Here's the core algorithm. It's remarkably simple:

  1. Take the entire sequence so far (prompt + any generated tokens)
  2. Run it through the transformer
  3. Get a probability distribution over all possible next tokens (~100,000 options)
  4. Select one token from that distribution
  5. Append it to the sequence
  6. Go back to step 1

That's it. The model generates one token, appends it, and runs the entire model again to generate the next one. A 500-token response requires 500 complete forward passes through the transformer.

This is called autoregressive generation — each step depends on all previous steps. The model can't plan ahead. It can't outline paragraph 3 before writing paragraph 1. It can't go back and revise. Each token is committed the moment it's generated.

This explains a common LLM behavior: a model might start answering a question one way, realize midway that the approach isn't working, and then awkwardly course-correct with phrases like "Actually, let me reconsider." It can't edit its earlier tokens — it can only append new ones.

KV caching: Making it fast

Running the full transformer for every single token sounds expensive — and it would be, without KV caching.

Here's the insight: when generating token 501, you don't actually need to recompute the attention Keys and Values for tokens 1-500. Those haven't changed. Only the new token's Query, Key, and Value vectors are needed.

KV caching stores the Key and Value vectors from all previous tokens in memory. For each new token, the model only computes the new token's Q, K, V, looks up the cached K and V from all previous tokens, and computes attention between the new token and the entire history.

The speedup is dramatic. Without KV caching, generating n tokens requires O(n²) total computation (each of the n steps processes the full sequence). With KV caching, it's O(n) — each step only processes one new token against the cached history.

The tradeoff is memory. For a model like GPT-4 with ~120 layers and ~128 attention heads, the KV cache for a 128K-token context can consume tens of gigabytes of GPU memory. This is often the bottleneck for long-context inference — not compute, but memory.

Prompt processing vs generation

There's an important distinction between two phases:

Prefill (prompt processing): The entire prompt is processed in parallel, generating KV cache entries for every token simultaneously. This is fast because the full prompt is known upfront and GPUs are optimized for parallel matrix operations.

Decoding (token generation): Tokens are generated one at a time, sequentially. Each step is a small computation but cannot be parallelized with the next step. This is why LLM responses appear to "stream" — each token arrives as it's computed.

The prefill phase is typically much faster per token than the decoding phase. A 1,000-token prompt might process in 200ms, but generating 1,000 tokens of response takes several seconds because each token must wait for the previous one.

How the next token is selected

At each step, the model outputs ~100,000 probabilities — one for each token in its vocabulary. How does it pick one?

Greedy decoding (temperature = 0)

Always pick the highest-probability token. Deterministic — same input always produces same output. Fast and consistent, but tends to produce repetitive, generic text. The model gets stuck in high-probability loops.

Sampling (temperature > 0)

Randomly sample from the probability distribution, with temperature controlling how random. Temperature < 1 concentrates probability on top tokens (conservative). Temperature > 1 flattens the distribution (creative). This is what most chatbots use.

For a deeper exploration of temperature and sampling strategies, see our post on what temperature means in AI.

Beam search

Maintain the top-N (typically 4-5) candidate sequences at each step, expanding each by one token and keeping the best N overall. This finds high-probability sequences rather than greedily picking one token at a time.

Beam search was dominant in early NLP tasks like machine translation and summarization. It tends to produce fluent, safe, but somewhat generic text. For open-ended generation and chatbots, sampling has largely replaced it because beam search outputs lack the diversity and naturalness that users expect.

Speculative decoding

A newer technique for faster inference. A small, fast "draft" model generates several candidate tokens at once. The large model then verifies them in parallel (since verification is parallelizable, unlike generation). If the draft tokens match what the large model would have produced, they're accepted in bulk. If not, the process falls back to the large model.

This can achieve 2-3x speedups without changing the output distribution — the final text is mathematically identical to what the large model would have generated alone.

Stop conditions: When does generation end?

The model doesn't inherently know when to stop. Left unchecked, it would generate tokens forever. Several mechanisms trigger the end:

End-of-sequence (EOS) token: During training, the model learns a special token that means "I'm done." When the model assigns high probability to the EOS token, generation stops. This is the primary mechanism — the model learns when a response is complete.

Maximum length: A hard cap (e.g., 4,096 tokens) prevents runaway generation. If the model hasn't produced an EOS token by this limit, generation is truncated. This is why long responses sometimes end abruptly mid-sentence.

Stop sequences: The API user can specify strings (e.g., "\n\n", "User:") that trigger early termination. Useful for structured formats like conversations, where you want the model to stop after its turn.

Repetition penalties: If the model starts repeating itself (a common failure mode at low temperatures), a penalty on recently generated tokens can break the loop — or generation can be stopped entirely.

Why the same prompt gives different outputs

When temperature > 0, each token selection involves randomness. Even if the probability distribution is identical each time, the random sampling produces different tokens. And since each token affects all subsequent tokens, a single different choice early on can cascade into a completely different response.

Consider a fork at token 10 where "however" and "additionally" both have 15% probability. One path leads to a counterargument. The other leads to supporting evidence. Same prompt, same model, entirely different outputs — because of one random choice.

This is a feature, not a bug. It's what makes LLMs feel creative and conversational rather than robotic. But it also means that evaluating an LLM on a single output is unreliable — you're sampling from a distribution of possible responses.

At temperature 0, the output is deterministic (same prompt → same output, assuming identical infrastructure). This is why low temperature is preferred for tasks requiring consistency.

The implications

Understanding autoregressive generation explains several LLM quirks:

  • No revision: The model can't go back and edit. If it realizes a mistake on token 200, it can only try to correct course going forward.
  • Left-to-right bias: The beginning of a response is generated with less context than the end. The model "commits" to a direction early.
  • Length sensitivity: Longer outputs have more opportunities for drift, error accumulation, and repetition.
  • Streaming: Responses appear token by token because they're generated token by token. What you see is exactly the process happening.
  • Planning limitations: The model can't outline a full response before starting. It writes the first sentence without knowing the last. Chain-of-thought prompting helps by making the model "think out loud" before answering.

Watch generation happen in real time

Try our interactive generation visualizer — watch the token tree branch as the model generates text, see the probability distribution shift at each step, and understand why each token was chosen.