What Is Tokenization in NLP?
Tokenization is the first step in how AI models understand text. Learn how BPE works step by step, why 'strawberry' breaks models, and how vocabulary size affects everything.
What Is Tokenization in NLP?
Before a language model can understand a single word you write, it needs to break your text into small pieces. Those pieces are called tokens, and the process of creating them is called tokenization. It sounds mundane — just splitting text, right? But tokenization is quietly one of the most consequential design decisions in all of AI.
Why can't models just read words?
Computers don't understand words — they understand numbers. Tokenization is the bridge: it maps text into a sequence of numerical IDs that the model can process.
But here's the twist: modern tokenizers don't split on word boundaries. The word "understanding" might become two tokens: understand + ing. The word "cat" is one token. An unusual word like "defenestration" might become three or four tokens: def + en + est + ration.
This subword approach is a deliberate tradeoff. Pure word-level tokenization would need millions of entries to handle every word form, proper noun, and typo. Character-level tokenization would make sequences impossibly long. Subword tokenization hits the sweet spot: a manageable vocabulary (30,000–100,000 tokens) that can represent any text.
Byte-Pair Encoding: A worked example
Most modern LLMs use Byte-Pair Encoding (BPE) or a close variant. Here's how it works, step by step.
Start with a tiny training corpus: "low low low lowest newest"
Step 1: Split everything into individual characters plus a word-end marker:
l o w · | l o w · | l o w · | l o w e s t · | n e w e s t ·
Step 2: Count all adjacent character pairs. The most frequent pair is l, o (appears 4 times). Merge it into a new token lo:
lo w · | lo w · | lo w · | lo w e s t · | n e w e s t ·
Step 3: Now lo, w is most frequent (4 times). Merge into low:
low · | low · | low · | low e s t · | n e w e s t ·
Step 4: e, s appears twice. Merge into es:
low · | low · | low · | low es t · | n e w es t ·
Step 5: es, t appears twice. Merge into est:
low · | low · | low · | low est · | n e w est ·
Keep merging until you reach your target vocabulary size. GPT-4 runs roughly 100,000 merges to build its vocabulary of ~100,000 tokens. Each merge is a learned compression of the training data's statistical patterns.
BPE vs SentencePiece vs WordPiece
Not all tokenizers use the same algorithm:
| Algorithm | Used By | Key Difference | |-----------|---------|----------------| | BPE | GPT-2, GPT-3, GPT-4 | Merges most frequent byte pairs | | SentencePiece | Llama, T5, ALBERT | Works on raw text (no pre-tokenization), language-agnostic | | WordPiece | BERT, DistilBERT | Maximizes training data likelihood instead of frequency |
SentencePiece is particularly notable because it treats the input as a raw byte stream — no assumptions about spaces, words, or language. This makes it more consistent across languages, though BPE with a byte-level fallback (used by GPT-4) achieves similar coverage.
The multilingual problem
Tokenization is biased toward the language dominant in training data. English text tokenizes efficiently — roughly 1 token per word. Other languages fare worse:
- English: "Hello, how are you?" → ~6 tokens
- Japanese: The same meaning → ~12–15 tokens
- Hindi: Similar meaning → ~10–14 tokens
- Korean: Similar meaning → ~10–13 tokens
This means non-English users hit context limits faster and pay more per API call for the same amount of meaning. It also means models have less "thinking room" per concept in languages with lower token efficiency. Some researchers argue this is a structural source of performance disparity across languages.
Why LLMs can't count the R's in "strawberry"
This is the most famous tokenization failure. Ask an LLM "How many R's are in strawberry?" and many models answer "2" instead of "3." Why?
Because the model never sees the individual letters. "Strawberry" is typically tokenized as something like str + aw + berry. The model processes these chunks, not characters. It has no built-in mechanism to scan individual letters within a token — it would need to reason about the substructure of its own input representation.
This isn't a reasoning failure. It's a representation failure. The model literally cannot see what it's being asked to count. The same issue affects tasks like reversing strings, detecting palindromes, and any operation that requires character-level access.
Vocabulary size tradeoffs
How big should the vocabulary be? It's a genuine engineering tension:
Larger vocabulary (100K+):
- More words become single tokens → shorter sequences → cheaper inference
- Better representation of rare words and multilingual text
- But: larger embedding tables consume more memory, and rare tokens get less training signal
Smaller vocabulary (30K–50K):
- Smaller model footprint, faster loading
- Every token is well-trained (appeared many times)
- But: common words split into subwords → longer sequences → more computation
GPT-2 used ~50,000 tokens. GPT-4 expanded to ~100,000. Llama 2 uses 32,000. There's no universally "right" answer — it depends on the training data, target languages, and deployment constraints.
Why this matters for you
Every time you use an LLM, tokenization shapes your experience:
- API costs are per token, not per word — understanding tokenization helps you estimate costs
- Context windows are measured in tokens — a 128K context window is roughly 96K English words
- Prompt engineering benefits from token awareness — some phrasings tokenize more efficiently
- Model behavior on letter-level tasks (counting, spelling, reversing) is directly limited by tokenization
Try it yourself
Head over to our How LLMs Work interactive explainer and try the tokenizer — type any text and watch it split into tokens in real time, with BPE merge steps visualized as they happen.