Tokens: The Pieces of Language
In Lesson 2, we said that language models predict 'the next token' over and over. But what exactly is a token? It is not a letter, and it is not always a full word. It is something in between — a chunk of text that a model has learned to treat as a meaningful unit. Understanding tokens is genuinely useful: it explains why models sometimes stumble on unusual words, why they count syllables imperfectly, why they charge by the token, and how to write prompts that work within their limits.
What a Token Is
A tokenizer is a program that breaks text into tokens before passing it to a language model. Different models use different tokenizers, but most popular ones use a method called Byte Pair Encoding (BPE) or a close relative. Here is the core idea: Common words get their own token. 'the,' 'is,' 'and,' 'cat,' 'school' — these appear so often that they each become a single token. Rare or long words get split into pieces. 'unbelievable' might become ['un', 'believ', 'able']. 'photosynthesis' might be ['photo', 'synthesis'] or even more pieces. Numbers and punctuation often become their own tokens. '2024' might tokenize as ['20', '24'] or ['2', '0', '2', '4'], depending on the tokenizer. Whitespace is often attached to the following token. So the phrase 'black cat' might tokenize as [' black', ' cat'] rather than ['black', ' ', 'cat']. On average, one token is roughly four characters of English text. A 1,000-word essay is roughly 1,300-1,500 tokens.
A token is the atomic unit a language model processes. It is produced by a tokenizer that splits text into subword chunks — pieces smaller than whole words but larger than individual letters. The model never sees raw text; it sees a sequence of integer IDs, one per token.
Here is a concrete example. Take the sentence: 'The quick brown fox jumps over the lazy dog.' Using a typical tokenizer, this might split into 10 tokens: ['The', ' quick', ' brown', ' fox', ' jumps', ' over', ' the', ' lazy', ' dog', '.'] Now try a less common phrase: 'Bioluminescent jellyfish glow beautifully.' This might split into: ['Bio', 'lumin', 'escent', ' jelly', 'fish', ' glow', ' beautifully', '.'] More tokens for the same number of words, because the tokenizer had to break up 'bioluminescent' and 'jellyfish.' This matters for performance — the model must process each token as a separate step — and it matters for cost, since most APIs charge per token.
Why Not Just Use Letters or Words?
This is a fair question. Letters are tempting: there are only 26 in English, so the vocabulary would be tiny. But letters carry almost no meaning on their own. Predicting the next letter is hard — the space of possible outputs at each step is enormous relative to the information in a single letter. The model would need to make thousands of predictions to generate a paragraph. Whole words are also tempting. But there are hundreds of thousands of English words, plus names, technical terms, abbreviations, and words in other languages. A model trained only on whole-word tokens would fail completely on any word it had not seen during training. Subword tokens are the best of both worlds. Common words get their own token (efficient, meaningful). Rare words are composed from known subword pieces (flexible, handles novelty). The vocabulary stays manageable — typically 50,000 to 100,000 tokens — while covering virtually any text the model might encounter.
Match each tokenization challenge to the reason subword tokens solve it.
Terms
Definitions
Drag terms onto their definitions, or click a term then click a definition to match.
Because models think in tokens, not letters, they are not naturally good at tasks that require letter-level analysis — like counting the number of times a letter appears in a word, or spelling out a word backwards. Knowing this helps you set realistic expectations and avoid prompts that demand letter-perfect tasks the model's architecture makes difficult.
Complete the sentences with the correct terms.
Why do language models use subword tokens rather than individual letters?
An unusual scientific term like 'thermodynamics' is most likely tokenized as:
Tokenize It Yourself
- Take the following five phrases and try to predict how a tokenizer would split them. Write your predictions:
- 1. 'hello world'
- 2. 'unimaginable'
- 3. 'ChatGPT'
- 4. 'the year 2025'
- 5. 'photosynthesis is fascinating'
- For each, predict: how many tokens? What are the pieces?
- Now visit platform.openai.com/tokenizer (if you have access) or search for 'OpenAI tokenizer' and paste in each phrase to see the real answer.
- Which predictions surprised you most? What pattern do you notice about which words stay whole versus get split?