Subwords & BPE
Theory
Modern LLMs don't tokenize word-by-word. They use Byte-Pair Encoding, a subword algorithm that builds a vocabulary of common character sequences.
- Start with every input as individual characters
- Find the most frequent adjacent pair across the corpus
- Merge that pair into a new token; repeat until vocab is full
Start[h][e][l][l][o]
Merge he[he][l][l][o]
Merge ll[he][ll][o]
Final[hello]
Common word
run → 1 token. Frequent enough to earn its own slot.Rare compound
internationalization → inter + national + ization = 3 tokens.GPT-4 uses ~100K token types. Any word can be assembled from subword pieces — no word is truly "unknown" — but rare words cost more tokens, eating context budget and money.