Subwords & BPE

Act 1 · ~4 min

Theory

Modern LLMs don't tokenize word-by-word. They use Byte-Pair Encoding, a subword algorithm that builds a vocabulary of common character sequences.

Start with every input as individual characters
Find the most frequent adjacent pair across the corpus
Merge that pair into a new token; repeat until vocab is full

Start[h][e][l][l][o]

Merge he[he][l][l][o]

Merge ll[he][ll][o]

Final[hello]

BPE merges the most frequent adjacent pair, step by step.

Common word

run → 1 token. Frequent enough to earn its own slot.

Rare compound

internationalization → inter + national + ization = 3 tokens.

GPT-4 uses ~100K token types. Any word can be assembled from subword pieces — no word is truly "unknown" — but rare words cost more tokens, eating context budget and money.

Foundations

Subwords & BPE

Theory