0
Act 1

Foundations

2 / 8

Subwords & BPE

Act 1 · ~4 min

Theory

Modern LLMs don't tokenize word-by-word. They use Byte-Pair Encoding, a subword algorithm that builds a vocabulary of common character sequences.

    1. Start with every input as individual characters
    2. Find the most frequent adjacent pair across the corpus
    3. Merge that pair into a new token; repeat until vocab is full
Start[h][e][l][l][o]
Merge he[he][l][l][o]
Merge ll[he][ll][o]
Final[hello]
BPE merges the most frequent adjacent pair, step by step.
Common word
run → 1 token. Frequent enough to earn its own slot.
Rare compound
internationalizationinter + national + ization = 3 tokens.

GPT-4 uses ~100K token types. Any word can be assembled from subword pieces — no word is truly "unknown" — but rare words cost more tokens, eating context budget and money.