Tokens Across Languages

Act 1 · ~4 min

Theory

Token efficiency varies sharply across languages because BPE vocabularies are built from corpora that are not balanced across languages.

Language	× vs English	100-word equivalent
English	1.0×	~133 tokens
Spanish / Portuguese	1.1–1.2×	~150 tokens
French	1.1–1.15×	~145 tokens
Arabic	1.5–2.0×	~220 tokens
Japanese	1.8–2.5×	~260 tokens
Chinese (Simplified)	2.0–2.5×	~270 tokens

GPT-family models (tiktoken cl100k) have ~100K token types, heavy on English subwords. Chinese and Japanese characters are individually less frequent in training, so they map to 1.5–2.5 tokens each instead of merging into efficient subwords.

Foundations

Tokens Across Languages

Theory