Tokens Across Languages
Theory
Token efficiency varies sharply across languages because BPE vocabularies are built from corpora that are not balanced across languages.
| Language | × vs English | 100-word equivalent |
|---|---|---|
| English | 1.0× | ~133 tokens |
| Spanish / Portuguese | 1.1–1.2× | ~150 tokens |
| French | 1.1–1.15× | ~145 tokens |
| Arabic | 1.5–2.0× | ~220 tokens |
| Japanese | 1.8–2.5× | ~260 tokens |
| Chinese (Simplified) | 2.0–2.5× | ~270 tokens |
GPT-family models (tiktoken cl100k) have ~100K token types, heavy on English subwords. Chinese and Japanese characters are individually less frequent in training, so they map to 1.5–2.5 tokens each instead of merging into efficient subwords.