Tokens
Theory
A token is the basic unit of text a language model processes. Tokenizers split your input before the model sees it: one token per common word, more for rare words, punctuation, numbers, or non-Latin script.
Token density by content type
| Content type | Tokens / word | Example |
|---|---|---|
| English prose | ~1.3 | "the cat sat" → 3 tokens |
| Code / JSON | ~2 | {"k":1} → 6 tokens |
| Chinese / Japanese | ~2.5 / char | "你好" → ~3 tokens |
Tokens drive two things:
- Cost — providers bill per million tokens in + out.
- Capacity — the context window is a fixed token count, shared by prompt, history, and response.
Prompt textTokenizerToken IDs (integers)Language modelNew token IDs