Tokens

Act 1 · ~4 min

Theory

A token is the basic unit of text a language model processes. Tokenizers split your input before the model sees it: one token per common word, more for rare words, punctuation, numbers, or non-Latin script.

Token density by content type

Content type	Tokens / word	Example
English prose	~1.3	"the cat sat" → 3 tokens
Code / JSON	~2	{"k":1} → 6 tokens
Chinese / Japanese	~2.5 / char	"你好" → ~3 tokens

Tokens drive two things:

Cost — providers bill per million tokens in + out.
Capacity — the context window is a fixed token count, shared by prompt, history, and response.

Prompt textTokenizerToken IDs (integers)Language modelNew token IDs

Foundations

Tokens

Theory

Token density by content type