Sampling Basics
Theory
After the forward pass, softmax turns logits into a probability distribution over the vocabulary. Sampling is what picks a single token from that distribution.
Greedy decoding
Always picks the top token.
"Paris" (p=0.82) → always
Deterministic and coherent, but prone to loops. "The cat sat on the mat. The cat sat on the mat..."
Stochastic sampling
Draws proportionally from the distribution.
"Paris" 82% / "Lyon" 9% / "Europe" 6%
Diverse and natural. Temperature reshapes these probabilities first.
Paris0.82
Lyon0.09
Europe0.06
delicious0.03
…~0
tailthousands ≈0
The tail is the problem: thousands of near-zero entries still accumulate mass. Sampling the full distribution lets junk tokens slip through. Top-k and top-p solve this by cutting the vocabulary to plausible candidates before drawing.