Sampling Basics

Act 2 · ~4 min

Theory

After the forward pass, softmax turns logits into a probability distribution over the vocabulary. Sampling is what picks a single token from that distribution.

Greedy decoding

Always picks the top token.

"Paris" (p=0.82) → always

Deterministic and coherent, but prone to loops. "The cat sat on the mat. The cat sat on the mat..."

Stochastic sampling

Draws proportionally from the distribution.

"Paris" 82% / "Lyon" 9% / "Europe" 6%

Diverse and natural. Temperature reshapes these probabilities first.

Paris0.82

Lyon0.09

Europe0.06

delicious0.03

…~0

tailthousands ≈0

Vocabulary distribution after softmax — the long tail leaks junk if sampled directly.

The tail is the problem: thousands of near-zero entries still accumulate mass. Sampling the full distribution lets junk tokens slip through. Top-k and top-p solve this by cutting the vocabulary to plausible candidates before drawing.

Understanding

Sampling Basics

Theory