0
Act 2

Understanding

2 / 8

Sampling Basics

Act 2 · ~4 min

Theory

After the forward pass, softmax turns logits into a probability distribution over the vocabulary. Sampling is what picks a single token from that distribution.

Greedy decoding

Always picks the top token.

"Paris" (p=0.82) → always

Deterministic and coherent, but prone to loops. "The cat sat on the mat. The cat sat on the mat..."

Stochastic sampling

Draws proportionally from the distribution.

"Paris" 82% / "Lyon" 9% / "Europe" 6%

Diverse and natural. Temperature reshapes these probabilities first.

Paris0.82
Lyon0.09
Europe0.06
delicious0.03
~0
tailthousands ≈0
Vocabulary distribution after softmax — the long tail leaks junk if sampled directly.

The tail is the problem: thousands of near-zero entries still accumulate mass. Sampling the full distribution lets junk tokens slip through. Top-k and top-p solve this by cutting the vocabulary to plausible candidates before drawing.