Top-P & Top-K Sampling
Theory
Two standard ways to clip the tail: top-k keeps a fixed shortlist; top-p (nucleus) adapts to each step.
| Property | Top-k | Top-p (nucleus) |
|---|---|---|
| Candidate count | Fixed: always K tokens | Variable: shape-adaptive |
| When model is confident | May keep weak tokens | Shrinks to strong ones |
| When model is uncertain | May cut valid options | Expands to cover the spread |
| Typical default | 40–50 | 0.9–0.95 |
Step 1 — Temperature first
Temperature rescales raw logits.
logits / T → softmax → distribution
Lower T sharpens the peak; higher T flattens it.
Step 2 — Top-p second
Nucleus carves the truncated set.
sort → cumsum until ≥ p → renormalize → sample
The reshaped distribution is what top-p sees — order is fixed.
Paris0.82 · cum 0.82
Lyon0.09 · cum 0.91
Europe0.06 · excluded
…tail · excluded