Preference Alignment (DPO)
Theory
Preference alignment teaches a model not just to follow instructions, but to do so well — shaped by human judgments of quality, safety, and tone.
DPO vs. RLHF
| Property | RLHF | DPO |
|---|---|---|
| Reward model | Separate, trained first | None — LLM is the implicit reward |
| Algorithm | PPO | Direct gradient update |
| Stability | Can be unstable | Stable, fewer hyperparameters |
| Data sensitivity | Moderate | High — noisy pairs hurt |
Dataset format: each sample is a triplet — prompt, chosen (preferred), rejected (dispreferred). The rejected response carries as much signal as chosen; it defines the behavior to move away from.
beta controls KL regularization: how far the aligned model can deviate from the SFT reference. Higher beta = more conservative alignment.
SFT Modelinstruction-following
Preference Data(prompt, chosen, rejected)
DPO Traininglog-ratio loss
Aligned Modelpreferred outputs
When to use DPO: harmful output reduction, style alignment, tone tuning. Choose RLHF for complex multi-dimensional preference hierarchies.