Preference Alignment (DPO)

Act 4 · ~5 min

Theory

Preference alignment teaches a model not just to follow instructions, but to do so well — shaped by human judgments of quality, safety, and tone.

DPO vs. RLHF

Property	RLHF	DPO
Reward model	Separate, trained first	None — LLM is the implicit reward
Algorithm	PPO	Direct gradient update
Stability	Can be unstable	Stable, fewer hyperparameters
Data sensitivity	Moderate	High — noisy pairs hurt

Dataset format: each sample is a triplet — prompt, chosen (preferred), rejected (dispreferred). The rejected response carries as much signal as chosen; it defines the behavior to move away from.

beta controls KL regularization: how far the aligned model can deviate from the SFT reference. Higher beta = more conservative alignment.

SFT Modelinstruction-following

Preference Data(prompt, chosen, rejected)

DPO Traininglog-ratio loss

Aligned Modelpreferred outputs

SFT then DPO — the mandatory sequence.

When to use DPO: harmful output reduction, style alignment, tone tuning. Choose RLHF for complex multi-dimensional preference hierarchies.

Mastery

Preference Alignment (DPO)

Theory