Supervised Fine-Tuning
Theory
Supervised fine-tuning picks up where pretraining leaves off. A pretrained model predicts the next token over raw text — it has no concept of "task" or "correct answer." SFT introduces that structure by continuing training on labelled (instruction, response) pairs.
| Stage | Data | Objective |
|---|---|---|
| Pretraining | Raw text, billions of tokens | Next-token prediction everywhere |
| SFT | (instruction, response) pairs | Next-token on response tokens only |
Dataset format. Each sample uses a chat template with three roles: system (persona), user (instruction), assistant (target response). Templates differ per model family — always use tokenizer.apply_chat_template rather than hand-rolling separators.
Loss masking. The cross-entropy gradient is computed only on the assistant response tokens. Instruction tokens are masked to zero gradient — the model is not penalised for failing to predict text it received as input.
When to SFT: consistent output format, domain vocabulary missing from the base model, style or persona transfer. When not to: injecting facts that update frequently — use RAG for that.
What comes next: LoRA reduces trainable parameters to roughly 1–5% of the model; DPO aligns outputs to human preference without a reward model.