Pre-Training

Act 4 · ~5 min

Theory

Pretraining is the first and most expensive stage in building a foundation model. The model is trained with a simple self-supervised objective: given a sequence of tokens, predict the next one. Run at sufficient scale, this single task teaches language, world knowledge, and reasoning.

Raw text corpus

Pretraining (next-token prediction)

Base model

SFT (instruction pairs)

RLHF / DPO (preferences)

Chat model

Training stages: pretraining builds the base; post-training aligns it for use.

Scaling laws (Chinchilla, 2022): capability scales predictably with three factors — model parameters, training tokens, and compute. Optimal runs scale all three together; undertrained large models waste compute.

Open weights

Llama, Mistral. Self-host. Fine-tune on private data. Full control, full infra cost.

Closed weights

GPT-4, Claude. Managed API. Faster start. No weight access, no below-API customization.

Mastery

Pre-Training

Theory