Pre-Training
Theory
Pretraining is the first and most expensive stage in building a foundation model. The model is trained with a simple self-supervised objective: given a sequence of tokens, predict the next one. Run at sufficient scale, this single task teaches language, world knowledge, and reasoning.
Raw text corpus
Pretraining (next-token prediction)
Base model
SFT (instruction pairs)
RLHF / DPO (preferences)
Chat model
Scaling laws (Chinchilla, 2022): capability scales predictably with three factors — model parameters, training tokens, and compute. Optimal runs scale all three together; undertrained large models waste compute.
Open weights
Llama, Mistral. Self-host. Fine-tune on private data. Full control, full infra cost.
Closed weights
GPT-4, Claude. Managed API. Faster start. No weight access, no below-API customization.