Quantization

Act 4 · ~4 min

Theory

Quantization reduces the numerical precision of model weights to lower memory usage and speed up matrix multiplications.

Precision	Bits	VRAM vs FP16	Quality Loss	Common Tool
FP32	32	+100%	None (baseline)	Training only
FP16 / BF16	16	Baseline	None	transformers default
INT8	8	−50%	Minimal (under 1%)	bitsandbytes
GPTQ	4	−75%	Small	AutoGPTQ
AWQ	4	−75%	Very small	AutoAWQ
GGUF Q4	4	−75%	Small	llama.cpp

Two strategies:

PTQ — quantize an already-trained model with a calibration dataset. Fast, no cluster needed. Standard for GPTQ + AWQ.
QAT — simulate low precision during training so the model adapts. Higher INT4 quality, costs a full training run.

FP32baseline · training

FP16−50% · serving default

INT8−75% · minimal loss

INT4−87% · small drop

Precision → memory trade-off across common formats.

Format by deployment:

Frees VRAM → larger KV cache.