Quantization
Theory
Quantization reduces the numerical precision of model weights to lower memory usage and speed up matrix multiplications.
| Precision | Bits | VRAM vs FP16 | Quality Loss | Common Tool |
|---|---|---|---|---|
| FP32 | 32 | +100% | None (baseline) | Training only |
| FP16 / BF16 | 16 | Baseline | None | transformers default |
| INT8 | 8 | β50% | Minimal (under 1%) | bitsandbytes |
| GPTQ | 4 | β75% | Small | AutoGPTQ |
| AWQ | 4 | β75% | Very small | AutoAWQ |
| GGUF Q4 | 4 | β75% | Small | llama.cpp |
Two strategies:
- PTQ β quantize an already-trained model with a calibration dataset. Fast, no cluster needed. Standard for GPTQ + AWQ.
- QAT β simulate low precision during training so the model adapts. Higher INT4 quality, costs a full training run.
FP32baseline Β· training
FP16β50% Β· serving default
INT8β75% Β· minimal loss
INT4β87% Β· small drop
Format by deployment:
- GGUF β CPU/edge (llama.cpp), GPU offload OK
- GPTQ β GPU batch, slightly lower quality than AWQ
- AWQ β GPU production, weight-aware, recommended for vLLM
Frees VRAM β larger KV cache.