LoRA & PEFT
Theory
Full fine-tuning trains every parameter. For a 7B model that means moving gigabytes of gradients each step — impractical without 80+ GB of GPU memory. LoRA (Low-Rank Adaptation) sidesteps this by decomposing weight updates mathematically.
Instead of learning a full ΔW matrix, LoRA learns two small matrices:
W_adapted = W_frozen + A · B
W_frozen ∈ R^(d×k) — base model, never updated
A ∈ R^(d×r) — trainable
B ∈ R^(r×k) — trainable
r << d, k — rank is the bottleneck
Only A and B are trained. At inference the adapter is merged back — zero latency overhead.
Updates all ~7B parameters.
Requires 80GB+ VRAM.
Best theoretical quality.
Updates ~0.8% of parameters.
Runs on 24GB GPU.
Comparable quality in practice.
QLoRA adds one step: compress the frozen base to 4-bit (NF4 quantization) before attaching LoRA adapters. This halves memory again — a 7B model trains on a single 16GB GPU.
Rank (r) is the quality-vs-cost dial: start at 16, drop to 4 if VRAM is tight, raise toward 64 if quality falls short. Alpha is typically 2× rank and rarely needs independent tuning.