LoRA & PEFT

Act 4 · ~4 min

Theory

Full fine-tuning trains every parameter. For a 7B model that means moving gigabytes of gradients each step — impractical without 80+ GB of GPU memory. LoRA (Low-Rank Adaptation) sidesteps this by decomposing weight updates mathematically.

Instead of learning a full ΔW matrix, LoRA learns two small matrices:

W_adapted = W_frozen + A · B

W_frozen ∈ R^(d×k)   — base model, never updated
A        ∈ R^(d×r)   — trainable
B        ∈ R^(r×k)   — trainable
r << d, k            — rank is the bottleneck

Only A and B are trained. At inference the adapter is merged back — zero latency overhead.

Full fine-tune

Updates all ~7B parameters.

Requires 80GB+ VRAM.

Best theoretical quality.

LoRA (r=16)

Updates ~0.8% of parameters.

Runs on 24GB GPU.

Comparable quality in practice.

QLoRA adds one step: compress the frozen base to 4-bit (NF4 quantization) before attaching LoRA adapters. This halves memory again — a 7B model trains on a single 16GB GPU.

Rank (r) is the quality-vs-cost dial: start at 16, drop to 4 if VRAM is tight, raise toward 64 if quality falls short. Alpha is typically 2× rank and rarely needs independent tuning.

Mastery

LoRA & PEFT

Theory