KV Cache & Batching
Theory
KV cache stores the key and value tensors produced by each attention layer for each token. On every decode step the model reads past K/V from cache and computes only the current token's pair — reducing per-step work from O(sequence length) to O(1).
| Concept | What it solves | Cost |
|---|---|---|
| KV cache | Redundant attention recomputation | VRAM: grows with context Ă— batch Ă— layers |
| Continuous batching | GPU idle time between requests | Slightly more scheduling complexity |
| PagedAttention | VRAM fragmentation from unequal sequence lengths | Requires vLLM (or compatible runtime) |
Cache memory formula:
cache size = 2 Ă— n_layers Ă— n_heads Ă— dim_head Ă— context_len Ă— batch_size Ă— bytes_per_param
A 7B model at FP16 with a 2048-token context and batch 16 exceeds 2 GB for KV cache alone — before weights.
Prompt (prefill)K/V computed once, stored in cache
Token n+1Read past K/V from cache; compute new K/V only
Token n+2Cache grows by one entry per step
Continuous batching evicts finished sequences immediately and fills the slot — GPU stays busy. PagedAttention manages cache in fixed pages like virtual memory, so freed pages are instantly reusable.