KV Cache & Batching

Act 4 · ~4 min

Theory

KV cache stores the key and value tensors produced by each attention layer for each token. On every decode step the model reads past K/V from cache and computes only the current token's pair — reducing per-step work from O(sequence length) to O(1).

Concept	What it solves	Cost
KV cache	Redundant attention recomputation	VRAM: grows with context × batch × layers
Continuous batching	GPU idle time between requests	Slightly more scheduling complexity
PagedAttention	VRAM fragmentation from unequal sequence lengths	Requires vLLM (or compatible runtime)

Cache memory formula:

cache size = 2 × n_layers × n_heads × dim_head × context_len × batch_size × bytes_per_param

A 7B model at FP16 with a 2048-token context and batch 16 exceeds 2 GB for KV cache alone — before weights.

Prompt (prefill)K/V computed once, stored in cache

Token n+1Read past K/V from cache; compute new K/V only

Token n+2Cache grows by one entry per step

Decode step with KV cache — only the new token is computed fresh.

Continuous batching evicts finished sequences immediately and fills the slot — GPU stays busy. PagedAttention manages cache in fixed pages like virtual memory, so freed pages are instantly reusable.

Mastery

KV Cache & Batching

Theory