0
Act 4

Mastery

6 / 10

KV Cache & Batching

Act 4 · ~4 min

Theory

KV cache stores the key and value tensors produced by each attention layer for each token. On every decode step the model reads past K/V from cache and computes only the current token's pair — reducing per-step work from O(sequence length) to O(1).

ConceptWhat it solvesCost
KV cacheRedundant attention recomputationVRAM: grows with context Ă— batch Ă— layers
Continuous batchingGPU idle time between requestsSlightly more scheduling complexity
PagedAttentionVRAM fragmentation from unequal sequence lengthsRequires vLLM (or compatible runtime)

Cache memory formula:

cache size = 2 Ă— n_layers Ă— n_heads Ă— dim_head Ă— context_len Ă— batch_size Ă— bytes_per_param

A 7B model at FP16 with a 2048-token context and batch 16 exceeds 2 GB for KV cache alone — before weights.

Prompt (prefill)K/V computed once, stored in cache
Token n+1Read past K/V from cache; compute new K/V only
Token n+2Cache grows by one entry per step
Decode step with KV cache — only the new token is computed fresh.

Continuous batching evicts finished sequences immediately and fills the slot — GPU stays busy. PagedAttention manages cache in fixed pages like virtual memory, so freed pages are instantly reusable.