Inference Serving (vLLM)

Act 4 · ~5 min

Theory

Why naive deployment fails: a single GPU handles prompt evaluation (prefill) fast, then decodes one token per forward pass. Without batching, each request occupies the GPU exclusively — utilization drops to 20–30% under concurrent load.

vLLM's two key innovations:

Mechanism	What it does
Continuous batching	New requests join the active batch mid-generation; GPU stays saturated
PagedAttention	KV cache stored in non-contiguous pages (like OS virtual memory); eliminates fragmentation

Request inarrives

Prefillprompt tokens processed

Decode (batched)tokens generated; new requests join continuously

Request lifecycle: naive vs. continuous batching.

OpenAI-compatible API — vLLM starts a server at /v1/chat/completions. Existing OpenAI clients need only a base_url change.

Deployment options:

Bare vllm serve command on any CUDA host
FastAPI wrapper adding auth, routing, or RAG context injection
SageMaker HuggingFace endpoint (managed autoscaling)
Kubernetes deployment with GPU node selectors

Metrics to instrument: requests/sec, time-to-first-token (TTFT), tokens/sec. These feed LLMOps alerting — the topic covered next.

Mastery

Inference Serving (vLLM)

Theory