Inference Serving (vLLM)
Theory
Why naive deployment fails: a single GPU handles prompt evaluation (prefill) fast, then decodes one token per forward pass. Without batching, each request occupies the GPU exclusively — utilization drops to 20–30% under concurrent load.
vLLM's two key innovations:
| Mechanism | What it does |
|---|---|
| Continuous batching | New requests join the active batch mid-generation; GPU stays saturated |
| PagedAttention | KV cache stored in non-contiguous pages (like OS virtual memory); eliminates fragmentation |
Request inarrives
Prefillprompt tokens processed
Decode (batched)tokens generated; new requests join continuously
OpenAI-compatible API — vLLM starts a server at /v1/chat/completions. Existing OpenAI clients need only a base_url change.
Deployment options:
- Bare
vllm servecommand on any CUDA host - FastAPI wrapper adding auth, routing, or RAG context injection
- SageMaker HuggingFace endpoint (managed autoscaling)
- Kubernetes deployment with GPU node selectors
Metrics to instrument: requests/sec, time-to-first-token (TTFT), tokens/sec. These feed LLMOps alerting — the topic covered next.