RAG Fundamentals
Theory
RAG adds a retrieval step between a user's query and the LLM's response. Instead of answering from training memory alone, the model is supplied with relevant documents pulled from an external store.
Queryuser question
Embedvector
Retrievetop-k chunks
Augmentprompt + context
Generateresponse
The three steps:
| Step | What happens |
|---|---|
| Retrieve | Query embedding compared to indexed chunks; top-k returned |
| Augment | Retrieved chunks injected into the prompt as context |
| Generate | LLM is steered to answer from supplied context, leaning less on memory |
RAG vs. fine-tuning for facts: RAG is updatable without retraining, source-attributable, and cheaper. Fine-tuning suits style or format adaptation, not rapidly changing facts.
Limits: retrieval quality is the binding constraint — irrelevant or empty results propagate into ungrounded generation. The context window caps how many chunks can be injected, and each request adds retrieval latency.
RAG reduces but does not eliminate hallucination. Next: chunking and vector search — the variables that determine retrieval quality.