RAG Fundamentals

Act 3 · ~5 min

Theory

RAG adds a retrieval step between a user's query and the LLM's response. Instead of answering from training memory alone, the model is supplied with relevant documents pulled from an external store.

Queryuser question

Embedvector

Retrievetop-k chunks

Augmentprompt + context

Generateresponse

The query is embedded, retrieved chunks are injected into the prompt, then the LLM generates the answer.

The three steps:

Step	What happens
Retrieve	Query embedding compared to indexed chunks; top-k returned
Augment	Retrieved chunks injected into the prompt as context
Generate	LLM is steered to answer from supplied context, leaning less on memory

RAG vs. fine-tuning for facts: RAG is updatable without retraining, source-attributable, and cheaper. Fine-tuning suits style or format adaptation, not rapidly changing facts.

Limits: retrieval quality is the binding constraint — irrelevant or empty results propagate into ungrounded generation. The context window caps how many chunks can be injected, and each request adds retrieval latency.

RAG reduces but does not eliminate hallucination. Next: chunking and vector search — the variables that determine retrieval quality.

Application

RAG Fundamentals

Theory