Chunking Strategies
Theory
RAG pipelines index documents as chunks because embedding models have token limits and dense vectors work best over focused passages.
Size trade-off: smaller chunks match queries precisely but risk losing surrounding context; larger chunks carry richer context but dilute the embedding signal. A common starting range is 256β512 tokens β tune against retrieval metrics for your corpus.
| Strategy | How it splits | When to prefer |
|---|---|---|
| Fixed-size | Every N tokens | Simple documents; fast baseline |
| Sentence/paragraph | Natural language boundaries | Prose with clear meaning units |
| Semantic | Embedding-based topic shifts | Long, heterogeneous documents |
| Recursive | Paragraphs β sentences β characters | Structured docs; balances size and structure |
Overlap (10β20% of chunk size) shares tokens at boundaries so a sentence split across chunks appears in both β preserving continuity retrieval would otherwise miss.
Metadata per chunk β source file, position, section title β enables filtered retrieval and source citation in the final answer.
Next: chunks become embeddings in a vector store; vector search determines which surface for a given query.