Hybrid Search & BM25
Theory
Vector search shares meaning with the query but has a blind spot: rare tokens. Product codes (PRD-7742-X), drug names, unusual proper nouns may have weak embeddings — the model didn't see enough examples in training.
BM25 handles this with lexical scoring: TF rewards documents where the term appears often, IDF boosts rare terms over common ones, length normalization keeps scores comparable across short and long docs.
Hybrid search fuses both signals. Two common methods:
| Method | How it works | When to use |
|---|---|---|
| Weighted sum | alpha * vector + (1 - alpha) * bm25 | When you can tune alpha on labeled data |
| RRF | Combines rank positions, not raw scores | Robust default without calibration |
RRF avoids comparing incompatible score scales, which is why it's a common starting point.
Decision guide: hybrid for technical docs, product catalogs, exact-term queries. Vector-only for purely semantic tasks (paraphrase, story similarity).
Next: reranking re-scores the fused shortlist for final precision.