Hybrid Search & BM25

Act 3 · ~4 min

Theory

Vector search shares meaning with the query but has a blind spot: rare tokens. Product codes (PRD-7742-X), drug names, unusual proper nouns may have weak embeddings — the model didn't see enough examples in training.

BM25 handles this with lexical scoring: TF rewards documents where the term appears often, IDF boosts rare terms over common ones, length normalization keeps scores comparable across short and long docs.

Hybrid search fuses both signals. Two common methods:

Method	How it works	When to use
Weighted sum	`alpha * vector + (1 - alpha) * bm25`	When you can tune alpha on labeled data
RRF	Combines rank positions, not raw scores	Robust default without calibration

RRF avoids comparing incompatible score scales, which is why it's a common starting point.

Decision guide: hybrid for technical docs, product catalogs, exact-term queries. Vector-only for purely semantic tasks (paraphrase, story similarity).

Next: reranking re-scores the fused shortlist for final precision.

Application

Hybrid Search & BM25

Theory