0
Act 3

Application

4 / 9

Hybrid Search & BM25

Act 3 · ~4 min

Theory

Vector search shares meaning with the query but has a blind spot: rare tokens. Product codes (PRD-7742-X), drug names, unusual proper nouns may have weak embeddings — the model didn't see enough examples in training.

BM25 handles this with lexical scoring: TF rewards documents where the term appears often, IDF boosts rare terms over common ones, length normalization keeps scores comparable across short and long docs.

Hybrid search fuses both signals. Two common methods:

MethodHow it worksWhen to use
Weighted sumalpha * vector + (1 - alpha) * bm25When you can tune alpha on labeled data
RRFCombines rank positions, not raw scoresRobust default without calibration

RRF avoids comparing incompatible score scales, which is why it's a common starting point.

Decision guide: hybrid for technical docs, product catalogs, exact-term queries. Vector-only for purely semantic tasks (paraphrase, story similarity).

Next: reranking re-scores the fused shortlist for final precision.