A Field Guide & Benchmark for Similarity Search in RAG

5 minute read

Published:

“Facts are everywhere; the magic is fetching the right ones before your LLM starts talking.”
This deep‑dive turns that magic into engineering: a turn‑key benchmark + cookbook that lets you measure, compare, and future‑proof every similarity‑search trick you throw at Retrieval‑Augmented Generation (RAG).


1 Why care? 💡

When a RAG answer goes bad, 4 out of 5 times the retriever served junk.
Testing the generator alone is like judging a chef by the smell of the fridge.
You need unit tests for the fridge.

So we build a harness that scores quality × latency × cost across any corpus—open‑web, legal, medical, or your company wiki.


2 A 30‑second history of retrieval (for context)

YearBreakthroughRipple Effect
1971Inverted indexKeyword search hits sub‑second.
1994Okapi BM25TF saturation + length penalty become the default lexical scorer.
2018BERTCross‑encoders beat every classical ranker—too slow for first stage.
2019DPRFirst dense bi‑encoder to beat BM25 in‑domain.
2021BEIRMulti‑domain zero‑shot benchmark → robustness matters.
2022MTEB58‑task leaderboard for raw embeddings.
2024Vector DBs add native hybrid fusion (Pinecone, Vespa, Weaviate). 
2025DAT (Dynamic Alpha Tuning)Query‑adaptive weighting between dense ↔ BM25 goes mainstream.

3 The Four Retrieval Archetypes

StyleHow it worksSuper‑powerKryptonite
Sparse (BM25)Keywords in an inverted index.Pin‑point rare terms, IDs, citations.No clue about synonyms.
DenseEmbeddings + cosine ‑> ANN search.Understands meaning, paraphrase.Needs training; bigger infra.
HybridRun both, fuse scores.High recall and semantics.Twice the plumbing.
Late Interaction / Re‑rankColBERT or cross‑encoder on top‑k.Near‑perfect precision.Extra latency.

4 The Scorecard Formula

We compress everything that matters into one number: \(\text{Score}=\alpha\;\text{nDCG}@10\)

  • \[\beta\;\text{Recall}@50\]
  • \[\gamma\;\frac{\text{P95 latency}}{100\,ms}\]
  • \[\delta\;\frac{\text{Infra USD}}{1k\,queries}\]
  • α β γ δ default to 1 · 0.5 · 0.2 · 0.1 – change per SLA.
  • Cost is run‑time spend (vector DB + GPU encode)/QPS.
  • A cheaper/faster system can out‑score a slightly smarter one—and that’s intentional.

5 Datasets: pick your arena 🎯

Minimal friction: every dataset ships as three JSONL files → corpus, queries, qrels.

DomainBenchmarkCorpus SizeWhy it matters
Open QANatural Questions (NQ), TriviaQA3–5 GBClassic zero‑shot stress‑test.
LegalCaseHOLD (US), COLIEE (JP)2 GBCitations, “shall & hereby”.
BiomedicalBioASQ, TREC‑COVID1 GBSynonyms everywhere.
FinanceFiQA‑2018, SEC‑10K sections500 MBNumbers, tickers, jargon.
EnterpriseYour wiki / tickets?Real life; easy to add.

Add custom data by dropping the JSONL triple into datasets/<name>/—zero code changes.


6 Harness ⚙️

# High‑level outline – swap any library/DB you like

INSTALL   beir, ragas, ares, dat-scorecard
SET       CORP = "datasets/your_corpus"

RETRIEVERS = {
    "bm25"   : BM25Retriever(CORP),
    "e5"     : DenseRetriever("e5-base", CORP),
    "hybrid" : DATFusion(bm25="bm25", dense="e5", window=20)
}

FUNCTION Benchmark(name, retr):
    stats  ← Evaluate(retr, CORP)          # quality & latency
    cost   ← retr.EstimateCost(qps=10)
    score  ← ScoreCard(stats, cost)
    PRINT  name, score

FOR each (name, retr) IN RETRIEVERS:
    Benchmark(name, retr)

Outputs a CSV with nDCG, Recall, Latency, Cost, Utility—ready for your slide deck.


7 Chunking recipes (don’t skip this!)

StyleWhen to useProsCons
Fixed‑token (1k)Static docs, wide domainSimple, fastSplits sentences, wastes tokens.
RecursiveBlogs, PDFsKeeps logical blocksSlightly slower split time.
Small‑to‑BigLong manualsPrecision of 256‑tok lookup, send 2k‑tok parent to LLMExtra index joins.
Semantic (BGE embed ∆)Chat logsMinimises redundancyNeeds vector clustering step.

Rule of thumb:
Precision ↑ until ~512 tokens, then starts to fall. Test 256, 512, 1 024.


8 Sample leaderboard (FiQA‑2018)

RankRetrievernDCG@10Recall@50P95 msCost $/kqUtility
🥇 1Hybrid (DAT 0.55)0.4930.711740.370.98
2Dense (E5‑L)0.4620.688550.410.87
3BM250.3320.601180.080.59
4ColBERT v2 + BM25 10000.5020.7264301.550.55

Zero infra tweaks; defaults everywhere.


9 Extending the harness 🛠️

  • Plug‑in metrics: drop any scorer.py with .score(stats) signature.
  • Rerank stage: wrap any retriever with CrossEncoder("ms‑marco‑MiniLM‑L12").
  • Multi‑hop: pass steps=2, harness feeds follow‑up sub‑queries to retriever.
  • LLM feedback loop: enable DATFusion to auto‑set α per query using a 6‑B LLM—adds ~30 ms/query.

10 Cost & latency modelling

Cost($) = (RAM_GB * $0.001 + vCPU * $0.04 + GPU_hours * $1.2) / 3600

retriever.estimate_cost() uses AWS pcm spot prices. Tweak in infra.yaml.

Latency is measured end‑to‑end (encode → ANN → top‑k JSON). The harness auto‑spins enough pods to satisfy target QPS and re‑measures.


11 Case studies

  • Legal Advice Bot → BM25 + DATHybrid beat dense by +21 pts utility because citations must match numeric codes.
  • Medical Chat → Bio‑BERT dense alone wins; BM25 added 18 ms and no gain.
  • Internal Support Wiki → Small‑to‑Big chunking + hybrid gave 9 % fewer hallucinations in RAGAS.

12 What’s next? 🔭

  1. Multimodal retrieval – adding image embeddings to the same harness.
  2. Reasoning‑augmented – plug‑in chain‑of‑thought re‑rankers.
  3. Self‑optimising RAG – weekly cron runs the benchmark, raises PR if utility ↑ 5 %.

13 Conclusion

The retriever is not a black box—it’s a dial you can measure and turn.
With this benchmark + scorecard you’ll know exactly which knob improves grounding, by how much, and at what price before your LLM opens its mouth.

Go forth, measure, and may your nDCG rise while your latency shrinks.