fix(eval): apply rerank in the query path + SciFact baseline by jrosskopf · Pull Request #235 · DataZooDE/escurel

jrosskopf · 2026-06-29T09:58:14Z

Summary

Two things, found by actually running the harness from #234:

Bug fix. The harness called Indexer::search directly, which does not run the reranker — the rerank stage lives in the server's tool_search (rerank_hits after search). So the rerank / two_pass_rerank configs were a silent no-op (byte-identical metrics to single_pass / two_pass). Fixed with a search_ranked helper that replicates the server's native-lane path: fetch rerank_candidate_pool(k) → rerank_hits → truncate to k, used by both the sequential and the concurrent-QPS passes. With rerank disabled it is exactly search(q, k).
Regression guard. The smoke test now passes a deterministic ReverseReranker and asserts the rerank config's ranking differs from single_pass — a no-op rerank fails the test.
Committed baseline — docs/eval/baseline-scifact.{md,json}.

Baseline (BEIR SciFact, 1000-doc qrels-preserving subsample)

bge-base-en-v1.5 (768-d BERT) + bge-reranker-base, CPU, 300 test queries.

config	nDCG@10	recall@10	recall@100	p50 ms	QPS
single_pass	0.846	0.921	0.993	146	12.5
two_pass	0.847	0.924	0.993	174	10.0
rerank	0.671	0.802	0.993	15104	0.2
two_pass_rerank	0.672	0.803	0.993	15030	0.2

Findings:

RAG: Matryoshka two-pass retrieval — coarse HNSW shortlist at low dim, full dim + rerank on survivors #218 two-pass — quality-neutral (+0.001 nDCG), +28 ms p50; the truncate-on-read coarse pass adds work rather than saving it at 1k docs (the second 128-d HNSW index is the deferred throughput win), as the RAG: Matryoshka two-pass retrieval — coarse HNSW shortlist at low dim, full dim + rerank on survivors #218 PR documented.
RAG: add a cross-encoder reranker stage after RRF fusion (promote the Reranker trait from stub to default) #215 rerank — regresses nDCG@10 (0.846 → 0.671) and latency (~15 s/query on CPU). Cause: the stage scores the 200-char snippet, not the full abstract (rerank_passage uses SearchHit.snippet). Two concrete levers named in the report: feed fuller passages to the reranker; rerank only on GPU and/or with a much smaller candidate pool.

The eval did its job — turned "the reranker is wired in" into a measured, falsifiable result.

(1k subsample because candle CPU BERT embedding is ~0.5 docs/s; the harness runs the full corpus unchanged on a BLAS/GPU build. Absolute numbers are indicative; the per-config deltas are the signal.)

Test plan

cargo test -p escurel-eval — metric units + the strengthened smoke test (real DuckDB + HashEmbedder + ReverseReranker, asserting rerank is applied). No model download, no #[ignore].
Full local gate green: fmt, clippy --workspace --all-targets -D warnings (+ -p escurel-eval --features candle,rerank), cargo test --workspace --all-targets, cargo build --workspace --release.

🤖 Generated with Claude Code

^{Need help on this PR? Tag /codesmith with what you need. Autofix is disabled.}

…baseline The harness called `Indexer::search` directly, which does NOT run the reranker — the rerank stage lives in the server's `tool_search` (`rerank_hits` after `search`). So the `rerank`/`two_pass_rerank` configs were a silent no-op (byte-identical to single_pass/two_pass). Fix: a `search_ranked` helper that replicates the server's native-lane path — fetch `rerank_candidate_pool(k)` candidates, apply `rerank_hits`, truncate to `k` — used by both the sequential and the concurrent-QPS passes. With rerank disabled it is exactly `search(q, k)`, so single_pass / two_pass are unchanged. Regression guard: the smoke test now passes a deterministic `ReverseReranker` and asserts the `rerank` config's ranking DIFFERS from single_pass (nDCG drops under reversal) — a no-op rerank fails it. Baseline: committed `docs/eval/baseline-scifact.{md,json}` — a 1000-doc qrels-preserving SciFact subsample (CPU candle embedding can't do the full 5183 in reasonable time) with bge-base-en-v1.5 + bge-reranker-base. Findings: two_pass is quality-neutral with a small latency cost (as designed at this corpus size); rerank REGRESSES nDCG@10 (0.846 → 0.671) and latency (~15 s/query on CPU) because it scores the 200-char snippet, not the full abstract — the report names the two concrete levers (fuller passages; GPU / smaller candidate pool). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

jrosskopf merged commit 43c8c97 into main Jun 29, 2026
1 check passed

jrosskopf deleted the fix/eval-apply-rerank branch June 29, 2026 10:25

jrosskopf mentioned this pull request Jun 30, 2026

fix(retrieval): rerank scores the full block body, not the 200-char snippet #236

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(eval): apply rerank in the query path + SciFact baseline#235

fix(eval): apply rerank in the query path + SciFact baseline#235
jrosskopf merged 1 commit into
mainfrom
fix/eval-apply-rerank

jrosskopf commented Jun 29, 2026 •

edited by blacksmith-sh Bot

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jrosskopf commented Jun 29, 2026 • edited by blacksmith-sh Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Baseline (BEIR SciFact, 1000-doc qrels-preserving subsample)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jrosskopf commented Jun 29, 2026 •

edited by blacksmith-sh Bot

Loading