feat(bench): multi-hop eval harness — A4 iterative-retrieval arm + scoring#40
Conversation
Stage-1 building blocks for the A1/A3/A4 multi-hop eval (D2). No wiring yet — the dataset loaders, the adapter, and the BenchmarkOpts.iterative flag land next. - retrieval/iterative.ts: dependency-injected iterative/agentic retrieval (retrieve -> LLM names the bridge -> re-retrieve). recall and proposeNextQuery are injected so the control flow is unit-testable with no OpenAI/Neo4j. A round-robin rank interleave reserves top-K slots for each hop's evidence so the round-2 bridge survives into top-K; cycle, maxRounds, and STOP guards bound the loop. - multihop/types.ts: normalized MultiHopItem for MuSiQue / 2WikiMultiHopQA / HotpotQA-distractor — one shape so the arms stay dataset-agnostic. - multihop/scoring.ts: judge-free retrieval metrics — recall@K, all-support@K, and bridge-recall@K (hop > 1 supporting paragraphs), the metric that isolates the dissimilar hop-2 evidence single-shot dense misses. Bridge recall is not-applicable (null) for datasets that do not label hops. Tests: 13 new (7 iterative control-flow incl. bridge recovery, 6 scoring), all Mac-runnable (type-only core imports, no native bindings). tsc clean.
|
Warning Review limit reached
More reviews will be available in 59 minutes and 4 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (5)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Building blocks for the A1/A3/A4 multi-hop eval (the "does a working graph beat
iterative retrieval on its home turf" experiment). Bench-only — no runtime change
to any shipped package.
What's here
(retrieve -> LLM names the bridge -> re-retrieve). The A4 arm: the cheap
no-graph multi-hop competitor.
recallandproposeNextQueryare injected sothe control flow is unit-testable without OpenAI/Neo4j.
HotpotQA item shape + judge-free metrics (recall@K, all-support@K, and
bridge-recall@K, which isolates the dissimilar hop-2 evidence single-shot
dense misses).
Not yet wired
Dataset loaders, the adapter
.run, theBenchmarkOpts.iterativeflag, and thearms runner land next — those need the real datasets and run on the bench host.
Test plan
Mac-runnable (type-only core imports). bench 36/36, tsc clean.