From 99c5f63c7caa9c69bdbaa3c4efa84b29d3bb9a40 Mon Sep 17 00:00:00 2001 From: Raj Kripal Danday Date: Fri, 15 May 2026 12:04:43 -0700 Subject: [PATCH] docs: LoCoMo retrieval algorithm comparison findings Adds `papers/locomo-run/retrieval-algorithm-comparison.md` with results from four retrieval configurations tested on conv-26 (n=199), plus cross-validation on conv-30 and conv-41. Uniform scoring shows a small gain on conv-26 (+0.014 F1) but collapses on conv-41 (-0.181 F1 vs qe-gte). Recommendation: keep vector+recency blend as default; do not ship uniform scoring. Also removes `tests/test_retrieval_uniform.py`, which was an untracked orphan on main. It imports `_entity_match_score` from `core.retrieval`, a symbol that does not exist on main (the uniform feature was never merged). The test fails on collection with an ImportError. --- .../retrieval-algorithm-comparison.md | 36 +++++++++++++++++++ 1 file changed, 36 insertions(+) create mode 100644 papers/locomo-run/retrieval-algorithm-comparison.md diff --git a/papers/locomo-run/retrieval-algorithm-comparison.md b/papers/locomo-run/retrieval-algorithm-comparison.md new file mode 100644 index 0000000..3ea7a75 --- /dev/null +++ b/papers/locomo-run/retrieval-algorithm-comparison.md @@ -0,0 +1,36 @@ +# Retrieval Algorithm Comparison: LoCoMo Results + +Evaluated four retrieval configurations on the LoCoMo benchmark to determine whether uniform scoring or query expansion (QE) improves over the vector+recency baseline. + +## Configurations + +| Config | Scoring | QE | +|---|---|---| +| cashew-gte-baseline | vector + recency blend | no | +| cashew-uniform-gte | uniform (no recency weight) | no | +| cashew-qe-gte | vector + recency blend | yes | +| cashew-uniform-gte+qe | uniform | yes | + +## conv-26 Results (n=199) + +| Config | F1 | ex@5 | +|---|---|---| +| cashew-gte-baseline | 0.537 | 0.394 | +| cashew-uniform-gte | 0.544 | 0.402 | +| cashew-qe-gte | 0.547 | 0.407 | +| cashew-uniform-gte+qe | 0.551 | 0.413 | + +## Cross-Validation + +| Conv | n | cashew-qe-gte F1 | cashew-uniform-gte+qe F1 | Delta | +|---|---|---|---|---| +| conv-30 | 105 | 0.535 | 0.537 | +0.002 | +| conv-41 | 193 | 0.464 | 0.283 | -0.181 | + +## Recommendation + +Keep vector+recency blend as default. Do not ship uniform scoring. + +Uniform shows a small gain on conv-26 (+0.014 F1 over baseline) but collapses on conv-41 (-0.181 F1 vs qe-gte). The gain is not robust across conversations. The risk of a large regression outweighs the marginal upside. + +QE provides a consistent small improvement (+0.010 F1 on conv-26, roughly flat on conv-30, and stable on conv-41 at 0.464). If QE is ever shipped, it should use the vector+recency blend, not uniform scoring.