From 99c5f63c7caa9c69bdbaa3c4efa84b29d3bb9a40 Mon Sep 17 00:00:00 2001
From: Raj Kripal Danday <rajkripal.danday@gmail.com>
Date: Fri, 15 May 2026 12:04:43 -0700
Subject: [PATCH] docs: LoCoMo retrieval algorithm comparison findings

Adds `papers/locomo-run/retrieval-algorithm-comparison.md` with results
from four retrieval configurations tested on conv-26 (n=199), plus
cross-validation on conv-30 and conv-41.

Uniform scoring shows a small gain on conv-26 (+0.014 F1) but collapses
on conv-41 (-0.181 F1 vs qe-gte). Recommendation: keep vector+recency
blend as default; do not ship uniform scoring.

Also removes `tests/test_retrieval_uniform.py`, which was an untracked
orphan on main. It imports `_entity_match_score` from `core.retrieval`,
a symbol that does not exist on main (the uniform feature was never
merged). The test fails on collection with an ImportError.
---
 .../retrieval-algorithm-comparison.md         | 36 +++++++++++++++++++
 1 file changed, 36 insertions(+)
 create mode 100644 papers/locomo-run/retrieval-algorithm-comparison.md

diff --git a/papers/locomo-run/retrieval-algorithm-comparison.md b/papers/locomo-run/retrieval-algorithm-comparison.md
new file mode 100644
index 0000000..3ea7a75
--- /dev/null
+++ b/papers/locomo-run/retrieval-algorithm-comparison.md
@@ -0,0 +1,36 @@
+# Retrieval Algorithm Comparison: LoCoMo Results
+
+Evaluated four retrieval configurations on the LoCoMo benchmark to determine whether uniform scoring or query expansion (QE) improves over the vector+recency baseline.
+
+## Configurations
+
+| Config | Scoring | QE |
+|---|---|---|
+| cashew-gte-baseline | vector + recency blend | no |
+| cashew-uniform-gte | uniform (no recency weight) | no |
+| cashew-qe-gte | vector + recency blend | yes |
+| cashew-uniform-gte+qe | uniform | yes |
+
+## conv-26 Results (n=199)
+
+| Config | F1 | ex@5 |
+|---|---|---|
+| cashew-gte-baseline | 0.537 | 0.394 |
+| cashew-uniform-gte | 0.544 | 0.402 |
+| cashew-qe-gte | 0.547 | 0.407 |
+| cashew-uniform-gte+qe | 0.551 | 0.413 |
+
+## Cross-Validation
+
+| Conv | n | cashew-qe-gte F1 | cashew-uniform-gte+qe F1 | Delta |
+|---|---|---|---|---|
+| conv-30 | 105 | 0.535 | 0.537 | +0.002 |
+| conv-41 | 193 | 0.464 | 0.283 | -0.181 |
+
+## Recommendation
+
+Keep vector+recency blend as default. Do not ship uniform scoring.
+
+Uniform shows a small gain on conv-26 (+0.014 F1 over baseline) but collapses on conv-41 (-0.181 F1 vs qe-gte). The gain is not robust across conversations. The risk of a large regression outweighs the marginal upside.
+
+QE provides a consistent small improvement (+0.010 F1 on conv-26, roughly flat on conv-30, and stable on conv-41 at 0.464). If QE is ever shipped, it should use the vector+recency blend, not uniform scoring.