Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 36 additions & 0 deletions papers/locomo-run/retrieval-algorithm-comparison.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Retrieval Algorithm Comparison: LoCoMo Results

Evaluated four retrieval configurations on the LoCoMo benchmark to determine whether uniform scoring or query expansion (QE) improves over the vector+recency baseline.

## Configurations

| Config | Scoring | QE |
|---|---|---|
| cashew-gte-baseline | vector + recency blend | no |
| cashew-uniform-gte | uniform (no recency weight) | no |
| cashew-qe-gte | vector + recency blend | yes |
| cashew-uniform-gte+qe | uniform | yes |

## conv-26 Results (n=199)

| Config | F1 | ex@5 |
|---|---|---|
| cashew-gte-baseline | 0.537 | 0.394 |
| cashew-uniform-gte | 0.544 | 0.402 |
| cashew-qe-gte | 0.547 | 0.407 |
| cashew-uniform-gte+qe | 0.551 | 0.413 |

## Cross-Validation

| Conv | n | cashew-qe-gte F1 | cashew-uniform-gte+qe F1 | Delta |
|---|---|---|---|---|
| conv-30 | 105 | 0.535 | 0.537 | +0.002 |
| conv-41 | 193 | 0.464 | 0.283 | -0.181 |

## Recommendation

Keep vector+recency blend as default. Do not ship uniform scoring.

Uniform shows a small gain on conv-26 (+0.014 F1 over baseline) but collapses on conv-41 (-0.181 F1 vs qe-gte). The gain is not robust across conversations. The risk of a large regression outweighs the marginal upside.

QE provides a consistent small improvement (+0.010 F1 on conv-26, roughly flat on conv-30, and stable on conv-41 at 0.464). If QE is ever shipped, it should use the vector+recency blend, not uniform scoring.