Skip to content

alvinmurimi/StaleRAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

StaleRAG

Similarity tells you which memories look relevant, not which one is current.

Retrieval ranks by semantic similarity. That is the right signal for relevance, and it carries no information about time. So when an entity accumulates a history (a person's employers, a club's coaches, a company's owners), every statement about it reads almost the same, and similarity cannot separate the current value from the superseded ones. The system retrieves the right entity and still answers with a stale fact. Relevance and recency are two different signals, and standard retrieval only models the first.

That retrieval is blind to time is not a new observation, and fixing it with a temporal reranking layer is not a new idea either (see the prior art below). What this repo adds is measurement: it isolates where the time-blindness actually bites, on real data, with a contamination control. The result is a boundary, not an average:

regime does plain RAG fail?
a fact gets one correction that names its entity no, it serves the current value about 99% of the time
an entity has a long succession of timestamped values yes, about 13% pooled (but mostly in sports data, see below)

The mechanism matters as much as the rates: this is not a retrieval error. In the tested succession-chain cases the current value is usually retrieved but not ranked to the top, because the statements about the entity are near-identical in wording and similarity has no time axis to break the tie. The first regime is why some synthetic "memory serves stale facts" demos overstate the problem: their corrections are worded so as not to name the entity, which makes them artificially hard to retrieve. That caveat holds for text that names entities cleanly, as ours does; on messy real prose with pronouns and inconsistent naming, single corrections could be harder than the 99% suggests. The second regime is a real blind spot that gets worse the longer an agent's memory lives. Exact numbers, the paired significance test, and the contamination check are in RESULTS.md.

Why the result holds up

  • Real data, not synthetic templates. Cases are built from real Wikidata fact-changes, ground-truthed by start-time and end-time qualifiers. No hand-written traps.
  • Contamination controlled. Closed-book runs (same questions, empty memory) score zero, so the open-book scores measure memory use, not the model answering from pretraining. That shuts down the usual "maybe the model already knows it" objection directly. Numbers in RESULTS.md.
  • Two regimes separated, not averaged. Single-correction and succession-chain cases are measured apart. That is where the signal is, and where averaged benchmarks lose it.
  • A baseline fix, not a new idea. Recency reranking is an established technique (see the prior art below). The repo ships a minimal version only to show the measured failure is addressable on the same budget; the contribution is the measurement, not the reranker.

The baseline reranker

recency_resolver.py is a minimal, dependency-free recency reranker, included as a baseline to show the measured failure is addressable. It is deliberately simple; for a production temporal layer (validity filtering, decay, conflict handling, version-chain dedup), see Emmimal/temporal-rag. You pass it a wide set of candidates with similarity scores and timestamps, and it returns a reordered top-k:

from recency_resolver import temporal_rerank

candidates = [
    {"text": "Acme's CEO is Alice Smith.",      "score": 0.83, "ts": "2019-01-01"},
    {"text": "Acme appointed Bob Jones as CEO.", "score": 0.41, "ts": "2024-06-01"},
    {"text": "Globex's CEO is Carol White.",     "score": 0.55, "ts": "2026-02-01"},
]
top = temporal_rerank("Who is the current CEO of Acme?", candidates, k=2)
# surfaces the recent Acme statement; ignores the newer but unrelated Globex line

It adds the signal similarity lacks. Among the candidates about the queried entity, it reserves part of the budget for the most recent one, found by IDF-weighted term overlap (rare entity names carry the match, common relation words are discounted). No model calls, no entity store. It does need to see the current value among its candidates, which is exactly what a narrow similarity top-k drops, so feed it a wide candidate pool (a large top-M, or the entity's scoped memory), not just your final top-k. In the benchmark it scores the full per-case memory, so treat it as a reranking layer over a wide prefetch rather than a free wrapper on a small top-k.

On real succession chains (390 cases across three relations) it cuts pooled stale-fact serving from about 13% to about 4% on the same reader budget (paired: fixed 38 of plain RAG's leaks, introduced 3, p < 0.001). But the effect is domain-dependent: large on the sports relations (clubs, head coaches) and marginal on the non-sports one (heads of government), where plain RAG already gets 94%. So the failure is partly a property of dense, clean sports chains, not universal. Where there is no failure (single corrections), it ties plain RAG. Per-relation numbers are in RESULTS.md.

Reproduce

# build the succession test for three relations (clubs, head coaches, heads of government), then combine
python harness/build_succession.py --prop P54  --min-len 5 --cases 200 --out data/succ_p54.jsonl
python harness/build_succession.py --prop P286 --min-len 5 --cases 100 --out data/succ_p286.jsonl
python harness/build_succession.py --prop P6   --min-len 4 --cases 90  --out data/succ_p6.jsonl
cat data/succ_p54.jsonl data/succ_p286.jsonl data/succ_p6.jsonl > data/succession.jsonl

# score plain RAG and the resolver (needs a Google Gemini API key in GEMINI_API_KEY; the harness calls
# Gemini's OpenAI-compatible endpoint, so it is a Gemini key, not an OpenAI one)
python harness/run.py --system plain_rag        --data data/succession.jsonl --out results/rag.jsonl
python harness/run.py --system recency_resolver --data data/succession.jsonl --out results/res.jsonl

The harness fixes the reader model and prompt for every system, so the comparison is about retrieval, not the reader. Any system plugs in with two methods, ingest(memories, meta) and recall(question). Results stream to disk per case and the harness retries dropped connections, so a long run survives a network blip. The reader and judge default to gemini-3.5-flash (pinned, not a drifting alias).

What's in the repo

  • recency_resolver.py: a minimal baseline reranking implementation, usable without the rest of the repo.
  • harness/build_succession.py: builds the succession-chain test from real entities with long histories. This is where the failure lives.
  • harness/wikidata_extract.py: builds the broader real-data fact-update suite from Wikidata (QLever), every memory timestamped, no hand-labelling.
  • harness/run.py: the harness, the reference plain-RAG adapter, the resolver adapter, and the judge.
  • harness/generate.py, harness/generate_buried.py: the original synthetic suites. Useful as controlled probes, but the synthetic single-shot cases mostly test the reader; the real-data suites are the point.
  • data/realfact.jsonl: a frozen real-data suite (4,661 cases, 8 relations).

Where this sits in the literature

The core idea and the fix are not new. The contribution is the measurement, and it is worth being precise about that:

  • Time-blindness and the temporal-reranking fix already exist. Emmimal/temporal-rag ("RAG is blind to time") ships a post-retrieval temporal layer with validity filtering, decay, conflict handling, and version-chain dedup, with practitioner write-ups on the same point. Our recency reranker is a minimal baseline next to that, not an advance.
  • Temporal Wikidata QA already exists: TempLAMA (2022) and PAT-Questions (2024) build "who was X's CEO in year Y" from the same start/end qualifiers. The data shape is not new.
  • Memory-conflict evaluation already exists: MemoryAgentBench (2025) tests selective forgetting and conflict resolution, and the "freshness" line of work shows memory systems collapsing on it.
  • Recall benchmarks LOCOMO and LongMemEval ask whether a system can find what it stored.
  • Memory-poisoning security work (ASB, AgentPoison, PoisonedRAG) measures whether poisoned memory triggers a bad action, a different threat model.

What this adds, and only this: a contamination-controlled measurement on real fact-changes that separates the regime where retrieval already works (single corrections) from the regime where it fails (succession chains), with the mechanism isolated (there, the current value is usually out-ranked by older near-identical lines rather than absent). The temporal-RAG tools above are solutions; this measures, on real data, where the problem they address actually occurs. Benchmarking those tools on this suite is the obvious next step.

Honest limits

  • The failure and the fix are specific to succession chains. For one-off corrections, plain retrieval already suffices and the reranker is unnecessary.
  • The fix is a mitigation, not a cure: it reduces stale-serving, it does not eliminate it, and on a small number of cases it makes things worse.
  • The resolver must see the current value among its candidates. In the benchmark it scores the full per-case memory, so in production it needs a wide top-M prefetch or an entity/time index, not a free wrapper on a small top-k.
  • Entity scoping is IDF-weighted term overlap, which assumes statements that name the entity in clean, consistent text, which is what Wikidata gives. It will degrade on real prose: pronouns ("he joined later that year"), name variants ("Acme" vs "Acme Corp"), or the entity name split from the fact across chunks. It has not been tested on messy multi-topic text, and that is a real gap.
  • The effect is domain-dependent, not uniform: large on the sports relations, marginal on heads of government. How it behaves on corporate, legal, financial, or conversational text is unknown.
  • 390 cases across three relations, one embedding model, one reader. A clean finding, not a mature benchmark paper.

License

  • Code: Apache-2.0 (see LICENSE)
  • Data: CC-BY-4.0 (see data/)

Contact

alvinmayende@gmail.com

About

A benchmark for outdated retrieval in LLM memory under temporal drift, with a recency-reranking baseline.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages