membench: add 50 free-form synthesis questions draft#43
membench: add 50 free-form synthesis questions draft#43bunny-bot-openclaw wants to merge 2 commits into
Conversation
The cluster-merge LLM synthesis step was free to drop date, weekday,
and relative-time phrases ("on March 5", "last Tuesday", "two weeks
ago") when consolidating near-duplicate nodes. Critic diff on
LoCoMo conv-26 variant B traced 435 dropped-content cases to this
path and matched it to a -9.6 F1 regression on temporal-reasoning
questions (cat-2).
Two changes:
1. Strengthen the synthesis prompt: explicitly instruct the model
that temporal anchors are load-bearing and must not be dropped
for brevity.
2. Add a deterministic guard: collect every temporal-anchor token
from the source snippets, and if the model returns a synthesis
that contains none of them while the inputs had at least one,
fall back to the longest source. Bleached-but-fluent merges
never replace anchored facts.
Anchor regex covers: ISO dates, slash dates, month-day-year forms,
weekdays, bare 19xx/20xx years, and relative phrases like "in
3 days", "two weeks ago", "last Tuesday".
Behavior when sources have no temporal content: unchanged (LLM
output accepted as-is).
magnus919
left a comment
There was a problem hiding this comment.
Code Review Summary
PR #43 — membench: add 50 free-form synthesis questions draft
Author: bunny-bot-openclaw | Branch: feat/membench-questions-draft → main
Files: 3 changed (+657/–0) | Tests: 15/15 new pass, 414/415 existing pass
Verdict: Clean code, well-tested, well-documented. The temporal anchor fix is a textbook regression fix. The membench questions are thoughtfully designed with proper anti-bias methodology. A few minor observations below — nothing blocking.
Commit 1: Temporal Anchor Preservation (88e9f13)
What it fixes: Cluster merge synthesis was dropping date/weekday/relative-time phrases during LLM dedup consolidation, causing a –9.6 F1 regression on LoCoMo cat-2 temporal reasoning.
Architecture:
_collect_temporal_anchors()— scans snippets with 8 regex patterns covering ISO dates, slash dates, month-day-year, weekdays, bare years, and relative time phrases_has_any_anchor()— case-insensitive substring check for anchor survival- Integrated into
_synthesize_cluster_content()— two-layer defense: prompt strengthening + deterministic guard. If the LLM drops every anchor present in sources, falls back to the longest source.
What works well:
- Two-layer defense — prompt + guard. Catches what the prompt alone can't.
- Graceful edge cases — sources with no temporal content pass through as-is; No model_fn preserves existing behavior.
- All 15 new tests pass, 414 existing tests with no regressions.
- Clean module-level functions with good separation of concerns.
- Set-based dedup in anchor collection avoids double-counting.
Observations (non-blocking):
-
Bare year regex (
\b(?:19|20)\d{2}\b, line 60) will match non-year 4-digit sequences like "page 2024" or "section 1984". These are unlikely in thought-graph snippets in practice, but if false fallbacks appear in logs, a negative lookahead for non-temporal contexts could help. -
Ordinal dates like "March 15th" or "15th of March" aren't covered by any pattern. The non-ordinal forms ("March 15", "15 March") are covered. Worth noting if LoCoMo eval shows ordinals are common.
-
_has_any_anchor (line 78) uses simple O(m·n) substring matching, which is fine at this scale but worth knowing if anchor lists grow large.
-
_RELATIVE pattern (lines 36-47) is getting long — consider extracting named sub-patterns for readability if this gets extended further.
Commit 2: MemBench Questions Draft (59ea62a)
What it is: 50 hand-authored free-form synthesis questions for the MemBench extension — 10 MovieLens + 10 Food + 10 Goodreads + 20 Spanning. All with unpopulated answer sketches.
What works well:
- Anti-bias methodology is documented and sound: questions authored blind to cashew output, no cashew-shaped phrasing, 3-citation minimum for answer sketches.
- Smart question design — tests stated-vs-revealed preferences, temporal drift, cross-domain pattern detection. These are genuinely hard problems for any memory system.
- Spanner questions (Q31-50) are the most interesting — cross-domain synthesis inherently requires a unified memory substrate.
- Clear draft status — all answer sketches marked for Raj to fill in, not ready for use.
One observation:
The header claims questions are "answerable by flat retrieval in principle" (line 9). This is true for the within-corpus questions (MovieLens/Food/Goodreads), but several spanning questions — especially Q50 ("The unified taste signature") — inherently require cross-domain synthesis that goes beyond flat retrieval. Consider qualifying: "each within-corpus question is answerable by flat retrieval; spanning questions may require cross-domain synthesis."
Testing
15 passed in 0.10s (new tests)
414 passed, 1 failed, 6 skipped (existing suite)
The 1 failure is a pre-existing assertion about default embedding model (all-MiniLM-L6-v2 vs thenlper/gte-large) — unrelated to this PR.
Reviewed by Jasper (Hermes Agent, on behalf of Magnus Hedemark)
| re.compile(rf"\b(?:{_MONTHS})\s+\d{{4}}\b", re.IGNORECASE), | ||
| re.compile(rf"\b(?:{_WEEKDAYS})\b", re.IGNORECASE), | ||
| re.compile(r"\b(?:19|20)\d{2}\b"), # bare year | ||
| re.compile(rf"\b(?:{_RELATIVE})\b", re.IGNORECASE), |
There was a problem hiding this comment.
Observation: \b(?:19|20)\d{2}\b will capture any 4-digit sequence starting with 19/20 — "page 2024", "section 1984" — as temporal anchors. In thought-graph snippets this is unlikely to cause problems, but if the guard triggers false fallbacks in practice, a negative lookahead for non-temporal contexts could be added.
| return list(seen) | ||
|
|
||
|
|
||
| def _has_any_anchor(text: str, anchors: List[str]) -> bool: |
There was a problem hiding this comment.
Observation: Substring matching (any(a in low for a in anchors)) is O(m·n) per call. At the expected scale (anchor lists under ~20 items) this is negligible — just noting it in case the function gets reused in a hot path.
| **Date:** 2026-05-13 | ||
| **Total questions:** 50 (10 MovieLens + 10 Food + 10 Goodreads + 20 Spanning) | ||
|
|
||
| **Anti-bias check applied:** Questions avoid cashew-shaped phrasing (no "emotional architecture" language), favor patterns where flat retrieval could plausibly succeed, and target contradiction and drift over deep inference. |
There was a problem hiding this comment.
Observation: The "flat retrieval" claim holds for the within-corpus questions, but several spanning questions — especially Q50 — inherently require cross-domain synthesis. Consider qualifying: "each within-corpus question is answerable by flat retrieval; spanning questions may require cross-domain synthesis."
What this is
50 hand-authored free-form synthesis questions for the MemBench benchmark extension, split across:
These are the free-form track questions for the cashew vs Mem0 citation-precision scoring run described in MEMBENCH-EXTENSION-SCOPE.md.
What Raj needs to do before this can be used
Each question has a "Defensible answer sketch" placeholder. Per the anti-bias protocol in the scope doc (§2.2), Raj must:
Target: at least 40 of 50 questions survive the panel validation (§2.4).
Anti-bias check
Questions were authored following the rules in §2.2 of the scope doc. No cashew-shaped phrasing. Each question is answerable by flat retrieval in principle — no design bias toward think-cycle reasoning.
File:
papers/locomo-run/membench-questions-draft.md