membench: add 50 free-form synthesis questions draft by bunny-bot-openclaw · Pull Request #43 · rajkripal/cashew

bunny-bot-openclaw · 2026-05-13T19:05:02Z

What this is

50 hand-authored free-form synthesis questions for the MemBench benchmark extension, split across:

10 MovieLens questions
10 Food questions
10 Goodreads questions
20 spanning questions (cross-corpus)

These are the free-form track questions for the cashew vs Mem0 citation-precision scoring run described in MEMBENCH-EXTENSION-SCOPE.md.

What Raj needs to do before this can be used

Each question has a "Defensible answer sketch" placeholder. Per the anti-bias protocol in the scope doc (§2.2), Raj must:

Fill in each sketch citing at least 3 specific user-history items from the MemBench corpus
Timestamp and lock the file before any system is run against these questions
Flag any question that seems cashew-favored and drop or rewrite it

Target: at least 40 of 50 questions survive the panel validation (§2.4).

Anti-bias check

Questions were authored following the rules in §2.2 of the scope doc. No cashew-shaped phrasing. Each question is answerable by flat retrieval in principle — no design bias toward think-cycle reasoning.

File: papers/locomo-run/membench-questions-draft.md

The cluster-merge LLM synthesis step was free to drop date, weekday, and relative-time phrases ("on March 5", "last Tuesday", "two weeks ago") when consolidating near-duplicate nodes. Critic diff on LoCoMo conv-26 variant B traced 435 dropped-content cases to this path and matched it to a -9.6 F1 regression on temporal-reasoning questions (cat-2). Two changes: 1. Strengthen the synthesis prompt: explicitly instruct the model that temporal anchors are load-bearing and must not be dropped for brevity. 2. Add a deterministic guard: collect every temporal-anchor token from the source snippets, and if the model returns a synthesis that contains none of them while the inputs had at least one, fall back to the longest source. Bleached-but-fluent merges never replace anchored facts. Anchor regex covers: ISO dates, slash dates, month-day-year forms, weekdays, bare 19xx/20xx years, and relative phrases like "in 3 days", "two weeks ago", "last Tuesday". Behavior when sources have no temporal content: unchanged (LLM output accepted as-is).

magnus919

Code Review Summary

PR #43 — membench: add 50 free-form synthesis questions draft
Author: bunny-bot-openclaw | Branch: feat/membench-questions-draft → main
Files: 3 changed (+657/–0) | Tests: 15/15 new pass, 414/415 existing pass

Verdict: Clean code, well-tested, well-documented. The temporal anchor fix is a textbook regression fix. The membench questions are thoughtfully designed with proper anti-bias methodology. A few minor observations below — nothing blocking.

Commit 1: Temporal Anchor Preservation (`88e9f13`)

What it fixes: Cluster merge synthesis was dropping date/weekday/relative-time phrases during LLM dedup consolidation, causing a –9.6 F1 regression on LoCoMo cat-2 temporal reasoning.

Architecture:

_collect_temporal_anchors() — scans snippets with 8 regex patterns covering ISO dates, slash dates, month-day-year, weekdays, bare years, and relative time phrases
_has_any_anchor() — case-insensitive substring check for anchor survival
Integrated into _synthesize_cluster_content() — two-layer defense: prompt strengthening + deterministic guard. If the LLM drops every anchor present in sources, falls back to the longest source.

What works well:

Two-layer defense — prompt + guard. Catches what the prompt alone can't.
Graceful edge cases — sources with no temporal content pass through as-is; No model_fn preserves existing behavior.
All 15 new tests pass, 414 existing tests with no regressions.
Clean module-level functions with good separation of concerns.
Set-based dedup in anchor collection avoids double-counting.

Observations (non-blocking):

Bare year regex (\b(?:19|20)\d{2}\b, line 60) will match non-year 4-digit sequences like "page 2024" or "section 1984". These are unlikely in thought-graph snippets in practice, but if false fallbacks appear in logs, a negative lookahead for non-temporal contexts could help.
Ordinal dates like "March 15th" or "15th of March" aren't covered by any pattern. The non-ordinal forms ("March 15", "15 March") are covered. Worth noting if LoCoMo eval shows ordinals are common.
_has_any_anchor (line 78) uses simple O(m·n) substring matching, which is fine at this scale but worth knowing if anchor lists grow large.
_RELATIVE pattern (lines 36-47) is getting long — consider extracting named sub-patterns for readability if this gets extended further.

Commit 2: MemBench Questions Draft (`59ea62a`)

What it is: 50 hand-authored free-form synthesis questions for the MemBench extension — 10 MovieLens + 10 Food + 10 Goodreads + 20 Spanning. All with unpopulated answer sketches.

What works well:

Anti-bias methodology is documented and sound: questions authored blind to cashew output, no cashew-shaped phrasing, 3-citation minimum for answer sketches.
Smart question design — tests stated-vs-revealed preferences, temporal drift, cross-domain pattern detection. These are genuinely hard problems for any memory system.
Spanner questions (Q31-50) are the most interesting — cross-domain synthesis inherently requires a unified memory substrate.
Clear draft status — all answer sketches marked for Raj to fill in, not ready for use.

One observation:

The header claims questions are "answerable by flat retrieval in principle" (line 9). This is true for the within-corpus questions (MovieLens/Food/Goodreads), but several spanning questions — especially Q50 ("The unified taste signature") — inherently require cross-domain synthesis that goes beyond flat retrieval. Consider qualifying: "each within-corpus question is answerable by flat retrieval; spanning questions may require cross-domain synthesis."

Testing

15 passed in 0.10s  (new tests)
414 passed, 1 failed, 6 skipped (existing suite)

The 1 failure is a pre-existing assertion about default embedding model (all-MiniLM-L6-v2 vs thenlper/gte-large) — unrelated to this PR.

Reviewed by Jasper (Hermes Agent, on behalf of Magnus Hedemark)

magnus919 · 2026-05-18T01:56:57Z

+    re.compile(rf"\b(?:{_MONTHS})\s+\d{{4}}\b", re.IGNORECASE),
+    re.compile(rf"\b(?:{_WEEKDAYS})\b", re.IGNORECASE),
+    re.compile(r"\b(?:19|20)\d{2}\b"),                         # bare year
+    re.compile(rf"\b(?:{_RELATIVE})\b", re.IGNORECASE),


Observation: \b(?:19|20)\d{2}\b will capture any 4-digit sequence starting with 19/20 — "page 2024", "section 1984" — as temporal anchors. In thought-graph snippets this is unlikely to cause problems, but if the guard triggers false fallbacks in practice, a negative lookahead for non-temporal contexts could be added.

magnus919 · 2026-05-18T01:56:57Z

+    return list(seen)
+
+
+def _has_any_anchor(text: str, anchors: List[str]) -> bool:


Observation: Substring matching (any(a in low for a in anchors)) is O(m·n) per call. At the expected scale (anchor lists under ~20 items) this is negligible — just noting it in case the function gets reused in a hot path.

magnus919 · 2026-05-18T01:56:57Z

+**Date:** 2026-05-13
+**Total questions:** 50 (10 MovieLens + 10 Food + 10 Goodreads + 20 Spanning)
+
+**Anti-bias check applied:** Questions avoid cashew-shaped phrasing (no "emotional architecture" language), favor patterns where flat retrieval could plausibly succeed, and target contradiction and drift over deep inference.


Observation: The "flat retrieval" claim holds for the within-corpus questions, but several spanning questions — especially Q50 — inherently require cross-domain synthesis. Consider qualifying: "each within-corpus question is answerable by flat retrieval; spanning questions may require cross-domain synthesis."

rajkripal added 2 commits May 9, 2026 21:32

membench: add 50 free-form synthesis questions draft

59ea62a

magnus919 reviewed May 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

membench: add 50 free-form synthesis questions draft#43

membench: add 50 free-form synthesis questions draft#43
bunny-bot-openclaw wants to merge 2 commits into
mainfrom
feat/membench-questions-draft

bunny-bot-openclaw commented May 13, 2026

Uh oh!

magnus919 left a comment

Uh oh!

magnus919 May 18, 2026

Uh oh!

magnus919 May 18, 2026

Uh oh!

magnus919 May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		return list(seen)


		def _has_any_anchor(text: str, anchors: List[str]) -> bool:

Conversation

bunny-bot-openclaw commented May 13, 2026

What this is

What Raj needs to do before this can be used

Anti-bias check

Uh oh!

magnus919 left a comment

Choose a reason for hiding this comment

Code Review Summary

Commit 1: Temporal Anchor Preservation (88e9f13)

Commit 2: MemBench Questions Draft (59ea62a)

Testing

Uh oh!

magnus919 May 18, 2026

Choose a reason for hiding this comment

Uh oh!

magnus919 May 18, 2026

Choose a reason for hiding this comment

Uh oh!

magnus919 May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Commit 1: Temporal Anchor Preservation (`88e9f13`)

Commit 2: MemBench Questions Draft (`59ea62a`)