Skip to content

membench: add 50 free-form synthesis questions draft#43

Open
bunny-bot-openclaw wants to merge 2 commits into
mainfrom
feat/membench-questions-draft
Open

membench: add 50 free-form synthesis questions draft#43
bunny-bot-openclaw wants to merge 2 commits into
mainfrom
feat/membench-questions-draft

Conversation

@bunny-bot-openclaw

Copy link
Copy Markdown
Collaborator

What this is

50 hand-authored free-form synthesis questions for the MemBench benchmark extension, split across:

  • 10 MovieLens questions
  • 10 Food questions
  • 10 Goodreads questions
  • 20 spanning questions (cross-corpus)

These are the free-form track questions for the cashew vs Mem0 citation-precision scoring run described in MEMBENCH-EXTENSION-SCOPE.md.

What Raj needs to do before this can be used

Each question has a "Defensible answer sketch" placeholder. Per the anti-bias protocol in the scope doc (§2.2), Raj must:

  1. Fill in each sketch citing at least 3 specific user-history items from the MemBench corpus
  2. Timestamp and lock the file before any system is run against these questions
  3. Flag any question that seems cashew-favored and drop or rewrite it

Target: at least 40 of 50 questions survive the panel validation (§2.4).

Anti-bias check

Questions were authored following the rules in §2.2 of the scope doc. No cashew-shaped phrasing. Each question is answerable by flat retrieval in principle — no design bias toward think-cycle reasoning.

File: papers/locomo-run/membench-questions-draft.md

rajkripal added 2 commits May 9, 2026 21:32
The cluster-merge LLM synthesis step was free to drop date, weekday,
and relative-time phrases ("on March 5", "last Tuesday", "two weeks
ago") when consolidating near-duplicate nodes. Critic diff on
LoCoMo conv-26 variant B traced 435 dropped-content cases to this
path and matched it to a -9.6 F1 regression on temporal-reasoning
questions (cat-2).

Two changes:

1. Strengthen the synthesis prompt: explicitly instruct the model
   that temporal anchors are load-bearing and must not be dropped
   for brevity.

2. Add a deterministic guard: collect every temporal-anchor token
   from the source snippets, and if the model returns a synthesis
   that contains none of them while the inputs had at least one,
   fall back to the longest source. Bleached-but-fluent merges
   never replace anchored facts.

Anchor regex covers: ISO dates, slash dates, month-day-year forms,
weekdays, bare 19xx/20xx years, and relative phrases like "in
3 days", "two weeks ago", "last Tuesday".

Behavior when sources have no temporal content: unchanged (LLM
output accepted as-is).

@magnus919 magnus919 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

PR #43membench: add 50 free-form synthesis questions draft
Author: bunny-bot-openclaw | Branch: feat/membench-questions-draftmain
Files: 3 changed (+657/–0) | Tests: 15/15 new pass, 414/415 existing pass

Verdict: Clean code, well-tested, well-documented. The temporal anchor fix is a textbook regression fix. The membench questions are thoughtfully designed with proper anti-bias methodology. A few minor observations below — nothing blocking.


Commit 1: Temporal Anchor Preservation (88e9f13)

What it fixes: Cluster merge synthesis was dropping date/weekday/relative-time phrases during LLM dedup consolidation, causing a –9.6 F1 regression on LoCoMo cat-2 temporal reasoning.

Architecture:

  • _collect_temporal_anchors() — scans snippets with 8 regex patterns covering ISO dates, slash dates, month-day-year, weekdays, bare years, and relative time phrases
  • _has_any_anchor() — case-insensitive substring check for anchor survival
  • Integrated into _synthesize_cluster_content() — two-layer defense: prompt strengthening + deterministic guard. If the LLM drops every anchor present in sources, falls back to the longest source.

What works well:

  • Two-layer defense — prompt + guard. Catches what the prompt alone can't.
  • Graceful edge cases — sources with no temporal content pass through as-is; No model_fn preserves existing behavior.
  • All 15 new tests pass, 414 existing tests with no regressions.
  • Clean module-level functions with good separation of concerns.
  • Set-based dedup in anchor collection avoids double-counting.

Observations (non-blocking):

  1. Bare year regex (\b(?:19|20)\d{2}\b, line 60) will match non-year 4-digit sequences like "page 2024" or "section 1984". These are unlikely in thought-graph snippets in practice, but if false fallbacks appear in logs, a negative lookahead for non-temporal contexts could help.

  2. Ordinal dates like "March 15th" or "15th of March" aren't covered by any pattern. The non-ordinal forms ("March 15", "15 March") are covered. Worth noting if LoCoMo eval shows ordinals are common.

  3. _has_any_anchor (line 78) uses simple O(m·n) substring matching, which is fine at this scale but worth knowing if anchor lists grow large.

  4. _RELATIVE pattern (lines 36-47) is getting long — consider extracting named sub-patterns for readability if this gets extended further.


Commit 2: MemBench Questions Draft (59ea62a)

What it is: 50 hand-authored free-form synthesis questions for the MemBench extension — 10 MovieLens + 10 Food + 10 Goodreads + 20 Spanning. All with unpopulated answer sketches.

What works well:

  • Anti-bias methodology is documented and sound: questions authored blind to cashew output, no cashew-shaped phrasing, 3-citation minimum for answer sketches.
  • Smart question design — tests stated-vs-revealed preferences, temporal drift, cross-domain pattern detection. These are genuinely hard problems for any memory system.
  • Spanner questions (Q31-50) are the most interesting — cross-domain synthesis inherently requires a unified memory substrate.
  • Clear draft status — all answer sketches marked for Raj to fill in, not ready for use.

One observation:

The header claims questions are "answerable by flat retrieval in principle" (line 9). This is true for the within-corpus questions (MovieLens/Food/Goodreads), but several spanning questions — especially Q50 ("The unified taste signature") — inherently require cross-domain synthesis that goes beyond flat retrieval. Consider qualifying: "each within-corpus question is answerable by flat retrieval; spanning questions may require cross-domain synthesis."


Testing

15 passed in 0.10s  (new tests)
414 passed, 1 failed, 6 skipped (existing suite)

The 1 failure is a pre-existing assertion about default embedding model (all-MiniLM-L6-v2 vs thenlper/gte-large) — unrelated to this PR.


Reviewed by Jasper (Hermes Agent, on behalf of Magnus Hedemark)

Comment thread core/sleep.py
re.compile(rf"\b(?:{_MONTHS})\s+\d{{4}}\b", re.IGNORECASE),
re.compile(rf"\b(?:{_WEEKDAYS})\b", re.IGNORECASE),
re.compile(r"\b(?:19|20)\d{2}\b"), # bare year
re.compile(rf"\b(?:{_RELATIVE})\b", re.IGNORECASE),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation: \b(?:19|20)\d{2}\b will capture any 4-digit sequence starting with 19/20 — "page 2024", "section 1984" — as temporal anchors. In thought-graph snippets this is unlikely to cause problems, but if the guard triggers false fallbacks in practice, a negative lookahead for non-temporal contexts could be added.

Comment thread core/sleep.py
return list(seen)


def _has_any_anchor(text: str, anchors: List[str]) -> bool:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation: Substring matching (any(a in low for a in anchors)) is O(m·n) per call. At the expected scale (anchor lists under ~20 items) this is negligible — just noting it in case the function gets reused in a hot path.

**Date:** 2026-05-13
**Total questions:** 50 (10 MovieLens + 10 Food + 10 Goodreads + 20 Spanning)

**Anti-bias check applied:** Questions avoid cashew-shaped phrasing (no "emotional architecture" language), favor patterns where flat retrieval could plausibly succeed, and target contradiction and drift over deep inference.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation: The "flat retrieval" claim holds for the within-corpus questions, but several spanning questions — especially Q50 — inherently require cross-domain synthesis. Consider qualifying: "each within-corpus question is answerable by flat retrieval; spanning questions may require cross-domain synthesis."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants