Skip to content

Memory overhaul: forget() correctness fix (Phase 1, deployed) + graph-measurement harness (Phase 0)#38

Merged
muhammadkh4n merged 9 commits into
mainfrom
feat/memory-overhaul
Jun 8, 2026
Merged

Memory overhaul: forget() correctness fix (Phase 1, deployed) + graph-measurement harness (Phase 0)#38
muhammadkh4n merged 9 commits into
mainfrom
feat/memory-overhaul

Conversation

@muhammadkh4n

Copy link
Copy Markdown
Owner

Summary

Two workstreams from the memory-system overhaul:

  1. Phase 1 — forget() correctness fix (user-facing; already deployed to rexvps prod + verified).
  2. Phase 0 — bench graph-measurement harness (bench-only, default-off → zero runtime/behaviour change).

Phase 1 — forget() actually forgets now (DEPLOYED)

The bug: forget() was inverted. The episode path called recordAccess()access_count++, which computeScore rewards (accessBoost), so forgetting a memory raised its recall rank. Procedural was a no-op; semantic floored a confidence value no recall path reads. This is the root cause of ~19 sessions of memory-gardening toil (the ~7,200-duplicate resurfacing).

The fix: forget = tombstone only. A new markForgotten(ids) storage method stamps a forgotten_at timestamp and touches neither access_count nor confidence. Every recall path gates on forgotten_at IS NULL (cloned from the proven superseded_by IS NULL gate):

  • SQLite (offline/CI path): migration v5 adds forgotten_at + partial indexes; vector + BM25 + deep-recall all gated.
  • PostgREST (production): forgotten_at columns + indexes + engram_mark_forgotten RPC + the gate in all 4 recall RPCs (12 branches). Applied + verified on the live 47K-row prod DB via a BEGIN/ROLLBACK round-trip (forget → absent, access_count unchanged, zero pollution); PostgREST schema cache reloaded.
  • Neo4j (red-team blocker): forgetMemories stamps :Memory.forgottenAt; the spreading-activation Cypher gates traversal on coalesce(n.forgottenAt, n.deletedAt) IS NULL (also prunes paths through forgotten nodes) so forget can't leak through the graph channel. Verified on real Neo4j 5.

Deploy status: live on rexvps (schema applied + reloaded, /opt/engram MCP service rebuilt/restarted, end-to-end memory_recall confirmed). Schema change is additive + backward-compatible (forgotten_at defaults NULL → no existing row hidden).

Phase 0 — make the graph measurable (bench-only, default-off)

Discovery: both bench adapters scored only recallResult.memories, while graph output lands in the ignored recallResult.associations channel — so graph:true vs graph:false mathematically could not move recall@K. "The graph doesn't help" was a measurement-instrumentation bug, not a verdict.

This PR builds the harness to actually measure it:

  • mergeAssociationsIntoScored + BenchmarkOpts.mergeAssociationsIntoTopK (default false → byte-identical runs) — unions the graph channel into the scored top-K. Scale-safe (gold-id set-membership metric).
  • 4-cell {graph}×{rerank} matrix runner + computeGraphEffect (recall@K lift on the graph-relevant split) + classifyRecallStructure.
  • graphVerdict — symmetric kill criterion (never conclude "kill" from a saturated aggregate; requires null aggregate and flat graphEffect on the graph-visible split with n≥100).
  • requireGraph hard-fail guard (no SQL-only result masquerading as a graph result) + per-unit wipeBenchGraph isolation.
  • Bench correctness fix: adapters now flushPendingWrites() before eval — they never awaited the fire-and-forget graph decomposition, so graph cells recalled against a half-built graph.

Finding (why this is default-off): an empirical run showed LongMemEval-S is saturated — vector+BM25+rerank already hits 100% R@5 on the multi-hop/temporal split with the graph OFF, so recall@K is blind to any graph enhancement here. LoCoMo was abandoned (documented dataset flaws + confused category map). So the graph keep/kill decision is deferred to a fair test (fix the disabled PPR + a multi-hop chaining benchmark like MuSiQue/2Wiki) — out of scope for this PR. The harness is valid tooling for that future test; it ships default-off with no runtime effect.


Behaviour change & risk

Area Change Risk
forget() Now correct (was inverted) Low — already deployed + verified on prod; schema additive/backward-compatible
Recall None forgotten_at IS NULL is a dormant no-op until something is forgotten
Bench (Phase 0) New tooling, mergeAssociationsIntoTopK default false None — bench-only, zero runtime change

Testing

sqlite 106/106 · core 495/495 · postgrest 71/71 · graph 12/12 · bench 23/23; typecheck clean across all touched packages. Phase 1 verified on real Postgres 17 + pgvector, real Neo4j 5, and production rexvps.

Follow-ups (NOT in this PR)

  • The graph "fair test": bind the existing seedActivations map into the spreading-activation Cypher (PPR is currently disabled via an activation=1.0 hardcode) + a MuSiQue/2Wiki adapter.
  • Turn spreading-activation off for default recall (it can only add noise to the precise dense pipeline).

…offline path)

The shipped forget() was inverted: on episodes it called recordAccess()
(access_count++), a term recall ranking REWARDS via accessBoost — so
forgetting a memory RAISED its recall rank; on semantic it floored a
confidence value no recall path reads; procedural was a no-op. Net: forget
did nothing useful or worse, which is why ~7,200 "forgotten" memories
resurfaced across 19 manual gardening sessions.

This lands the offline-testable core of the fix (core contract + SQLite +
PostgREST adapter; the PostgREST schema/RPC + Neo4j gate follow):

- storage.ts: add markForgotten(ids): Promise<number> to Episode/Semantic/
  Procedural storage — a tombstone that sets forgotten_at and touches
  NEITHER access_count NOR confidence.
- memory.ts forget(): rewrite the confirm path to call markForgotten per
  tier; drop the recordAccess/recordAccessAndBoost calls and the confidence
  floor (the forgotten_at tombstone is the single source of truth). Remove
  the now-unused CONFIDENCE_FLOOR.
- sqlite: migration v5 adds forgotten_at (+ partial index) to episodes/
  semantic/procedural; markForgotten impls; AND forgotten_at IS NULL gate
  cloned onto every recall path (vectorSearch + textBoost + the per-store
  hybrid/BM25/vector fallbacks), mirroring the proven superseded_by gate.
- postgrest adapters: markForgotten via table PATCH (schema lands next).
- tests: forget-e2e (recall gate excludes a tombstoned memory while the
  sibling survives; forget removes matched content; access_count NOT bumped;
  confirm=false no-op; idempotent at both levels) — would fail on old code.
  migration test updated to v5 + a forgotten_at column assertion.

sqlite 106/106, core 495/495, typecheck clean.
…ten (Phase 1 production path)

Carry the forget() tombstone into the PostgREST schema so forget removes
content from every recall path on the production (Postgres+pgvector) backend,
matching the SQLite v5 offline path.

- forgotten_at timestamptz on memory_episodes/semantic/procedural, added both
  in the CREATE TABLE bodies and via idempotent ADD COLUMN IF NOT EXISTS so
  re-applying onto an already-provisioned DB actually adds the column.
- engram_mark_forgotten(p_memory_type, p_ids): stamps forgotten_at and touches
  neither access_count nor confidence (writing access_count was the inverted-
  forget bug; flooring confidence was dead). Idempotent, returns rows stamped.
- AND forgotten_at IS NULL gate in all 4 recall RPCs (engram_hybrid_recall,
  engram_recall, engram_text_boost, engram_vector_search) across the episode,
  semantic and procedural branches. Digests are not forgettable.
- partial indexes on tombstoned rows (lockstep with the SQLite v5 indexes).
- EOF post-apply smoke executes every recall RPC + engram_mark_forgotten so a
  missing column or broken gate surfaces at apply time. The dump emits
  functions before tables and relies on check_function_bodies=false, so the
  forgotten_at columns live in the table section and the smoke is the
  call-time guard; there is no migration runner.

Verified on Postgres 17 + pgvector: forget round-trip across all three types,
sibling survives, forgotten row's access_count unchanged. New schema-gate test
pins the predicates; postgrest suite 71/71, typecheck clean.
…through the graph

After the SQL forget gate, a forgotten memory could still surface through graph
spreading activation — and would leak into authoritative recall once
associations are merged into the scored pool. Close the graph channel.

- GraphPort.forgetMemories?(ids): optional, capability-guarded port method.
- NeuralGraph.forgetMemories stamps forgottenAt on :Memory {id} nodes
  (idempotent, returns count). Memory nodes are uniformly :Memory regardless of
  memoryType, so one match covers episode/semantic/procedural ids.
- spreading-activation Cypher gates traversal on
  coalesce(n.forgottenAt, n.deletedAt) IS NULL (NULL-permissive), which also
  prunes paths that pass through a forgotten node.
- Memory.forget() calls graph.forgetMemories after the SQL tombstone,
  capability-guarded and non-fatal: SQL stays the source of truth, and a graph
  hiccup or absent Neo4j never fails the forget.

Verified on Neo4j 5: spreading-activation 12/12 incl. endpoint exclusion,
path-through pruning, sibling survival, idempotency (1 then 0) and cross-type
semantic gating. core 495/495, sqlite 106/106, typecheck clean.
…red (Phase 0, unit 1)

Discovery #1: both bench adapters scored only recallResult.memories, but graph
spreading-activation output lands in the separate recallResult.associations
channel. So graph:true vs graph:false mathematically could not move recall@K —
"the graph doesn't help" was a measurement-instrumentation bug, not a verdict.

- mergeAssociationsIntoScored(recallResult, flag): when the flag is set, unions
  associations after the memory channel; otherwise returns memories unchanged.
- BenchmarkOpts.mergeAssociationsIntoTopK (default false → byte-identical runs).
- Wired into both LongMemEval (runQuestion) and LoCoMo (evaluateDataset) scoring.
- Score-scale-safe by construction: both adapters score by gold-id set-membership
  over the deduped top-K, not score magnitude. So unioning the graph-relevance-
  ranked associations after the MMR/cross-encoder-ranked memories cannot be
  confounded by the scale mismatch — a gold id is either in the first K deduped
  ids or it is not. Memory-first ordering means associations can only RESCUE a
  gold id the memory channel missed, never displace one.
- associations-visible-to-scored.test.ts: deterministic invariant (no Neo4j, no
  LLM, no dataset) — gold present in the scored pool with merge ON, absent OFF.
  This is the "associations-visible" invariant the symmetric kill criterion
  (later unit) depends on.
- Adds packages/bench/vitest.config.ts so the gate test runs under turbo/CI.

Default off → zero behaviour change to existing runs. bench typecheck clean, 4/4.
Encodes the red-team rule that prevents the historical mistake — concluding
"kill the graph" from the saturated LongMemEval-S aggregate (~98.8% recall@5,
where nothing has headroom to move). graphVerdict() demands POSITIVE evidence
of no-effect before a kill:

- kill ONLY when the primary aggregate delta is null/negative AND graphEffect is
  flat (≤ epsilon) on a graph-visible split with n ≥ 100, and the
  associations-visible invariant is green. Never the aggregate alone.
- keep as soon as graphEffect clears epsilon on a powered, visible split — even
  when the saturated aggregate is flat or negative.
- insufficient_power when underpowered (n<100), when the invariant is red, or in
  the ambiguous aggregate-positive-but-flat-effect case.

Pure and deterministic; 6 table-driven cases pin the asymmetry, the power gate,
and the invariant dependency. bench typecheck clean, 10/10.
…units 3-4)

createBenchMemory now returns a {memory, config, graphActuallyWired} handle
instead of a bare Memory, so graph cells can reach the graph handle and hard-
fail when it is absent.

- requireGraph(handle): throws if a graph cell runs without a real bench Neo4j,
  killing the silent SQL-only fallback that would otherwise report a SQL delta
  as a graph result (the "graph was never measured" trap). Lives in a
  dependency-light bench-memory-handle module so the guard + types are
  unit-testable without loading the ONNX native binding.
- wipeBenchGraph wired into LongMemEval runQuestion and LoCoMo runConversation
  before ingest: Neo4j is a shared external process (unlike the per-call fresh
  :memory: SQLite), so each question/conversation must start with a clean graph
  or the previous unit's nodes pollute spreading activation. (wipeBenchGraph
  existed but was called nowhere.)
- Migrated all 5 createBenchMemory callers to destructure the handle.

bench typecheck clean, 12/12.
…/3 gate filter (Phase 0, units 6-8b)

- classifyRecallStructure: deterministic label {lookup, multi_hop, temporal,
  aggregation} from dataset signals (LoCoMo category, LongMemEval ability) with
  a gold-cardinality + temporal-token heuristic fallback. GRAPH_RELEVANT =
  {multi_hop, temporal} — the split where spreading activation should help.
- computeGraphEffect: recall@K(merge ON) − recall@K(merge OFF) on the
  graph-relevant split (or the stronger graph-visible split when per-question
  graphCouldContribute is supplied). This is the scale-independent lift that
  feeds graphVerdict; an empty split returns zero effect, so with the n<100
  power gate no decision is ever fabricated.
- LoCoMo categories filter (BenchmarkOpts.categories): score only the requested
  categories (e.g. [2,3]) while ingesting the corpus whole — filters the metric,
  not the graph the recall traverses. Canonical category map locked from
  judge-adapter: 1=single_hop 2=multi_hop 3=temporal 4=open_domain 5=adversarial.

bench typecheck clean, 20/20.
… unit 5)

Completes the Phase 0 measurement harness. compareMatrix runs the 4 cells, each
graph-on cell with mergeAssociationsIntoTopK so the graph channel is visible to
recall@K, and computes graphEffect as the recall@K lift on the graph-relevant
split by pairing each graph-on cell's per-question outcomes against its
same-rerank graph-off sibling (one recall per question — no double-scoring).

- requireGraph hard-fails before any graph cell runs when no bench Neo4j is
  wired, so a SQL-only fallback can never be reported as a graph result.
- extract{LongMemEval,LoCoMo}Outcomes live in a dependency-light matrix-outcomes
  module (no adapter/onnx import) so the pairing + classification stays
  unit-testable without native binaries; chained through computeGraphEffect.
- BaselineProvenance: git HEAD + corpus sha256 + flags + Neo4j-gate-state,
  written to results/gates/graph-eval-baseline.json (gitignore switched to
  results/* + !results/gates/ so the baseline can be committed).
- CLI: --matrix, --require-graph, --categories 2,3.

Pure orchestration unit-tested; the adapter-running wrapper + the live baseline
are validated against the bench runtime on the server. bench typecheck clean, 23/23.
Graph decomposition in memory.ingest() is fire-and-forget (pushed to
_pendingWrites). Neither bench adapter awaited it, so recall ran against a
half-built graph — the graph cells produced empty/sparse associations and
graphEffect was spuriously ~0. This is exactly the measurement bug that makes
"the graph doesn't help" look true when the graph was never given a chance.

Call memory.flushPendingWrites() at the ingest→eval boundary in both adapters
(LoCoMo runConversation after consolidation; LongMemEval runQuestion after
ingest) so the graph is fully built before recall.

bench typecheck clean, 23/23.
@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown

Review Change Stack

Caution

Review failed

Pull request was closed or merged during review

Walkthrough

This PR introduces a memory tombstoning mechanism across all storage backends and a comprehensive 4-cell benchmark matrix for measuring graph and reranking effects. The forgotten mechanism excludes memories from recall while preserving audit history; the matrix evaluation runs paired benchmark configurations to compute graph contribution metrics.

Changes

Memory Tombstoning & Benchmark Matrix Evaluation

Layer / File(s) Summary
Storage adapter contracts for tombstoning
packages/core/src/adapters/storage.ts
EpisodeStorage, SemanticStorage, and ProceduralStorage gain a new markForgotten(ids) method that tombstones memories via forgotten_at without affecting access_count or confidence, idempotent and returning the count of newly tombstoned rows.
Neo4j graph tombstone implementation
packages/graph/src/neural-graph.ts, packages/graph/src/spreading-activation.ts, packages/graph/test/spreading-activation.test.ts
NeuralGraph adds forgetMemories(ids) to stamp forgottenAt on Memory nodes; spreading activation filters out forgotten nodes via Cypher path constraints; integration tests verify forgotten memories are excluded from activation results and downstream paths are pruned.
SQLite tombstone schema and implementation
packages/sqlite/src/episodes.ts, packages/sqlite/src/semantic.ts, packages/sqlite/src/procedural.ts, packages/sqlite/src/migrations.ts, packages/sqlite/src/adapter.ts, packages/sqlite/test/migrations.test.ts
Schema migration v5 adds forgotten_at column with partial indexes; all storage classes implement markForgotten; search/recall/vector queries are updated to filter forgotten rows; adapter vector-search paths exclude forgotten rows.
PostgREST schema and tombstone implementation
packages/postgrest/schema.sql, packages/postgrest/src/episodes.ts, packages/postgrest/src/semantic.ts, packages/postgrest/src/procedural.ts, packages/postgrest/test/forgotten-at-gate.test.ts
Schema adds forgotten_at column and engram_mark_forgotten RPC; all recall/search functions gated with forgotten_at IS NULL; storage classes implement markForgotten; schema tests validate gate predicate counts and column presence across forgettable tables.
Core memory forget() implementation with tombstoning
packages/core/src/memory.ts, packages/core/test/retrieval/mock-storage.ts
forget() replaces depreciation with true tombstoning: gates by minimum relevance threshold, filters by optional tier, returns preview when confirm is false, and when confirmed, marks matching items forgotten across storage types and best-effort tombstones Neo4j nodes (non-fatal on errors); mock storage adds markForgotten mocks.
Benchmark memory factory refactor to handle wrapper
packages/bench/src/bench-memory-handle.ts, packages/bench/src/memory-factory.ts
createBenchMemory now returns BenchMemoryHandle with memory instance, config (graph and reranker backend), and graphActuallyWired flag; new module exports requireGraph hard-fail guard preventing silent SQL-only fallback.
Recall structure classification
packages/bench/src/classification/classify-recall-structure.ts
Deterministic classifier categorizes questions as lookup, multi_hop, temporal, or aggregation using dataset signals (category, ability) and heuristics (gold-id cardinality, temporal regex); exports GRAPH_RELEVANT set identifying graph-benefiting structures.
Association merging and recall outcome types
packages/bench/src/merge-associations.ts
Defines BenchRecallResult and BenchScoredMemory types from recall results; mergeAssociationsIntoScored conditionally includes graph-derived associations in top-K pool for visibility.
Graph effect and verdict metrics
packages/bench/src/metrics/graph-effect.ts, packages/bench/src/metrics/graph-verdict.ts
computeGraphEffect measures graph contribution via split selection (graph-relevant or graph-visible), computing recall@K on/off and returning delta; graphVerdict gates on invariant, power threshold (MIN_POWER_N=100), and epsilon-based flat-effect detection (DEFAULT_EPSILON=0.005).
Matrix comparison runner and outcome extraction
packages/bench/src/runner/compare-matrix.ts, packages/bench/src/runner/matrix-outcomes.ts
compareMatrix runs 4-cell {graph on/off} × {rerank on/off} ablation with optional graph requirement checking; outcome extractors pair on/off predictions by question/QA id, classify structures, and emit outcomes with recallAtK flags for graph-effect computation.
Benchmark adapter matrix support
packages/bench/src/locomo/adapter.ts, packages/bench/src/longmemeval/adapter.ts, packages/bench/src/locomo/judge-adapter.ts, packages/bench/src/locomo/forensics/local-recall-sweep.ts, packages/bench/src/longmemeval/forensics/recall-sweep.ts
Adapters destructure memory/config from createBenchMemory handle, conditionally wipe Neo4j graph before ingest, flush pending writes before recall, and merge associations into topK; LoCoMo adds category filtering.
CLI matrix execution mode with corpus hashing
packages/bench/bin/engram-bench.ts
Adds --matrix, --require-graph, --categories argument parsing; implements hashCorpus() for SHA-256 fingerprinting; matrix branch captures provenance, runs compareMatrix, prints per-cell results with graph effect, writes baseline JSON to ./results/gates/.
Benchmark types and exports
packages/bench/src/types.ts, packages/bench/src/index.ts
Extends BenchmarkOpts with mergeAssociationsIntoTopK and categories flags; introduces MatrixCell, BaselineProvenance, and ComparisonMatrixResult types; expands index re-exports for matrix, classification, and metrics utilities.
Forgotten mechanism tests
packages/sqlite/test/forget-e2e.test.ts, packages/postgrest/test/forgotten-at-gate.test.ts
End-to-end SQLite test validates exclusion from recall, idempotency, preview-as-no-op, access-count preservation; PostgREST schema test validates gate predicate counts and column/index presence.
Matrix evaluation and metrics tests
packages/bench/test/classify-recall-structure.test.ts, packages/bench/test/graph-effect.test.ts, packages/bench/test/graph-verdict.test.ts, packages/bench/test/matrix-outcomes.test.ts
Comprehensive test coverage for classification, graph-effect computation, verdict gating, and outcome extraction; validates split selection, recall computation, power thresholds, and outcome pairing.
Association merging and graph requirement tests
packages/bench/test/associations-visible-to-scored.test.ts, packages/bench/test/require-graph.test.ts
Tests mergeAssociationsIntoScored visibility and ordering; tests requireGraph hard-fail guard.
Configuration and tooling
.gitignore, packages/bench/vitest.config.ts
Updates .gitignore to selectively preserve results/gates/ for matrix baselines; adds vitest.config.ts with test discovery and 10s timeout.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • muhammadkh4n/engram#27: Directly related because this PR extends the opt-in bench Neo4j wiring by refactoring createBenchMemory to return a config handle and enforcing graph wipe/flush behavior in adapters when graph is enabled.

Poem

🐰 Forgetting is a gentle art,
Tombstones guard what once lived free,
Four cells ablate the matrix chart,
Measuring when graphs help memory.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately captures the two main workstreams: a deployed forget() correctness fix (Phase 1) and a new graph-measurement harness (Phase 0), clearly identifying the primary changes in the changeset.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, detailing both the forget() fix (bug, solution, deployment status) and Phase 0 bench harness (instrumentation fix, new tooling, default-off behavior).
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/memory-overhaul

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@muhammadkh4n muhammadkh4n merged commit 390ac7a into main Jun 8, 2026
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant