Memory overhaul: forget() correctness fix (Phase 1, deployed) + graph-measurement harness (Phase 0) by muhammadkh4n · Pull Request #38 · muhammadkh4n/engram

muhammadkh4n · 2026-06-08T08:24:55Z

Summary

Two workstreams from the memory-system overhaul:

Phase 1 — forget() correctness fix (user-facing; already deployed to rexvps prod + verified).
Phase 0 — bench graph-measurement harness (bench-only, default-off → zero runtime/behaviour change).

Phase 1 — `forget()` actually forgets now (DEPLOYED)

The bug: forget() was inverted. The episode path called recordAccess() → access_count++, which computeScore rewards (accessBoost), so forgetting a memory raised its recall rank. Procedural was a no-op; semantic floored a confidence value no recall path reads. This is the root cause of ~19 sessions of memory-gardening toil (the ~7,200-duplicate resurfacing).

The fix: forget = tombstone only. A new markForgotten(ids) storage method stamps a forgotten_at timestamp and touches neither access_count nor confidence. Every recall path gates on forgotten_at IS NULL (cloned from the proven superseded_by IS NULL gate):

SQLite (offline/CI path): migration v5 adds forgotten_at + partial indexes; vector + BM25 + deep-recall all gated.
PostgREST (production): forgotten_at columns + indexes + engram_mark_forgotten RPC + the gate in all 4 recall RPCs (12 branches). Applied + verified on the live 47K-row prod DB via a BEGIN/ROLLBACK round-trip (forget → absent, access_count unchanged, zero pollution); PostgREST schema cache reloaded.
Neo4j (red-team blocker): forgetMemories stamps :Memory.forgottenAt; the spreading-activation Cypher gates traversal on coalesce(n.forgottenAt, n.deletedAt) IS NULL (also prunes paths through forgotten nodes) so forget can't leak through the graph channel. Verified on real Neo4j 5.

Deploy status: live on rexvps (schema applied + reloaded, /opt/engram MCP service rebuilt/restarted, end-to-end memory_recall confirmed). Schema change is additive + backward-compatible (forgotten_at defaults NULL → no existing row hidden).

Phase 0 — make the graph measurable (bench-only, default-off)

Discovery: both bench adapters scored only recallResult.memories, while graph output lands in the ignored recallResult.associations channel — so graph:true vs graph:false mathematically could not move recall@K. "The graph doesn't help" was a measurement-instrumentation bug, not a verdict.

This PR builds the harness to actually measure it:

mergeAssociationsIntoScored + BenchmarkOpts.mergeAssociationsIntoTopK (default false → byte-identical runs) — unions the graph channel into the scored top-K. Scale-safe (gold-id set-membership metric).
4-cell {graph}×{rerank} matrix runner + computeGraphEffect (recall@K lift on the graph-relevant split) + classifyRecallStructure.
graphVerdict — symmetric kill criterion (never conclude "kill" from a saturated aggregate; requires null aggregate and flat graphEffect on the graph-visible split with n≥100).
requireGraph hard-fail guard (no SQL-only result masquerading as a graph result) + per-unit wipeBenchGraph isolation.
Bench correctness fix: adapters now flushPendingWrites() before eval — they never awaited the fire-and-forget graph decomposition, so graph cells recalled against a half-built graph.

Finding (why this is default-off): an empirical run showed LongMemEval-S is saturated — vector+BM25+rerank already hits 100% R@5 on the multi-hop/temporal split with the graph OFF, so recall@K is blind to any graph enhancement here. LoCoMo was abandoned (documented dataset flaws + confused category map). So the graph keep/kill decision is deferred to a fair test (fix the disabled PPR + a multi-hop chaining benchmark like MuSiQue/2Wiki) — out of scope for this PR. The harness is valid tooling for that future test; it ships default-off with no runtime effect.

Behaviour change & risk

Area	Change	Risk
`forget()`	Now correct (was inverted)	Low — already deployed + verified on prod; schema additive/backward-compatible
Recall	None	`forgotten_at IS NULL` is a dormant no-op until something is forgotten
Bench (Phase 0)	New tooling, `mergeAssociationsIntoTopK` default false	None — bench-only, zero runtime change

Testing

sqlite 106/106 · core 495/495 · postgrest 71/71 · graph 12/12 · bench 23/23; typecheck clean across all touched packages. Phase 1 verified on real Postgres 17 + pgvector, real Neo4j 5, and production rexvps.

Follow-ups (NOT in this PR)

The graph "fair test": bind the existing seedActivations map into the spreading-activation Cypher (PPR is currently disabled via an activation=1.0 hardcode) + a MuSiQue/2Wiki adapter.
Turn spreading-activation off for default recall (it can only add noise to the precise dense pipeline).

…offline path) The shipped forget() was inverted: on episodes it called recordAccess() (access_count++), a term recall ranking REWARDS via accessBoost — so forgetting a memory RAISED its recall rank; on semantic it floored a confidence value no recall path reads; procedural was a no-op. Net: forget did nothing useful or worse, which is why ~7,200 "forgotten" memories resurfaced across 19 manual gardening sessions. This lands the offline-testable core of the fix (core contract + SQLite + PostgREST adapter; the PostgREST schema/RPC + Neo4j gate follow): - storage.ts: add markForgotten(ids): Promise<number> to Episode/Semantic/ Procedural storage — a tombstone that sets forgotten_at and touches NEITHER access_count NOR confidence. - memory.ts forget(): rewrite the confirm path to call markForgotten per tier; drop the recordAccess/recordAccessAndBoost calls and the confidence floor (the forgotten_at tombstone is the single source of truth). Remove the now-unused CONFIDENCE_FLOOR. - sqlite: migration v5 adds forgotten_at (+ partial index) to episodes/ semantic/procedural; markForgotten impls; AND forgotten_at IS NULL gate cloned onto every recall path (vectorSearch + textBoost + the per-store hybrid/BM25/vector fallbacks), mirroring the proven superseded_by gate. - postgrest adapters: markForgotten via table PATCH (schema lands next). - tests: forget-e2e (recall gate excludes a tombstoned memory while the sibling survives; forget removes matched content; access_count NOT bumped; confirm=false no-op; idempotent at both levels) — would fail on old code. migration test updated to v5 + a forgotten_at column assertion. sqlite 106/106, core 495/495, typecheck clean.

…ten (Phase 1 production path) Carry the forget() tombstone into the PostgREST schema so forget removes content from every recall path on the production (Postgres+pgvector) backend, matching the SQLite v5 offline path. - forgotten_at timestamptz on memory_episodes/semantic/procedural, added both in the CREATE TABLE bodies and via idempotent ADD COLUMN IF NOT EXISTS so re-applying onto an already-provisioned DB actually adds the column. - engram_mark_forgotten(p_memory_type, p_ids): stamps forgotten_at and touches neither access_count nor confidence (writing access_count was the inverted- forget bug; flooring confidence was dead). Idempotent, returns rows stamped. - AND forgotten_at IS NULL gate in all 4 recall RPCs (engram_hybrid_recall, engram_recall, engram_text_boost, engram_vector_search) across the episode, semantic and procedural branches. Digests are not forgettable. - partial indexes on tombstoned rows (lockstep with the SQLite v5 indexes). - EOF post-apply smoke executes every recall RPC + engram_mark_forgotten so a missing column or broken gate surfaces at apply time. The dump emits functions before tables and relies on check_function_bodies=false, so the forgotten_at columns live in the table section and the smoke is the call-time guard; there is no migration runner. Verified on Postgres 17 + pgvector: forget round-trip across all three types, sibling survives, forgotten row's access_count unchanged. New schema-gate test pins the predicates; postgrest suite 71/71, typecheck clean.

…through the graph After the SQL forget gate, a forgotten memory could still surface through graph spreading activation — and would leak into authoritative recall once associations are merged into the scored pool. Close the graph channel. - GraphPort.forgetMemories?(ids): optional, capability-guarded port method. - NeuralGraph.forgetMemories stamps forgottenAt on :Memory {id} nodes (idempotent, returns count). Memory nodes are uniformly :Memory regardless of memoryType, so one match covers episode/semantic/procedural ids. - spreading-activation Cypher gates traversal on coalesce(n.forgottenAt, n.deletedAt) IS NULL (NULL-permissive), which also prunes paths that pass through a forgotten node. - Memory.forget() calls graph.forgetMemories after the SQL tombstone, capability-guarded and non-fatal: SQL stays the source of truth, and a graph hiccup or absent Neo4j never fails the forget. Verified on Neo4j 5: spreading-activation 12/12 incl. endpoint exclusion, path-through pruning, sibling survival, idempotency (1 then 0) and cross-type semantic gating. core 495/495, sqlite 106/106, typecheck clean.

…red (Phase 0, unit 1) Discovery #1: both bench adapters scored only recallResult.memories, but graph spreading-activation output lands in the separate recallResult.associations channel. So graph:true vs graph:false mathematically could not move recall@K — "the graph doesn't help" was a measurement-instrumentation bug, not a verdict. - mergeAssociationsIntoScored(recallResult, flag): when the flag is set, unions associations after the memory channel; otherwise returns memories unchanged. - BenchmarkOpts.mergeAssociationsIntoTopK (default false → byte-identical runs). - Wired into both LongMemEval (runQuestion) and LoCoMo (evaluateDataset) scoring. - Score-scale-safe by construction: both adapters score by gold-id set-membership over the deduped top-K, not score magnitude. So unioning the graph-relevance- ranked associations after the MMR/cross-encoder-ranked memories cannot be confounded by the scale mismatch — a gold id is either in the first K deduped ids or it is not. Memory-first ordering means associations can only RESCUE a gold id the memory channel missed, never displace one. - associations-visible-to-scored.test.ts: deterministic invariant (no Neo4j, no LLM, no dataset) — gold present in the scored pool with merge ON, absent OFF. This is the "associations-visible" invariant the symmetric kill criterion (later unit) depends on. - Adds packages/bench/vitest.config.ts so the gate test runs under turbo/CI. Default off → zero behaviour change to existing runs. bench typecheck clean, 4/4.

Encodes the red-team rule that prevents the historical mistake — concluding "kill the graph" from the saturated LongMemEval-S aggregate (~98.8% recall@5, where nothing has headroom to move). graphVerdict() demands POSITIVE evidence of no-effect before a kill: - kill ONLY when the primary aggregate delta is null/negative AND graphEffect is flat (≤ epsilon) on a graph-visible split with n ≥ 100, and the associations-visible invariant is green. Never the aggregate alone. - keep as soon as graphEffect clears epsilon on a powered, visible split — even when the saturated aggregate is flat or negative. - insufficient_power when underpowered (n<100), when the invariant is red, or in the ambiguous aggregate-positive-but-flat-effect case. Pure and deterministic; 6 table-driven cases pin the asymmetry, the power gate, and the invariant dependency. bench typecheck clean, 10/10.

…units 3-4) createBenchMemory now returns a {memory, config, graphActuallyWired} handle instead of a bare Memory, so graph cells can reach the graph handle and hard- fail when it is absent. - requireGraph(handle): throws if a graph cell runs without a real bench Neo4j, killing the silent SQL-only fallback that would otherwise report a SQL delta as a graph result (the "graph was never measured" trap). Lives in a dependency-light bench-memory-handle module so the guard + types are unit-testable without loading the ONNX native binding. - wipeBenchGraph wired into LongMemEval runQuestion and LoCoMo runConversation before ingest: Neo4j is a shared external process (unlike the per-call fresh :memory: SQLite), so each question/conversation must start with a clean graph or the previous unit's nodes pollute spreading activation. (wipeBenchGraph existed but was called nowhere.) - Migrated all 5 createBenchMemory callers to destructure the handle. bench typecheck clean, 12/12.

…/3 gate filter (Phase 0, units 6-8b) - classifyRecallStructure: deterministic label {lookup, multi_hop, temporal, aggregation} from dataset signals (LoCoMo category, LongMemEval ability) with a gold-cardinality + temporal-token heuristic fallback. GRAPH_RELEVANT = {multi_hop, temporal} — the split where spreading activation should help. - computeGraphEffect: recall@K(merge ON) − recall@K(merge OFF) on the graph-relevant split (or the stronger graph-visible split when per-question graphCouldContribute is supplied). This is the scale-independent lift that feeds graphVerdict; an empty split returns zero effect, so with the n<100 power gate no decision is ever fabricated. - LoCoMo categories filter (BenchmarkOpts.categories): score only the requested categories (e.g. [2,3]) while ingesting the corpus whole — filters the metric, not the graph the recall traverses. Canonical category map locked from judge-adapter: 1=single_hop 2=multi_hop 3=temporal 4=open_domain 5=adversarial. bench typecheck clean, 20/20.

… unit 5) Completes the Phase 0 measurement harness. compareMatrix runs the 4 cells, each graph-on cell with mergeAssociationsIntoTopK so the graph channel is visible to recall@K, and computes graphEffect as the recall@K lift on the graph-relevant split by pairing each graph-on cell's per-question outcomes against its same-rerank graph-off sibling (one recall per question — no double-scoring). - requireGraph hard-fails before any graph cell runs when no bench Neo4j is wired, so a SQL-only fallback can never be reported as a graph result. - extract{LongMemEval,LoCoMo}Outcomes live in a dependency-light matrix-outcomes module (no adapter/onnx import) so the pairing + classification stays unit-testable without native binaries; chained through computeGraphEffect. - BaselineProvenance: git HEAD + corpus sha256 + flags + Neo4j-gate-state, written to results/gates/graph-eval-baseline.json (gitignore switched to results/* + !results/gates/ so the baseline can be committed). - CLI: --matrix, --require-graph, --categories 2,3. Pure orchestration unit-tested; the adapter-running wrapper + the live baseline are validated against the bench runtime on the server. bench typecheck clean, 23/23.

Graph decomposition in memory.ingest() is fire-and-forget (pushed to _pendingWrites). Neither bench adapter awaited it, so recall ran against a half-built graph — the graph cells produced empty/sparse associations and graphEffect was spuriously ~0. This is exactly the measurement bug that makes "the graph doesn't help" look true when the graph was never given a chance. Call memory.flushPendingWrites() at the ingest→eval boundary in both adapters (LoCoMo runConversation after consolidation; LongMemEval runQuestion after ingest) so the graph is fully built before recall. bench typecheck clean, 23/23.

coderabbitai · 2026-06-08T08:25:09Z

Caution

Review failed

Pull request was closed or merged during review

Walkthrough

This PR introduces a memory tombstoning mechanism across all storage backends and a comprehensive 4-cell benchmark matrix for measuring graph and reranking effects. The forgotten mechanism excludes memories from recall while preserving audit history; the matrix evaluation runs paired benchmark configurations to compute graph contribution metrics.

Changes

Memory Tombstoning & Benchmark Matrix Evaluation

Layer / File(s)	Summary
Storage adapter contracts for tombstoning `packages/core/src/adapters/storage.ts`	`EpisodeStorage`, `SemanticStorage`, and `ProceduralStorage` gain a new `markForgotten(ids)` method that tombstones memories via `forgotten_at` without affecting `access_count` or confidence, idempotent and returning the count of newly tombstoned rows.
Neo4j graph tombstone implementation `packages/graph/src/neural-graph.ts`, `packages/graph/src/spreading-activation.ts`, `packages/graph/test/spreading-activation.test.ts`	`NeuralGraph` adds `forgetMemories(ids)` to stamp `forgottenAt` on Memory nodes; spreading activation filters out forgotten nodes via Cypher path constraints; integration tests verify forgotten memories are excluded from activation results and downstream paths are pruned.
SQLite tombstone schema and implementation `packages/sqlite/src/episodes.ts`, `packages/sqlite/src/semantic.ts`, `packages/sqlite/src/procedural.ts`, `packages/sqlite/src/migrations.ts`, `packages/sqlite/src/adapter.ts`, `packages/sqlite/test/migrations.test.ts`	Schema migration v5 adds `forgotten_at` column with partial indexes; all storage classes implement `markForgotten`; search/recall/vector queries are updated to filter forgotten rows; adapter vector-search paths exclude forgotten rows.
PostgREST schema and tombstone implementation `packages/postgrest/schema.sql`, `packages/postgrest/src/episodes.ts`, `packages/postgrest/src/semantic.ts`, `packages/postgrest/src/procedural.ts`, `packages/postgrest/test/forgotten-at-gate.test.ts`	Schema adds `forgotten_at` column and `engram_mark_forgotten` RPC; all recall/search functions gated with `forgotten_at IS NULL`; storage classes implement `markForgotten`; schema tests validate gate predicate counts and column presence across forgettable tables.
Core memory forget() implementation with tombstoning `packages/core/src/memory.ts`, `packages/core/test/retrieval/mock-storage.ts`	`forget()` replaces depreciation with true tombstoning: gates by minimum relevance threshold, filters by optional tier, returns preview when confirm is false, and when confirmed, marks matching items forgotten across storage types and best-effort tombstones Neo4j nodes (non-fatal on errors); mock storage adds `markForgotten` mocks.
Benchmark memory factory refactor to handle wrapper `packages/bench/src/bench-memory-handle.ts`, `packages/bench/src/memory-factory.ts`	`createBenchMemory` now returns `BenchMemoryHandle` with memory instance, config (graph and reranker backend), and `graphActuallyWired` flag; new module exports `requireGraph` hard-fail guard preventing silent SQL-only fallback.
Recall structure classification `packages/bench/src/classification/classify-recall-structure.ts`	Deterministic classifier categorizes questions as `lookup`, `multi_hop`, `temporal`, or `aggregation` using dataset signals (category, ability) and heuristics (gold-id cardinality, temporal regex); exports `GRAPH_RELEVANT` set identifying graph-benefiting structures.
Association merging and recall outcome types `packages/bench/src/merge-associations.ts`	Defines `BenchRecallResult` and `BenchScoredMemory` types from recall results; `mergeAssociationsIntoScored` conditionally includes graph-derived associations in top-K pool for visibility.
Graph effect and verdict metrics `packages/bench/src/metrics/graph-effect.ts`, `packages/bench/src/metrics/graph-verdict.ts`	`computeGraphEffect` measures graph contribution via split selection (graph-relevant or graph-visible), computing recall@K on/off and returning delta; `graphVerdict` gates on invariant, power threshold (`MIN_POWER_N=100`), and epsilon-based flat-effect detection (`DEFAULT_EPSILON=0.005`).
Matrix comparison runner and outcome extraction `packages/bench/src/runner/compare-matrix.ts`, `packages/bench/src/runner/matrix-outcomes.ts`	`compareMatrix` runs 4-cell {graph on/off} × {rerank on/off} ablation with optional graph requirement checking; outcome extractors pair on/off predictions by question/QA id, classify structures, and emit outcomes with recallAtK flags for graph-effect computation.
Benchmark adapter matrix support `packages/bench/src/locomo/adapter.ts`, `packages/bench/src/longmemeval/adapter.ts`, `packages/bench/src/locomo/judge-adapter.ts`, `packages/bench/src/locomo/forensics/local-recall-sweep.ts`, `packages/bench/src/longmemeval/forensics/recall-sweep.ts`	Adapters destructure memory/config from `createBenchMemory` handle, conditionally wipe Neo4j graph before ingest, flush pending writes before recall, and merge associations into topK; LoCoMo adds category filtering.
CLI matrix execution mode with corpus hashing `packages/bench/bin/engram-bench.ts`	Adds `--matrix`, `--require-graph`, `--categories` argument parsing; implements `hashCorpus()` for SHA-256 fingerprinting; matrix branch captures provenance, runs `compareMatrix`, prints per-cell results with graph effect, writes baseline JSON to `./results/gates/`.
Benchmark types and exports `packages/bench/src/types.ts`, `packages/bench/src/index.ts`	Extends `BenchmarkOpts` with `mergeAssociationsIntoTopK` and `categories` flags; introduces `MatrixCell`, `BaselineProvenance`, and `ComparisonMatrixResult` types; expands index re-exports for matrix, classification, and metrics utilities.
Forgotten mechanism tests `packages/sqlite/test/forget-e2e.test.ts`, `packages/postgrest/test/forgotten-at-gate.test.ts`	End-to-end SQLite test validates exclusion from recall, idempotency, preview-as-no-op, access-count preservation; PostgREST schema test validates gate predicate counts and column/index presence.
Matrix evaluation and metrics tests `packages/bench/test/classify-recall-structure.test.ts`, `packages/bench/test/graph-effect.test.ts`, `packages/bench/test/graph-verdict.test.ts`, `packages/bench/test/matrix-outcomes.test.ts`	Comprehensive test coverage for classification, graph-effect computation, verdict gating, and outcome extraction; validates split selection, recall computation, power thresholds, and outcome pairing.
Association merging and graph requirement tests `packages/bench/test/associations-visible-to-scored.test.ts`, `packages/bench/test/require-graph.test.ts`	Tests `mergeAssociationsIntoScored` visibility and ordering; tests `requireGraph` hard-fail guard.
Configuration and tooling `.gitignore`, `packages/bench/vitest.config.ts`	Updates `.gitignore` to selectively preserve `results/gates/` for matrix baselines; adds `vitest.config.ts` with test discovery and 10s timeout.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

muhammadkh4n/engram#27: Directly related because this PR extends the opt-in bench Neo4j wiring by refactoring createBenchMemory to return a config handle and enforcing graph wipe/flush behavior in adapters when graph is enabled.

Poem

🐰 Forgetting is a gentle art,
Tombstones guard what once lived free,
Four cells ablate the matrix chart,
Measuring when graphs help memory.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately captures the two main workstreams: a deployed forget() correctness fix (Phase 1) and a new graph-measurement harness (Phase 0), clearly identifying the primary changes in the changeset.
Description check	✅ Passed	The description is comprehensive and directly related to the changeset, detailing both the forget() fix (bug, solution, deployment status) and Phase 0 bench harness (instrumentation fix, new tooling, default-off behavior).
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/memory-overhaul

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

muhammadkh4n added 9 commits June 6, 2026 05:13

muhammadkh4n merged commit 390ac7a into main Jun 8, 2026
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory overhaul: forget() correctness fix (Phase 1, deployed) + graph-measurement harness (Phase 0)#38

Memory overhaul: forget() correctness fix (Phase 1, deployed) + graph-measurement harness (Phase 0)#38
muhammadkh4n merged 9 commits into
mainfrom
feat/memory-overhaul

muhammadkh4n commented Jun 8, 2026

Uh oh!

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading

Review failed

❌ Failed checks (1 warning)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

muhammadkh4n commented Jun 8, 2026

Summary

Phase 1 — forget() actually forgets now (DEPLOYED)

Phase 0 — make the graph measurable (bench-only, default-off)

Behaviour change & risk

Testing

Follow-ups (NOT in this PR)

Uh oh!

coderabbitai Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Phase 1 — `forget()` actually forgets now (DEPLOYED)

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading