Memory overhaul: forget() correctness fix (Phase 1, deployed) + graph-measurement harness (Phase 0)#38
Conversation
…offline path) The shipped forget() was inverted: on episodes it called recordAccess() (access_count++), a term recall ranking REWARDS via accessBoost — so forgetting a memory RAISED its recall rank; on semantic it floored a confidence value no recall path reads; procedural was a no-op. Net: forget did nothing useful or worse, which is why ~7,200 "forgotten" memories resurfaced across 19 manual gardening sessions. This lands the offline-testable core of the fix (core contract + SQLite + PostgREST adapter; the PostgREST schema/RPC + Neo4j gate follow): - storage.ts: add markForgotten(ids): Promise<number> to Episode/Semantic/ Procedural storage — a tombstone that sets forgotten_at and touches NEITHER access_count NOR confidence. - memory.ts forget(): rewrite the confirm path to call markForgotten per tier; drop the recordAccess/recordAccessAndBoost calls and the confidence floor (the forgotten_at tombstone is the single source of truth). Remove the now-unused CONFIDENCE_FLOOR. - sqlite: migration v5 adds forgotten_at (+ partial index) to episodes/ semantic/procedural; markForgotten impls; AND forgotten_at IS NULL gate cloned onto every recall path (vectorSearch + textBoost + the per-store hybrid/BM25/vector fallbacks), mirroring the proven superseded_by gate. - postgrest adapters: markForgotten via table PATCH (schema lands next). - tests: forget-e2e (recall gate excludes a tombstoned memory while the sibling survives; forget removes matched content; access_count NOT bumped; confirm=false no-op; idempotent at both levels) — would fail on old code. migration test updated to v5 + a forgotten_at column assertion. sqlite 106/106, core 495/495, typecheck clean.
…ten (Phase 1 production path) Carry the forget() tombstone into the PostgREST schema so forget removes content from every recall path on the production (Postgres+pgvector) backend, matching the SQLite v5 offline path. - forgotten_at timestamptz on memory_episodes/semantic/procedural, added both in the CREATE TABLE bodies and via idempotent ADD COLUMN IF NOT EXISTS so re-applying onto an already-provisioned DB actually adds the column. - engram_mark_forgotten(p_memory_type, p_ids): stamps forgotten_at and touches neither access_count nor confidence (writing access_count was the inverted- forget bug; flooring confidence was dead). Idempotent, returns rows stamped. - AND forgotten_at IS NULL gate in all 4 recall RPCs (engram_hybrid_recall, engram_recall, engram_text_boost, engram_vector_search) across the episode, semantic and procedural branches. Digests are not forgettable. - partial indexes on tombstoned rows (lockstep with the SQLite v5 indexes). - EOF post-apply smoke executes every recall RPC + engram_mark_forgotten so a missing column or broken gate surfaces at apply time. The dump emits functions before tables and relies on check_function_bodies=false, so the forgotten_at columns live in the table section and the smoke is the call-time guard; there is no migration runner. Verified on Postgres 17 + pgvector: forget round-trip across all three types, sibling survives, forgotten row's access_count unchanged. New schema-gate test pins the predicates; postgrest suite 71/71, typecheck clean.
…through the graph
After the SQL forget gate, a forgotten memory could still surface through graph
spreading activation — and would leak into authoritative recall once
associations are merged into the scored pool. Close the graph channel.
- GraphPort.forgetMemories?(ids): optional, capability-guarded port method.
- NeuralGraph.forgetMemories stamps forgottenAt on :Memory {id} nodes
(idempotent, returns count). Memory nodes are uniformly :Memory regardless of
memoryType, so one match covers episode/semantic/procedural ids.
- spreading-activation Cypher gates traversal on
coalesce(n.forgottenAt, n.deletedAt) IS NULL (NULL-permissive), which also
prunes paths that pass through a forgotten node.
- Memory.forget() calls graph.forgetMemories after the SQL tombstone,
capability-guarded and non-fatal: SQL stays the source of truth, and a graph
hiccup or absent Neo4j never fails the forget.
Verified on Neo4j 5: spreading-activation 12/12 incl. endpoint exclusion,
path-through pruning, sibling survival, idempotency (1 then 0) and cross-type
semantic gating. core 495/495, sqlite 106/106, typecheck clean.
…red (Phase 0, unit 1) Discovery #1: both bench adapters scored only recallResult.memories, but graph spreading-activation output lands in the separate recallResult.associations channel. So graph:true vs graph:false mathematically could not move recall@K — "the graph doesn't help" was a measurement-instrumentation bug, not a verdict. - mergeAssociationsIntoScored(recallResult, flag): when the flag is set, unions associations after the memory channel; otherwise returns memories unchanged. - BenchmarkOpts.mergeAssociationsIntoTopK (default false → byte-identical runs). - Wired into both LongMemEval (runQuestion) and LoCoMo (evaluateDataset) scoring. - Score-scale-safe by construction: both adapters score by gold-id set-membership over the deduped top-K, not score magnitude. So unioning the graph-relevance- ranked associations after the MMR/cross-encoder-ranked memories cannot be confounded by the scale mismatch — a gold id is either in the first K deduped ids or it is not. Memory-first ordering means associations can only RESCUE a gold id the memory channel missed, never displace one. - associations-visible-to-scored.test.ts: deterministic invariant (no Neo4j, no LLM, no dataset) — gold present in the scored pool with merge ON, absent OFF. This is the "associations-visible" invariant the symmetric kill criterion (later unit) depends on. - Adds packages/bench/vitest.config.ts so the gate test runs under turbo/CI. Default off → zero behaviour change to existing runs. bench typecheck clean, 4/4.
Encodes the red-team rule that prevents the historical mistake — concluding "kill the graph" from the saturated LongMemEval-S aggregate (~98.8% recall@5, where nothing has headroom to move). graphVerdict() demands POSITIVE evidence of no-effect before a kill: - kill ONLY when the primary aggregate delta is null/negative AND graphEffect is flat (≤ epsilon) on a graph-visible split with n ≥ 100, and the associations-visible invariant is green. Never the aggregate alone. - keep as soon as graphEffect clears epsilon on a powered, visible split — even when the saturated aggregate is flat or negative. - insufficient_power when underpowered (n<100), when the invariant is red, or in the ambiguous aggregate-positive-but-flat-effect case. Pure and deterministic; 6 table-driven cases pin the asymmetry, the power gate, and the invariant dependency. bench typecheck clean, 10/10.
…units 3-4)
createBenchMemory now returns a {memory, config, graphActuallyWired} handle
instead of a bare Memory, so graph cells can reach the graph handle and hard-
fail when it is absent.
- requireGraph(handle): throws if a graph cell runs without a real bench Neo4j,
killing the silent SQL-only fallback that would otherwise report a SQL delta
as a graph result (the "graph was never measured" trap). Lives in a
dependency-light bench-memory-handle module so the guard + types are
unit-testable without loading the ONNX native binding.
- wipeBenchGraph wired into LongMemEval runQuestion and LoCoMo runConversation
before ingest: Neo4j is a shared external process (unlike the per-call fresh
:memory: SQLite), so each question/conversation must start with a clean graph
or the previous unit's nodes pollute spreading activation. (wipeBenchGraph
existed but was called nowhere.)
- Migrated all 5 createBenchMemory callers to destructure the handle.
bench typecheck clean, 12/12.
…/3 gate filter (Phase 0, units 6-8b)
- classifyRecallStructure: deterministic label {lookup, multi_hop, temporal,
aggregation} from dataset signals (LoCoMo category, LongMemEval ability) with
a gold-cardinality + temporal-token heuristic fallback. GRAPH_RELEVANT =
{multi_hop, temporal} — the split where spreading activation should help.
- computeGraphEffect: recall@K(merge ON) − recall@K(merge OFF) on the
graph-relevant split (or the stronger graph-visible split when per-question
graphCouldContribute is supplied). This is the scale-independent lift that
feeds graphVerdict; an empty split returns zero effect, so with the n<100
power gate no decision is ever fabricated.
- LoCoMo categories filter (BenchmarkOpts.categories): score only the requested
categories (e.g. [2,3]) while ingesting the corpus whole — filters the metric,
not the graph the recall traverses. Canonical category map locked from
judge-adapter: 1=single_hop 2=multi_hop 3=temporal 4=open_domain 5=adversarial.
bench typecheck clean, 20/20.
… unit 5)
Completes the Phase 0 measurement harness. compareMatrix runs the 4 cells, each
graph-on cell with mergeAssociationsIntoTopK so the graph channel is visible to
recall@K, and computes graphEffect as the recall@K lift on the graph-relevant
split by pairing each graph-on cell's per-question outcomes against its
same-rerank graph-off sibling (one recall per question — no double-scoring).
- requireGraph hard-fails before any graph cell runs when no bench Neo4j is
wired, so a SQL-only fallback can never be reported as a graph result.
- extract{LongMemEval,LoCoMo}Outcomes live in a dependency-light matrix-outcomes
module (no adapter/onnx import) so the pairing + classification stays
unit-testable without native binaries; chained through computeGraphEffect.
- BaselineProvenance: git HEAD + corpus sha256 + flags + Neo4j-gate-state,
written to results/gates/graph-eval-baseline.json (gitignore switched to
results/* + !results/gates/ so the baseline can be committed).
- CLI: --matrix, --require-graph, --categories 2,3.
Pure orchestration unit-tested; the adapter-running wrapper + the live baseline
are validated against the bench runtime on the server. bench typecheck clean, 23/23.
Graph decomposition in memory.ingest() is fire-and-forget (pushed to _pendingWrites). Neither bench adapter awaited it, so recall ran against a half-built graph — the graph cells produced empty/sparse associations and graphEffect was spuriously ~0. This is exactly the measurement bug that makes "the graph doesn't help" look true when the graph was never given a chance. Call memory.flushPendingWrites() at the ingest→eval boundary in both adapters (LoCoMo runConversation after consolidation; LongMemEval runQuestion after ingest) so the graph is fully built before recall. bench typecheck clean, 23/23.
|
Caution Review failedPull request was closed or merged during review WalkthroughThis PR introduces a memory tombstoning mechanism across all storage backends and a comprehensive 4-cell benchmark matrix for measuring graph and reranking effects. The forgotten mechanism excludes memories from recall while preserving audit history; the matrix evaluation runs paired benchmark configurations to compute graph contribution metrics. ChangesMemory Tombstoning & Benchmark Matrix Evaluation
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary
Two workstreams from the memory-system overhaul:
forget()correctness fix (user-facing; already deployed to rexvps prod + verified).Phase 1 —
forget()actually forgets now (DEPLOYED)The bug:
forget()was inverted. The episode path calledrecordAccess()→access_count++, whichcomputeScorerewards (accessBoost), so forgetting a memory raised its recall rank. Procedural was a no-op; semantic floored a confidence value no recall path reads. This is the root cause of ~19 sessions of memory-gardening toil (the ~7,200-duplicate resurfacing).The fix: forget = tombstone only. A new
markForgotten(ids)storage method stamps aforgotten_attimestamp and touches neitheraccess_countnorconfidence. Every recall path gates onforgotten_at IS NULL(cloned from the provensuperseded_by IS NULLgate):forgotten_at+ partial indexes; vector + BM25 + deep-recall all gated.forgotten_atcolumns + indexes +engram_mark_forgottenRPC + the gate in all 4 recall RPCs (12 branches). Applied + verified on the live 47K-row prod DB via aBEGIN/ROLLBACKround-trip (forget → absent,access_countunchanged, zero pollution); PostgREST schema cache reloaded.forgetMemoriesstamps:Memory.forgottenAt; the spreading-activation Cypher gates traversal oncoalesce(n.forgottenAt, n.deletedAt) IS NULL(also prunes paths through forgotten nodes) so forget can't leak through the graph channel. Verified on real Neo4j 5.Deploy status: live on rexvps (schema applied + reloaded,
/opt/engramMCP service rebuilt/restarted, end-to-endmemory_recallconfirmed). Schema change is additive + backward-compatible (forgotten_atdefaults NULL → no existing row hidden).Phase 0 — make the graph measurable (bench-only, default-off)
Discovery: both bench adapters scored only
recallResult.memories, while graph output lands in the ignoredrecallResult.associationschannel — sograph:truevsgraph:falsemathematically could not move recall@K. "The graph doesn't help" was a measurement-instrumentation bug, not a verdict.This PR builds the harness to actually measure it:
mergeAssociationsIntoScored+BenchmarkOpts.mergeAssociationsIntoTopK(default false → byte-identical runs) — unions the graph channel into the scored top-K. Scale-safe (gold-id set-membership metric).{graph}×{rerank}matrix runner +computeGraphEffect(recall@K lift on the graph-relevant split) +classifyRecallStructure.graphVerdict— symmetric kill criterion (never conclude "kill" from a saturated aggregate; requires null aggregate and flat graphEffect on the graph-visible split with n≥100).requireGraphhard-fail guard (no SQL-only result masquerading as a graph result) + per-unitwipeBenchGraphisolation.flushPendingWrites()before eval — they never awaited the fire-and-forget graph decomposition, so graph cells recalled against a half-built graph.Finding (why this is default-off): an empirical run showed LongMemEval-S is saturated — vector+BM25+rerank already hits 100% R@5 on the multi-hop/temporal split with the graph OFF, so recall@K is blind to any graph enhancement here. LoCoMo was abandoned (documented dataset flaws + confused category map). So the graph keep/kill decision is deferred to a fair test (fix the disabled PPR + a multi-hop chaining benchmark like MuSiQue/2Wiki) — out of scope for this PR. The harness is valid tooling for that future test; it ships default-off with no runtime effect.
Behaviour change & risk
forget()forgotten_at IS NULLis a dormant no-op until something is forgottenmergeAssociationsIntoTopKdefault falseTesting
sqlite 106/106 · core 495/495 · postgrest 71/71 · graph 12/12 · bench 23/23; typecheck clean across all touched packages. Phase 1 verified on real Postgres 17 + pgvector, real Neo4j 5, and production rexvps.Follow-ups (NOT in this PR)
seedActivationsmap into the spreading-activation Cypher (PPR is currently disabled via anactivation=1.0hardcode) + a MuSiQue/2Wiki adapter.