Implement progressive merging via epoch-based propagation with union-find by claude[bot] · Pull Request #29 · resolveworks/worldgraph

claude · 2026-03-27T15:13:11Z

Summary

Implement epoch-based progressive merging as described in docs/progressive_merging.md
High-confidence entity merges are committed during propagation (not just post-processing), enriching neighborhoods for subsequent iterations
All 57 original tests pass at every commit; 2 new tests added (59 total)

Closes #25

Implementation

Each commit is a standalone step where all existing tests pass:

Move negative dampening outside inner propagation loop — pure behavior-preserving refactor making the inner loop a clean monotone fixpoint
Extract propagate_positive() and apply_negative() — split into standalone functions, same API and behavior
Add epoch loop with union-find merging — wraps propagate + dampen cycle in an outer epoch loop. Key design decisions:
- Adjacency lists deduplicated per epoch to prevent inflated evidence from merged entities
- Name-only seed (not carried-forward confidence) used for negative evidence to prevent circular reinforcement across epochs
- Best confidence across all epochs preserved per pair, ensuring progressive merging never worsens scores
- Default parameters (merge_threshold=0.9, max_epochs=5) reproduce previous behavior
Enriched-neighborhood test (Layer 2) — three articles where A+B merge in epoch 1, then C benefits from the enriched A+B neighborhood (confidence improves from ~0.57 to ~0.78)
Cascading false merge test (Layer 3) — two unrelated clusters with isomorphic structure stay separate even with progressive merging enabled
CLI exposure — --merge-threshold and --max-epochs options added to worldgraph match

Test plan

All 57 original tests pass at every step
New L2 test demonstrates enriched-neighborhood benefit (higher confidence with max_epochs>1 vs max_epochs=1)
New L3 test verifies no cascading false merges across unrelated clusters
match_graphs() API backward-compatible (new params have defaults reproducing current behavior)

🤖 Generated with Claude Code

The negative factor reads from name_seed (fixed) and doesn't depend on anything that changes within the positive fixpoint loop. Moving it to a single post-convergence pass makes the inner loop a clean monotone fixpoint and prepares for epoch-based progressive merging. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Split the positive fixpoint loop and negative dampening pass into standalone functions. propagate_similarity() now calls these in sequence — same API, same behavior. This separation enables the epoch loop in the next step. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Wrap the propagate + dampen cycle in an outer epoch loop. High-confidence merges (>= merge_threshold, default 0.9) are committed between epochs via union-find. Merged entities' neighborhoods are unioned for subsequent epochs, allowing evidence from transitively-matched entities to compound. Key design decisions: - Adjacency lists are deduplicated per epoch to prevent inflated evidence from merged entities having multiple copies of structurally identical edges - Name-only seed (not carried-forward confidence) is used for negative evidence to prevent circular reinforcement across epochs - Best confidence across all epochs is preserved per pair, ensuring progressive merging never worsens scores vs single-epoch behavior - With default parameters, existing behavior is preserved Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Three articles where A+B merge in epoch 1 (identical names + strong structural overlap). Article C's "Meridian Tech Corp" has moderate name similarity and neighbors split between A-unique (Austin) and B-unique (Volta Systems). Without progressive merging, C sees only pairwise evidence. With progressive merging, the enriched A+B neighborhood provides additional structural paths, producing measurably higher confidence. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…rges Two unrelated clusters with isomorphic structure (NovaTech/DataVault vs Quantum Labs/ClearSky) stay separate even with progressive merging enabled. Within-cluster merges work correctly while cross-cluster isolation is maintained. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add --merge-threshold and --max-epochs options to the match CLI command, passing them through run_matching to the propagation pipeline. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

claude

Requesting changes (can't use --request-changes on own PR)

Dead code: _build_weighted_adjacency and _build_forward_adjacency (match.py:155-208)

These two functions are defined but never called — they were replaced by _build_epoch_adjacency but left in the file. Per CLAUDE.md: "never add backward-compatibility shims, preserve stale signatures, or keep dead code around 'just in case'. Refactor completely."

Delete both functions. No other issues found — the rest of the implementation is clean and correct.

These functions were replaced by _build_epoch_adjacency but left behind. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

claude

Review: No issues found — this is ready to merge.

Clean decomposition of the propagation loop into propagate_positive / apply_negative, and the epoch-based progressive merging is well-structured. Specific things I verified:

Behavioral equivalence for single-epoch case: Since the negative factor depends only on fixed name_seed and forward_adj, moving it from per-iteration to post-convergence produces identical final values. The convergence criterion difference (dampened vs undampened delta) is negligible.
Adjacency deduplication in _build_epoch_adjacency correctly prevents inflated evidence from merged entities' redundant edges.
Name-only seed for negative evidence prevents circular reinforcement across epochs — the design from docs/progressive_merging.md is faithfully implemented.
best_confidence tracking correctly expands canonical-rep scores to original entity pairs and preserves the best score across epochs, so progressive merging never worsens results.
Return type change (propagate_similarity now returns tuple[Confidence, UnionFind]) is handled correctly at the only call site in match_graphs.
No dead code, no shims, no half-finished refactors — the old _build_weighted_adjacency and _build_forward_adjacency are cleanly replaced by _build_epoch_adjacency.

One minor note (not blocking): _build_epoch_adjacency doesn't skip edges where src == tgt after mapping through union-find (self-loops from merging connected entities). This is unlikely in practice since entities connected by a relation are semantically distinct, but a if src == tgt: continue guard would be a clean defensive addition.

When merged entities were previously connected by a relation, mapping both endpoints through union-find produces src == tgt. These self-loops would pollute adjacency lists with spurious neighbors. Filter them out. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

monneyboi

Replace the epoch-based double loop with a single propagation loop using incremental canonical adjacency.

The current design runs an inner fixpoint loop to convergence, then an outer epoch loop that rebuilds adjacency, re-seeds confidence, and rebuilds pairs between epochs. This is wasteful (most re-convergence work is redundant since only merged entities' neighborhoods changed) and harder to follow than necessary.

The epoch split exists because the docs sketched it that way, but the algorithm doesn't require it. A single loop with inline merging works if we handle adjacency dedup correctly — and we can do that incrementally instead of rebuilding from scratch each epoch.

The key insight: maintain a canonical_adj alongside the UnionFind, updated incrementally on merge.

The adjacency is consumed identically in both positive propagation and negative evidence: adj.get(entity_id, []) → iterate Neighbor(entity_id, relation, func_weight). The consumer doesn't care how the list was built. So on merge of A and B into canonical rep C:

# O(degree_A + degree_B) per merge
combined = canonical_adj[A] + canonical_adj[B]
seen = set()
deduped = []
for nbr in combined:
    canon_nbr = uf.find(nbr.entity_id)
    key = (canon_nbr, nbr.relation)
    if key not in seen:
        seen.add(key)
        deduped.append(Neighbor(canon_nbr, nbr.relation, nbr.func_weight))
canonical_adj[C] = deduped

The inner loop stays flat — same structure as current code, no member iteration, no adjacency rebuild:

for neighbor_a in canonical_adj.get(ca, []):
    for neighbor_b in canonical_adj.get(cb, []):
        nbr_conf = prev_base.get((neighbor_a.entity_id, neighbor_b.entity_id), 0.0)

Why dedup matters (can't skip it): without dedup, an entity appearing in 5 articles inflates structural evidence from ~0.59 to ~0.99 on the same underlying fact. The exponential sum saturates but not fast enough — the inflation is severe at exactly the scale we're targeting (major entities appearing across many articles).

What this eliminates:

_build_epoch_adjacency (full O(|edges|) rebuild per epoch)
_build_epoch_pairs (full O(n²) rebuild per epoch)
_seed_epoch_confidence (full O(n² × names²) re-seeding per epoch)
The best_confidence tracking and final member-expansion loop
The propagate_positive / apply_negative split into separate functions (they fold back into the single loop)

What this keeps:

UnionFind for entity identity (already have it)
Canonical adjacency for neighborhood identity (new, incremental, O(degree) per merge)
Pairs list and confidence dict keyed by canonical reps, updated on merge
Deduplication of redundant edges between same canonical entities

The epoch-based design ran an inner fixpoint loop to convergence, then an outer epoch loop that rebuilt adjacency, re-seeded confidence, and rebuilt pairs between epochs. This was wasteful and harder to follow. Replace with a single propagation loop that maintains a canonical_adj alongside the UnionFind, updated incrementally on merge (O(degree) per merge instead of O(|edges|) full rebuild). Positive evidence runs to convergence, then negative dampening is applied once, then merges are committed inline. Enriched neighborhoods compound across merge cycles. Key changes: - Remove _build_epoch_adjacency, _build_epoch_pairs, _seed_epoch_confidence - Remove propagate_positive and apply_negative as separate functions - Remove max_epochs parameter (single loop handles everything) - Add _build_adjacency, _build_pairs, _seed_confidence (initial setup) - Incremental canonical_adj dedup on merge prevents evidence inflation - Re-seed from name similarity after merge to prevent compounding dampening - Negative evidence resolves neighbor IDs through UnionFind and treats same-canonical-entity neighbors as perfect matches Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

claude

Bug: negative evidence uses the wrong functionality direction.

_build_adjacency (line 295) assigns functionality weights matching the old positive-propagation adjacency — source gets func.inverse, target gets func.forward:

adjacency[src].append(Neighbor(tgt, edge.relation, func.inverse))
adjacency[tgt].append(Neighbor(src, edge.relation, func.forward))

But _one_sided_negative reads neighbor_a.func_weight as the negative-evidence weight. Per the algorithm design and docs/negative_evidence.md line 38 ("Negative evidence uses forward functionality"), it needs the opposite direction: forward for outgoing neighbors, inverse for incoming.

The old code had a separate _build_forward_adjacency with the swapped directions specifically for this. The refactor to a single canonical_adj dropped that distinction — negative evidence now weights mismatches by inverse functionality (outgoing) / forward functionality (incoming) instead of forward / inverse. This is semantically wrong: a functional relation (high forward func, e.g. "acquired") should generate strong negative evidence when its target doesn't match, but now it's weighted by inverse func which measures something different (how unique is the source given the target?).

The fix: since _one_sided_negative iterates over adj.get(id_a, []) and reads neighbor.func_weight, you need the weight to reflect the correct direction for negative evidence. Two options:

Store both weights in Neighbor (add a field), use the right one in each context.
Build a parallel neg_adj with swapped directions (like the old _build_forward_adjacency), maintained incrementally alongside canonical_adj.

Option 1 is cleaner since it avoids maintaining two adjacency structures through merges.

The refactor to a single canonical_adj dropped the distinction between positive and negative functionality directions. Negative evidence needs forward func for outgoing neighbors and inverse for incoming (the opposite of positive propagation). Add neg_func_weight to Neighbor so each context uses the correct weight. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

claude

No issues found — ready to merge.

The negative-evidence functionality fix is correct: both weights stored per Neighbor, func_weight for positive propagation and neg_func_weight for negative evidence, with the right direction assignments in _build_adjacency. Single-loop progressive merging is clean: positive converges before negative dampening, incremental O(degree) adjacency updates on merge, re-seed prevents dampening compounding across cycles. No dead code or shims remain.

- Remove canon_graphs tracking (dead code: all pairs originate cross-graph, so the len==1 filter never triggers; ra==rb already handles self-pairs) - Replace pre_merge_canons/pre_merge_graphs with a simple set comprehension - Extract _remap_confidence helper for the identical confidence/name_seed remapping operations Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

claude

No issues found — this is ready to merge.

Verified: functionality weight directions in _build_adjacency correctly assign func_weight (positive) and neg_func_weight (negative) matching the old separate adjacency builders. Incremental adjacency merge dedup is correct (same-relation entries have identical weights). Single-loop design with positive-converge-then-negative-then-merge is sound — re-seeding prevents dampening compounding across cycles. Final expansion to original entity pairs is correct. No dead code or shims.

github-actions bot and others added 6 commits March 27, 2026 15:43

Expose merge_threshold and max_epochs in CLI and run_matching

557e666

Add --merge-threshold and --max-epochs options to the match CLI command, passing them through run_matching to the propagation pipeline. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

claude bot commented Mar 27, 2026

View reviewed changes

Remove dead code: _build_weighted_adjacency and _build_forward_adjacency

a626765

These functions were replaced by _build_epoch_adjacency but left behind. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

claude bot commented Mar 27, 2026

View reviewed changes

monneyboi requested changes Mar 27, 2026

View reviewed changes

claude bot commented Mar 29, 2026

View reviewed changes

monneyboi added 2 commits March 29, 2026 19:58

update docs for other convergence proof

db440aa

refactor to new convergence

115b587

monneyboi merged commit 3274843 into main Mar 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement progressive merging via epoch-based propagation with union-find#29

Implement progressive merging via epoch-based propagation with union-find#29
monneyboi merged 13 commits intomainfrom
progressive-merging

claude bot commented Mar 27, 2026

Uh oh!

claude bot left a comment

Uh oh!

claude bot left a comment

Uh oh!

monneyboi left a comment

Uh oh!

claude bot left a comment

Uh oh!

claude bot left a comment

Uh oh!

claude bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

claude bot commented Mar 27, 2026

Summary

Implementation

Test plan

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Uh oh!

monneyboi left a comment

Choose a reason for hiding this comment

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant