Skip to content

Implement progressive merging via epoch-based propagation with union-find#29

Merged
monneyboi merged 13 commits intomainfrom
progressive-merging
Mar 29, 2026
Merged

Implement progressive merging via epoch-based propagation with union-find#29
monneyboi merged 13 commits intomainfrom
progressive-merging

Conversation

@claude
Copy link
Copy Markdown

@claude claude bot commented Mar 27, 2026

Summary

  • Implement epoch-based progressive merging as described in docs/progressive_merging.md
  • High-confidence entity merges are committed during propagation (not just post-processing), enriching neighborhoods for subsequent iterations
  • All 57 original tests pass at every commit; 2 new tests added (59 total)

Closes #25

Implementation

Each commit is a standalone step where all existing tests pass:

  1. Move negative dampening outside inner propagation loop — pure behavior-preserving refactor making the inner loop a clean monotone fixpoint
  2. Extract propagate_positive() and apply_negative() — split into standalone functions, same API and behavior
  3. Add epoch loop with union-find merging — wraps propagate + dampen cycle in an outer epoch loop. Key design decisions:
    • Adjacency lists deduplicated per epoch to prevent inflated evidence from merged entities
    • Name-only seed (not carried-forward confidence) used for negative evidence to prevent circular reinforcement across epochs
    • Best confidence across all epochs preserved per pair, ensuring progressive merging never worsens scores
    • Default parameters (merge_threshold=0.9, max_epochs=5) reproduce previous behavior
  4. Enriched-neighborhood test (Layer 2) — three articles where A+B merge in epoch 1, then C benefits from the enriched A+B neighborhood (confidence improves from ~0.57 to ~0.78)
  5. Cascading false merge test (Layer 3) — two unrelated clusters with isomorphic structure stay separate even with progressive merging enabled
  6. CLI exposure--merge-threshold and --max-epochs options added to worldgraph match

Test plan

  • All 57 original tests pass at every step
  • New L2 test demonstrates enriched-neighborhood benefit (higher confidence with max_epochs>1 vs max_epochs=1)
  • New L3 test verifies no cascading false merges across unrelated clusters
  • match_graphs() API backward-compatible (new params have defaults reproducing current behavior)

🤖 Generated with Claude Code

github-actions bot and others added 6 commits March 27, 2026 15:43
The negative factor reads from name_seed (fixed) and doesn't depend on
anything that changes within the positive fixpoint loop. Moving it to a
single post-convergence pass makes the inner loop a clean monotone
fixpoint and prepares for epoch-based progressive merging.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Split the positive fixpoint loop and negative dampening pass into
standalone functions. propagate_similarity() now calls these in
sequence — same API, same behavior. This separation enables the
epoch loop in the next step.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wrap the propagate + dampen cycle in an outer epoch loop. High-confidence
merges (>= merge_threshold, default 0.9) are committed between epochs
via union-find. Merged entities' neighborhoods are unioned for subsequent
epochs, allowing evidence from transitively-matched entities to compound.

Key design decisions:
- Adjacency lists are deduplicated per epoch to prevent inflated evidence
  from merged entities having multiple copies of structurally identical edges
- Name-only seed (not carried-forward confidence) is used for negative
  evidence to prevent circular reinforcement across epochs
- Best confidence across all epochs is preserved per pair, ensuring
  progressive merging never worsens scores vs single-epoch behavior
- With default parameters, existing behavior is preserved

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three articles where A+B merge in epoch 1 (identical names + strong
structural overlap). Article C's "Meridian Tech Corp" has moderate name
similarity and neighbors split between A-unique (Austin) and B-unique
(Volta Systems). Without progressive merging, C sees only pairwise
evidence. With progressive merging, the enriched A+B neighborhood
provides additional structural paths, producing measurably higher
confidence.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rges

Two unrelated clusters with isomorphic structure (NovaTech/DataVault vs
Quantum Labs/ClearSky) stay separate even with progressive merging
enabled. Within-cluster merges work correctly while cross-cluster
isolation is maintained.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add --merge-threshold and --max-epochs options to the match CLI command,
passing them through run_matching to the propagation pipeline.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Author

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes (can't use --request-changes on own PR)

Dead code: _build_weighted_adjacency and _build_forward_adjacency (match.py:155-208)

These two functions are defined but never called — they were replaced by _build_epoch_adjacency but left in the file. Per CLAUDE.md: "never add backward-compatibility shims, preserve stale signatures, or keep dead code around 'just in case'. Refactor completely."

Delete both functions. No other issues found — the rest of the implementation is clean and correct.

These functions were replaced by _build_epoch_adjacency but left behind.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Author

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: No issues found — this is ready to merge.

Clean decomposition of the propagation loop into propagate_positive / apply_negative, and the epoch-based progressive merging is well-structured. Specific things I verified:

  • Behavioral equivalence for single-epoch case: Since the negative factor depends only on fixed name_seed and forward_adj, moving it from per-iteration to post-convergence produces identical final values. The convergence criterion difference (dampened vs undampened delta) is negligible.
  • Adjacency deduplication in _build_epoch_adjacency correctly prevents inflated evidence from merged entities' redundant edges.
  • Name-only seed for negative evidence prevents circular reinforcement across epochs — the design from docs/progressive_merging.md is faithfully implemented.
  • best_confidence tracking correctly expands canonical-rep scores to original entity pairs and preserves the best score across epochs, so progressive merging never worsens results.
  • Return type change (propagate_similarity now returns tuple[Confidence, UnionFind]) is handled correctly at the only call site in match_graphs.
  • No dead code, no shims, no half-finished refactors — the old _build_weighted_adjacency and _build_forward_adjacency are cleanly replaced by _build_epoch_adjacency.

One minor note (not blocking): _build_epoch_adjacency doesn't skip edges where src == tgt after mapping through union-find (self-loops from merging connected entities). This is unlikely in practice since entities connected by a relation are semantically distinct, but a if src == tgt: continue guard would be a clean defensive addition.

When merged entities were previously connected by a relation, mapping
both endpoints through union-find produces src == tgt. These self-loops
would pollute adjacency lists with spurious neighbors. Filter them out.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@monneyboi monneyboi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace the epoch-based double loop with a single propagation loop using incremental canonical adjacency.

The current design runs an inner fixpoint loop to convergence, then an outer epoch loop that rebuilds adjacency, re-seeds confidence, and rebuilds pairs between epochs. This is wasteful (most re-convergence work is redundant since only merged entities' neighborhoods changed) and harder to follow than necessary.

The epoch split exists because the docs sketched it that way, but the algorithm doesn't require it. A single loop with inline merging works if we handle adjacency dedup correctly — and we can do that incrementally instead of rebuilding from scratch each epoch.

The key insight: maintain a canonical_adj alongside the UnionFind, updated incrementally on merge.

The adjacency is consumed identically in both positive propagation and negative evidence: adj.get(entity_id, []) → iterate Neighbor(entity_id, relation, func_weight). The consumer doesn't care how the list was built. So on merge of A and B into canonical rep C:

# O(degree_A + degree_B) per merge
combined = canonical_adj[A] + canonical_adj[B]
seen = set()
deduped = []
for nbr in combined:
    canon_nbr = uf.find(nbr.entity_id)
    key = (canon_nbr, nbr.relation)
    if key not in seen:
        seen.add(key)
        deduped.append(Neighbor(canon_nbr, nbr.relation, nbr.func_weight))
canonical_adj[C] = deduped

The inner loop stays flat — same structure as current code, no member iteration, no adjacency rebuild:

for neighbor_a in canonical_adj.get(ca, []):
    for neighbor_b in canonical_adj.get(cb, []):
        nbr_conf = prev_base.get((neighbor_a.entity_id, neighbor_b.entity_id), 0.0)

Why dedup matters (can't skip it): without dedup, an entity appearing in 5 articles inflates structural evidence from ~0.59 to ~0.99 on the same underlying fact. The exponential sum saturates but not fast enough — the inflation is severe at exactly the scale we're targeting (major entities appearing across many articles).

What this eliminates:

  • _build_epoch_adjacency (full O(|edges|) rebuild per epoch)
  • _build_epoch_pairs (full O(n²) rebuild per epoch)
  • _seed_epoch_confidence (full O(n² × names²) re-seeding per epoch)
  • The best_confidence tracking and final member-expansion loop
  • The propagate_positive / apply_negative split into separate functions (they fold back into the single loop)

What this keeps:

  • UnionFind for entity identity (already have it)
  • Canonical adjacency for neighborhood identity (new, incremental, O(degree) per merge)
  • Pairs list and confidence dict keyed by canonical reps, updated on merge
  • Deduplication of redundant edges between same canonical entities

The epoch-based design ran an inner fixpoint loop to convergence, then
an outer epoch loop that rebuilt adjacency, re-seeded confidence, and
rebuilt pairs between epochs. This was wasteful and harder to follow.

Replace with a single propagation loop that maintains a canonical_adj
alongside the UnionFind, updated incrementally on merge (O(degree) per
merge instead of O(|edges|) full rebuild). Positive evidence runs to
convergence, then negative dampening is applied once, then merges are
committed inline. Enriched neighborhoods compound across merge cycles.

Key changes:
- Remove _build_epoch_adjacency, _build_epoch_pairs, _seed_epoch_confidence
- Remove propagate_positive and apply_negative as separate functions
- Remove max_epochs parameter (single loop handles everything)
- Add _build_adjacency, _build_pairs, _seed_confidence (initial setup)
- Incremental canonical_adj dedup on merge prevents evidence inflation
- Re-seed from name similarity after merge to prevent compounding dampening
- Negative evidence resolves neighbor IDs through UnionFind and treats
  same-canonical-entity neighbors as perfect matches

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Author

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: negative evidence uses the wrong functionality direction.

_build_adjacency (line 295) assigns functionality weights matching the old positive-propagation adjacency — source gets func.inverse, target gets func.forward:

adjacency[src].append(Neighbor(tgt, edge.relation, func.inverse))
adjacency[tgt].append(Neighbor(src, edge.relation, func.forward))

But _one_sided_negative reads neighbor_a.func_weight as the negative-evidence weight. Per the algorithm design and docs/negative_evidence.md line 38 ("Negative evidence uses forward functionality"), it needs the opposite direction: forward for outgoing neighbors, inverse for incoming.

The old code had a separate _build_forward_adjacency with the swapped directions specifically for this. The refactor to a single canonical_adj dropped that distinction — negative evidence now weights mismatches by inverse functionality (outgoing) / forward functionality (incoming) instead of forward / inverse. This is semantically wrong: a functional relation (high forward func, e.g. "acquired") should generate strong negative evidence when its target doesn't match, but now it's weighted by inverse func which measures something different (how unique is the source given the target?).

The fix: since _one_sided_negative iterates over adj.get(id_a, []) and reads neighbor.func_weight, you need the weight to reflect the correct direction for negative evidence. Two options:

  1. Store both weights in Neighbor (add a field), use the right one in each context.
  2. Build a parallel neg_adj with swapped directions (like the old _build_forward_adjacency), maintained incrementally alongside canonical_adj.

Option 1 is cleaner since it avoids maintaining two adjacency structures through merges.

The refactor to a single canonical_adj dropped the distinction between
positive and negative functionality directions. Negative evidence needs
forward func for outgoing neighbors and inverse for incoming (the
opposite of positive propagation). Add neg_func_weight to Neighbor so
each context uses the correct weight.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Author

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found — ready to merge.

The negative-evidence functionality fix is correct: both weights stored per Neighbor, func_weight for positive propagation and neg_func_weight for negative evidence, with the right direction assignments in _build_adjacency. Single-loop progressive merging is clean: positive converges before negative dampening, incremental O(degree) adjacency updates on merge, re-seed prevents dampening compounding across cycles. No dead code or shims remain.

- Remove canon_graphs tracking (dead code: all pairs originate cross-graph,
  so the len==1 filter never triggers; ra==rb already handles self-pairs)
- Replace pre_merge_canons/pre_merge_graphs with a simple set comprehension
- Extract _remap_confidence helper for the identical confidence/name_seed
  remapping operations

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Author

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found — this is ready to merge.

Verified: functionality weight directions in _build_adjacency correctly assign func_weight (positive) and neg_func_weight (negative) matching the old separate adjacency builders. Incremental adjacency merge dedup is correct (same-relation entries have identical weights). Single-loop design with positive-converge-then-negative-then-merge is sound — re-seeding prevents dampening compounding across cycles. Final expansion to original entity pairs is correct. No dead code or shims.

@monneyboi monneyboi merged commit 3274843 into main Mar 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement progressive merging via epoch-based propagation with union-find

1 participant