Skip to content

Implement progressive merging via epoch-based propagation#26

Closed
claude[bot] wants to merge 3 commits intomainfrom
implement-progressive-merging
Closed

Implement progressive merging via epoch-based propagation#26
claude[bot] wants to merge 3 commits intomainfrom
implement-progressive-merging

Conversation

@claude
Copy link
Copy Markdown

@claude claude bot commented Mar 26, 2026

Summary

  • Refactors propagate_similarity() into propagate_positive() (monotone fixpoint) and apply_negative() (single-pass dampening), wrapped in an epoch loop with union-find merging
  • Between epochs, high-confidence pairs (>0.9) are committed via union-find, enriching neighborhoods for subsequent propagation
  • Canonical adjacency is deduplicated to prevent artificial inflation; shared canonical neighbors contribute confidence=1.0
  • All 54 existing tests pass unchanged; 2 new tests added (56 total)

Test plan

  • All 54 existing tests pass at every commit
  • test_progressive_merging_enriches_neighborhoods — Layer 2: verifies enriched-neighborhood benefit (higher confidence with max_epochs>1 vs max_epochs=1)
  • test_progressive_merging_no_cascading_false_merges — Layer 3: verifies cross-cluster isolation with progressive merging enabled

Closes #25

🤖 Generated with Claude Code

github-actions bot and others added 2 commits March 26, 2026 15:57
Refactor propagate_similarity() into three clean components:
- propagate_positive(): monotone positive fixpoint loop
- apply_negative(): single-pass post-convergence negative dampening
- propagate_similarity(): epoch loop wrapping the above two

Between epochs, high-confidence pairs (above merge_threshold=0.9) are
committed via union-find, enriching neighborhoods for subsequent
propagation. Merged entities' neighborhoods are deduplicated to prevent
artificial inflation of positive evidence. Shared canonical neighbors
(same entity after merging) contribute confidence=1.0 in both positive
propagation and negative evidence checks.

With default parameters, existing behavior is preserved: pairs below 0.9
don't trigger progressive merges, and the epoch loop exits naturally.

Closes #25 (steps 1-3)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Layer 2 test (test_propagation.py):
- test_progressive_merging_enriches_neighborhoods: verifies that after
  A+B merge in epoch 1, C's entity benefits from the enriched neighborhood
  in epoch 2, producing higher confidence than max_epochs=1.

Layer 3 test (test_integration.py):
- test_progressive_merging_no_cascading_false_merges: two unrelated
  clusters with identical entity names and similar structure stay
  separate even after within-cluster progressive merges enrich
  neighborhoods. Negative evidence prevents cross-cluster merging.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Author

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

O(n⁴) blowup in canonical pair seeding (match.py lines 536-542)

Inside the epoch loop, each new canonical pair triggers an inner scan of all original pairs to find the best name similarity:

for id_a, id_b in all_pairs:          # O(n²) outer
    ...
    if pair not in seen_canon:
        for orig_a, orig_b in all_pairs:  # O(n²) inner
            ...

This is O(|all_pairs|²) = O(n⁴) per epoch. For the target scale ("continuous feed of all major news outlets"), this will dominate runtime.

Fix: precompute a best_name_per_canon map in a single O(n²) pass over all_pairs at the top of each epoch, then look up best_name_per_canon.get(pair, 0.0) inside the canonical-pair loop:

best_name_per_canon: dict[tuple[str, str], float] = {}
for orig_a, orig_b in all_pairs:
    ca, cb = uf.find(orig_a), uf.find(orig_b)
    if ca == cb:
        continue
    cpair = (ca, cb) if ca < cb else (cb, ca)
    best_name_per_canon[cpair] = max(
        best_name_per_canon.get(cpair, 0.0),
        name_seed.get((orig_a, orig_b), 0.0),
    )

Everything else looks correct — the canonical adjacency dedup, the identity check in _one_sided_negative, the epoch-to-epoch carry-forward, and the final confidence mapping all hold up. Tests are well-designed. Just the one quadratic blowup to fix.

Copy link
Copy Markdown
Contributor

@monneyboi monneyboi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix the O(n⁴) blowup in canonical pair seeding identified in the previous review.

Inside the epoch loop, each new canonical pair triggers an inner scan of all original pairs to find the best name similarity — this is O(|all_pairs|²) = O(n⁴) per epoch.

Fix: precompute a best_name_per_canon map in a single O(n²) pass over all_pairs at the top of each epoch, then look up from it inside the canonical-pair loop. See the previous review comment for the suggested implementation.

Precompute best_name_per_canon map in a single O(n²) pass over all_pairs
at the top of each epoch, replacing the O(n²) inner scan that ran for
each new canonical pair.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Author

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: stale canonical reps when max_epochs is exhausted

When the epoch loop runs all max_epochs iterations AND the final epoch produces merges, the confidence dict becomes miskeyed relative to the union-find state used in the final mapping (lines 622-636).

The issue: merges are committed at lines 609-610 (uf.union(ca, cb)), which can change canonical reps. But confidence was computed before those merges, so its keys use the pre-merge reps. The final mapping at line 632 looks up confidence.get((ca, cb), ...) using post-merge reps from uf.find(), which may not match the keys in confidence.

Concrete example: epoch N (the last) merges canonical reps X and Z, with Z becoming the new rep. For an unmerged pair (X, Y) that had confidence[(X, Y)] = 0.7, the final mapping computes ca = uf.find(...) = Z and looks up confidence[(Z, Y)] — which doesn't exist — returning 0.0 instead of 0.7.

The normal exit path (not new_mergesbreak at line 606) is fine because no merges are committed after the last confidence computation. The bug is only on the range(max_epochs) exhaustion path.

Fix: skip the merge-commit on the final iteration since there's no subsequent epoch to benefit. Something like:

if _epoch == max_epochs - 1:
    break

before line 608. Merged pairs still get picked up by build_match_groups (their confidence is ≥ 0.9, well above the 0.8 grouping threshold), and unmerged pairs keep their correct scores.

@monneyboi
Copy link
Copy Markdown
Contributor

Closing — the canonical-name remapping approach is too hacky. Will implement #3 (multi-label entities) first, then rebuild progressive merging cleanly on top.

@monneyboi monneyboi closed this Mar 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement progressive merging via epoch-based propagation with union-find

1 participant