Implement progressive merging via epoch-based propagation#26
Implement progressive merging via epoch-based propagation#26claude[bot] wants to merge 3 commits intomainfrom
Conversation
Refactor propagate_similarity() into three clean components: - propagate_positive(): monotone positive fixpoint loop - apply_negative(): single-pass post-convergence negative dampening - propagate_similarity(): epoch loop wrapping the above two Between epochs, high-confidence pairs (above merge_threshold=0.9) are committed via union-find, enriching neighborhoods for subsequent propagation. Merged entities' neighborhoods are deduplicated to prevent artificial inflation of positive evidence. Shared canonical neighbors (same entity after merging) contribute confidence=1.0 in both positive propagation and negative evidence checks. With default parameters, existing behavior is preserved: pairs below 0.9 don't trigger progressive merges, and the epoch loop exits naturally. Closes #25 (steps 1-3) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Layer 2 test (test_propagation.py): - test_progressive_merging_enriches_neighborhoods: verifies that after A+B merge in epoch 1, C's entity benefits from the enriched neighborhood in epoch 2, producing higher confidence than max_epochs=1. Layer 3 test (test_integration.py): - test_progressive_merging_no_cascading_false_merges: two unrelated clusters with identical entity names and similar structure stay separate even after within-cluster progressive merges enrich neighborhoods. Negative evidence prevents cross-cluster merging. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
O(n⁴) blowup in canonical pair seeding (match.py lines 536-542)
Inside the epoch loop, each new canonical pair triggers an inner scan of all original pairs to find the best name similarity:
for id_a, id_b in all_pairs: # O(n²) outer
...
if pair not in seen_canon:
for orig_a, orig_b in all_pairs: # O(n²) inner
...This is O(|all_pairs|²) = O(n⁴) per epoch. For the target scale ("continuous feed of all major news outlets"), this will dominate runtime.
Fix: precompute a best_name_per_canon map in a single O(n²) pass over all_pairs at the top of each epoch, then look up best_name_per_canon.get(pair, 0.0) inside the canonical-pair loop:
best_name_per_canon: dict[tuple[str, str], float] = {}
for orig_a, orig_b in all_pairs:
ca, cb = uf.find(orig_a), uf.find(orig_b)
if ca == cb:
continue
cpair = (ca, cb) if ca < cb else (cb, ca)
best_name_per_canon[cpair] = max(
best_name_per_canon.get(cpair, 0.0),
name_seed.get((orig_a, orig_b), 0.0),
)Everything else looks correct — the canonical adjacency dedup, the identity check in _one_sided_negative, the epoch-to-epoch carry-forward, and the final confidence mapping all hold up. Tests are well-designed. Just the one quadratic blowup to fix.
monneyboi
left a comment
There was a problem hiding this comment.
Fix the O(n⁴) blowup in canonical pair seeding identified in the previous review.
Inside the epoch loop, each new canonical pair triggers an inner scan of all original pairs to find the best name similarity — this is O(|all_pairs|²) = O(n⁴) per epoch.
Fix: precompute a best_name_per_canon map in a single O(n²) pass over all_pairs at the top of each epoch, then look up from it inside the canonical-pair loop. See the previous review comment for the suggested implementation.
Precompute best_name_per_canon map in a single O(n²) pass over all_pairs at the top of each epoch, replacing the O(n²) inner scan that ran for each new canonical pair. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Bug: stale canonical reps when max_epochs is exhausted
When the epoch loop runs all max_epochs iterations AND the final epoch produces merges, the confidence dict becomes miskeyed relative to the union-find state used in the final mapping (lines 622-636).
The issue: merges are committed at lines 609-610 (uf.union(ca, cb)), which can change canonical reps. But confidence was computed before those merges, so its keys use the pre-merge reps. The final mapping at line 632 looks up confidence.get((ca, cb), ...) using post-merge reps from uf.find(), which may not match the keys in confidence.
Concrete example: epoch N (the last) merges canonical reps X and Z, with Z becoming the new rep. For an unmerged pair (X, Y) that had confidence[(X, Y)] = 0.7, the final mapping computes ca = uf.find(...) = Z and looks up confidence[(Z, Y)] — which doesn't exist — returning 0.0 instead of 0.7.
The normal exit path (not new_merges → break at line 606) is fine because no merges are committed after the last confidence computation. The bug is only on the range(max_epochs) exhaustion path.
Fix: skip the merge-commit on the final iteration since there's no subsequent epoch to benefit. Something like:
if _epoch == max_epochs - 1:
breakbefore line 608. Merged pairs still get picked up by build_match_groups (their confidence is ≥ 0.9, well above the 0.8 grouping threshold), and unmerged pairs keep their correct scores.
|
Closing — the canonical-name remapping approach is too hacky. Will implement #3 (multi-label entities) first, then rebuild progressive merging cleanly on top. |
Summary
propagate_similarity()intopropagate_positive()(monotone fixpoint) andapply_negative()(single-pass dampening), wrapped in an epoch loop with union-find mergingTest plan
test_progressive_merging_enriches_neighborhoods— Layer 2: verifies enriched-neighborhood benefit (higher confidence withmax_epochs>1vsmax_epochs=1)test_progressive_merging_no_cascading_false_merges— Layer 3: verifies cross-cluster isolation with progressive merging enabledCloses #25
🤖 Generated with Claude Code