Implement progressive merging via epoch-based propagation by claude[bot] · Pull Request #26 · resolveworks/worldgraph

claude · 2026-03-26T14:59:41Z

Summary

Refactors propagate_similarity() into propagate_positive() (monotone fixpoint) and apply_negative() (single-pass dampening), wrapped in an epoch loop with union-find merging
Between epochs, high-confidence pairs (>0.9) are committed via union-find, enriching neighborhoods for subsequent propagation
Canonical adjacency is deduplicated to prevent artificial inflation; shared canonical neighbors contribute confidence=1.0
All 54 existing tests pass unchanged; 2 new tests added (56 total)

Test plan

All 54 existing tests pass at every commit
test_progressive_merging_enriches_neighborhoods — Layer 2: verifies enriched-neighborhood benefit (higher confidence with max_epochs>1 vs max_epochs=1)
test_progressive_merging_no_cascading_false_merges — Layer 3: verifies cross-cluster isolation with progressive merging enabled

Closes #25

🤖 Generated with Claude Code

Refactor propagate_similarity() into three clean components: - propagate_positive(): monotone positive fixpoint loop - apply_negative(): single-pass post-convergence negative dampening - propagate_similarity(): epoch loop wrapping the above two Between epochs, high-confidence pairs (above merge_threshold=0.9) are committed via union-find, enriching neighborhoods for subsequent propagation. Merged entities' neighborhoods are deduplicated to prevent artificial inflation of positive evidence. Shared canonical neighbors (same entity after merging) contribute confidence=1.0 in both positive propagation and negative evidence checks. With default parameters, existing behavior is preserved: pairs below 0.9 don't trigger progressive merges, and the epoch loop exits naturally. Closes #25 (steps 1-3) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Layer 2 test (test_propagation.py): - test_progressive_merging_enriches_neighborhoods: verifies that after A+B merge in epoch 1, C's entity benefits from the enriched neighborhood in epoch 2, producing higher confidence than max_epochs=1. Layer 3 test (test_integration.py): - test_progressive_merging_no_cascading_false_merges: two unrelated clusters with identical entity names and similar structure stay separate even after within-cluster progressive merges enrich neighborhoods. Negative evidence prevents cross-cluster merging. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

claude

O(n⁴) blowup in canonical pair seeding (match.py lines 536-542)

Inside the epoch loop, each new canonical pair triggers an inner scan of all original pairs to find the best name similarity:

for id_a, id_b in all_pairs:          # O(n²) outer
    ...
    if pair not in seen_canon:
        for orig_a, orig_b in all_pairs:  # O(n²) inner
            ...

This is O(|all_pairs|²) = O(n⁴) per epoch. For the target scale ("continuous feed of all major news outlets"), this will dominate runtime.

Fix: precompute a best_name_per_canon map in a single O(n²) pass over all_pairs at the top of each epoch, then look up best_name_per_canon.get(pair, 0.0) inside the canonical-pair loop:

best_name_per_canon: dict[tuple[str, str], float] = {}
for orig_a, orig_b in all_pairs:
    ca, cb = uf.find(orig_a), uf.find(orig_b)
    if ca == cb:
        continue
    cpair = (ca, cb) if ca < cb else (cb, ca)
    best_name_per_canon[cpair] = max(
        best_name_per_canon.get(cpair, 0.0),
        name_seed.get((orig_a, orig_b), 0.0),
    )

Everything else looks correct — the canonical adjacency dedup, the identity check in _one_sided_negative, the epoch-to-epoch carry-forward, and the final confidence mapping all hold up. Tests are well-designed. Just the one quadratic blowup to fix.

monneyboi

Fix the O(n⁴) blowup in canonical pair seeding identified in the previous review.

Inside the epoch loop, each new canonical pair triggers an inner scan of all original pairs to find the best name similarity — this is O(|all_pairs|²) = O(n⁴) per epoch.

Fix: precompute a best_name_per_canon map in a single O(n²) pass over all_pairs at the top of each epoch, then look up from it inside the canonical-pair loop. See the previous review comment for the suggested implementation.

Precompute best_name_per_canon map in a single O(n²) pass over all_pairs at the top of each epoch, replacing the O(n²) inner scan that ran for each new canonical pair. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

claude

Bug: stale canonical reps when max_epochs is exhausted

When the epoch loop runs all max_epochs iterations AND the final epoch produces merges, the confidence dict becomes miskeyed relative to the union-find state used in the final mapping (lines 622-636).

The issue: merges are committed at lines 609-610 (uf.union(ca, cb)), which can change canonical reps. But confidence was computed before those merges, so its keys use the pre-merge reps. The final mapping at line 632 looks up confidence.get((ca, cb), ...) using post-merge reps from uf.find(), which may not match the keys in confidence.

Concrete example: epoch N (the last) merges canonical reps X and Z, with Z becoming the new rep. For an unmerged pair (X, Y) that had confidence[(X, Y)] = 0.7, the final mapping computes ca = uf.find(...) = Z and looks up confidence[(Z, Y)] — which doesn't exist — returning 0.0 instead of 0.7.

The normal exit path (not new_merges → break at line 606) is fine because no merges are committed after the last confidence computation. The bug is only on the range(max_epochs) exhaustion path.

Fix: skip the merge-commit on the final iteration since there's no subsequent epoch to benefit. Something like:

if _epoch == max_epochs - 1:
    break

before line 608. Merged pairs still get picked up by build_match_groups (their confidence is ≥ 0.9, well above the 0.8 grouping threshold), and unmerged pairs keep their correct scores.

monneyboi · 2026-03-27T12:48:54Z

Closing — the canonical-name remapping approach is too hacky. Will implement #3 (multi-label entities) first, then rebuild progressive merging cleanly on top.

github-actions bot and others added 2 commits March 26, 2026 15:57

claude bot commented Mar 26, 2026

View reviewed changes

monneyboi requested changes Mar 26, 2026

View reviewed changes

Fix O(n⁴) blowup in canonical pair name-similarity seeding

8c25e7e

Precompute best_name_per_canon map in a single O(n²) pass over all_pairs at the top of each epoch, replacing the O(n²) inner scan that ran for each new canonical pair. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

claude bot commented Mar 26, 2026

View reviewed changes

monneyboi closed this Mar 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement progressive merging via epoch-based propagation#26

Implement progressive merging via epoch-based propagation#26
claude[bot] wants to merge 3 commits intomainfrom
implement-progressive-merging

claude bot commented Mar 26, 2026

Uh oh!

claude bot left a comment

Uh oh!

monneyboi left a comment

Uh oh!

claude bot left a comment

Uh oh!

monneyboi commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

claude bot commented Mar 26, 2026

Summary

Test plan

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Uh oh!

monneyboi left a comment

Choose a reason for hiding this comment

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Uh oh!

monneyboi commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant