Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 8 additions & 8 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,23 +38,23 @@ The core propagation loop (`match.py`):

1. **Name-similarity seeding** — Soft TF-IDF + Jaro-Winkler seeds the confidence dict before iteration starts. This gives propagation initial signal to work with. See [docs/name_similarity.md](docs/name_similarity.md).

2. **Relation similarity via sentence embeddings** — relation phrase similarity is a continuous multiplier on propagation paths, not a binary gate. "acquired" ↔ "purchased" (~0.85) contributes proportionally; "acquired" ↔ "located in" (~0.1) contributes almost nothing. This replaces the identical-label requirement in standard SF/PARIS.
2. **Relation similarity via sentence embeddings** — relation phrase similarity is thresholded into equivalence classes. "acquired" ↔ "purchased" (above threshold) are treated as equivalent; "acquired" ↔ "located in" (below) are not. The threshold is used consistently for functionality pooling and propagation gating.

3. **Functionality weighting** — global forward and inverse functionality (1/avg_degree), with similar relation phrases pooled. See [docs/functionality.md](docs/functionality.md).
3. **Functionality weighting** — global forward and inverse functionality (1/avg_degree), with equivalent relation phrases pooled. See [docs/functionality.md](docs/functionality.md).

4. **Exponential sum aggregation** — `1 - exp(-λ × Σ strengths)` where each path contributes `rel_sim × min(func_a, func_b) × neighbor_confidence`. Rewards breadth over single strong paths.
4. **Exponential sum aggregation** — `1 - exp(-λ × Σ strengths)` where each path contributes `min(func_a, func_b) × neighbor_confidence`. Rewards breadth over single strong paths.

5. **Monotone non-decreasing updates** — confidence only goes up, never down. Preserves convergence guarantees (FLORA-style).
5. **Damped fixed-point iteration** — `new = (1-d)*old + d*computed` where computed integrates positive and negative evidence around the name-similarity seed. Converges via contraction (see [docs/similarity_flooding.md](docs/similarity_flooding.md)).

6. **Unified N-graph matching** — all article graphs merged into one, propagation runs once over all cross-graph pairs. Final grouping via union-find.

### What's not implemented (yet)
7. **Negative evidence** ([docs/negative_evidence.md](docs/negative_evidence.md)) — integrated directly into the single propagation score. Neighbors with confidence < 0.5 contribute negative evidence weighted by forward functionality, pushing the score toward 0. Damped iteration bounds circular reinforcement geometrically.

These are documented in `docs/` with design sketches but no code.
8. **Progressive merging** ([docs/progressive_merging.md](docs/progressive_merging.md)) — high-confidence merges are committed inline during the single propagation loop. Canonical adjacency is updated incrementally on merge (O(degree) per merge), avoiding full adjacency rebuilds. Enriched neighborhoods compound structural evidence across merge cycles.

- **Negative evidence** ([docs/negative_evidence.md](docs/negative_evidence.md)) — the absence of expected neighbor matches should count against entity equivalence. Without this, entities with identical names but different contexts merge incorrectly (see `test_identical_names_different_contexts_no_merge`, `test_similar_names_disjoint_neighborhoods_no_match`). PARIS tried this and abandoned it as too aggressive; we propose a dampened version.
### What's not implemented (yet)

- **Progressive merging** ([docs/progressive_merging.md](docs/progressive_merging.md)) — commit high-confidence merges during propagation and continue with enriched neighborhoods. Currently all merging is post-processing via union-find.
These are documented in `docs/` with design sketches but no code.

- **Local functionality** — FLORA uses per-entity functionality (`1/|targets for this specific source|`), not just global averages. We only compute global.

Expand Down
48 changes: 26 additions & 22 deletions docs/negative_evidence.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,46 +54,49 @@ In knowledge base alignment (PARIS's domain), completeness is somewhat reasonabl

FLORA (Peng et al. 2025) explicitly excludes negation from its framework. The "Simple Positive FIS" (Definition 1) requires all variables to be non-decreasing, which is what makes the Knaster-Tarski convergence proof work. Allowing scores to decrease would break monotonicity and void the convergence guarantee.

## Our approach: dampened negative evidence
By switching from Knaster-Tarski (monotone updates) to Banach (contraction mappings) as our convergence framework, this restriction is lifted — scores can decrease, and negative evidence integrates naturally into each iteration. See [similarity_flooding.md](similarity_flooding.md) for the full theoretical comparison.

## Our approach: integrated negative evidence via damped iteration

We need negative evidence but cannot afford PARIS's brittleness. The key insight is that negative evidence should be **weaker and more selective** than positive evidence, reflecting the fundamental asymmetry in our setting:

- A match between neighbors is *reliable* positive evidence (two articles independently reporting the same fact)
- A *missing* match could mean many things (incomplete coverage, relation phrasing mismatch, extraction error)

### Dampened negative factor
### How it works

Positive and negative evidence are computed together in each propagation step, feeding into a single score per entity pair. For each pair `(a, b)`, we examine all neighbor pairs `(y, y')` connected via similar relations:

For each entity pair `(a, b)`, compute a negative factor:
- **Positive**: if the neighbor pair's confidence is above 0.5 (likely match), it contributes to `pos_strength`, weighted by inverse functionality — matching neighbors of a functional relation are strong evidence FOR the match.
- **Negative**: if the neighbor pair's confidence is below 0.5 (likely non-match), it contributes to `neg_strength`, weighted by forward functionality — a functional relation whose target doesn't match is evidence AGAINST the match.

Both are aggregated via exp-sum and combined with the name-similarity seed:

```
neg(a, b) = PRODUCT_{edge r(a, y)} max(
1 - alpha × fun(r) × PRODUCT_{edge r'(b, y')} (1 - Pr(y ≡ y')),
floor
)
pos_agg = 1 - exp(-λ × pos_strength)
neg_agg = 1 - exp(-λ × neg_strength)

seed = name_similarity(a, b)
computed = seed + pos_agg × (1 - seed) - neg_agg × seed
```

Where:
- `alpha < 1` is a dampening coefficient (e.g. 0.3) that weakens the negative signal relative to PARIS's full-strength version
- `floor` (e.g. 0.5) prevents any single missing match from killing the score entirely
- `fun(r)` is forward functionality — only functional relations generate negative evidence
- The inner product checks whether `y` matches *any* of `b`'s neighbors via similar relations
The seed serves as the baseline. Positive evidence pushes toward 1.0 (proportional to the room above seed), negative evidence pushes toward 0.0 (proportional to the seed itself). With no structural evidence, the score equals the seed. With strong negative evidence and no positive evidence, the score approaches zero.

The dampening addresses the incompleteness problem: even with high forward functionality and no matching target, the penalty is at most `(1 - alpha × 1.0)` per path, clamped to `floor`.
### The 0.5 threshold as a natural gate

### When to apply
The threshold for contributing positive vs negative evidence is 0.5 — the point of maximum uncertainty. A neighbor pair with confidence 0.6 contributes weak positive evidence. One with confidence 0.1 contributes strong negative evidence. One at exactly 0.5 contributes nothing.

Negative evidence should activate only when there is already positive evidence to temper. If a pair has near-zero positive similarity, negative evidence is irrelevant. Apply as:
This replaces the separate "gate" mechanism from the dual-channel design. There is no need for a separate activation threshold — the 0.5 boundary naturally ensures that negative evidence only affects pairs whose neighbors have meaningful non-match signal.

```
final(a, b) = positive(a, b) × neg(a, b) if positive(a, b) > gate
positive(a, b) otherwise
```
### Self-correcting dynamics

Unlike PARIS's one-shot negative factor, our approach is iterative and self-correcting. Consider two entities whose CEO neighbors initially have low name similarity (0.35). In early iterations, `1 - 0.35 = 0.65 > 0.5`, so the CEO pair generates negative evidence for the parent entities. But if the CEO pair has its own structural evidence (e.g. both graduated from the same university), its confidence rises across iterations. Once it crosses 0.5, it switches from generating negative evidence to generating positive evidence. The damped iteration converges to a consistent assignment.

The gate (e.g. 0.3) ensures negative evidence only modulates pairs that are already plausible matches. This prevents wasting computation on the vast majority of pairs that will never match.
This dynamic is impossible with the dual-channel monotone approach, where negative evidence is fixed at seed values to prevent circular reinforcement. Damped iteration allows circular reinforcement, bounded by the contraction property — feedback loops shrink geometrically rather than exploding.

### Convergence implications
### Convergence

Multiplying by a negative factor makes the update non-monotone — a pair's score can decrease between iterations. This breaks FLORA's strict monotone convergence guarantee. The dampening coefficient (`alpha`) and per-path floor bound the magnitude of any single negative adjustment, and scores can never drop below `floor^k` (where k is the number of edges), so they can't collapse to zero. This makes practical convergence likely but not formally guaranteed — oscillation across pairs in circular dependency chains is theoretically possible. If convergence issues arise, `alpha` and `floor` are the tuning knobs.
The damped update `new = (1-α) × old + α × computed` ensures convergence for sparse graphs (see [similarity_flooding.md](similarity_flooding.md) for the full convergence analysis). Negative evidence does not require special treatment — it is part of the same contraction mapping. Each iteration brings the score vector closer to the unique fixed point regardless of whether individual scores go up or down.

### What negative evidence does NOT replace

Expand All @@ -114,3 +117,4 @@ Negative evidence interacts with several other components:

- Suchanek, Abiteboul, Senellart. *PARIS: Probabilistic Alignment of Relations, Instances, and Schema.* VLDB 2011. Section 4 (Equations 4-7), Section 6.3 (experimental evaluation of negative evidence).
- Peng, Bonald, Suchanek. *FLORA: Unsupervised Knowledge Graph Alignment by Fuzzy Logic.* 2025. Definition 1 (no-negation constraint), Theorem 1 (convergence requires monotonicity).
- Lizorkin, Velikhov, Grinev, Turdakov. *Accuracy Estimate and Optimization Techniques for SimRank Computation.* PVLDB 2008. (Contraction convergence proof for iterative graph similarity with decay factor.)
53 changes: 29 additions & 24 deletions docs/progressive_merging.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,31 +38,38 @@ This is relevant because it suggests two fundamentally different strategies:

Strategy 2 avoids the convergence issues of progressive merging but misses the enriched-neighborhood benefit.

## Our approach: epoch-based progressive merging
## Our approach: progressive merging within damped iteration

Neither the literature nor standard fixpoint theory provides a clean answer for progressive merging during propagation. We propose a hybrid that preserves most of the convergence properties while gaining the neighborhood enrichment benefit.
We use a single-loop design where propagation runs with damped updates (positive and negative evidence integrated into each step), and merges are committed when the iteration converges. Merged neighborhoods then compound structural evidence in subsequent iterations.

### The mechanism

Divide propagation into **epochs**. Within each epoch, run standard propagation (monotone non-decreasing, convergence guaranteed). Between epochs, commit matches and merge:

```
for epoch in range(max_epochs):
# Phase 1: Standard propagation within the epoch
confidence = propagate_to_convergence(graph, confidence)

# Phase 2: Commit high-confidence merges
for iteration in range(max_iter):
# Damped propagation step (positive + negative in one pass)
for each pair (a, b):
computed = seed + pos_agg(neighbors) × (1 - seed) - neg_agg(neighbors) × seed
confidence(a, b) = (1 - α) × old + α × computed
if not converged:
continue

# Commit high-confidence merges
new_merges = find_merges(confidence, threshold=merge_threshold)
if not new_merges:
break # No new merges → global convergence

# Phase 3: Merge entities in the graph
graph = apply_merges(graph, new_merges)
break # Converged, no new merges → done

# Phase 4: Re-seed confidence for the merged graph
confidence = reseed(graph, confidence, new_merges)
# Update adjacency incrementally, remap pairs/confidence
for a, b in new_merges:
uf.union(a, b)
canonical_adj[uf.find(a)] = dedup(adj[a] + adj[b])
confidence, pairs = remap_to_canonical(confidence, pairs, uf)
```

Key properties:
- **No separate phases**: positive and negative evidence are computed together in each step, not sequentially. See [negative_evidence.md](negative_evidence.md) for how this works.
- **No reseeding**: the damped update naturally anchors to the seed — there is no compounding dampening effect that requires periodic reseeding.
- **Incremental adjacency**: maintaining a `canonical_adj` alongside the UnionFind avoids rebuilding adjacency from scratch each cycle. Each merge costs O(degree), not O(|edges|).

### What merging means concretely

When entities `a` and `b` are merged into entity `ab`:
Expand Down Expand Up @@ -94,25 +101,23 @@ The merge threshold should be conservative. A false merge during propagation is

### Convergence properties

Within each epoch, propagation converges normally (monotone non-decreasing on a finite lattice). Between epochs, merging changes the graph structure, so the overall process is not a standard fixpoint iteration.
Within each convergence cycle, the damped iteration converges via the contraction mapping property (see [similarity_flooding.md](similarity_flooding.md)). Between cycles, merging changes the graph structure, so the overall process is not a single contraction mapping.

However, the process is still well-behaved:
However, the process is well-behaved:
1. **Merges are irreversible**: once committed, entities stay merged. The set of merged entities grows monotonically.
2. **The graph shrinks**: each merge reduces the entity count by one. The process must terminate in at most N-1 merge steps.
3. **Confidence is non-decreasing across epochs**: `max(confidence(a, x), confidence(b, x)) >= confidence(a, x)` for all `x`.
4. **Termination**: if no epoch produces new merges, the process halts.
3. **Within-cycle convergence**: each cycle converges to the unique fixed point of the current graph's contraction mapping. The fixed point changes when merges alter the graph structure.
4. **Termination**: if no cycle produces new merges, the process halts.

This is not a formal convergence guarantee in the FLORA sense (no Knaster-Tarski applies to the cross-epoch dynamics). But the monotonic reduction in entity count provides a strong termination guarantee, and the conservative merge threshold limits cascade risk.
The conservative merge threshold (0.9) limits cascade risk: only very high-confidence pairs are merged, and enriched neighborhoods from those merges are unlikely to create false matches above the same threshold.

### Interaction with negative evidence

Progressive merging and [negative evidence](negative_evidence.md) interact in two ways:

1. **Enriched neighborhoods improve negative evidence quality.** After merging, a combined entity has more edges, which means more opportunities for both positive AND negative evidence. A false match candidate that survived against sparse individual neighborhoods may fail against the richer merged neighborhood.

2. **Negative evidence prevents false progressive merges.** If negative evidence runs within each epoch (before the merge step), it can suppress pairs that had high positive scores but contradictory functional relations. This is a safety mechanism against the cascade risk of progressive merging.

The recommended order within each epoch: propagate positive evidence → apply negative dampening → commit merges above threshold.
2. **Negative evidence prevents false progressive merges.** Because negative evidence is integrated into each propagation step (not applied post-hoc), it suppresses pairs with contradictory functional relations before they ever reach the merge threshold. This is a natural safety mechanism against cascade risk.

## What progressive merging does NOT solve

Expand All @@ -123,5 +128,5 @@ The recommended order within each epoch: propagate positive evidence → apply n
## References

- Suchanek, Abiteboul, Senellart. *PARIS: Probabilistic Alignment of Relations, Instances, and Schema.* VLDB 2011. Section 5.2 (maximal assignment as soft progressive commitment).
- Peng, Bonald, Suchanek. *FLORA: Unsupervised Knowledge Graph Alignment by Fuzzy Logic.* 2025. Theorem 1 (convergence requires monotonicity — why merging during propagation is problematic).
- Peng, Bonald, Suchanek. *FLORA: Unsupervised Knowledge Graph Alignment by Fuzzy Logic.* 2025. Theorem 1 (convergence requires monotonicity — contrast with our contraction-based approach).
- Liao, Sabetiansfahani, Bhatt, Ben-Hur. *IsoRankN: Spectral Methods for Global Alignment of Multiple Protein Networks.* Bioinformatics 2009. Sections 2.2-2.4 (star spread as alternative to progressive merging).
Loading