-
Notifications
You must be signed in to change notification settings - Fork 0
Smarter relation deduplication: fuzzy pre-filter + LLM semantic validation #38
Copy link
Copy link
Open
Description
Problem
Current deduplication uses fuzz.ratio >= 75 which produces false positives (semantically opposite relations can match) and false negatives (semantically equivalent relations with different wording don't match).
Example: "ally in the quest" + "companion and fellow traveler" → same thing, but fuzzy won't catch it reliably.
Proposed architecture
Three-stage pipeline:
1. fuzz.ratio >= ~85-90 → fast pre-filter (raise threshold from 75)
2. LLM semantic check → confirm or reject the candidate merge
3. user clarification → only if LLM returns "unsure"
LLM validation prompt
The LLM should receive full context, not just the relation strings:
Character A: Gimli
Character B: Legolas
Existing relation in DB: "companion met during the Council of Elrond"
New candidate relation: "ally forged through shared battle"
Current profile of Gimli: <arc, background>
Are these describing the same relationship, or distinct aspects worth keeping both?
→ merge / keep_both / unsure
Without context, the LLM judges words. With context, it judges meaning — "ally" and "companion" can be the same for Frodo/Sam but distinct for Aragorn/Boromir.
Notes
- Keep fuzzy as a cheap pre-filter, not the decision maker
- LLM call only triggered when fuzzy score is in ambiguous range (~70-90)
unsureescalates to user clarification (existing mechanism)- Raising the fuzzy threshold alone would already reduce noise
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels