Skip to content

Smarter relation deduplication: fuzzy pre-filter + LLM semantic validation #38

@renaudcepre

Description

@renaudcepre

Problem

Current deduplication uses fuzz.ratio >= 75 which produces false positives (semantically opposite relations can match) and false negatives (semantically equivalent relations with different wording don't match).

Example: "ally in the quest" + "companion and fellow traveler" → same thing, but fuzzy won't catch it reliably.

Proposed architecture

Three-stage pipeline:

1. fuzz.ratio >= ~85-90  → fast pre-filter (raise threshold from 75)
2. LLM semantic check    → confirm or reject the candidate merge
3. user clarification    → only if LLM returns "unsure"

LLM validation prompt

The LLM should receive full context, not just the relation strings:

Character A: Gimli
Character B: Legolas

Existing relation in DB: "companion met during the Council of Elrond"
New candidate relation: "ally forged through shared battle"

Current profile of Gimli: <arc, background>

Are these describing the same relationship, or distinct aspects worth keeping both?
→ merge / keep_both / unsure

Without context, the LLM judges words. With context, it judges meaning — "ally" and "companion" can be the same for Frodo/Sam but distinct for Aragorn/Boromir.

Notes

  • Keep fuzzy as a cheap pre-filter, not the decision maker
  • LLM call only triggered when fuzzy score is in ambiguous range (~70-90)
  • unsure escalates to user clarification (existing mechanism)
  • Raising the fuzzy threshold alone would already reduce noise

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions