feat: hybrid evidence engine — word-number amounts, time/id matching,…#1
Merged
Conversation
… guarded LLM case_type
Evidence reasoning (the 35-pt core):
- normalize: parse word-number amounts ("five thousand", "panch hajar",
"দুই হাজার"), scale words (5k / ২ লাখ), time-of-day + clock hour, and
referenced merchant/agent/biller ids
- matcher: disambiguate same-amount transfers by time-of-day / id; still
returns null (insufficient_data) when nothing cleanly separates candidates
- classifier: confidence-aware classify(); tighter phishing precision
(strong vs weak signals); broader Banglish/English keyword coverage
Guarded LLM case_type reclassification (hybrid tail):
- app/llm/classify.py: LLM picks case_type from the fixed enum ONLY for
low-confidence cases with a concrete signal; validated against the enum,
confidence-capped, falls back to the rule answer on any error
- never touches relevant_transaction_id / verdict / severity / safety;
does not fire on vague no-signal complaints (stay 'other')
Safety: expand the unauthorized-promise denylist (conversational phrasings)
Observability: per-decision audit-trail log + case_type/verdict counters
Docs: rewrite README (requirement→solution table, decision/path notes,
$0-cost story, colorful Mermaid diagrams); fix stale test counts; PRD §4.1
now matches the implemented matcher
Tests: 102 passing (was 82) incl. word-amount, time/id disambiguation, and
monkeypatched LLM-reclassification guardrail tests; 10/10 samples exact with
LLM on and off
There was a problem hiding this comment.
Pull request overview
This PR extends the deterministic evidence engine to better ground complaints to real transactions by extracting richer complaint signals (scaled/word amounts, time-of-day, referenced ids) and using those signals to disambiguate same-amount candidates. It also adds confidence-aware classification plus an optional, rules-gated LLM case_type reclassification path, along with expanded safety promise detection, observability counters/audit logging, updated docs, and new tests.
Changes:
- Add normalization for scaled + word-number amounts, time/day hints, and merchant/agent/biller id references; use id/time signals to disambiguate same-amount matches.
- Introduce confidence-aware deterministic classification and an optional LLM
case_typereclassification helper (enum-validated + timeout-capped), with corresponding tests. - Expand safety unauthorized-promise patterns; add metrics + audit-trail logging; refresh README/PRD and update test counts.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
app/engine/normalize.py |
Adds extraction of scaled/word amounts, time signals, and referenced ids into Normalized. |
app/engine/matcher.py |
Adds id/time-based disambiguation logic and allows id as a fallback candidate signal. |
app/engine/classifier.py |
Adds classify() returning (CaseType, confidence) and refines keyword groups. |
app/llm/classify.py |
New: enum-validated LLM case_type classifier used for low-confidence cases. |
app/pipeline.py |
Integrates confidence-aware classify + optional LLM reclassification; adds counters and audit-trail logging. |
app/engine/safety.py |
Expands unauthorized promise denylist patterns. |
app/config.py |
Adds settings knobs for LLM case_type reclassification thresholds/timeouts. |
.env.example |
Documents new env vars for LLM reclassification configuration. |
tests/test_matcher.py |
Adds tests for scaled/word amounts, time extraction, and id/time disambiguation behavior. |
tests/test_classifier.py |
Adds tests validating confidence levels for strong/weak signals. |
tests/test_llm_reclassify.py |
New: offline monkeypatched tests for the LLM reclassification guardrails. |
README.md |
Updates architecture/docs, test counts, and adds diagrams and requirement→solution mapping. |
prd.md |
Updates matcher specification to reflect the implemented id/time disambiguation behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+88
to
+95
| scored = [(t, _hour_distance(h, norm.time_hour)) for t in candidates if (h := _txn_hour(t)) is not None] | ||
| if scored: | ||
| scored.sort(key=lambda x: x[1]) | ||
| best_t, best_d = scored[0] | ||
| second_d = scored[1][1] if len(scored) > 1 else 99 | ||
| if best_d <= 2 and second_d - best_d >= 3: | ||
| flags.append("time_match") | ||
| return best_t |
Comment on lines
+312
to
+315
| # Merchant / agent / biller identifiers referenced in the complaint text, e.g. | ||
| # "MERCHANT-7821", "AGENT-318", "BILLER-DESCO". Used as a counterparty signal. | ||
| _ID_TOKEN = re.compile(r"\b((?:MERCHANT|AGENT|BILLER|TXN)[-_][A-Z0-9]+)\b", re.IGNORECASE) | ||
|
|
Comment on lines
+67
to
+68
| cp = t.counterparty.upper().replace("_", "-") | ||
| return any(cp == i or i in cp or cp in i for i in ids) |
Comment on lines
+156
to
+159
| "signals": { | ||
| "amounts": norm.amounts, | ||
| "phones": norm.phones, | ||
| "ids": norm.ids, |
Comment on lines
+63
to
82
| case_type, rule_conf = classifier.classify(norm, req.user_type) | ||
|
|
||
| # Optional LLM reclassification — case_type ONLY, low-confidence cases only, | ||
| # and only when there is a concrete signal (so genuinely vague complaints stay | ||
| # 'other'). Never touches the transaction match, verdict, severity or safety. | ||
| llm_reclassified = False | ||
| if ( | ||
| settings.llm_active | ||
| and settings.llm_classify_enabled | ||
| and rule_conf <= settings.classify_confidence_threshold | ||
| and (norm.amounts or norm.phones or norm.ids | ||
| or case_type == CaseType.phishing_or_social_engineering) | ||
| ): | ||
| suggested = await llm_classify_case_type(safe_complaint, settings) | ||
| if suggested is not None and suggested != case_type: | ||
| metrics.incr("classify.llm_override") | ||
| case_type = suggested | ||
| llm_reclassified = True | ||
|
|
||
| match = match_transaction(norm, txns, case_type) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
… guarded LLM case_type
Evidence reasoning (the 35-pt core):
Guarded LLM case_type reclassification (hybrid tail):
Safety: expand the unauthorized-promise denylist (conversational phrasings)
Observability: per-decision audit-trail log + case_type/verdict counters
Docs: rewrite README (requirement→solution table, decision/path notes, $0-cost story, colorful Mermaid diagrams); fix stale test counts; PRD §4.1 now matches the implemented matcher
Tests: 102 passing (was 82) incl. word-amount, time/id disambiguation, and monkeypatched LLM-reclassification guardrail tests; 10/10 samples exact with LLM on and off