Skip to content

AuvroIslam/SustHackathonPreli

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

QueueStorm Investigator

CI

An AI/API SupportOps copilot for a digital-finance platform (SUST Codex Hackathon — Preliminary). It reads one customer complaint plus a short transaction-history snippet, investigates what actually happened (the complaint may contradict the data), classifies and routes the case, and drafts a safe reply — never asking for credentials, never promising unauthorized financial action.

  • Endpoints: GET /health, POST /analyze-ticket (+ GET /metrics)
  • Stack: Python 3.12 · FastAPI · Pydantic v2 · httpx · (optional) Redis · Docker + Caddy
  • Design: rules-first — a deterministic engine decides every scored field; a free LLM (Groq) only writes the text fields, behind a code-level safety sanitizer with template fallback.

Live

Status: 103/103 unit tests pass; 10/10 public sample cases correct end-to-end (every scored field — relevant_transaction_id, evidence_verdict, case_type, department, severity, human_review_required — matches the expected output exactly); image ≈ 239 MB; p95 latency ≈ 3 ms on the template path (≈ 1.2 s for the first cold call).


Why rules-first

The judge scores Evidence Reasoning (35) + Safety (20) + API/Schema (15) = 70/100 on fields that must be exact and consistent. Those are computed by deterministic code, so they are correct, instant, free, and immune to prompt injection. The LLM is used only for the Response Quality (10) text — and even that degrades gracefully to safe templates. This is also the most cost-efficient design: the free tier does the language work, paid models are a rare fallback.

flowchart LR
    R([Request]) --> V["Validate<br/>400 / 422"]
    V --> SI["Strip injected<br/>instructions"]
    SI --> N["Normalize<br/>amount · phone · id · time · lang"]
    N --> RU["RULES — scored fields<br/>match · verdict · case_type<br/>routing · severity · escalation"]
    RU --> GATE{"rule<br/>confident?"}
    GATE -->|"yes · 80–90%"| TX
    GATE -->|"low conf + concrete signal"| LLMC["LLM picks case_type<br/>enum-validated · capped"]
    LLMC --> TX["TEXT fields only<br/>Groq → DeepSeek → templates"]
    TX --> SA["SAFETY sanitizer<br/>runs on every reply"]
    SA --> O([200 JSON])

    classDef io fill:#0f766e,stroke:#5eead4,stroke-width:2px,color:#ecfeff
    classDef pre fill:#1d4ed8,stroke:#93c5fd,stroke-width:2px,color:#eff6ff
    classDef rules fill:#15803d,stroke:#86efac,stroke-width:2px,color:#f0fdf4
    classDef gate fill:#b45309,stroke:#fcd34d,stroke-width:2px,color:#fffbeb
    classDef llm fill:#7e22ce,stroke:#d8b4fe,stroke-width:2px,color:#faf5ff
    classDef safe fill:#b91c1c,stroke:#fca5a5,stroke-width:2px,color:#fef2f2
    class R,O io
    class V,SI,N pre
    class RU rules
    class GATE gate
    class LLMC,TX llm
    class SA safe
Loading

The LLM cannot change verdict, transaction match, severity, routing, escalation, or safety. The one place AI reasoning is allowed into a scored field is a narrow, guarded case_type reclassification: when the deterministic keyword classifier is low-confidence (a complaint phrased outside the keyword lists, or a weak-signal phishing mention) and there is a concrete signal (an amount/phone/id), the LLM picks case_type from the fixed enum — validated against the enum, capped in confidence, and falling back to the rule answer on any error. It deliberately does not fire on vague no-signal complaints (those stay other), and it never selects relevant_transaction_id — the matcher decides that deterministically and is free to return null on ambiguity (which the rubric rewards). This is the hybrid sweet spot: rules for the confident 80–90%, LLM only for the genuinely ambiguous tail.

All 10 public sample cases are classified by the rules alone (rule_conf ≥ 0.7 for each); the LLM reclassifier is a safety net for adversarial / novel phrasing the rules haven't seen — exactly the hidden-test cases the judges may add.

Who decides what — output-field provenance

Every scored field is owned by deterministic code. The LLM owns only free text, and the one scored field it can influence (case_type) is enum-validated and rules-gated.

flowchart TB
    subgraph G["🟢 Deterministic engine · scored · injection-proof"]
        direction LR
        F1["relevant_transaction_id"]:::r
        F2["evidence_verdict"]:::r
        F3["severity"]:::r
        F4["department"]:::r
        F5["human_review_required"]:::r
        F6["reason_codes · confidence"]:::r
    end
    subgraph H["🟡 Rules-gated, LLM-assisted"]
        F7["case_type<br/>(rule first; LLM only on low confidence,<br/>validated against the enum)"]:::h
    end
    subgraph L["🟣 LLM text → always safety-sanitized"]
        F8["agent_summary"]:::l
        F9["recommended_next_action"]:::l
        F10["customer_reply"]:::l
    end

    classDef r fill:#166534,stroke:#86efac,color:#f0fdf4
    classDef h fill:#a16207,stroke:#fde047,color:#fefce8
    classDef l fill:#6b21a8,stroke:#d8b4fe,color:#faf5ff
    style G fill:#052e16,stroke:#22c55e,color:#dcfce7
    style H fill:#422006,stroke:#eab308,color:#fef9c3
    style L fill:#3b0764,stroke:#a855f7,color:#f3e8ff
Loading

Evidence reasoning — how the matcher investigates (the 35-pt core)

The complaint may contradict the data. The matcher grounds the claim in real transactions using layered signals, and returns null rather than guess when nothing cleanly separates candidates (the rubric rewards asking over a wrong dispute).

flowchart TD
    C["Complaint signals<br/>amount · phone · id · time"] --> A{"amount<br/>match?"}
    A -->|"unique"| ONE["single candidate"]:::ok
    A -->|"none"| PH{"phone / id<br/>match?"}
    PH -->|"unique"| ONE
    PH -->|"none"| NULL1["relevant_id = null<br/>insufficient_data"]:::stop
    A -->|"multiple"| DUP{"duplicate pattern?<br/>2× same recipient, secs apart"}
    DUP -->|"yes"| SECOND["pick the 2nd<br/>→ consistent"]:::ok
    DUP -->|"no"| DIS{"time-of-day or id<br/>separates exactly one?"}
    DIS -->|"yes"| ONE
    DIS -->|"no · different recipients"| NULL2["relevant_id = null<br/>insufficient_data<br/>ask to clarify"]:::stop
    ONE --> VERD{"data vs claim"}
    VERD -->|"supports"| CONS["evidence_verdict<br/>= consistent"]:::good
    VERD -->|"contradicts<br/>e.g. established recipient,<br/>status already completed"| INC["evidence_verdict<br/>= inconsistent"]:::warn

    classDef ok fill:#1d4ed8,stroke:#93c5fd,color:#eff6ff
    classDef good fill:#15803d,stroke:#86efac,color:#f0fdf4
    classDef warn fill:#b45309,stroke:#fcd34d,color:#fffbeb
    classDef stop fill:#b91c1c,stroke:#fca5a5,color:#fef2f2
Loading

How each requirement is solved

A direct map from what the problem statement asks → how this service does it → where it lives in code.

Requirement How we solve it Code
Pick the right transaction (Evidence, 35 pts) Layered deterministic matcher: exact amount (digits, Bangla numerals, commas, 5k/২ লাখ, and word-numbers like "five thousand"/"panch hajar") → phone tail → merchant/agent/biller idtime-of-day disambiguation ("around 2pm" → the 14:0x txn). The id is always a real history id or null, never invented. matcher.py · normalize.py
Evidence verdict Compare the matched transaction against the claim: failed/pending status, established-recipient pattern, already-settled, duplicate pair → consistent / inconsistent / insufficient_data. matcher.py
Don't guess on ambiguity Multiple same-amount txns to different recipients with no separating signal → return null + a clarifying reply (the rubric rewards asking over a wrong dispute). matcher.py
Classify + route (case_type, department, severity) Priority-ordered keyword rules first; low-confidence cases get a guarded enum-validated LLM vote on case_type only. classifier.py · llm/classify.py
Escalation (human_review_required) Deterministic formula validated against all 10 samples (phishing/duplicate/agent always; wrong_transfer w/ id; any inconsistent; high-value). classifier.py
No PIN/OTP request (−15) Regex denylist blocks requests, allows warnings, and proactively adds "do not share your PIN/OTP". safety.py
No unauthorized refund/reversal (−10) Blocks promises, substitutes "any eligible amount will be returned through official channels". safety.py
Official channels only (−10) Strips links, phone numbers, third-party app redirects from replies. safety.py
Prompt injection Injected instruction-sentences stripped pre-analysis and structurally rules (not the LLM) decide every scored field. safety.py · pipeline.py
Bangla / Banglish Numeral + word-number parsing, language detection, reply in the complaint's language, Bangla-native safe templates. normalize.py · templates.py
Malformed / missing input Tolerant Pydantic input model + controlled 400/422/500 handlers — never crashes, never leaks secrets/stack traces. schemas.py · main.py
Exact schema, echo ticket_id Strict output Enums + response_model make an invalid shape impossible. schemas.py
Latency < 30 s, p95 target < 5 s One LLM call per ticket, ~6–8 s hard timeout, concurrency cap, input-hash cache, instant template fast-path on any miss. gateway.py · pipeline.py
Reachable & reproducible Live HTTPS (Azure VM + Caddy auto-TLS), public Docker image, copy-paste runbook. RUNBOOK.md

Key decisions — the path we took

Each choice below was a deliberate trade-off for this rubric and these constraints (no GPU, <1 GB image, 30 s timeout, free-tier budget):

  • Rules-first, not LLM-first. 70 of 100 points are exact-match fields. Deterministic code makes them correct, instant, free, and injection-proof. An LLM-first design would be slower, costlier, and non-reproducible. → chosen
  • Considered semantic embeddings for classification → rejected. sentence-transformers pulls in PyTorch + weights (≫1 GB) and trips the "no large local model weights / no multi-GB downloads" rule; multilingual models needed for Bangla make it worse. We get the same generalization benefit from a guarded LLM tail at zero image cost. → rejected, replaced
  • Added a low-confidence LLM case_type vote. Pure keywords miss novel phrasing ("money ended up somewhere it shouldn't"). The LLM is invoked only when rules are unsure and a concrete signal exists, picks from the fixed enum, and never selects the transaction (the safe-null behavior is rewarded). → chosen
  • Considered Guardrails AI / Llama Guard → rejected. Llama Guard is an 8 B model (breaks the image/GPU limits); Guardrails' re-ask loop adds latency + non-determinism. A deterministic regex sanitizer + template replacement runs sub-millisecond on every reply. → rejected, hand-rolled
  • Three-tier text fallback (Groq → DeepSeek → templates). Guarantees a valid, safe 200 even if every paid/free LLM is down — reliability over flash. → chosen

Documentation

Doc Contents
SYSTEM_DESIGN.md Architecture, sequence/class/ER UML, reasoning + safety + scaling deep dives (Mermaid diagrams)
API_REFERENCE.md Full request/response schema, enums, status codes, curl examples
DEPLOY_AZURE.md Beginner click-by-click Azure (Student) deployment guide
RUNBOOK.md Copy-paste local Docker + Azure VM (CLI) steps
prd.md Product requirements & design decisions
sample_outputs/ Generated responses for all 10 public sample cases

API

GET /health200

{"status": "ok"}

POST /analyze-ticket

Request (required ticket_id, complaint; the rest optional):

{
  "ticket_id": "TKT-001",
  "complaint": "I sent 5000 taka to a wrong number around 2pm today...",
  "language": "en",
  "channel": "in_app_chat",
  "user_type": "customer",
  "transaction_history": [
    {"transaction_id": "TXN-9101", "timestamp": "2026-04-14T14:08:22Z",
     "type": "transfer", "amount": 5000, "counterparty": "+8801719876543", "status": "completed"}
  ]
}

Response (200):

{
  "ticket_id": "TKT-001",
  "relevant_transaction_id": "TXN-9101",
  "evidence_verdict": "consistent",
  "case_type": "wrong_transfer",
  "severity": "high",
  "department": "dispute_resolution",
  "agent_summary": "Customer reports sending 5000 BDT via TXN-9101 ...",
  "recommended_next_action": "Verify TXN-9101 details with the customer ...",
  "customer_reply": "We have received your request regarding transaction TXN-9101. Please do not share your PIN or OTP ...",
  "human_review_required": true,
  "confidence": 0.9,
  "reason_codes": ["wrong_transfer", "transaction_match", "dispute_initiated"]
}

Status codes: 200 success · 400 malformed JSON / missing required field · 422 valid schema but empty complaint · 500 controlled internal error (never leaks stack traces or secrets). More worked examples in sample_outputs/.


Quickstart

Docker (recommended)

docker compose up -d --build        # app on http://localhost:8000 (+ Redis)
curl http://localhost:8000/health   # {"status":"ok"}
python scripts/evaluate.py http://localhost:8000   # runs all 10 samples + checks

Local (no Docker)

python -m venv .venv && . .venv/Scripts/activate     # Windows; use bin/activate on *nix
pip install -r requirements-dev.txt
uvicorn app.main:app --host 0.0.0.0 --port 8000
pytest -q                                            # 103 tests

Enabling the LLM (optional)

The service runs fully without any API key (safe templates). To enable LLM-written text, copy .env.example to .env and set GROQ_API_KEY. Compose reads .env automatically.


MODELS

Model Where it runs Role Why
Rules engine (this repo) In-process (CPU) All scored fields — transaction match, evidence verdict, case_type (first pass), department, severity, escalation, safety Exact, deterministic, free, injection-proof. This is the core of the solution.
Groq llama-3.3-70b-versatile Groq Cloud (free tier), OpenAI-compatible API Primary writer of agent_summary, recommended_next_action, customer_reply Strong quality + JSON mode at zero cost; fast hosted inference (no local weights, no GPU).
Groq llama-3.3-70b-versatile (same model, separate call) Groq free tier Guarded case_type reclassification — fires only when rule confidence ≤ 0.55 AND a concrete signal exists (amount/phone/id or credential mention). Result validated against the enum; never touches relevant_transaction_id. Confidence capped at 0.7 on override. Reuses the same primary model (free). Safety net for novel phrasing that the keyword classifier doesn't recognize; falls back to the rule answer on any error.
Groq llama-3.1-8b-instant Groq Cloud (free tier) In-Groq fast fallback Lower latency if the 70B is slow/limited.
DeepSeek deepseek-chat DeepSeek API (paid, cheap) Fallback if Groq errors / is rate-limited Cheap insurance for text quality under burst.
OpenAI gpt-4o-mini OpenAI API (paid) Off by default; rare hard cases only Honors the "paid rarely" cost policy.
Templates (this repo) In-process Final fallback + every safety replacement Guarantees a valid, safe reply even with no LLM.

Cost reasoning — baseline is $0. Every scored decision is computed in-process (CPU, free). Text is written by Groq's free tier — one call per uncached ticket; identical tickets are served from cache (no call); a concurrency cap + per-minute guard keep us inside the free RPM/TPM limits; the low-confidence case_type vote reuses that same free tier. Only if Groq errors/rate-limits do we touch the cheap DeepSeek tier; OpenAI is off by default. With no key at all the service is fully functional on deterministic templates. So even at the scenario's ~40k complaints/day, the expected spend is $0 (free tier + cache + template fallback), with cheap DeepSeek as burst insurance only.


Safety logic

Enforced in code (app/engine/safety.py) on every customer-facing text, regardless of source:

  • No credential requests (−15): blocks "share/provide/enter your OTP/PIN/password"; allows warnings ("do not share your PIN/OTP") and proactively includes one.
  • No unauthorized financial promises (−10): blocks "we will refund/reverse/unblock"; uses the safe phrasing "any eligible amount will be returned through official channels." Internal ops steps ("initiate the reversal flow") are allowed.
  • Official channels only (−10): strips links, phone numbers, and third-party app redirects from replies; recommending the merchant for a refund is allowed.
  • Prompt-injection resistance: injected instruction-sentences are stripped before analysis, and structurally the rules — not the LLM — decide every scored field, so injection cannot alter classification/routing/verdict. On any violation the field is replaced with a guaranteed-safe template.
  • Customer-leaked secret redaction: if the customer pastes their own PIN/OTP/card number into the complaint, those digits are scrubbed before analysis, before the LLM sees them, and from any text echoed back. Regex-based, EN + BN. This is a different concern from the credential-request rule above: that one blocks the bot asking for secrets; this one stops the customer volunteering a secret from leaking into logs, the LLM context, or the reply.

Why a hand-rolled sanitizer (not Guardrails AI / Llama Guard)? The standard libraries are excellent but a poor fit here: Llama Guard is an 8B model (violates the no-GPU / <1 GB-image / multi-GB-download rules), and Guardrails AI's re-ask loop adds latency and non-determinism on a free-tier model. We need the safety layer to be deterministic, dependency-light, and sub-millisecond so it can run on every reply (LLM or template) without risking the 30 s timeout. The regex/keyword denylist plus template-replacement gives that guarantee; the LLM being unable to touch scored fields is the structural backstop.


Testing

pytest -q                                  # 103 unit/integration tests (offline, deterministic)
python scripts/evaluate.py <BASE_URL>      # end-to-end: 10 samples + error codes + latency
python scripts/gen_sample_outputs.py <BASE_URL>   # regenerate sample_outputs/

Deployment

The API (the judged service) runs as live HTTPS on an Azure VM via Docker Compose + Caddy (auto Let's Encrypt on the free *.cloudapp.azure.com DNS label, load-balancing N app replicas) — base URL https://queuestorm-nafiz.centralindia.cloudapp.azure.com. Steps in RUNBOOK.md.

An optional static demo UI (web/index.html) is deployed separately on Vercel (https://queuestorm-nine.vercel.app) purely for manual exploration; it is a single dependency-free HTML page that calls the API from the browser (CORS is enabled on the API for this). It is not part of the judged path — the harness calls the API base URL directly.

flowchart LR
    J["Judge / agent<br/>HTTPS"] --> CADDY
    subgraph VM["☁️ Azure VM · 2 vCPU / 4 GB · Docker Compose"]
        CADDY["Caddy :443<br/>auto-TLS + load-balance"]:::edge
        CADDY --> A1["app · uvicorn"]:::app
        CADDY --> A2["app · uvicorn"]:::app
        CADDY --> AN["app · uvicorn ×N"]:::app
        A1 --> R[("Redis<br/>shared cache + rate-limit<br/>·optional·")]:::store
        A2 --> R
        AN --> R
    end
    A1 -. "free" .-> GROQ["Groq 70B / 8B<br/>primary"]:::free
    A2 -. "on 429 / error" .-> DS["DeepSeek<br/>cheap fallback"]:::paid
    AN -. "opt-in · rare" .-> OAI["OpenAI<br/>off by default"]:::paid
    A1 --> T["templates<br/>final safe fallback"]:::safe

    classDef edge fill:#0f766e,stroke:#5eead4,color:#ecfeff
    classDef app fill:#1d4ed8,stroke:#93c5fd,color:#eff6ff
    classDef store fill:#9f1239,stroke:#fda4af,color:#fff1f2
    classDef free fill:#15803d,stroke:#86efac,color:#f0fdf4
    classDef paid fill:#a16207,stroke:#fde047,color:#fefce8
    classDef safe fill:#6b21a8,stroke:#d8b4fe,color:#faf5ff
    style VM fill:#0b1220,stroke:#38bdf8,color:#e0f2fe
Loading

Stateless app replicas scale horizontally behind Caddy; cache + rate-limit state lives in optional Redis (in-memory fallback if absent). Every external dependency degrades gracefully — Groq → DeepSeek → templates — so a valid request always returns a safe 200.

Audit trail (judge reproducibility)

Every request emits a single structured stdout log line capturing the why behind the decision, so judges can docker compose logs app | grep '"decision"' to reproduce any answer:

{
  "ticket_id": "TKT-001",
  "case_type": "wrong_transfer",
  "evidence_verdict": "consistent",
  "relevant_transaction_id": "TXN-9101",
  "department": "dispute_resolution",
  "severity": "high",
  "human_review_required": true,
  "match_flags": ["transaction_match"],
  "signals": {
    "amounts": [5000.0],
    "phones": ["1712345678"],
    "ids": [],
    "time_hour": 14,
    "day_offset": 0,
    "language": "en"
  },
  "text_source": "template",
  "rule_confidence": 0.9,
  "llm_reclassified": false,
  "safety_violations": [],
  "injection_detected": false
}

Per-case decision counters (e.g. case_type.wrong_transfer, verdict.inconsistent, human_review.true) are exposed at GET /metrics for the monitoring tie-breaker.

Assumptions & limitations

  • Inputs are synthetic; no real customer/payment data or payment-API integration.
  • relevant_transaction_id is always null or a real id from the provided history — never invented.
  • Matching uses multiple grounded signals — exact amount (incl. Bangla numerals, commas, and scale words like 5k / ২ লাখ), counterparty phone tail, referenced merchant/agent/biller id, transaction type, and time-of-day (e.g. "around 2pm" → the 14:0x transaction). When two same-amount transactions go to different recipients and nothing (time/id) separates them, the service returns insufficient_data with a clarifying reply rather than guessing — guessing a dispute is penalised more than asking.
  • Every decision emits a structured audit-trail log line (matched id, which signals fired, verdict, text source, any safety violation, injection-detected flag) for traceability and debugging under the judge harness; aggregate counts are exposed at /metrics.
  • Groq free-tier rate limits can push some requests to DeepSeek/templates under heavy burst — by design, never an error.
  • Bangla replies are guaranteed safe via templates even when the model's Bangla is imperfect.

About

Rules-first FastAPI "complaint investigator" for digital finance: grounds each ticket in transaction evidence, classifies/routes it, and writes a safe reply — LLM for text only, deterministic safety in code.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors