An AI/API SupportOps copilot for a digital-finance platform (SUST Codex Hackathon — Preliminary). It reads one customer complaint plus a short transaction-history snippet, investigates what actually happened (the complaint may contradict the data), classifies and routes the case, and drafts a safe reply — never asking for credentials, never promising unauthorized financial action.
- Endpoints:
GET /health,POST /analyze-ticket(+GET /metrics) - Stack: Python 3.12 · FastAPI · Pydantic v2 · httpx · (optional) Redis · Docker + Caddy
- Design: rules-first — a deterministic engine decides every scored field; a free LLM (Groq) only writes the text fields, behind a code-level safety sanitizer with template fallback.
- API base URL (the judged endpoint): https://queuestorm-nafiz.centralindia.cloudapp.azure.com
GET /health→{"status":"ok"}·POST /analyze-ticket→ structured JSON
- Demo UI (optional, for humans): https://queuestorm-nine.vercel.app — a thin static page to explore the API by hand. It only calls the API above; the judge harness should target the API base URL, not this page.
Status: 103/103 unit tests pass; 10/10 public sample cases correct end-to-end (every scored field —
relevant_transaction_id,evidence_verdict,case_type,department,severity,human_review_required— matches the expected output exactly); image ≈ 239 MB; p95 latency ≈ 3 ms on the template path (≈ 1.2 s for the first cold call).
The judge scores Evidence Reasoning (35) + Safety (20) + API/Schema (15) = 70/100 on fields that must be exact and consistent. Those are computed by deterministic code, so they are correct, instant, free, and immune to prompt injection. The LLM is used only for the Response Quality (10) text — and even that degrades gracefully to safe templates. This is also the most cost-efficient design: the free tier does the language work, paid models are a rare fallback.
flowchart LR
R([Request]) --> V["Validate<br/>400 / 422"]
V --> SI["Strip injected<br/>instructions"]
SI --> N["Normalize<br/>amount · phone · id · time · lang"]
N --> RU["RULES — scored fields<br/>match · verdict · case_type<br/>routing · severity · escalation"]
RU --> GATE{"rule<br/>confident?"}
GATE -->|"yes · 80–90%"| TX
GATE -->|"low conf + concrete signal"| LLMC["LLM picks case_type<br/>enum-validated · capped"]
LLMC --> TX["TEXT fields only<br/>Groq → DeepSeek → templates"]
TX --> SA["SAFETY sanitizer<br/>runs on every reply"]
SA --> O([200 JSON])
classDef io fill:#0f766e,stroke:#5eead4,stroke-width:2px,color:#ecfeff
classDef pre fill:#1d4ed8,stroke:#93c5fd,stroke-width:2px,color:#eff6ff
classDef rules fill:#15803d,stroke:#86efac,stroke-width:2px,color:#f0fdf4
classDef gate fill:#b45309,stroke:#fcd34d,stroke-width:2px,color:#fffbeb
classDef llm fill:#7e22ce,stroke:#d8b4fe,stroke-width:2px,color:#faf5ff
classDef safe fill:#b91c1c,stroke:#fca5a5,stroke-width:2px,color:#fef2f2
class R,O io
class V,SI,N pre
class RU rules
class GATE gate
class LLMC,TX llm
class SA safe
The LLM cannot change verdict, transaction match, severity, routing, escalation, or safety. The one place AI reasoning is allowed into a scored field is a narrow, guarded case_type reclassification: when the deterministic keyword classifier is low-confidence (a complaint phrased outside the keyword lists, or a weak-signal phishing mention) and there is a concrete signal (an amount/phone/id), the LLM picks case_type from the fixed enum — validated against the enum, capped in confidence, and falling back to the rule answer on any error. It deliberately does not fire on vague no-signal complaints (those stay other), and it never selects relevant_transaction_id — the matcher decides that deterministically and is free to return null on ambiguity (which the rubric rewards). This is the hybrid sweet spot: rules for the confident 80–90%, LLM only for the genuinely ambiguous tail.
All 10 public sample cases are classified by the rules alone (rule_conf ≥ 0.7 for each); the LLM reclassifier is a safety net for adversarial / novel phrasing the rules haven't seen — exactly the hidden-test cases the judges may add.
Every scored field is owned by deterministic code. The LLM owns only free text, and the one scored field it can influence (case_type) is enum-validated and rules-gated.
flowchart TB
subgraph G["🟢 Deterministic engine · scored · injection-proof"]
direction LR
F1["relevant_transaction_id"]:::r
F2["evidence_verdict"]:::r
F3["severity"]:::r
F4["department"]:::r
F5["human_review_required"]:::r
F6["reason_codes · confidence"]:::r
end
subgraph H["🟡 Rules-gated, LLM-assisted"]
F7["case_type<br/>(rule first; LLM only on low confidence,<br/>validated against the enum)"]:::h
end
subgraph L["🟣 LLM text → always safety-sanitized"]
F8["agent_summary"]:::l
F9["recommended_next_action"]:::l
F10["customer_reply"]:::l
end
classDef r fill:#166534,stroke:#86efac,color:#f0fdf4
classDef h fill:#a16207,stroke:#fde047,color:#fefce8
classDef l fill:#6b21a8,stroke:#d8b4fe,color:#faf5ff
style G fill:#052e16,stroke:#22c55e,color:#dcfce7
style H fill:#422006,stroke:#eab308,color:#fef9c3
style L fill:#3b0764,stroke:#a855f7,color:#f3e8ff
The complaint may contradict the data. The matcher grounds the claim in real transactions using layered signals, and returns null rather than guess when nothing cleanly separates candidates (the rubric rewards asking over a wrong dispute).
flowchart TD
C["Complaint signals<br/>amount · phone · id · time"] --> A{"amount<br/>match?"}
A -->|"unique"| ONE["single candidate"]:::ok
A -->|"none"| PH{"phone / id<br/>match?"}
PH -->|"unique"| ONE
PH -->|"none"| NULL1["relevant_id = null<br/>insufficient_data"]:::stop
A -->|"multiple"| DUP{"duplicate pattern?<br/>2× same recipient, secs apart"}
DUP -->|"yes"| SECOND["pick the 2nd<br/>→ consistent"]:::ok
DUP -->|"no"| DIS{"time-of-day or id<br/>separates exactly one?"}
DIS -->|"yes"| ONE
DIS -->|"no · different recipients"| NULL2["relevant_id = null<br/>insufficient_data<br/>ask to clarify"]:::stop
ONE --> VERD{"data vs claim"}
VERD -->|"supports"| CONS["evidence_verdict<br/>= consistent"]:::good
VERD -->|"contradicts<br/>e.g. established recipient,<br/>status already completed"| INC["evidence_verdict<br/>= inconsistent"]:::warn
classDef ok fill:#1d4ed8,stroke:#93c5fd,color:#eff6ff
classDef good fill:#15803d,stroke:#86efac,color:#f0fdf4
classDef warn fill:#b45309,stroke:#fcd34d,color:#fffbeb
classDef stop fill:#b91c1c,stroke:#fca5a5,color:#fef2f2
A direct map from what the problem statement asks → how this service does it → where it lives in code.
| Requirement | How we solve it | Code |
|---|---|---|
| Pick the right transaction (Evidence, 35 pts) | Layered deterministic matcher: exact amount (digits, Bangla numerals, commas, 5k/২ লাখ, and word-numbers like "five thousand"/"panch hajar") → phone tail → merchant/agent/biller id → time-of-day disambiguation ("around 2pm" → the 14:0x txn). The id is always a real history id or null, never invented. |
matcher.py · normalize.py |
| Evidence verdict | Compare the matched transaction against the claim: failed/pending status, established-recipient pattern, already-settled, duplicate pair → consistent / inconsistent / insufficient_data. |
matcher.py |
| Don't guess on ambiguity | Multiple same-amount txns to different recipients with no separating signal → return null + a clarifying reply (the rubric rewards asking over a wrong dispute). |
matcher.py |
Classify + route (case_type, department, severity) |
Priority-ordered keyword rules first; low-confidence cases get a guarded enum-validated LLM vote on case_type only. |
classifier.py · llm/classify.py |
Escalation (human_review_required) |
Deterministic formula validated against all 10 samples (phishing/duplicate/agent always; wrong_transfer w/ id; any inconsistent; high-value). |
classifier.py |
| No PIN/OTP request (−15) | Regex denylist blocks requests, allows warnings, and proactively adds "do not share your PIN/OTP". | safety.py |
| No unauthorized refund/reversal (−10) | Blocks promises, substitutes "any eligible amount will be returned through official channels". | safety.py |
| Official channels only (−10) | Strips links, phone numbers, third-party app redirects from replies. | safety.py |
| Prompt injection | Injected instruction-sentences stripped pre-analysis and structurally rules (not the LLM) decide every scored field. | safety.py · pipeline.py |
| Bangla / Banglish | Numeral + word-number parsing, language detection, reply in the complaint's language, Bangla-native safe templates. | normalize.py · templates.py |
| Malformed / missing input | Tolerant Pydantic input model + controlled 400/422/500 handlers — never crashes, never leaks secrets/stack traces. |
schemas.py · main.py |
Exact schema, echo ticket_id |
Strict output Enums + response_model make an invalid shape impossible. |
schemas.py |
| Latency < 30 s, p95 target < 5 s | One LLM call per ticket, ~6–8 s hard timeout, concurrency cap, input-hash cache, instant template fast-path on any miss. | gateway.py · pipeline.py |
| Reachable & reproducible | Live HTTPS (Azure VM + Caddy auto-TLS), public Docker image, copy-paste runbook. | RUNBOOK.md |
Each choice below was a deliberate trade-off for this rubric and these constraints (no GPU, <1 GB image, 30 s timeout, free-tier budget):
- Rules-first, not LLM-first. 70 of 100 points are exact-match fields. Deterministic code makes them correct, instant, free, and injection-proof. An LLM-first design would be slower, costlier, and non-reproducible. → chosen
- Considered semantic embeddings for classification → rejected. sentence-transformers pulls in PyTorch + weights (≫1 GB) and trips the "no large local model weights / no multi-GB downloads" rule; multilingual models needed for Bangla make it worse. We get the same generalization benefit from a guarded LLM tail at zero image cost. → rejected, replaced
- Added a low-confidence LLM
case_typevote. Pure keywords miss novel phrasing ("money ended up somewhere it shouldn't"). The LLM is invoked only when rules are unsure and a concrete signal exists, picks from the fixed enum, and never selects the transaction (the safe-nullbehavior is rewarded). → chosen - Considered Guardrails AI / Llama Guard → rejected. Llama Guard is an 8 B model (breaks the image/GPU limits); Guardrails' re-ask loop adds latency + non-determinism. A deterministic regex sanitizer + template replacement runs sub-millisecond on every reply. → rejected, hand-rolled
- Three-tier text fallback (Groq → DeepSeek → templates). Guarantees a valid, safe
200even if every paid/free LLM is down — reliability over flash. → chosen
| Doc | Contents |
|---|---|
| SYSTEM_DESIGN.md | Architecture, sequence/class/ER UML, reasoning + safety + scaling deep dives (Mermaid diagrams) |
| API_REFERENCE.md | Full request/response schema, enums, status codes, curl examples |
| DEPLOY_AZURE.md | Beginner click-by-click Azure (Student) deployment guide |
| RUNBOOK.md | Copy-paste local Docker + Azure VM (CLI) steps |
| prd.md | Product requirements & design decisions |
sample_outputs/ |
Generated responses for all 10 public sample cases |
{"status": "ok"}Request (required ticket_id, complaint; the rest optional):
{
"ticket_id": "TKT-001",
"complaint": "I sent 5000 taka to a wrong number around 2pm today...",
"language": "en",
"channel": "in_app_chat",
"user_type": "customer",
"transaction_history": [
{"transaction_id": "TXN-9101", "timestamp": "2026-04-14T14:08:22Z",
"type": "transfer", "amount": 5000, "counterparty": "+8801719876543", "status": "completed"}
]
}Response (200):
{
"ticket_id": "TKT-001",
"relevant_transaction_id": "TXN-9101",
"evidence_verdict": "consistent",
"case_type": "wrong_transfer",
"severity": "high",
"department": "dispute_resolution",
"agent_summary": "Customer reports sending 5000 BDT via TXN-9101 ...",
"recommended_next_action": "Verify TXN-9101 details with the customer ...",
"customer_reply": "We have received your request regarding transaction TXN-9101. Please do not share your PIN or OTP ...",
"human_review_required": true,
"confidence": 0.9,
"reason_codes": ["wrong_transfer", "transaction_match", "dispute_initiated"]
}Status codes: 200 success · 400 malformed JSON / missing required field · 422 valid schema but empty complaint · 500 controlled internal error (never leaks stack traces or secrets). More worked examples in sample_outputs/.
docker compose up -d --build # app on http://localhost:8000 (+ Redis)
curl http://localhost:8000/health # {"status":"ok"}
python scripts/evaluate.py http://localhost:8000 # runs all 10 samples + checkspython -m venv .venv && . .venv/Scripts/activate # Windows; use bin/activate on *nix
pip install -r requirements-dev.txt
uvicorn app.main:app --host 0.0.0.0 --port 8000
pytest -q # 103 testsThe service runs fully without any API key (safe templates). To enable LLM-written text, copy .env.example to .env and set GROQ_API_KEY. Compose reads .env automatically.
| Model | Where it runs | Role | Why |
|---|---|---|---|
| Rules engine (this repo) | In-process (CPU) | All scored fields — transaction match, evidence verdict, case_type (first pass), department, severity, escalation, safety | Exact, deterministic, free, injection-proof. This is the core of the solution. |
Groq llama-3.3-70b-versatile |
Groq Cloud (free tier), OpenAI-compatible API | Primary writer of agent_summary, recommended_next_action, customer_reply |
Strong quality + JSON mode at zero cost; fast hosted inference (no local weights, no GPU). |
Groq llama-3.3-70b-versatile (same model, separate call) |
Groq free tier | Guarded case_type reclassification — fires only when rule confidence ≤ 0.55 AND a concrete signal exists (amount/phone/id or credential mention). Result validated against the enum; never touches relevant_transaction_id. Confidence capped at 0.7 on override. Reuses the same primary model (free). |
Safety net for novel phrasing that the keyword classifier doesn't recognize; falls back to the rule answer on any error. |
Groq llama-3.1-8b-instant |
Groq Cloud (free tier) | In-Groq fast fallback | Lower latency if the 70B is slow/limited. |
DeepSeek deepseek-chat |
DeepSeek API (paid, cheap) | Fallback if Groq errors / is rate-limited | Cheap insurance for text quality under burst. |
OpenAI gpt-4o-mini |
OpenAI API (paid) | Off by default; rare hard cases only | Honors the "paid rarely" cost policy. |
| Templates (this repo) | In-process | Final fallback + every safety replacement | Guarantees a valid, safe reply even with no LLM. |
Cost reasoning — baseline is $0. Every scored decision is computed in-process (CPU, free). Text is written by Groq's free tier — one call per uncached ticket; identical tickets are served from cache (no call); a concurrency cap + per-minute guard keep us inside the free RPM/TPM limits; the low-confidence case_type vote reuses that same free tier. Only if Groq errors/rate-limits do we touch the cheap DeepSeek tier; OpenAI is off by default. With no key at all the service is fully functional on deterministic templates. So even at the scenario's ~40k complaints/day, the expected spend is $0 (free tier + cache + template fallback), with cheap DeepSeek as burst insurance only.
Enforced in code (app/engine/safety.py) on every customer-facing text, regardless of source:
- No credential requests (−15): blocks "share/provide/enter your OTP/PIN/password"; allows warnings ("do not share your PIN/OTP") and proactively includes one.
- No unauthorized financial promises (−10): blocks "we will refund/reverse/unblock"; uses the safe phrasing "any eligible amount will be returned through official channels." Internal ops steps ("initiate the reversal flow") are allowed.
- Official channels only (−10): strips links, phone numbers, and third-party app redirects from replies; recommending the merchant for a refund is allowed.
- Prompt-injection resistance: injected instruction-sentences are stripped before analysis, and structurally the rules — not the LLM — decide every scored field, so injection cannot alter classification/routing/verdict. On any violation the field is replaced with a guaranteed-safe template.
- Customer-leaked secret redaction: if the customer pastes their own PIN/OTP/card number into the complaint, those digits are scrubbed before analysis, before the LLM sees them, and from any text echoed back. Regex-based, EN + BN. This is a different concern from the credential-request rule above: that one blocks the bot asking for secrets; this one stops the customer volunteering a secret from leaking into logs, the LLM context, or the reply.
Why a hand-rolled sanitizer (not Guardrails AI / Llama Guard)? The standard libraries are excellent but a poor fit here: Llama Guard is an 8B model (violates the no-GPU / <1 GB-image / multi-GB-download rules), and Guardrails AI's re-ask loop adds latency and non-determinism on a free-tier model. We need the safety layer to be deterministic, dependency-light, and sub-millisecond so it can run on every reply (LLM or template) without risking the 30 s timeout. The regex/keyword denylist plus template-replacement gives that guarantee; the LLM being unable to touch scored fields is the structural backstop.
pytest -q # 103 unit/integration tests (offline, deterministic)
python scripts/evaluate.py <BASE_URL> # end-to-end: 10 samples + error codes + latency
python scripts/gen_sample_outputs.py <BASE_URL> # regenerate sample_outputs/The API (the judged service) runs as live HTTPS on an Azure VM via Docker Compose + Caddy (auto Let's Encrypt on the free *.cloudapp.azure.com DNS label, load-balancing N app replicas) — base URL https://queuestorm-nafiz.centralindia.cloudapp.azure.com. Steps in RUNBOOK.md.
An optional static demo UI (web/index.html) is deployed separately on Vercel (https://queuestorm-nine.vercel.app) purely for manual exploration; it is a single dependency-free HTML page that calls the API from the browser (CORS is enabled on the API for this). It is not part of the judged path — the harness calls the API base URL directly.
flowchart LR
J["Judge / agent<br/>HTTPS"] --> CADDY
subgraph VM["☁️ Azure VM · 2 vCPU / 4 GB · Docker Compose"]
CADDY["Caddy :443<br/>auto-TLS + load-balance"]:::edge
CADDY --> A1["app · uvicorn"]:::app
CADDY --> A2["app · uvicorn"]:::app
CADDY --> AN["app · uvicorn ×N"]:::app
A1 --> R[("Redis<br/>shared cache + rate-limit<br/>·optional·")]:::store
A2 --> R
AN --> R
end
A1 -. "free" .-> GROQ["Groq 70B / 8B<br/>primary"]:::free
A2 -. "on 429 / error" .-> DS["DeepSeek<br/>cheap fallback"]:::paid
AN -. "opt-in · rare" .-> OAI["OpenAI<br/>off by default"]:::paid
A1 --> T["templates<br/>final safe fallback"]:::safe
classDef edge fill:#0f766e,stroke:#5eead4,color:#ecfeff
classDef app fill:#1d4ed8,stroke:#93c5fd,color:#eff6ff
classDef store fill:#9f1239,stroke:#fda4af,color:#fff1f2
classDef free fill:#15803d,stroke:#86efac,color:#f0fdf4
classDef paid fill:#a16207,stroke:#fde047,color:#fefce8
classDef safe fill:#6b21a8,stroke:#d8b4fe,color:#faf5ff
style VM fill:#0b1220,stroke:#38bdf8,color:#e0f2fe
Stateless app replicas scale horizontally behind Caddy; cache + rate-limit state lives in optional Redis (in-memory fallback if absent). Every external dependency degrades gracefully — Groq → DeepSeek → templates — so a valid request always returns a safe 200.
Every request emits a single structured stdout log line capturing the why behind the decision, so judges can docker compose logs app | grep '"decision"' to reproduce any answer:
{
"ticket_id": "TKT-001",
"case_type": "wrong_transfer",
"evidence_verdict": "consistent",
"relevant_transaction_id": "TXN-9101",
"department": "dispute_resolution",
"severity": "high",
"human_review_required": true,
"match_flags": ["transaction_match"],
"signals": {
"amounts": [5000.0],
"phones": ["1712345678"],
"ids": [],
"time_hour": 14,
"day_offset": 0,
"language": "en"
},
"text_source": "template",
"rule_confidence": 0.9,
"llm_reclassified": false,
"safety_violations": [],
"injection_detected": false
}Per-case decision counters (e.g. case_type.wrong_transfer, verdict.inconsistent, human_review.true) are exposed at GET /metrics for the monitoring tie-breaker.
- Inputs are synthetic; no real customer/payment data or payment-API integration.
relevant_transaction_idis alwaysnullor a real id from the provided history — never invented.- Matching uses multiple grounded signals — exact amount (incl. Bangla numerals, commas, and scale words like
5k/২ লাখ), counterparty phone tail, referenced merchant/agent/biller id, transaction type, and time-of-day (e.g. "around 2pm" → the 14:0x transaction). When two same-amount transactions go to different recipients and nothing (time/id) separates them, the service returnsinsufficient_datawith a clarifying reply rather than guessing — guessing a dispute is penalised more than asking. - Every decision emits a structured audit-trail log line (matched id, which signals fired, verdict, text source, any safety violation, injection-detected flag) for traceability and debugging under the judge harness; aggregate counts are exposed at
/metrics. - Groq free-tier rate limits can push some requests to DeepSeek/templates under heavy burst — by design, never an error.
- Bangla replies are guaranteed safe via templates even when the model's Bangla is imperfect.