A production-grade Text-to-SQL Agent that lets non-technical stakeholders (finance, operations) query complex supply chain databases using natural language, simulate multi-variable financial scenarios, and instantly visualize results β without writing SQL.
Achieves 91.7% accuracy on a 60-query benchmark across 7 difficulty tiers with 100% schema faithfulness and 99.1% retrieval recall.
Built on the TPC-H Benchmark dataset with custom schema modifications to test real-world ambiguity handling.
- Natural Language Interface: Translate questions like "Show me top 5 suppliers by revenue in Europe" into optimized DuckDB SQL with 5-table joins.
- Hybrid Search Retriever: BM25 + Semantic Search (ChromaDB) with 99.1% table recall β maps vague user terms to specific database columns.
- Rule-Based Clarifier: Detects ambiguous queries (e.g., "show me revenue" β by region? by nation?) and prompts users to clarify before making an API call.
- Self-Healing Execution Loop: 3-attempt retry with error feedback β autonomously fixes SQL syntax errors and regenerates queries.
- Multi-Variable Scenarios: Handle combinations like "What if we increase price by 5% AND reduce discount by 3% for the EUROPE region?"
- Sensitivity Analysis: Automatically tests multiple levels (5%, 10%, 15%, 20%) and plots impact trends.
- Region/Segment Scoping: Simulate changes scoped to specific regions, nations, or customer segments using CASE WHEN in CTEs.
- Scenario Presets: 8 one-click preset scenarios in the sidebar (Pricing, Discounts, Multi-Variable, Sensitivity).
- Dynamic Context Tracking: Presets automatically adapt to your conversation context β if you've been querying ASIA, the presets update to use ASIA. Zero API cost (rule-based regex extraction).
- Scenario History & Comparison: Store up to 10 simulation results per session and compare them side-by-side with bar charts and data tables.
- Non-Destructive: All simulations run via read-only CTEs β zero risk to production data.
- Intelligent Detection: Distinguishes between normal query results and simulation outputs based on column patterns (original_value, simulated_value, difference, pct_change).
- Three Visualization Modes:
- Sensitivity Analysis: Data table + scenario bar chart + impact trend line
- Grouped Comparison: Side-by-side bar chart (original vs simulated per group)
- Single Comparison: Metric cards with delta indicators (Original β Simulated β Impact)
- Standard Queries: Auto-detects time-series β line chart, categorical β bar chart, or renders data tables.
- Read-Only Access: DuckDB runs in
read_only=Truemode. - Dual-Layer SQL Guardrails:
sqlparsestatement-type validation + keyword blocklist (DROP, DELETE, INSERT, UPDATE, ALTER, TRUNCATE). - Hallucination Control: Queries referencing non-existent columns (e.g., "profit", "net margin") are rejected with explanations.
- 100% safety refusal rate on the evaluation (5/5 destructive queries blocked).
- Live metrics panel in the sidebar tracking total queries, cache hit rate, average latency, per-stage timing (retriever, LLM, DB execution), and estimated token usage.
- 60-query benchmark across 7 difficulty tiers (simple select, single join, aggregation, multi-hop, window functions, simulation, safety).
- Result-set comparison against gold SQL β not string matching, actual data comparison with exact, approximate, fuzzy column, and lenient matching.
- Retrieval evaluation measuring Precision@k, Recall@k, and MRR for the hybrid retriever.
- Faithfulness check validating all generated SQL references only real tables, columns, and valid foreign key joins.
- Failure categorization β every failed query is classified (wrong_result, syntax_error, wrong_tables, empty_result, etc.).
| Component | Technology |
|---|---|
| LLM | Anthropic Claude Sonnet 4 |
| Orchestration | LangChain |
| Database | DuckDB (OLAP-optimized) |
| Frontend | Streamlit |
| Vector Store | ChromaDB (local persistence) |
| Embeddings | HuggingFace all-MiniLM-L6-v2 (runs on CPU) |
| Containerization | Docker + Docker Compose |
| CI/CD | GitHub Actions β GHCR |
| Testing | pytest + custom eval framework |
AI_SQL_Agent/
βββ app.py # Streamlit frontend (chat, viz, presets, history)
βββ Dockerfile # Container definition
βββ docker-compose.yml # One-command deployment
βββ docker-entrypoint.sh # Auto-setup: generates DB + vector store if missing
βββ requirements.txt # Pinned dependencies
βββ .env # API keys (not committed)
β
βββ src/
β βββ agent_graph.py # Core agent: LLM loop, caching, simulation, safety
β βββ retriever.py # Hybrid retriever: BM25 + ChromaDB ensemble
β βββ clarifier.py # Rule-based query ambiguity detection
β βββ metrics.py # Observability: latency, cache hits, token tracking
β
βββ eval/
β βββ benchmark.json # 60 labeled test queries across 7 tiers
β βββ accuracy_eval.py # Result-set comparison with gold SQL
β βββ retrieval_eval.py # Precision@k, Recall@k, MRR for retriever
β βββ faithfulness_check.py # Schema validation of generated SQL
β βββ results/
β βββ accuracy_report.json # Latest accuracy results
β βββ retrieval_report.json # Latest retrieval results
β βββ faithfulness_report.json # Latest faithfulness results
β
βββ tests/
β βββ test_agent.py # pytest unit & integration tests (25 tests)
β
βββ scripts/
β βββ demo_db.py # Generate TPC-H demo database (SF=0.1)
β βββ rename_data.py # Add custom columns to full DB
β βββ check_models.py # List available API models
β
βββ data/
β βββ sql_examples.json # Few-shot examples for retriever
β βββ chroma_db/ # Persisted vector store
β βββ supply_chain.db # Full TPC-H database (SF=1)
β βββ sql_agent_demo.db # Demo database (SF=0.1)
β
βββ .github/
βββ workflows/
βββ docker-publish.yml # Auto-build and push image to GHCR
git clone https://github.com/ShahaDeven/AI_SQL_Agent.git
cd AI_SQL_Agent
# Add your API key
echo "ANTHROPIC_API_KEY=your_key_here" > .env
# OR: echo "GOOGLE_API_KEY=your_key_here" > .env
# OR: echo "OPENAI_API_KEY=your_key_here" > .env
# Run (auto-generates demo DB on first run)
docker compose up --build
# Open http://localhost:8501git clone https://github.com/ShahaDeven/AI_SQL_Agent.git
cd AI_SQL_Agent
python -m venv venv
# Windows: venv\Scripts\activate
# Mac/Linux: source venv/bin/activate
pip install -r requirements.txt
# Add your API key
echo "ANTHROPIC_API_KEY=your_key_here" > .env
# OR: echo "GOOGLE_API_KEY=your_key_here" > .env
# OR: echo "OPENAI_API_KEY=your_key_here" > .env
# Generate demo database (first time only)
python scripts/demo_db.py
# Setup vector store (first time only)
python -c "from src.retriever import setup_vector_db; setup_vector_db()"
# Run
streamlit run app.py| Provider | API Key | Model Used |
|---|---|---|
| Anthropic | ANTHROPIC_API_KEY |
Claude Sonnet 4 |
GOOGLE_API_KEY |
Gemini 2.5 Flash | |
| OpenAI | OPENAI_API_KEY |
GPT-4o Mini |
The agent auto-detects which API key is set. Only one is required.
Basic Analytics:
"What is the total revenue per region?" "List the top 3 customers in the AUTOMOBILE segment."
Complex Reasoning (Multi-Hop):
"Who is the supplier with the most parts in the region with the lowest revenue?"
What-If Simulation:
"What if we increased the discount by 10%? How would that affect total revenue?"
Multi-Variable + Scoped:
"What if we increased the price by 5% AND reduced the discount by 3% for the EUROPE region?"
Sensitivity Analysis:
"How sensitive is total revenue to discount changes? Test 5%, 10%, 15%, and 20%."
Evaluated on a 60-query benchmark across 7 difficulty tiers with result-set comparison against gold SQL.
| Tier | Queries | Accuracy |
|---|---|---|
| Simple Select | 10 | 100% |
| Single Join | 10 | 100% |
| Aggregation | 10 | 100% |
| Multi-Hop Reasoning | 10 | 90% |
| Window Functions | 10 | 70% |
| Simulation (What-If) | 5 | 80% |
| Safety Refusals | 5 | 100% |
| Overall | 60 | 91.7% |
| Metric | Score |
|---|---|
| Table Recall@3 | 99.1% |
| Column Recall@3 | 85.9% |
| MRR | 0.94 |
| Metric | Score |
|---|---|
| Faithfulness Rate | 100% (55/55) |
| Hallucinated Tables | 0 |
| Hallucinated Columns | 0 |
| Invalid Joins | 0 |
| Metric | Value |
|---|---|
| Avg Latency | 0.67s |
| Safety Refusal Rate | 100% (5/5) |
| Unit Tests | 25/25 passed |
# Run evaluations yourself
pytest tests/test_agent.py -v # Unit tests
python eval/accuracy_eval.py # 60-query accuracy benchmark
python eval/retrieval_eval.py # Retrieval precision/recall
python eval/faithfulness_check.py # Schema faithfulness validationThe application is fully containerized with an auto-setup entrypoint that generates the TPC-H database and ChromaDB vector store on first run.
# Build and run
docker compose up --build
# Or pull the pre-built image
docker pull ghcr.io/shahadeven/ai_sql_agent:latestGitHub Actions CI/CD: Every push to main automatically builds and publishes a Docker image to GitHub Container Registry (GHCR).
- Core Text-to-SQL Agent with hybrid retrieval
- Self-healing execution pipeline with 3-attempt retry
- What-If scenario simulator with CTE injection
- Auto-visualizer with pattern detection
- Rule-based query clarification system
- Observability dashboard (latency, cache, tokens)
- Multi-variable scenarios + sensitivity analysis
- Dynamic scenario presets with context tracking
- Scenario history & comparison
- Docker containerization with auto-setup entrypoint
- GitHub Actions CI/CD β GHCR
- 60-query evaluation framework (accuracy + retrieval + faithfulness)
- Live demo deployment (Streamlit Cloud / HF Spaces)
- CI-integrated eval (fail build if accuracy < 90%)
