Text-2-SQL Agent: Enterprise SQL Benchmark for A2A Agents

A comprehensive, reproducible benchmark framework for evaluating LLM-powered SQL agents with multi-dimensional scoring, hallucination detection, and standardized A2A protocol support.

📋 Table of Contents

Overview
What Makes Text-2-SQL Agent Unique
Architecture
Quick Start
Benchmark Design Quality
Evaluation Methodology
Error Category Metrics
Docker Deployment
Robust Error Handling & Logging
Reproducibility
Resource Requirements
API Reference
Innovation & Impact

Overview

Text-2-SQL Agent is a standardized SQL benchmark designed to evaluate AI agents' ability to generate correct, efficient, and safe SQL queries. Unlike simple pass/fail benchmarks, Text-2-SQL Agent provides:

7-Dimensional Scoring: Correctness, Efficiency, Safety, Completeness, Semantic Accuracy, Best Practices, Plan Quality
Hallucination Detection: Pre-execution detection of phantom tables, columns, and invalid functions
Error Category Analysis: Detailed breakdown of failure modes (schema errors, syntax errors, planning errors)
Multi-Dialect Support: SQLite, DuckDB, PostgreSQL, BigQuery
A2A Protocol: Standardized agent-to-agent communication for reproducible tournaments

Target Audience

AI Researchers: Evaluate and compare SQL generation models
LLM Developers: Benchmark text-to-SQL capabilities
Enterprise Teams: Assess agents for production readiness
AgentBeats Platform Users: Run standardized tournaments

What Makes Text-2-SQL Agent Unique

1. Beyond Binary Pass/Fail

Traditional SQL benchmarks only check if queries execute and return correct results. Text-2-SQL Agent evaluates how the SQL was generated:

Dimension	What It Measures	Why It Matters
Correctness (35%)	Result matches expected output	Core functionality
Efficiency (15%)	Query execution time	Production readiness
Safety (20%)	No hallucinations, valid syntax	Reliability
Completeness (10%)	All expected data returned	Data quality
Semantic Accuracy (10%)	Values match, not just row counts	Precision
Best Practices (5%)	No SELECT *, proper JOINs	Code quality
Plan Quality (5%)	Efficient execution plan	Database optimization

2. Pre-Execution Hallucination Detection

Text-2-SQL Agent detects phantom identifiers before execution, preventing cryptic database errors:

Query: SELECT fake_column FROM nonexistent_table
┌─────────────────────────────────────────────────────────────┐
│ HALLUCINATION DETECTED                                       │
├─────────────────────────────────────────────────────────────┤
│ ⚠ Phantom Table: nonexistent_table                          │
│ ⚠ Phantom Column: fake_column                               │
│ Hallucination Score: 0.0 (severe)                           │
└─────────────────────────────────────────────────────────────┘

3. Error Category Classification

Inspired by research on SQL generation failures, Text-2-SQL Agent classifies errors into actionable categories:

┌─────────────────────────────────────────────────────────────┐
│                    ERROR DISTRIBUTION                        │
├─────────────────────────────────────────────────────────────┤
│ Schema Errors (54.3%)                                        │
│   ├── Wrong Schema Linking: 27.6%                           │
│   ├── Wrong Column: 16.6%                                   │
│   └── Wrong Table: 10.1%                                    │
│                                                              │
│ Analysis Errors (60.7%)                                      │
│   ├── Erroneous Data Analysis: 35.5%                        │
│   ├── Incorrect Planning: 17.7%                             │
│   └── Incorrect Data Calculation: 7.5%                      │
│                                                              │
│ SQL Errors (37.5%)                                           │
│   ├── Condition Filter Error: 11.5%                         │
│   ├── Dialect/Function Error: 10.3%                         │
│   ├── Join Error: 8.3%                                      │
│   └── Syntax Error: 7.4%                                    │
└─────────────────────────────────────────────────────────────┘

4. Standardized A2A Protocol

Text-2-SQL Agent implements the AgentBeats A2A protocol for reproducible agent tournaments:

┌─────────────────────────────────────────────────────────────┐
│                     TOURNAMENT FLOW                          │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌─────────────┐    Tasks     ┌─────────────────────────┐   │
│  │ GREEN AGENT │ ───────────▶ │     PURPLE AGENTS       │   │
│  │ (Evaluator) │              │  ┌─────────┐ ┌────────┐ │   │
│  │             │ ◀─────────── │  │ Agent A │ │Agent B │ │   │
│  │ Text-2-SQL  │    SQL       │  │ (GPT-4) │ │(Gemini)│ │   │
│  └─────────────┘  Responses   │  └─────────┘ └────────┘ │   │
│         │                     └─────────────────────────┘   │
│         ▼                                                    │
│  ┌─────────────────────────────────────────────────────┐    │
│  │                   ARTIFACT                           │    │
│  │  Rankings, Scores, Error Metrics, Recommendations   │    │
│  └─────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────┘

Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                      TEXT-2-SQL AGENT EVALUATION SYSTEM                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ┌──────────────────────────────────────────────────────────────────────┐  │
│  │                         SQL EXECUTOR (Multi-Dialect)                  │  │
│  │  ┌───────────┐  ┌───────────┐  ┌────────────┐  ┌───────────┐         │  │
│  │  │  SQLite   │  │  DuckDB   │  │ PostgreSQL │  │  BigQuery │         │  │
│  │  │ (default) │  │(analytics)│  │   (prod)   │  │  (cloud)  │         │  │
│  │  └───────────┘  └───────────┘  └────────────┘  └───────────┘         │  │
│  └──────────────────────────────────────────────────────────────────────┘  │
│                                    │                                        │
│                                    ▼                                        │
│  ┌──────────────────────────────────────────────────────────────────────┐  │
│  │                      VALIDATION LAYER                                 │  │
│  │  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────────┐   │  │
│  │  │ SQL Parser      │  │ Hallucination   │  │ Schema Validator    │   │  │
│  │  │ • sqlglot AST   │  │ Detector        │  │ • Table existence   │   │  │
│  │  │ • Multi-dialect │  │ • Phantom tables│  │ • Column validation │   │  │
│  │  │ • Transpilation │  │ • Phantom cols  │  │ • Type checking     │   │  │
│  │  └─────────────────┘  │ • Invalid funcs │  └─────────────────────┘   │  │
│  │                       └─────────────────┘                             │  │
│  └──────────────────────────────────────────────────────────────────────┘  │
│                                    │                                        │
│                                    ▼                                        │
│  ┌──────────────────────────────────────────────────────────────────────┐  │
│  │                      ERROR CLASSIFIER                                 │  │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  │  │
│  │  │Schema Errors│  │Analysis Err │  │  SQL Errors │  │Other Errors │  │  │
│  │  │• Wrong table│  │• Bad plan   │  │• Syntax     │  │• Prompt len │  │  │
│  │  │• Wrong col  │  │• Bad calc   │  │• Join error │  │• Ext. knowl │  │  │
│  │  │• Bad linking│  │• Data error │  │• Filter err │  │             │  │  │
│  │  └─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘  │  │
│  └──────────────────────────────────────────────────────────────────────┘  │
│                                    │                                        │
│                                    ▼                                        │
│  ┌──────────────────────────────────────────────────────────────────────┐  │
│  │                    7-DIMENSIONAL SCORING ENGINE                       │  │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  │  │
│  │  │ Correctness │  │ Efficiency  │  │   Safety    │  │  Semantic   │  │  │
│  │  │    35%      │  │    15%      │  │    20%      │  │  Accuracy   │  │  │
│  │  │             │  │  Adaptive   │  │  Weighted   │  │    10%      │  │  │
│  │  └─────────────┘  │ Thresholds  │  │ Hallucin.   │  └─────────────┘  │  │
│  │                   └─────────────┘  └─────────────┘                    │  │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐                    │  │
│  │  │Completeness │  │    Best     │  │    Plan     │                    │  │
│  │  │    10%      │  │  Practices  │  │  Quality    │                    │  │
│  │  │             │  │     5%      │  │     5%      │                    │  │
│  │  └─────────────┘  └─────────────┘  └─────────────┘                    │  │
│  └──────────────────────────────────────────────────────────────────────┘  │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Component Details

Component	Purpose	Key Features
SQL Executor	Execute queries across databases	Connection pooling, timeout handling, error recovery
Validation Layer	Pre-execution analysis	AST parsing, schema validation, hallucination detection
Error Classifier	Categorize failures	Pattern matching, confidence scoring, actionable insights
Scoring Engine	Multi-dimensional evaluation	Weighted scoring, adaptive thresholds, presets
Artifact Builder	Generate reports	Rankings, metrics, visualizations

Quick Start

Option 1: Docker (Recommended)

# Clone repository
git clone https://github.com/ashcastelinocs124/text-2-sql-agent.git
cd text-2-sql-agent

# Create environment file
cp .env.example .env
# Edit .env and add your GOOGLE_API_KEY (for Gemini)

# Start all services (Green Agent + Purple Agent)
docker-compose -f docker-compose.agentbeats.yml up -d

# Verify services are running
curl http://localhost:8001/health  # Green Agent (Evaluator)
curl http://localhost:8080/health  # Purple Agent (SQL Generator)

# Run a tournament
curl -X POST http://localhost:8001/assess \
  -H "Content-Type: application/json" \
  -d '{
    "participants": {
      "gemini": "http://agentx-purple-gemini:8080"
    },
    "config": {
      "task_count": 10,
      "difficulty": ["easy", "medium"],
      "scorer_preset": "default"
    }
  }'

# Stop services
docker-compose -f docker-compose.agentbeats.yml down

Option 2: Local Installation

# Clone repository
git clone https://github.com/ashcastelinocs124/text-2-sql-agent.git
cd text-2-sql-agent

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Run benchmark locally
python run_benchmark.py --output results/

# Run with specific difficulties
python run_benchmark.py --difficulty easy,medium,hard --output results/

Option 3: Programmatic Usage

from agentx import SQLExecutor, ExecutorConfig

# Create executor (SQLite, zero setup)
executor = SQLExecutor(ExecutorConfig(dialect="sqlite"))

# Execute and evaluate a query
result = executor.process_query("SELECT * FROM customers WHERE city = 'NYC'")

print(f"Status: {result.overall_status}")
print(f"Score: {result.scores.overall:.2%}")
print(f"Hallucination Score: {result.scores.hallucination_score:.2%}")

executor.close()

Benchmark Design Quality

Task Difficulty Progression

Text-2-SQL Agent provides 27+ SQL tasks across 4 difficulty levels:

Difficulty	Tasks	Skills Tested
Easy (10)	Basic SELECT, WHERE, LIMIT, COUNT	Schema understanding, simple filters
Medium (10)	JOINs, Subqueries, GROUP BY, CASE	Multi-table reasoning, aggregation
Hard (4)	Window functions, CTEs, Ranking	Advanced SQL, analytical queries
Enterprise (30)	Star schema, SCD, Sessionization, Cohorts	Real-world data warehouse patterns

Real-World Representative Tasks

Tasks are modeled on actual production SQL patterns:

-- Enterprise: Star Schema Analysis (Realistic data warehouse pattern)
SELECT dp.category, ds.region, SUM(sf.quantity * sf.unit_price) as revenue
FROM sales_fact sf
JOIN dim_product dp ON sf.product_id = dp.product_id
JOIN dim_store ds ON sf.store_id = ds.store_id
GROUP BY dp.category, ds.region
ORDER BY revenue DESC

-- Hard: Sessionization with 30-minute timeout (User behavior analytics)
WITH time_diffs AS (
  SELECT user_id, event_timestamp,
    CASE WHEN (julianday(event_timestamp) - 
               julianday(LAG(event_timestamp) OVER (...))) * 24 * 60 > 30
         THEN 1 ELSE 0 END as new_session
  FROM user_events
)
SELECT user_id, SUM(new_session) OVER (...) as session_id FROM time_diffs

Agentic Capability Testing

Capability	How Text-2-SQL Agent Tests It
Schema Understanding	Agents must correctly identify tables/columns from schema
Multi-step Reasoning	Complex queries require planning (CTEs, subqueries)
Dialect Awareness	Function validation per database dialect
Error Recovery	Hallucination detection before execution
Best Practices	Code quality scoring (avoiding anti-patterns)

Why Text-2-SQL Agent Tasks Are Not Trivial

Text-2-SQL Agent is designed to avoid tasks solvable by simple heuristics. Here's how our tasks compare:

Trivial vs Text-2-SQL Agent Task Comparison

Aspect	❌ Trivial Benchmark	✅ Text-2-SQL Agent Task
Query	`SELECT * FROM users`	`SELECT name, total_spent FROM customers WHERE id IN (SELECT customer_id FROM orders GROUP BY customer_id HAVING SUM(total) > 500)`
Reasoning	Template matching	Multi-step planning required
Schema	Single table	19-table enterprise schema with relationships
Evaluation	Binary pass/fail	7-dimensional nuanced scoring
Error Insight	"Query failed"	"Wrong column: used `user_name` instead of `name` (schema_error)"

Why Simple Heuristics Fail on Text-2-SQL Agent

Heuristic	Why It Fails
"Just use SELECT *"	Best practices score penalizes SELECT *
"Match keywords to columns"	Enterprise schema has similar column names across tables
"Use first table mentioned"	Star schema requires correct fact-dimension joins
"Copy SQL patterns"	Sessionization, SCD queries require semantic understanding
"Ignore schema relationships"	Foreign key joins are required for correct results

Task Complexity Examples

❌ Trivial (WikiSQL-style):

-- Question: "How many employees are there?"
SELECT COUNT(*) FROM employees
-- Solvable by: keyword matching "count" → COUNT(*)

✅ Text-2-SQL Agent Medium:

-- Question: "Find customers who have placed orders with a total greater than 100"
SELECT * FROM customers 
WHERE id IN (SELECT customer_id FROM orders WHERE total > 100)
-- Requires: Understanding two tables, subquery planning, correct join logic

✅ Text-2-SQL Agent Enterprise:

-- Question: "Calculate cohort retention: for each monthly cohort, 
--            show what percentage of customers made purchases in subsequent months"
WITH first_purchase AS (
  SELECT customer_id, strftime('%Y-%m', MIN(order_date)) as cohort_month
  FROM orders_fact GROUP BY customer_id
),
monthly_activity AS (
  SELECT o.customer_id, f.cohort_month,
    (strftime('%Y', o.order_date) - strftime('%Y', f.cohort_month || '-01')) * 12 +
    (strftime('%m', o.order_date) - strftime('%m', f.cohort_month || '-01')) as months_since
  FROM orders_fact o JOIN first_purchase f ON o.customer_id = f.customer_id
)
SELECT cohort_month, months_since, 
  COUNT(DISTINCT customer_id) as active_customers,
  ROUND(COUNT(DISTINCT customer_id) * 100.0 / 
    FIRST_VALUE(COUNT(DISTINCT customer_id)) OVER (PARTITION BY cohort_month ORDER BY months_since), 2) as retention_rate
FROM monthly_activity
GROUP BY cohort_month, months_since
-- Requires: CTEs, window functions, date arithmetic, multi-step aggregation

Complexity Metrics

Difficulty	Avg Tables	Avg Joins	Uses CTEs	Uses Window Funcs	Subqueries
Easy	1.0	0	0%	0%	10%
Medium	1.8	0.8	20%	0%	40%
Hard	2.5	1.5	60%	80%	30%
Enterprise	3.2	2.4	70%	60%	50%

Evaluation Methodology

Scoring Criteria

Each dimension has clear, objective criteria:

# Correctness: Exact result matching with tolerance
correctness_score = comparator.compare(
    actual_results, 
    expected_results,
    numeric_tolerance=1e-6,
    ignore_row_order=True,
)

# Safety: Weighted hallucination severity
safety_score = 1.0 - (
    0.4 * (1 if phantom_tables else 0) +      # Severe
    0.35 * (1 if phantom_columns else 0) +    # Moderate
    0.25 * (1 if invalid_functions else 0)    # Minor
)

# Efficiency: Adaptive thresholds based on complexity
efficiency_score = score_execution_time(
    time_ms=result.execution_time_ms,
    thresholds=get_thresholds_for_complexity(query_complexity)
)

Scoring Presets

Preset	Use Case	Weights Modified
default	Balanced evaluation	Standard weights
strict	Production safety	+Safety, +Correctness
performance	Query optimization	+Efficiency, +Plan Quality
quality	Code review	+Best Practices

Automated Evaluation

All evaluation is fully automated with no manual intervention:

Parse SQL using sqlglot AST
Validate against schema snapshot
Detect hallucinations before execution
Execute with timeout and error handling
Compare results with expected output
Score across all 7 dimensions
Classify any errors into categories
Generate artifact with metrics

Error Category Metrics

Text-2-SQL Agent classifies errors to provide actionable debugging insights:

Error Categories

Category	Subcategory	Description
Schema Error	`wrong_table`	Table doesn't exist in schema
	`wrong_column`	Column doesn't exist in table
	`wrong_schema_linking`	Incorrect foreign key / alias usage
Analysis Error	`erroneous_data_analysis`	Logic error in query
	`incorrect_planning`	Missing GROUP BY, wrong aggregation
	`incorrect_data_calculation`	Division by zero, numeric overflow
SQL Error	`syntax_error`	Invalid SQL syntax
	`join_error`	Ambiguous/invalid JOIN conditions
	`condition_filter_error`	Invalid WHERE/HAVING clauses
	`dialect_function_error`	Function not supported in dialect
Other	`excessive_prompt_length`	Context limit exceeded
	`misunderstanding_external_knowledge`	Domain knowledge error

Metrics Output

{
  "error_metrics_summary": {
    "total_tasks": 27,
    "successful_tasks": 19,
    "failed_tasks": 8,
    "success_rate": 70.4,
    "category_percentages": {
      "schema_error": 62.5,
      "sql_error": 25.0,
      "analysis_error": 12.5
    },
    "subcategory_percentages": {
      "wrong_column": 37.5,
      "wrong_table": 25.0,
      "syntax_error": 25.0,
      "incorrect_planning": 12.5
    }
  }
}

Docker Deployment

Pre-built Images

Image	Description	Size
`ghcr.io/ashcastelinocs124/agentx-green`	SQL Benchmark Evaluator (A2A Protocol)	~600MB
`ghcr.io/ashcastelinocs124/agentx-purple`	LLM SQL Generator	~400MB

Docker Compose

# docker-compose.agentbeats.yml
services:
  agentx-green:
    image: ghcr.io/ashcastelinocs124/agentx-green:latest
    ports:
      - "9009:9009"
    command: ["--host", "0.0.0.0", "--port", "9009", "--dialect", "sqlite"]
    healthcheck:
      test: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:9009/.well-known/agent.json')"]
      interval: 10s
      timeout: 5s
      retries: 3

  agentx-purple-gemini:
    image: ghcr.io/ashcastelinocs124/agentx-purple:latest
    ports:
      - "8080:8080"
    environment:
      - GOOGLE_API_KEY=${GOOGLE_API_KEY}
    command: ["--host", "0.0.0.0", "--port", "8080", "--llm", "gemini"]
    healthcheck:
      test: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:8080/.well-known/agent.json')"]
      interval: 10s
      timeout: 5s
      retries: 3

Build from Source

# Build Green Agent (A2A Protocol Compatible)
docker build --platform linux/amd64 -f docker/Dockerfile.green -t agentx-green .

# Build Purple Agent
docker build --platform linux/amd64 -f docker/Dockerfile.purple -t agentx-purple .

# Multi-platform build for GHCR
docker buildx build --platform linux/amd64 -f docker/Dockerfile.green -t ghcr.io/ashcastelinocs124/agentx-green:latest --push .
docker buildx build --platform linux/amd64 -f docker/Dockerfile.purple -t ghcr.io/ashcastelinocs124/agentx-purple:latest --push .

Key Changes in Latest Docker Images

Green Agent (A2A Protocol)

A2A Protocol Support: Now implements standard A2A protocol for AgentBeats compatibility
New Entrypoint: Uses entrypoint_green_a2a.py with A2AStarletteApplication
Port Change: Default port changed from 8001 to 9009
Dependencies: Added a2a-sdk[http-server]>=0.3.20 and uvicorn>=0.38.0
Health Check: Uses Python urllib instead of curl (compatible with slim images)

Purple Agent

Agent Card Support: Added /.well-known/agent-card.json endpoint for AgentBeats
Health Check: Uses Python urllib for compatibility
Multi-LLM Support: Gemini (default) and OpenAI via environment variables

Robust Error Handling & Logging

Text-2-SQL Agent implements production-grade resilience patterns to ensure reliable operation during agent tournaments and benchmarks.

Circuit Breaker Pattern

Prevents cascading failures when agents become unavailable:

from a2a.resilience import CircuitBreaker, CircuitState

# Circuit breaker automatically opens after 3 failures
breaker = CircuitBreaker(
    failure_threshold=3,      # Open after 3 consecutive failures
    recovery_timeout=30.0,    # Wait 30s before testing recovery
    half_open_max_calls=1,    # Allow 1 test call in half-open state
)

# Usage in agent communication
if breaker.can_execute():
    try:
        result = await send_task_to_agent(endpoint, task)
        breaker.record_success()
    except Exception:
        breaker.record_failure()
        # After 3 failures, circuit opens → fast-fail for 30s
else:
    raise CircuitOpenError(f"Agent {endpoint} unavailable")

Circuit States:

┌─────────┐  3 failures  ┌─────────┐  30s timeout  ┌───────────┐
│ CLOSED  │ ───────────▶ │  OPEN   │ ────────────▶ │ HALF-OPEN │
│(normal) │              │(reject) │               │  (test)   │
└─────────┘              └─────────┘               └───────────┘
     ▲                                                   │
     │              success                              │
     └───────────────────────────────────────────────────┘

Retry with Exponential Backoff

from a2a.resilience import ResilientHTTPClient

# HTTP client with automatic retry and circuit breaker
client = ResilientHTTPClient(
    circuit_failure_threshold=3,
    circuit_recovery_timeout=30.0,
)

# Automatically retries with exponential backoff (1s, 2s, 4s)
response = await client.request(
    "POST",
    "http://agent:8080/generate",
    operation_type="sql_generation",  # 60s timeout for LLM calls
    json={"question": "...", "schema": {...}}
)

Timeout Configuration:

Operation Type	Default Timeout	Rationale
`health_check`	5s	Quick liveness/readiness probes
`sql_generation`	60s	LLM generation can be slow
`schema_fetch`	10s	Database schema operations
`default`	30s	Standard operations

Comprehensive Health Checks

Kubernetes-compatible liveness and readiness probes:

from a2a.health import HealthChecker, HealthStatus

# Initialize health checker with agent and executor
checker = HealthChecker(
    agent=green_agent,
    executor=sql_executor,
    version="1.0.0",
)

# Liveness probe (quick, <100ms)
liveness = await checker.check_liveness()
# Returns: {"status": "healthy", "checks": [{"name": "process", "status": "pass"}]}

# Readiness probe (thorough)
readiness = await checker.check_readiness()
# Checks: tasks_loaded, database, llm_client, custom checks

Health Response Example:

{
  "status": "healthy",
  "timestamp": "2025-01-15T01:25:00Z",
  "version": "1.0.0",
  "checks": [
    {"name": "tasks_loaded", "status": "pass", "duration_ms": 0.5, "message": "27 tasks loaded"},
    {"name": "database", "status": "pass", "duration_ms": 1.2, "message": "Database accessible"},
    {"name": "llm_client", "status": "pass", "duration_ms": 0.3, "message": "LLM client configured (gemini)"}
  ]
}

Error Recovery Strategies

Error Type	Recovery Strategy
Network Timeout	Retry 3x with exponential backoff
Agent Unavailable	Circuit breaker → fast-fail → auto-recover
Database Error	Log, mark task failed, continue evaluation
LLM API Error	Retry with backoff, degrade gracefully
Invalid SQL	Classify error, score as 0, provide feedback

Logging Configuration

import logging

# Enable detailed logging for debugging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)

# Component-specific log levels
logging.getLogger('a2a.server').setLevel(logging.DEBUG)      # API requests
logging.getLogger('a2a.resilience').setLevel(logging.INFO)   # Circuit breaker events
logging.getLogger('a2a.health').setLevel(logging.DEBUG)      # Health checks
logging.getLogger('agentx.executor').setLevel(logging.INFO)  # SQL execution

Sample Log Output:

2025-01-15 01:25:00,123 - a2a.server - INFO - Starting assessment abc123 with 2 participants
2025-01-15 01:25:00,456 - a2a.resilience - INFO - Circuit CLOSED for agent1:8080
2025-01-15 01:25:01,789 - agentx.executor - INFO - Executed query in 2.3ms, 5 rows returned
2025-01-15 01:25:02,012 - a2a.resilience - WARN - Failure 1/3 for agent2:8081
2025-01-15 01:25:05,345 - a2a.resilience - WARN - Circuit OPEN for agent2:8081 (3 failures)
2025-01-15 01:25:35,678 - a2a.resilience - INFO - Circuit HALF-OPEN for agent2:8081, testing...

Reproducibility

Consistent Results Guarantee

Text-2-SQL Agent ensures reproducible benchmark runs through:

Feature	Implementation
Deterministic Task Order	Tasks loaded from JSON in fixed order
Fixed Seed Data	Same sample data for all runs
Schema Snapshots	Captured before evaluation
Versioned Tasks	Task definitions in version control
Same-Tasks Mode	All agents receive identical tasks in tournaments

Running the Same Benchmark

# Same tasks file, same difficulty, same output
python run_benchmark.py \
  --tasks tasks/gold_queries/sqlite/basic_queries.json \
  --difficulty easy,medium \
  --output results/

# Results will be identical across runs (for same agent)

Tournament Reproducibility

# Configure tournament in scenario.toml
[scenario]
name = "SQL Benchmark Tournament"

[config]
same_tasks = true           # All agents get same tasks
difficulty = ["easy", "medium"]
task_count = 10
scorer_preset = "default"

Resource Requirements

Minimum Requirements

Resource	Green Agent	Purple Agent
CPU	1 core	1 core
Memory	512MB	1GB
Disk	200MB	200MB
Time/Task	~100ms	~2-5s (LLM dependent)

Recommended for Production

Resource	Recommendation
CPU	2 cores
Memory	2GB
Timeout	30s per query
Concurrent Agents	Up to 5 in parallel

Performance Benchmarks

┌─────────────────────────────────────────────────────────────┐
│ BENCHMARK: 27 tasks, SQLite, default scorer                 │
├─────────────────────────────────────────────────────────────┤
│ Total Time:           0.08 seconds                          │
│ Tasks/Second:         337.5                                 │
│ Avg Execution Time:   0.02ms per query                      │
│ Memory Peak:          150MB                                 │
└─────────────────────────────────────────────────────────────┘

API Reference

Green Agent Endpoints

Method	Endpoint	Description
`GET`	`/health`	Health check
`GET`	`/schema`	Get database schema
`GET`	`/.well-known/agent.json`	A2A agent descriptor
`POST`	`/assess`	Run tournament assessment

Purple Agent Endpoints

Method	Endpoint	Description
`GET`	`/health`	Health check
`GET`	`/.well-known/agent.json`	A2A agent descriptor
`POST`	`/generate`	Generate SQL from question

Assessment Request

POST /assess
{
  "participants": {
    "agent_id": "http://agent-endpoint:port"
  },
  "config": {
    "task_count": 10,
    "difficulty": ["easy", "medium", "hard"],
    "scorer_preset": "default",
    "same_tasks": true,
    "parallel_evaluation": true
  }
}

Assessment Response

{
  "status": "completed",
  "artifact": {
    "assessment_id": "abc123",
    "rankings": [
      {"rank": 1, "participant_id": "agent1", "overall_score": 0.92}
    ],
    "participants": {
      "agent1": {
        "scores": {
          "overall": 0.92,
          "correctness": 0.95,
          "efficiency": 0.88,
          "safety": 1.0
        }
      }
    },
    "error_metrics_summary": {
      "category_percentages": {"schema_error": 50.0},
      "subcategory_percentages": {"wrong_column": 50.0}
    }
  }
}

Innovation & Impact

Original Contributions

Multi-Dimensional SQL Scoring: First benchmark to evaluate 7 dimensions beyond correctness
Pre-Execution Hallucination Detection: Catches phantom identifiers before database errors
Error Category Classification: Research-backed categorization of SQL generation failures
A2A Protocol Integration: Standardized tournament interface for agent comparison

Addressing Evaluation Gaps

Gap in Existing Benchmarks	Text-2-SQL Agent Solution
Binary pass/fail only	7-dimensional nuanced scoring
No hallucination tracking	Pre-execution phantom detection
Cryptic database errors	Actionable error categorization
No code quality assessment	Best practices dimension
Single dialect focus	Multi-dialect support

Use Cases

Model Comparison: Benchmark GPT-4 vs Gemini vs Claude on SQL generation
Regression Testing: Track model quality over versions
Production Readiness: Assess safety and efficiency for deployment
Research: Analyze error patterns in SQL generation

Complementary to Existing Benchmarks

Benchmark	Focus	Text-2-SQL Agent Complement
Spider	Schema understanding	+ Safety & efficiency scoring
BIRD	Complex queries	+ Hallucination detection
WikiSQL	Simple queries	+ Enterprise patterns

License

MIT License - See LICENSE for details.

Contributing

Contributions welcome! Please see our contribution guidelines.

Citation

If you use Text-2-SQL Agent in your research, please cite:

@software{text2sqlagent2025,
  title = {Text-2-SQL Agent: Enterprise SQL Benchmark for A2A Agents},
  author = {Dalmia, Keshav},
  year = {2025},
  url = {https://github.com/ashcastelinocs124/text-2-sql-agent}
}

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
agentx_a2a		agentx_a2a
docker		docker
evaluation		evaluation
results		results
src/agentx		src/agentx
tasks		tasks
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
2411.07763v2.pdf		2411.07763v2.pdf
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
README.md		README.md
a2a_executor_green.py		a2a_executor_green.py
demo.py		demo.py
docker-compose.agentbeats.yml		docker-compose.agentbeats.yml
docker-compose.yml		docker-compose.yml
entrypoint_green.py		entrypoint_green.py
entrypoint_green_a2a.py		entrypoint_green_a2a.py
entrypoint_purple.py		entrypoint_purple.py
requirements-green.txt		requirements-green.txt
requirements-purple.txt		requirements-purple.txt
requirements.txt		requirements.txt
run_benchmark.py		run_benchmark.py
run_evaluation_pipeline.py		run_evaluation_pipeline.py
scenario.toml		scenario.toml
submission_method.md		submission_method.md
test_a2a.py		test_a2a.py
test_enhanced_scoring.py		test_enhanced_scoring.py
test_evaluation_pipeline.py		test_evaluation_pipeline.py
test_multi_dialect.py		test_multi_dialect.py

ashcastelinocs124/text-2-sql-agent

Folders and files

Latest commit

History

Repository files navigation