Isartor Benchmark Suite

This directory contains the reproducible benchmark harness and fixtures for measuring Isartor's deflection rate and latency characteristics.

Two benchmark tracks are available:

Track	Script	Use case
General harness	`run.py`	FAQ / mixed-workload deflection rate via `/api/chat`
Claude Code 3-way	`claude_code_benchmark.py`	Baseline vs cold vs warm comparison for Claude Code agent sessions via `/v1/messages`
Claude Code + Copilot	`claude_code_benchmark.py`	With-vs-without-Isartor comparison for Claude Code agent sessions via `/v1/messages`

General Benchmark (FAQ / Mixed Workload)

Quick Start

# 1. Start the Isartor server (requires a running instance)
#    Configure a known API key used for the X-API-Key header (defaults to "changeme").
docker run -p 8080:8080 \
    -e ISARTOR__GATEWAY_API_KEY=changeme \
    ghcr.io/isartor-ai/isartor:latest
# or (from source):
ISARTOR__GATEWAY_API_KEY=changeme cargo run --release

# 2. Ensure the benchmark harness uses the same API key (for the X-API-Key header)
export ISARTOR_API_KEY=changeme

# 3. Run both built-in fixtures and save results (make shortcut)
make benchmark

# 4. Run without a server (dry-run / smoke-test)
make benchmark-dry-run

# 5. Run a single fixture with a custom request limit
python3 benchmarks/run.py \
    --url http://localhost:8080 \
    --input benchmarks/fixtures/faq_loop.jsonl \
    --requests 500

# 6. Generate the with/without-Isartor ROI report (live data)
make report

# 7. Generate the ROI report offline using simulated data
make report-dry-run
# 6. Run the Claude Code / Qwen 2.5 Coder 7B Layer 2 benchmark
#    (requires the Qwen sidecar stack: cd docker && docker compose -f docker-compose.qwen-benchmark.yml up --build)
make benchmark-qwen

Requirements

Python 3.10+ (uses the built-in urllib module — no pip install needed)
A running Isartor instance (or use --dry-run for offline validation)

Fixture Files

File	Prompts	Description
`fixtures/faq_loop.jsonl`	1,000	Simulates a repetitive FAQ / agent-loop workload. Covers returns, shipping, billing, account management, and more — all with semantic rephrasings. Designed to stress L1a (exact cache) and L1b (semantic cache).
`fixtures/diverse_tasks.jsonl`	500	Genuine variety: code generation, summarisation, Q&A, data extraction, and creative writing. Represents a realistic mixed-workload with lower deflection than the FAQ loop corpus.
`fixtures/claude_code_todo_app.jsonl`	20	Deterministic Claude Code workload — exact prompts for building a TypeScript todo app in five phases. Used by the Claude Code benchmark harness.

Each file is in JSONL format — one JSON object per line:

{"prompt": "What is your return policy?"}
{"prompt": "How do returns work?"}

The claude_code_todo_app.jsonl fixture carries extra metadata fields:

{"phase": "scaffold", "step": 1, "prompt": "Initialize a new TypeScript Node.js project…"}
{"phase": "backend",  "step": 4, "prompt": "Define the Todo TypeScript interface…"}

Claude Code TypeScript Todo-App Benchmark

The claude_code_benchmark.py harness provides a deterministic, phase-based benchmark for measuring how Isartor deflects Claude Code traffic when an AI agent implements a TypeScript todo app end-to-end.

Phases

Phase	Steps	Description
`scaffold`	1–3	`package.json`, `tsconfig.json`, directory structure
`backend`	4–9	TypeScript types, in-memory store, Express CRUD routes, middleware, app, server
`frontend`	10–12	`index.html`, `app.js` (vanilla JS), `styles.css`
`tests`	13–15	Jest config, unit tests, integration tests (supertest)
`docker`	16–18	Multi-stage `Dockerfile`, `docker-compose.yml`, `.env.example`, `.gitignore`
`polish`	19–20	`README.md`, final review prompt

Scenarios

Scenario	Description	Expected Deflection
`baseline`	Control path — prompts go straight to L3 (cloud), no Isartor cache. Documents the raw LLM cost.	~0%
`cold`	First pass through Isartor with an empty cache. Prompts go mostly to L3; the cache is seeded.	~3% (L2 only)
`warm`	Second pass through Isartor with a seeded cache. L1a exact hits dominate because the same prompts are replayed.	~82%

The warm-cache deflection rate is high because all 20 prompts are identical between the cold and warm runs — Isartor's L1a exact cache uses a SHA-256 hash of the full prompt and returns sub-millisecond responses on replay.

Running the Claude Code Benchmark

# All three scenarios — dry-run (no server required):
make benchmark-claude-code-dry-run

# All three scenarios against a live Isartor instance:
ISARTOR_URL=http://localhost:8080 \
ISARTOR_API_KEY=changeme \
make benchmark-claude-code

# Single scenario:
python3 benchmarks/claude_code_benchmark.py --scenario warm \
    --url http://localhost:8080 --api-key changeme

# Post results to a GitHub issue:
GH_TOKEN=ghp_xxx \
GH_REPO=isartor-ai/Isartor \
GH_ISSUE=42 \
python3 benchmarks/claude_code_benchmark.py --all-scenarios \
    --url http://localhost:8080 --api-key changeme

# Capture Isartor /health + /debug/stats snapshots before and after each run:
python3 benchmarks/claude_code_benchmark.py --all-scenarios --snapshot \
    --url http://localhost:8080 --api-key changeme

All CLI flags also honour environment variables:

Env variable	CLI flag	Default
`ISARTOR_URL`	`--url`	`http://localhost:8080`
`ISARTOR_API_KEY`	`--api-key`	`changeme`
`GH_TOKEN`	`--gh-token`	(empty)
`GH_REPO`	`--repo`	(empty)
`GH_ISSUE`	`--issue`	(empty)
`ISARTOR_TIMEOUT`	`--timeout`	`120`

Sample output (dry-run)

Loaded 20 tasks from claude_code_todo_app.jsonl
Phases: backend, docker, frontend, polish, scaffold, tests

════════════════════════════════════════════════════════════
  Running scenario: COLD
════════════════════════════════════════════════════════════
  ─────────────────────────────────────────────────────────
  Scenario : COLD
  Tasks    : 20
  L1a (exact)    :     0  (0.0%)
  L1b (semantic) :     0  (0.0%)
  L2  (SLM)      :     1  (5.0%)
  L3  (cloud)    :    19  (95.0%)
  Errors         :     0
  Deflection     : 5.0%
  P50 latency    : 820.0 ms
  ...

════════════════════════════════════════════════════════════
  Running scenario: WARM
════════════════════════════════════════════════════════════
  Scenario : WARM
  Tasks    : 20
  L1a (exact)    :    16  (80.0%)
  L1b (semantic) :     2  (10.0%)
  L2  (SLM)      :     1  (5.0%)
  L3  (cloud)    :     1  (5.0%)
  Errors         :     0
  Deflection     : 95.0%
  P50 latency    : 0.4 ms
  ...

Acceptance Validator

The validate_todo_app.py script checks whether a generated todo app meets the acceptance criteria from the benchmark task packet.

# Static checks only (no compiler / runner needed):
python3 benchmarks/validate_todo_app.py --app-dir ./output/todo-app

# Static checks + TypeScript compilation + Jest test suite:
python3 benchmarks/validate_todo_app.py --app-dir ./output/todo-app \
    --compile --run-tests

# With API smoke tests against a running app:
python3 benchmarks/validate_todo_app.py --app-dir ./output/todo-app \
    --api-url http://localhost:3000

# Machine-readable JSON output:
python3 benchmarks/validate_todo_app.py --app-dir ./output/todo-app \
    --json-output ./validation_results.json

# Make shortcut:
APP_DIR=./output/todo-app make validate-todo-app
APP_DIR=./output/todo-app VALIDATE_ARGS="--compile --run-tests" make validate-todo-app

Checks performed:

Category	Checks
File presence	18 required + 3 optional files
`package.json`	Valid JSON · expected scripts · dependencies · devDependencies
`tsconfig.json`	Valid JSON · strict mode · outDir set
Source structure	Todo type definition · store exports · router methods · CORS · /health
Test files	Unit tests cover all store functions · integration tests cover all endpoints
Frontend	HTML structure · JS fetch/CRUD calls · styles
Docker	Multi-stage Dockerfile · docker-compose services/ports
TypeScript compile	`tsc --noEmit` (optional, requires `npm install`)
Jest suite	`npm test` (optional)
API smoke tests	All CRUD endpoints + 400/404 error cases (optional)

CLI Reference (existing harness)

usage: run.py [-h] [--url URL] [--api-key KEY] [--input INPUT]
              [--requests REQUESTS] [--all] [--output OUTPUT] [--dry-run]

Isartor Benchmark Harness

options:
  -h, --help           show this help message and exit
  --url URL            Base URL of the running Isartor instance
                       (default: $ISARTOR_URL or http://localhost:8080)
  --api-key KEY        Value for the X-API-Key header; must match the server's
                       gateway_api_key setting
                       (default: $ISARTOR_API_KEY or 'changeme')
  --input INPUT        Path to a JSONL fixture file to benchmark
  --requests REQUESTS  Limit the number of prompts to send (0 = all)
  --all                Run all built-in fixtures and write results
  --output OUTPUT      Path for the results JSON file
                       (default: benchmarks/results/latest.json)
  --dry-run            Simulate responses locally (no server required).
                       Useful for CI validation and smoke-testing.

Both --url and --api-key honour environment variables:

ISARTOR_URL=http://localhost:3000 \
ISARTOR_API_KEY=mysecret \
python3 benchmarks/run.py --all

Understanding the Output

-- faq_loop --
  Total requests : 1000
  L1a (exact)    :   412  (41.2%)
  L1b (semantic) :   189  (18.9%)
  L2  (SLM)      :     0  ( 0.0%)
  L3  (cloud)    :   399  (39.9%)
  Errors         :     0
  Deflection rate: 60.1%
  P50 latency    :  0.3 ms
  P95 latency    :  820.0 ms
  P99 latency    :  950.0 ms
  Cost saved     : $0.1503  ($0.000150/req)

### faq_loop

| Layer              | Hits   | % of Traffic | Avg Latency (p50) |
|--------------------|--------|--------------|-------------------|
| L1a (exact)        |    412 |       41.2%  |            0.3 ms |
| L1b (semantic)     |    189 |       18.9%  |            3.1 ms |
| L2  (SLM)          |      0 |        0.0%  |                 - |
| L3  (cloud)        |    399 |       39.9%  |          820.0 ms |
| **Total deflected**| **601** | **60.1%** | |
| **Cost saved**     |        |              | **$0.000150/req** |

> Overall latency — P50: 0.3 ms | P95: 820.0 ms | P99: 950.0 ms

Note: Overall P50 is sub-millisecond because >60% of requests are served from cache (L1a/L1b). P95 and P99 reflect cloud-latency once those percentiles fall into the L3 (cloud) bucket. Cost formula: 601 deflected × 50 tokens × $0.000005/token = $0.1503 total saved. Divided across all 1,000 requests → $0.1503 ÷ 1000 = $0.000150/req (the per-request figure shown in the table).

L1a (exact) — request matched an exact (SHA-256) cache entry
L1b (semantic) — request matched a semantically similar cached entry (cosine similarity)
L2 (SLM) — resolved by the local Small Language Model without a cloud call
L3 (cloud) — forwarded to the configured external LLM provider
Deflection rate — percentage of requests resolved by L1a + L1b + L2 (cloud cost avoided)
Cost saved — estimated USD saved per request using the gpt-4o input token price ($0.000005/token × avg_prompt_tokens × deflected_requests)

Response Headers

Every response from Isartor includes two observability headers:

Header	Values	Description
`X-Isartor-Layer`	`l1a`, `l1b`, `l2`, `l3`	Which layer resolved the request
`X-Isartor-Deflected`	`true`, `false`	Whether the cloud call was avoided

Cost Estimation Methodology

The harness estimates cloud cost saved using the following formula:

tokens_saved = avg_prompt_tokens × (L1a_hits + L1b_hits + L2_hits)
cost_saved   = tokens_saved × 0.000005          # gpt-4o input rate, USD/token
cost_per_req = cost_saved / total_requests

avg_prompt_tokens defaults to 50 (a conservative estimate for typical FAQ / agent traffic).
The gpt-4o input rate of $0.000005/token is the public OpenAI pricing as of the benchmark baseline.
Only input tokens are counted; output tokens are not estimated.

The Claude Code benchmark uses avg_prompt_tokens = 120 and the Claude Sonnet input rate of $0.000003/token to reflect the longer, more detailed prompts typical of agentic coding workflows.

Results File

After --all is used, results are written to benchmarks/results/latest.json:

{
  "timestamp": "2025-01-15T10:23:00Z",
  "isartor_version": "0.1.0",
  "hardware": "4-core x86_64, 8 GB RAM, no GPU",
  "fixtures": {
    "faq_loop": {
      "total_requests": 1000,
      "deflection_rate": 0.712,
      "l1a_hits": 423,
      "l1b_hits": 214,
      "l2_hits": 75,
      "l3_hits": 288,
      "l1a_rate": 0.423,
      "l1b_rate": 0.214,
      "l2_rate": 0.075,
      "l3_rate": 0.288,
      "error_count": 0,
      "p50_ms": 0.4,
      "p95_ms": 820.0,
      "p99_ms": 950.0,
      "l1a_p50_ms": 0.35,
      "l1b_p50_ms": 3.1,
      "l2_p50_ms": 130.0,
      "l3_p50_ms": 820.0,
      "tokens_saved": 35600,
      "cost_saved_usd": 0.178,
      "cost_per_req_usd": 0.000178
    },
    "diverse_tasks": {
      "total_requests": 500,
      "deflection_rate": 0.38,
      "l1a_hits": 90,
      "l1b_hits": 100,
      "l2_hits": 0,
      "l3_hits": 310,
      "l1a_rate": 0.18,
      "l1b_rate": 0.20,
      "l2_rate": 0.0,
      "l3_rate": 0.62,
      "error_count": 0,
      "p50_ms": 820.0,
      "p95_ms": 1050.0,
      "p99_ms": 1200.0,
      "l1a_p50_ms": 0.35,
      "l1b_p50_ms": 3.2,
      "l2_p50_ms": null,
      "l3_p50_ms": 820.0,
      "tokens_saved": 9500,
      "cost_saved_usd": 0.0475,
      "cost_per_req_usd": 0.000095
    }
  }
}

The Claude Code benchmark writes per-scenario files and an aggregated summary:

benchmarks/results/claude_code_cold_20260101T120000Z.json
benchmarks/results/claude_code_warm_20260101T120015Z.json
benchmarks/results/claude_code_summary.json

Reproducing the Reference Numbers

Any engineer can reproduce results in under 10 minutes:

# Clone and build
git clone https://github.com/isartor-ai/Isartor.git && cd Isartor
cargo build --release

# Start Isartor with default settings (exact + semantic cache enabled).
# The gateway_api_key defaults to "changeme"; the harness will use the
# same value via $ISARTOR_API_KEY so authentication passes automatically.
# For L3 (cloud) requests, configure an Azure OpenAI backend:
ISARTOR__CACHE_MODE=both \
ISARTOR__LLM_PROVIDER=azure \
ISARTOR__EXTERNAL_LLM_URL=https://<resource>.openai.azure.com \
ISARTOR__EXTERNAL_LLM_API_KEY=<your-azure-key> \
ISARTOR__AZURE_DEPLOYMENT_ID=gpt-4o-mini \
ISARTOR__AZURE_API_VERSION=2024-08-01-preview \
./target/release/isartor &
sleep 5  # wait for the server to start

# Run the full benchmark suite (ISARTOR_API_KEY defaults to 'changeme')
make benchmark
# or equivalently:
# ISARTOR_API_KEY=changeme python3 benchmarks/run.py --url http://localhost:8080 --all

If you have configured a custom API key, export it before running:

export ISARTOR__GATEWAY_API_KEY=your-secret-key  # server
export ISARTOR_API_KEY=your-secret-key            # harness
make benchmark

Hardware: 4-core CPU, 8 GB RAM, no GPU. Results will vary based on hardware and whether L2 SLM inference is enabled.

CI Integration

The .github/workflows/benchmark.yml workflow runs automatically on every PR targeting main. It:

Builds Isartor from source.
Starts the server and waits for it to become healthy.
Runs both fixture corpora through the harness.
Posts a formatted result table as a PR comment (updated on subsequent pushes).
Uploads benchmarks/results/ci_run.json as a workflow artifact.

A validate-harness job also runs in dry-run mode on every push to confirm the harness itself is functioning correctly without requiring a live server.

Required repository secret

The CI workflow routes L3 (cloud) requests through Azure OpenAI. Add the following secret to your repository (Settings → Secrets and variables → Actions → New repository secret):

Secret name	Value
`AZURE_OPENAI_API_KEY`	Your Azure OpenAI key

Without this secret the server cannot reach the Azure backend and L3 requests will return 502 errors. L1a/L1b cache-hit rows are unaffected.

This directory contains the reproducible benchmark harness and fixtures for measuring Isartor's deflection rate and latency characteristics.

Quick Start

# 1. Start the Isartor server (requires a running instance)
#    Configure a known API key used for the X-API-Key header (defaults to "changeme").
docker run -p 8080:8080 \
    -e ISARTOR__GATEWAY_API_KEY=changeme \
    ghcr.io/isartor-ai/isartor:latest
# or (from source):
ISARTOR__GATEWAY_API_KEY=changeme cargo run --release

# 2. Ensure the benchmark harness uses the same API key (for the X-API-Key header)
export ISARTOR_API_KEY=changeme

# 3. Run both built-in fixtures and save results (make shortcut)
make benchmark

# 4. Run without a server (dry-run / smoke-test)
make benchmark-dry-run

# 5. Run a single fixture with a custom request limit
python3 benchmarks/run.py \
    --url http://localhost:8080 \
    --input benchmarks/fixtures/faq_loop.jsonl \
    --requests 500

Requirements

Python 3.10+ (uses the built-in urllib module — no pip install needed)
A running Isartor instance (or use --dry-run for offline validation)

Fixture Files

File	Prompts	Description
`fixtures/faq_loop.jsonl`	1,000	Simulates a repetitive FAQ / agent-loop workload. Covers returns, shipping, billing, account management, and more — all with semantic rephrasings. Designed to stress L1a (exact cache) and L1b (semantic cache).
`fixtures/diverse_tasks.jsonl`	500	Genuine variety: code generation, summarisation, Q&A, data extraction, and creative writing. Represents a realistic mixed-workload with lower deflection than the FAQ loop corpus.
`fixtures/claude_code_todo_app.jsonl`	58	Deterministic TypeScript todo-app coding workload for the Claude Code three-way benchmark. Includes unique implementation prompts, semantic variants, and exact repeats.
`fixtures/claude_code_tasks.jsonl`	388	Realistic Claude Code / Copilot workload: coding questions, algorithm explanations, Rust/Go/Python/TypeScript patterns, DevOps tasks, and architecture concepts. Designed to stress Layer 2 (SLM) with a Qwen 2.5 Coder 7B sidecar. Run via `make benchmark-qwen`.
`fixtures/claude_code_todo.jsonl`	105	Realistic Claude Code session prompts for building a React TypeScript todo application. Covers component scaffolding, custom hooks, tests, routing, and tooling. Designed to model the repetitive, cache-friendly patterns of an AI-assisted coding workflow.

Each file is in JSONL format — one JSON object per line:

{"prompt": "What is your return policy?"}
{"prompt": "How do returns work?"}

CLI Reference

`run.py` — Benchmark Harness

usage: run.py [-h] [--url URL] [--api-key KEY] [--input INPUT]
              [--requests REQUESTS] [--all] [--output OUTPUT] [--dry-run]

Isartor Benchmark Harness

options:
  -h, --help           show this help message and exit
  --url URL            Base URL of the running Isartor instance
                       (default: $ISARTOR_URL or http://localhost:8080)
  --api-key KEY        Value for the X-API-Key header; must match the server's
                       gateway_api_key setting
                       (default: $ISARTOR_API_KEY or 'changeme')
  --input INPUT        Path to a JSONL fixture file to benchmark
  --requests REQUESTS  Limit the number of prompts to send (0 = all)
  --all                Run all built-in fixtures and write results
  --output OUTPUT      Path for the results JSON file
                       (default: benchmarks/results/latest.json)
  --dry-run            Simulate responses locally (no server required).
                       Useful for CI validation and smoke-testing.

Both --url and --api-key honour environment variables:

ISARTOR_URL=http://localhost:3000 \
ISARTOR_API_KEY=mysecret \
python3 benchmarks/run.py --all

Understanding the Output

-- faq_loop --
  Total requests : 1000
  L1a (exact)    :   412  (41.2%)
  L1b (semantic) :   189  (18.9%)
  L2  (SLM)      :     0  ( 0.0%)
  L3  (cloud)    :   399  (39.9%)
  Errors         :     0
  Deflection rate: 60.1%
  P50 latency    :  0.3 ms
  P95 latency    :  820.0 ms
  P99 latency    :  950.0 ms
  Cost saved     : $0.1503  ($0.000150/req)

### faq_loop

| Layer              | Hits   | % of Traffic | Avg Latency (p50) |
|--------------------|--------|--------------|-------------------|
| L1a (exact)        |    412 |       41.2%  |            0.3 ms |
| L1b (semantic)     |    189 |       18.9%  |            3.1 ms |
| L2  (SLM)          |      0 |        0.0%  |                 - |
| L3  (cloud)        |    399 |       39.9%  |          820.0 ms |
| **Total deflected**| **601** | **60.1%** | |
| **Cost saved**     |        |              | **$0.000150/req** |

> Overall latency — P50: 0.3 ms | P95: 820.0 ms | P99: 950.0 ms

Note: Overall P50 is sub-millisecond because >60% of requests are served from cache (L1a/L1b). P95 and P99 reflect cloud-latency once those percentiles fall into the L3 (cloud) bucket. Cost formula: 601 deflected × 50 tokens × $0.000005/token = $0.1503 total saved. Divided across all 1,000 requests → $0.1503 ÷ 1000 = $0.000150/req (the per-request figure shown in the table).

L1a (exact) — request matched an exact (SHA-256) cache entry
L1b (semantic) — request matched a semantically similar cached entry (cosine similarity)
L2 (SLM) — resolved by the local Small Language Model without a cloud call
L3 (cloud) — forwarded to the configured external LLM provider
Deflection rate — percentage of requests resolved by L1a + L1b + L2 (cloud cost avoided)
Cost saved — estimated USD saved per request using the gpt-4o input token price ($0.000005/token × avg_prompt_tokens × deflected_requests)

Response Headers

Every response from Isartor includes two observability headers:

Header	Values	Description
`X-Isartor-Layer`	`l1a`, `l1b`, `l2`, `l3`	Which layer resolved the request
`X-Isartor-Deflected`	`true`, `false`	Whether the cloud call was avoided

Cost Estimation Methodology

The harness estimates cloud cost saved using the following formula:

tokens_saved = avg_prompt_tokens × (L1a_hits + L1b_hits + L2_hits)
cost_saved   = tokens_saved × 0.000005          # gpt-4o input rate, USD/token
cost_per_req = cost_saved / total_requests

avg_prompt_tokens defaults to 50 (a conservative estimate for typical FAQ / agent traffic).
The gpt-4o input rate of $0.000005/token is the public OpenAI pricing as of the benchmark baseline.
Only input tokens are counted; output tokens are not estimated.

Results File

After --all is used, results are written to benchmarks/results/latest.json:

{
  "timestamp": "2025-01-15T10:23:00Z",
  "isartor_version": "0.1.0",
  "hardware": "4-core x86_64, 8 GB RAM, no GPU",
  "fixtures": {
    "faq_loop": {
      "total_requests": 1000,
      "deflection_rate": 0.712,
      "l1a_hits": 423,
      "l1b_hits": 214,
      "l2_hits": 75,
      "l3_hits": 288,
      "l1a_rate": 0.423,
      "l1b_rate": 0.214,
      "l2_rate": 0.075,
      "l3_rate": 0.288,
      "error_count": 0,
      "p50_ms": 0.4,
      "p95_ms": 820.0,
      "p99_ms": 950.0,
      "l1a_p50_ms": 0.35,
      "l1b_p50_ms": 3.1,
      "l2_p50_ms": 130.0,
      "l3_p50_ms": 820.0,
      "tokens_saved": 35600,
      "cost_saved_usd": 0.178,
      "cost_per_req_usd": 0.000178
    },
    "diverse_tasks": {
      "total_requests": 500,
      "deflection_rate": 0.38,
      "l1a_hits": 90,
      "l1b_hits": 100,
      "l2_hits": 0,
      "l3_hits": 310,
      "l1a_rate": 0.18,
      "l1b_rate": 0.20,
      "l2_rate": 0.0,
      "l3_rate": 0.62,
      "error_count": 0,
      "p50_ms": 820.0,
      "p95_ms": 1050.0,
      "p99_ms": 1200.0,
      "l1a_p50_ms": 0.35,
      "l1b_p50_ms": 3.2,
      "l2_p50_ms": null,
      "l3_p50_ms": 820.0,
      "tokens_saved": 9500,
      "cost_saved_usd": 0.0475,
      "cost_per_req_usd": 0.000095
    }
  }
}

Reproducing the Reference Numbers

Any engineer can reproduce results in under 10 minutes:

# Clone and build
git clone https://github.com/isartor-ai/Isartor.git && cd Isartor
cargo build --release

# Start Isartor with default settings (exact + semantic cache enabled).
# The gateway_api_key defaults to "changeme"; the harness will use the
# same value via $ISARTOR_API_KEY so authentication passes automatically.
# For L3 (cloud) requests, configure an Azure OpenAI backend:
ISARTOR__CACHE_MODE=both \
ISARTOR__LLM_PROVIDER=azure \
ISARTOR__EXTERNAL_LLM_URL=https://<resource>.openai.azure.com \
ISARTOR__EXTERNAL_LLM_API_KEY=<your-azure-key> \
ISARTOR__AZURE_DEPLOYMENT_ID=gpt-4o-mini \
ISARTOR__AZURE_API_VERSION=2024-08-01-preview \
./target/release/isartor &
sleep 5  # wait for the server to start

# Run the full benchmark suite (ISARTOR_API_KEY defaults to 'changeme')
make benchmark
# or equivalently:
# ISARTOR_API_KEY=changeme python3 benchmarks/run.py --url http://localhost:8080 --all

If you have configured a custom API key, export it before running:

export ISARTOR__GATEWAY_API_KEY=your-secret-key  # server
export ISARTOR_API_KEY=your-secret-key            # harness
make benchmark

Hardware: 4-core CPU, 8 GB RAM, no GPU. Results will vary based on hardware and whether L2 SLM inference is enabled.

CI Integration

The .github/workflows/benchmark.yml workflow runs automatically on every PR targeting main. It:

Builds Isartor from source.
Starts the server and waits for it to become healthy.
Runs both fixture corpora through the harness.
Posts a formatted result table as a PR comment (updated on subsequent pushes).
Uploads benchmarks/results/ci_run.json as a workflow artifact.

A validate-harness job also runs in dry-run mode on every push to confirm the harness itself is functioning correctly without requiring a live server.

The .github/workflows/roi-report.yml workflow generates the full ROI report:

Runs a dry-run benchmark (or uses existing results).
Generates benchmarks/results/roi_report.json and benchmarks/results/roi_report.md.
Uploads both files as workflow artifacts.
Posts the Markdown report to the configured GitHub issue.

To trigger the ROI report manually:

gh workflow run roi-report.yml \
  --field issue_number=<your-issue-number> \
  --field dry_run=false

Required repository secret

The CI workflow routes L3 (cloud) requests through Azure OpenAI. Add the following secret to your repository (Settings → Secrets and variables → Actions → New repository secret):

Secret name	Value
`AZURE_OPENAI_API_KEY`	Your Azure OpenAI key

Without this secret the server cannot reach the Azure backend and L3 requests will return 502 errors. L1a/L1b cache-hit rows are unaffected.

Claude Code Three-Way Benchmark

benchmarks/claude_code_benchmark.py runs a three-way comparison benchmark that quantifies the ROI of routing Claude Code through Isartor.

What it measures

Baseline — without Isartor: every prompt goes directly to the cloud LLM. No local deflection. All latency is cloud-round-trip latency.
Isartor cold cache: first pass through Isartor with an empty cache. Novel prompts fall through to L2 (Qwen) or L3 (cloud). Only exact duplicate prompts within the run hit L1a.
Isartor warm cache: second pass with the cache populated from the cold run. Previously-seen prompts are deflected locally at L1a or L1b.

ROI Report

The report.py script produces a full with-vs-without-Isartor comparison from existing benchmark data.

# From live benchmark results:
make report

# Offline (dry-run simulation):
make report-dry-run

# From a specific results file:
python3 benchmarks/report.py --input benchmarks/results/ci_run.json

Outputs:

File	Description
`benchmarks/results/roi_report.json`	Machine-readable artifact (schema v1)
`benchmarks/results/roi_report.md`	Human-readable Markdown report

The report covers:

With vs without comparison — cloud token usage, cost, and latency for each scenario
L1/L2/L3 layer breakdown — hit counts, rates, and per-layer p50 latencies
Token distribution — separate input and output token estimates per layer
Cost reduction — estimated USD savings based on public gpt-4o pricing
Latency delta — P50 latency improvement from cache deflection
Error / interruption resilience — deflected requests are immune to cloud outages
L2 SLM justification — when the local SLM sidecar adds value vs falls through
Methodology and assumptions — all estimates clearly labelled

Claude Code + GitHub Copilot Benchmark

benchmarks/claude_code_benchmark.py runs a dedicated comparison benchmark that quantifies the ROI of routing Claude Code through Isartor compared to a direct cloud path.

What it measures

Case A — without Isartor: every prompt goes directly to the cloud LLM provider (Anthropic API or GitHub Copilot-backed endpoint). All latency is cloud-round-trip latency and all tokens consume cloud quota.
Case B — with Isartor (Qwen 2.5 Coder 7B as L2): prompts route through the full Isartor deflection stack: L1a exact cache → L1b semantic cache → L2 Qwen (llama.cpp) → L3 cloud.

Reported metrics for each case:

Metric	Description
L1a / L1b / L2 / L3 hits	Requests resolved at each layer
Deflection rate	% of requests that avoided cloud
P50 / P95 / P99 latency	Overall and per-layer latencies
Est. cloud tokens avoided	Input + output tokens saved by deflection
Est. cost saving	USD reduction using Claude 3.5 Sonnet pricing

Quick Start

# Dry-run — no server needed, CI-safe, deterministic output:
make benchmark-claude-code-dry-run

# Live three-way benchmark against a running Isartor instance:
ISARTOR_URL=http://localhost:8080 ISARTOR_API_KEY=changeme \
  make benchmark-claude-code

# Full end-to-end with auto-start (requires Qwen GGUF + llama-server):
GITHUB_TOKEN=ghp_... \
  ./scripts/run_claude_code_benchmark.sh \
make benchmark-claude-copilot-dry-run

# Case B only against a live Isartor instance:
python3 benchmarks/claude_code_benchmark.py --case B \
    --isartor-url http://localhost:8080

# Full comparison with a real Anthropic API key (Case A) and live Isartor (Case B):
ANTHROPIC_API_KEY=sk-ant-... \
python3 benchmarks/claude_code_benchmark.py --compare \
    --isartor-url http://localhost:8080 \
    --api-key changeme

# Full end-to-end orchestration (downloads model, starts sidecar + Isartor):
GITHUB_TOKEN=ghp_... \
./scripts/run_claude_code_benchmark.sh --compare \
    --start-llama-server \
    --start-isartor

Model setup (Layer 2 — Qwen 2.5 Coder 7B)

# 1. Download the Qwen 2.5 Coder 7B GGUF (~4.7 GB):
./scripts/download_qwen_gguf.sh

# 2. Start llama-server:
llama-server \
  --model models/qwen2.5-coder-7b-instruct-q4_k_m.gguf \
  --host 127.0.0.1 --port 8090 \
  --ctx-size 4096 --n-predict 512

# 3. Start Isartor with Qwen as Layer 2:
ISARTOR__ENABLE_SLM_ROUTER=true \
ISARTOR__LAYER2__SIDECAR_URL=http://127.0.0.1:8090/v1 \
./target/release/isartor up

GitHub Actions workflow

The .github/workflows/claude-code-benchmark.yml workflow runs the three-way benchmark on demand and posts progress + final results as GitHub issue comments.

Trigger:

gh workflow run claude-code-benchmark.yml \
  -f issue_number=<N> \
  -f dry_run=true

Issue comment sequence:

🚀 Benchmark started
⚙️ Environment setup complete (live mode)
✅ Baseline run completed
✅ Isartor cold cache run completed
✅ Isartor warm cache run completed
📊 Final results table

Output files

File	Description
`results/claude_code_benchmark.json`	Machine-readable three-way results
`results/claude_code_benchmark_report.md`	Human-readable Markdown report

CLI Reference

usage: claude_code_benchmark.py [-h] [--three-way] [--scenario {baseline,cold,warm}]
                                  [--dry-run] [--isartor-url URL] [--api-key KEY]
                                  [--direct-url URL] [--direct-api-key KEY]
                                  [--input FILE] [--requests N]
                                  [--output FILE] [--report FILE]

Options:
  --three-way           Run all three scenarios and generate a comparison report
  --scenario {baseline,cold,warm}
                        Run a single scenario
  --dry-run             Simulate responses locally — no server required (CI-safe)
  --isartor-url URL     Isartor base URL (default: $ISARTOR_URL or http://localhost:8080)
  --api-key KEY         Isartor X-API-Key (default: $ISARTOR_API_KEY or 'changeme')
  --direct-url URL      Direct API URL for baseline (default: $ANTHROPIC_BASE_URL)
  --direct-api-key KEY  API key for baseline (default: $ANTHROPIC_API_KEY)
  --input FILE          JSONL fixture file (default: fixtures/claude_code_todo_app.jsonl)
  --requests N          Limit number of prompts per scenario (0 = all)
  --output FILE         JSON results file (default: results/claude_code_benchmark.json)
  --report FILE         Markdown report file (default: results/claude_code_benchmark_report.md)

Sample output (dry-run)

-- Baseline — without Isartor --
  Total requests : 58
  L3  (cloud)    :    58  (100.0%)
  Deflection rate: 0.0%  (no local deflection — every request hits cloud)
  P50 latency    : 1421.0 ms
  Est. cloud cost: $0.3132  ($0.005400/req)

-- Isartor cold cache — with Qwen L2 --
  Total requests : 58
  L1a (exact)    :     6  (10.3%)
  L1b (semantic) :     3  ( 5.2%)
  L2  (Qwen)     :     6  (10.3%)
  L3  (cloud)    :    43  (74.1%)
  Deflection rate: 25.9%
  P50 latency    : 1454.8 ms
  Est. cloud cost: $0.2322  ($0.004003/req)

-- Isartor warm cache — with Qwen L2 --
  Total requests : 58
  L1a (exact)    :    21  (36.2%)
  L1b (semantic) :     9  (15.5%)
  L2  (Qwen)     :     5  ( 8.6%)
  L3  (cloud)    :    23  (39.7%)
  Deflection rate: 60.3%
  P50 latency    :  7.4 ms
  Est. cloud cost: $0.1242  ($0.002141/req)

Acceptance Criteria

The harness enforces these criteria and exits non-zero if any fail:

Criterion	Threshold
Warm deflection rate	≥ 60 %
Cold deflection rate	≥ 10 %
Error rate (any scenario)	< 5 %

Qwen 2.5 Coder 7B Sidecar

docker/docker-compose.qwen-sidecar.yml defines the Layer 2 sidecar:

Model: Qwen/Qwen2.5-Coder-7B-Instruct-GGUF (Q4_K_M, ~4.4 GB)
Served by: ghcr.io/ggml-org/llama.cpp:server
Port: 8081 (OpenAI-compatible /v1/chat/completions)
Environment overrides: QWEN_CTX_SIZE, QWEN_N_GPU_LAYERS, QWEN_N_THREADS, QWEN_PORT

Smoke test:

curl http://localhost:8081/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen2.5-coder-7b","messages":[{"role":"user","content":"hi"}],"max_tokens":16}'

ROI Report

benchmarks/roi_report.py reads benchmarks/results/claude_code_latest.json and emits:

benchmarks/results/claude_code_roi_report.md — human-readable report
benchmarks/results/claude_code_roi_<timestamp>.json — machine-readable artifact

Key assumptions (edit roi_report.py to match your environment):

Assumption	Default
gpt-4o input price	$0.000005 / token
gpt-4o output price	$0.000015 / token
Avg input tokens / request	75
Avg output tokens / response	300
Monthly request volume	50,000
Isartor self-hosting cost	$50 / month

CLI Reference (Case A/B comparison)

usage: claude_code_benchmark.py [-h] [--case {A,B}] [--compare] [--dry-run]
                                 [--isartor-url URL] [--api-key KEY]
                                 [--direct-url URL] [--direct-api-key KEY]
                                 [--input FILE] [--requests N]
                                 [--output FILE] [--report FILE]

Options:
  --case {A,B}       Run a single case: A (without Isartor) or B (with Isartor)
  --compare          Run both cases and generate a comparison report
  --dry-run          Simulate responses locally — no server needed (CI-safe)
  --isartor-url URL  Isartor base URL (default: $ISARTOR_URL or http://localhost:8080)
  --api-key KEY      Isartor X-API-Key (default: $ISARTOR_API_KEY or 'changeme')
  --direct-url URL   Direct API URL for Case A (default: $ANTHROPIC_BASE_URL)
  --direct-api-key KEY  API key for Case A (default: $ANTHROPIC_API_KEY)
  --input FILE       JSONL fixture file (default: fixtures/claude_code_todo_app.jsonl)
  --requests N       Limit number of prompts (0 = all)
  --output FILE      JSON results file (default: results/claude_code_copilot.json)
  --report FILE      Markdown report file (default: results/claude_code_copilot_report.md)

Output files

File	Description
`results/claude_code_copilot.json`	Machine-readable results (both cases)
`results/claude_code_copilot_report.md`	Human-readable Markdown comparison report

Sample output (dry-run)

-- Case A — without Isartor --
  Total requests : 58
  L3  (cloud)    :    58  (100.0%)
  Deflection rate: 0.0%  (no local deflection — every request hits cloud)
  P50 latency    : 1408.5 ms
  Est. cloud cost: $0.3132  ($0.005400/req)

-- Case B — with Isartor (Qwen L2) --
  Total requests : 58
  L1a (exact)    :    20  (34.5%)
  L1b (semantic) :    15  (25.9%)
  L2  (Qwen)     :    11  (19.0%)
  L3  (cloud)    :    12  (20.7%)
  Deflection rate: 79.3%
  P50 latency    : 5.9 ms
  Est. cloud cost: $0.0648  ($0.001117/req)

Note on Case A control path: Claude Code's native behavior routes requests through GitHub Copilot's infrastructure, which does not expose per-request layer or deflection metadata. The Case A baseline therefore uses a direct Anthropic API call (or a simulated all-L3 distribution in dry-run mode) as the nearest defensible control. This is documented explicitly in the benchmark report so comparisons are reproducible and transparent.

Scenario Runner (`scenario_runner.py`)

The scenario runner replays claude_code_todo_app.jsonl prompts through two code-generation scenarios:

Baseline: Claude Code + Copilot (direct, no Isartor)
Isartor: Claude Code + Isartor + Copilot

For each scenario, the runner:

Creates a fresh git-initialized workspace
Sends each fixture prompt to the target endpoint
Extracts code from LLM responses and writes files into the workspace
Commits generated code
Runs validate_todo_app.py structural checks
Produces output artifacts

Output Artifacts

File	Description
`tokens.json`	Per-prompt and aggregate token accounting with layer breakdown
`code.diff`	Unified diff between baseline and Isartor workspaces
`summary.md`	Markdown comparison report
`full_results.json`	Complete JSON results for programmatic consumption

Usage

# Dry-run (no API calls, simulated responses)
make scenario-run-dry

# Live run with Copilot baseline
ISARTOR_API_KEY=changeme COPILOT_KEY=ghp_xxx make scenario-run

# Direct invocation with options
python3 benchmarks/scenario_runner.py \
  --dry-run \
  --requests 10 \
  --output-dir benchmarks/results/my_run

# Run only one scenario
python3 benchmarks/scenario_runner.py --isartor-only --api-key changeme
python3 benchmarks/scenario_runner.py --baseline-only --copilot-token ghp_xxx

Environment Variables

Variable	Description
`ISARTOR_URL`	Gateway URL (default: `http://localhost:8080`)
`ISARTOR_API_KEY`	Gateway API key
`COPILOT_KEY`	GitHub token for Copilot baseline
`DIRECT_LLM_URL`	Alternative baseline URL (Anthropic-compatible)
`DIRECT_LLM_API_KEY`	API key for alternative baseline

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Isartor Benchmark Suite

General Benchmark (FAQ / Mixed Workload)

Quick Start

Requirements

Fixture Files

Claude Code TypeScript Todo-App Benchmark

Phases

Scenarios

Running the Claude Code Benchmark

Sample output (dry-run)

Acceptance Validator

CLI Reference (existing harness)

Understanding the Output

Response Headers

Cost Estimation Methodology

Results File

Reproducing the Reference Numbers

CI Integration

Required repository secret

Quick Start

Requirements

Fixture Files

CLI Reference

run.py — Benchmark Harness

Understanding the Output

Response Headers

Cost Estimation Methodology

Results File

Reproducing the Reference Numbers

CI Integration

Required repository secret

Claude Code Three-Way Benchmark

What it measures

ROI Report

Claude Code + GitHub Copilot Benchmark

What it measures

Quick Start

Model setup (Layer 2 — Qwen 2.5 Coder 7B)

GitHub Actions workflow

Output files

CLI Reference

Sample output (dry-run)

Acceptance Criteria

Qwen 2.5 Coder 7B Sidecar

ROI Report

CLI Reference (Case A/B comparison)

Output files

Sample output (dry-run)

Scenario Runner (scenario_runner.py)

Output Artifacts

Usage

Environment Variables

`run.py` — Benchmark Harness

Scenario Runner (`scenario_runner.py`)