This directory contains the reproducible benchmark harness and fixtures for measuring Isartor's deflection rate and latency characteristics.
Two benchmark tracks are available:
| Track | Script | Use case |
|---|---|---|
| General harness | run.py |
FAQ / mixed-workload deflection rate via /api/chat |
| Claude Code 3-way | claude_code_benchmark.py |
Baseline vs cold vs warm comparison for Claude Code agent sessions via /v1/messages |
| Claude Code + Copilot | claude_code_benchmark.py |
With-vs-without-Isartor comparison for Claude Code agent sessions via /v1/messages |
# 1. Start the Isartor server (requires a running instance)
# Configure a known API key used for the X-API-Key header (defaults to "changeme").
docker run -p 8080:8080 \
-e ISARTOR__GATEWAY_API_KEY=changeme \
ghcr.io/isartor-ai/isartor:latest
# or (from source):
ISARTOR__GATEWAY_API_KEY=changeme cargo run --release
# 2. Ensure the benchmark harness uses the same API key (for the X-API-Key header)
export ISARTOR_API_KEY=changeme
# 3. Run both built-in fixtures and save results (make shortcut)
make benchmark
# 4. Run without a server (dry-run / smoke-test)
make benchmark-dry-run
# 5. Run a single fixture with a custom request limit
python3 benchmarks/run.py \
--url http://localhost:8080 \
--input benchmarks/fixtures/faq_loop.jsonl \
--requests 500
# 6. Generate the with/without-Isartor ROI report (live data)
make report
# 7. Generate the ROI report offline using simulated data
make report-dry-run
# 6. Run the Claude Code / Qwen 2.5 Coder 7B Layer 2 benchmark
# (requires the Qwen sidecar stack: cd docker && docker compose -f docker-compose.qwen-benchmark.yml up --build)
make benchmark-qwen- Python 3.10+ (uses the built-in
urllibmodule — nopip installneeded) - A running Isartor instance (or use
--dry-runfor offline validation)
| File | Prompts | Description |
|---|---|---|
fixtures/faq_loop.jsonl |
1,000 | Simulates a repetitive FAQ / agent-loop workload. Covers returns, shipping, billing, account management, and more — all with semantic rephrasings. Designed to stress L1a (exact cache) and L1b (semantic cache). |
fixtures/diverse_tasks.jsonl |
500 | Genuine variety: code generation, summarisation, Q&A, data extraction, and creative writing. Represents a realistic mixed-workload with lower deflection than the FAQ loop corpus. |
fixtures/claude_code_todo_app.jsonl |
20 | Deterministic Claude Code workload — exact prompts for building a TypeScript todo app in five phases. Used by the Claude Code benchmark harness. |
Each file is in JSONL format — one JSON object per line:
{"prompt": "What is your return policy?"}
{"prompt": "How do returns work?"}The claude_code_todo_app.jsonl fixture carries extra metadata fields:
{"phase": "scaffold", "step": 1, "prompt": "Initialize a new TypeScript Node.js project…"}
{"phase": "backend", "step": 4, "prompt": "Define the Todo TypeScript interface…"}The claude_code_benchmark.py harness provides a deterministic, phase-based
benchmark for measuring how Isartor deflects Claude Code traffic when an AI
agent implements a TypeScript todo app end-to-end.
| Phase | Steps | Description |
|---|---|---|
scaffold |
1–3 | package.json, tsconfig.json, directory structure |
backend |
4–9 | TypeScript types, in-memory store, Express CRUD routes, middleware, app, server |
frontend |
10–12 | index.html, app.js (vanilla JS), styles.css |
tests |
13–15 | Jest config, unit tests, integration tests (supertest) |
docker |
16–18 | Multi-stage Dockerfile, docker-compose.yml, .env.example, .gitignore |
polish |
19–20 | README.md, final review prompt |
| Scenario | Description | Expected Deflection |
|---|---|---|
baseline |
Control path — prompts go straight to L3 (cloud), no Isartor cache. Documents the raw LLM cost. | ~0% |
cold |
First pass through Isartor with an empty cache. Prompts go mostly to L3; the cache is seeded. | ~3% (L2 only) |
warm |
Second pass through Isartor with a seeded cache. L1a exact hits dominate because the same prompts are replayed. | ~82% |
The warm-cache deflection rate is high because all 20 prompts are identical between the cold and warm runs — Isartor's L1a exact cache uses a SHA-256 hash of the full prompt and returns sub-millisecond responses on replay.
# All three scenarios — dry-run (no server required):
make benchmark-claude-code-dry-run
# All three scenarios against a live Isartor instance:
ISARTOR_URL=http://localhost:8080 \
ISARTOR_API_KEY=changeme \
make benchmark-claude-code
# Single scenario:
python3 benchmarks/claude_code_benchmark.py --scenario warm \
--url http://localhost:8080 --api-key changeme
# Post results to a GitHub issue:
GH_TOKEN=ghp_xxx \
GH_REPO=isartor-ai/Isartor \
GH_ISSUE=42 \
python3 benchmarks/claude_code_benchmark.py --all-scenarios \
--url http://localhost:8080 --api-key changeme
# Capture Isartor /health + /debug/stats snapshots before and after each run:
python3 benchmarks/claude_code_benchmark.py --all-scenarios --snapshot \
--url http://localhost:8080 --api-key changemeAll CLI flags also honour environment variables:
| Env variable | CLI flag | Default |
|---|---|---|
ISARTOR_URL |
--url |
http://localhost:8080 |
ISARTOR_API_KEY |
--api-key |
changeme |
GH_TOKEN |
--gh-token |
(empty) |
GH_REPO |
--repo |
(empty) |
GH_ISSUE |
--issue |
(empty) |
ISARTOR_TIMEOUT |
--timeout |
120 |
Loaded 20 tasks from claude_code_todo_app.jsonl
Phases: backend, docker, frontend, polish, scaffold, tests
════════════════════════════════════════════════════════════
Running scenario: COLD
════════════════════════════════════════════════════════════
─────────────────────────────────────────────────────────
Scenario : COLD
Tasks : 20
L1a (exact) : 0 (0.0%)
L1b (semantic) : 0 (0.0%)
L2 (SLM) : 1 (5.0%)
L3 (cloud) : 19 (95.0%)
Errors : 0
Deflection : 5.0%
P50 latency : 820.0 ms
...
════════════════════════════════════════════════════════════
Running scenario: WARM
════════════════════════════════════════════════════════════
Scenario : WARM
Tasks : 20
L1a (exact) : 16 (80.0%)
L1b (semantic) : 2 (10.0%)
L2 (SLM) : 1 (5.0%)
L3 (cloud) : 1 (5.0%)
Errors : 0
Deflection : 95.0%
P50 latency : 0.4 ms
...
The validate_todo_app.py script checks whether a generated todo app meets
the acceptance criteria from the benchmark task packet.
# Static checks only (no compiler / runner needed):
python3 benchmarks/validate_todo_app.py --app-dir ./output/todo-app
# Static checks + TypeScript compilation + Jest test suite:
python3 benchmarks/validate_todo_app.py --app-dir ./output/todo-app \
--compile --run-tests
# With API smoke tests against a running app:
python3 benchmarks/validate_todo_app.py --app-dir ./output/todo-app \
--api-url http://localhost:3000
# Machine-readable JSON output:
python3 benchmarks/validate_todo_app.py --app-dir ./output/todo-app \
--json-output ./validation_results.json
# Make shortcut:
APP_DIR=./output/todo-app make validate-todo-app
APP_DIR=./output/todo-app VALIDATE_ARGS="--compile --run-tests" make validate-todo-appChecks performed:
| Category | Checks |
|---|---|
| File presence | 18 required + 3 optional files |
package.json |
Valid JSON · expected scripts · dependencies · devDependencies |
tsconfig.json |
Valid JSON · strict mode · outDir set |
| Source structure | Todo type definition · store exports · router methods · CORS · /health |
| Test files | Unit tests cover all store functions · integration tests cover all endpoints |
| Frontend | HTML structure · JS fetch/CRUD calls · styles |
| Docker | Multi-stage Dockerfile · docker-compose services/ports |
| TypeScript compile | tsc --noEmit (optional, requires npm install) |
| Jest suite | npm test (optional) |
| API smoke tests | All CRUD endpoints + 400/404 error cases (optional) |
usage: run.py [-h] [--url URL] [--api-key KEY] [--input INPUT]
[--requests REQUESTS] [--all] [--output OUTPUT] [--dry-run]
Isartor Benchmark Harness
options:
-h, --help show this help message and exit
--url URL Base URL of the running Isartor instance
(default: $ISARTOR_URL or http://localhost:8080)
--api-key KEY Value for the X-API-Key header; must match the server's
gateway_api_key setting
(default: $ISARTOR_API_KEY or 'changeme')
--input INPUT Path to a JSONL fixture file to benchmark
--requests REQUESTS Limit the number of prompts to send (0 = all)
--all Run all built-in fixtures and write results
--output OUTPUT Path for the results JSON file
(default: benchmarks/results/latest.json)
--dry-run Simulate responses locally (no server required).
Useful for CI validation and smoke-testing.
Both --url and --api-key honour environment variables:
ISARTOR_URL=http://localhost:3000 \
ISARTOR_API_KEY=mysecret \
python3 benchmarks/run.py --all-- faq_loop --
Total requests : 1000
L1a (exact) : 412 (41.2%)
L1b (semantic) : 189 (18.9%)
L2 (SLM) : 0 ( 0.0%)
L3 (cloud) : 399 (39.9%)
Errors : 0
Deflection rate: 60.1%
P50 latency : 0.3 ms
P95 latency : 820.0 ms
P99 latency : 950.0 ms
Cost saved : $0.1503 ($0.000150/req)
### faq_loop
| Layer | Hits | % of Traffic | Avg Latency (p50) |
|--------------------|--------|--------------|-------------------|
| L1a (exact) | 412 | 41.2% | 0.3 ms |
| L1b (semantic) | 189 | 18.9% | 3.1 ms |
| L2 (SLM) | 0 | 0.0% | - |
| L3 (cloud) | 399 | 39.9% | 820.0 ms |
| **Total deflected**| **601** | **60.1%** | |
| **Cost saved** | | | **$0.000150/req** |
> Overall latency — P50: 0.3 ms | P95: 820.0 ms | P99: 950.0 ms
Note: Overall P50 is sub-millisecond because >60% of requests are served from cache (L1a/L1b). P95 and P99 reflect cloud-latency once those percentiles fall into the L3 (cloud) bucket. Cost formula:
601 deflected × 50 tokens × $0.000005/token = $0.1503 total saved. Divided across all 1,000 requests →$0.1503 ÷ 1000 = $0.000150/req(the per-request figure shown in the table).
- L1a (exact) — request matched an exact (SHA-256) cache entry
- L1b (semantic) — request matched a semantically similar cached entry (cosine similarity)
- L2 (SLM) — resolved by the local Small Language Model without a cloud call
- L3 (cloud) — forwarded to the configured external LLM provider
- Deflection rate — percentage of requests resolved by L1a + L1b + L2 (cloud cost avoided)
- Cost saved — estimated USD saved per request using the gpt-4o input token price (
$0.000005/token × avg_prompt_tokens × deflected_requests)
Every response from Isartor includes two observability headers:
| Header | Values | Description |
|---|---|---|
X-Isartor-Layer |
l1a, l1b, l2, l3 |
Which layer resolved the request |
X-Isartor-Deflected |
true, false |
Whether the cloud call was avoided |
The harness estimates cloud cost saved using the following formula:
tokens_saved = avg_prompt_tokens × (L1a_hits + L1b_hits + L2_hits)
cost_saved = tokens_saved × 0.000005 # gpt-4o input rate, USD/token
cost_per_req = cost_saved / total_requests
avg_prompt_tokensdefaults to 50 (a conservative estimate for typical FAQ / agent traffic).- The gpt-4o input rate of
$0.000005/tokenis the public OpenAI pricing as of the benchmark baseline. - Only input tokens are counted; output tokens are not estimated.
The Claude Code benchmark uses avg_prompt_tokens = 120 and the Claude Sonnet
input rate of $0.000003/token to reflect the longer, more detailed prompts
typical of agentic coding workflows.
After --all is used, results are written to benchmarks/results/latest.json:
{
"timestamp": "2025-01-15T10:23:00Z",
"isartor_version": "0.1.0",
"hardware": "4-core x86_64, 8 GB RAM, no GPU",
"fixtures": {
"faq_loop": {
"total_requests": 1000,
"deflection_rate": 0.712,
"l1a_hits": 423,
"l1b_hits": 214,
"l2_hits": 75,
"l3_hits": 288,
"l1a_rate": 0.423,
"l1b_rate": 0.214,
"l2_rate": 0.075,
"l3_rate": 0.288,
"error_count": 0,
"p50_ms": 0.4,
"p95_ms": 820.0,
"p99_ms": 950.0,
"l1a_p50_ms": 0.35,
"l1b_p50_ms": 3.1,
"l2_p50_ms": 130.0,
"l3_p50_ms": 820.0,
"tokens_saved": 35600,
"cost_saved_usd": 0.178,
"cost_per_req_usd": 0.000178
},
"diverse_tasks": {
"total_requests": 500,
"deflection_rate": 0.38,
"l1a_hits": 90,
"l1b_hits": 100,
"l2_hits": 0,
"l3_hits": 310,
"l1a_rate": 0.18,
"l1b_rate": 0.20,
"l2_rate": 0.0,
"l3_rate": 0.62,
"error_count": 0,
"p50_ms": 820.0,
"p95_ms": 1050.0,
"p99_ms": 1200.0,
"l1a_p50_ms": 0.35,
"l1b_p50_ms": 3.2,
"l2_p50_ms": null,
"l3_p50_ms": 820.0,
"tokens_saved": 9500,
"cost_saved_usd": 0.0475,
"cost_per_req_usd": 0.000095
}
}
}The Claude Code benchmark writes per-scenario files and an aggregated summary:
benchmarks/results/claude_code_cold_20260101T120000Z.json
benchmarks/results/claude_code_warm_20260101T120015Z.json
benchmarks/results/claude_code_summary.json
Any engineer can reproduce results in under 10 minutes:
# Clone and build
git clone https://github.com/isartor-ai/Isartor.git && cd Isartor
cargo build --release
# Start Isartor with default settings (exact + semantic cache enabled).
# The gateway_api_key defaults to "changeme"; the harness will use the
# same value via $ISARTOR_API_KEY so authentication passes automatically.
# For L3 (cloud) requests, configure an Azure OpenAI backend:
ISARTOR__CACHE_MODE=both \
ISARTOR__LLM_PROVIDER=azure \
ISARTOR__EXTERNAL_LLM_URL=https://<resource>.openai.azure.com \
ISARTOR__EXTERNAL_LLM_API_KEY=<your-azure-key> \
ISARTOR__AZURE_DEPLOYMENT_ID=gpt-4o-mini \
ISARTOR__AZURE_API_VERSION=2024-08-01-preview \
./target/release/isartor &
sleep 5 # wait for the server to start
# Run the full benchmark suite (ISARTOR_API_KEY defaults to 'changeme')
make benchmark
# or equivalently:
# ISARTOR_API_KEY=changeme python3 benchmarks/run.py --url http://localhost:8080 --allIf you have configured a custom API key, export it before running:
export ISARTOR__GATEWAY_API_KEY=your-secret-key # server
export ISARTOR_API_KEY=your-secret-key # harness
make benchmarkHardware: 4-core CPU, 8 GB RAM, no GPU. Results will vary based on hardware and whether L2 SLM inference is enabled.
The .github/workflows/benchmark.yml workflow runs automatically on every PR
targeting main. It:
- Builds Isartor from source.
- Starts the server and waits for it to become healthy.
- Runs both fixture corpora through the harness.
- Posts a formatted result table as a PR comment (updated on subsequent pushes).
- Uploads
benchmarks/results/ci_run.jsonas a workflow artifact.
A validate-harness job also runs in dry-run mode on every push to confirm
the harness itself is functioning correctly without requiring a live server.
The CI workflow routes L3 (cloud) requests through Azure OpenAI. Add the following secret to your repository (Settings → Secrets and variables → Actions → New repository secret):
| Secret name | Value |
|---|---|
AZURE_OPENAI_API_KEY |
Your Azure OpenAI key |
Without this secret the server cannot reach the Azure backend and L3 requests will return 502 errors. L1a/L1b cache-hit rows are unaffected.
This directory contains the reproducible benchmark harness and fixtures for measuring Isartor's deflection rate and latency characteristics.
# 1. Start the Isartor server (requires a running instance)
# Configure a known API key used for the X-API-Key header (defaults to "changeme").
docker run -p 8080:8080 \
-e ISARTOR__GATEWAY_API_KEY=changeme \
ghcr.io/isartor-ai/isartor:latest
# or (from source):
ISARTOR__GATEWAY_API_KEY=changeme cargo run --release
# 2. Ensure the benchmark harness uses the same API key (for the X-API-Key header)
export ISARTOR_API_KEY=changeme
# 3. Run both built-in fixtures and save results (make shortcut)
make benchmark
# 4. Run without a server (dry-run / smoke-test)
make benchmark-dry-run
# 5. Run a single fixture with a custom request limit
python3 benchmarks/run.py \
--url http://localhost:8080 \
--input benchmarks/fixtures/faq_loop.jsonl \
--requests 500- Python 3.10+ (uses the built-in
urllibmodule — nopip installneeded) - A running Isartor instance (or use
--dry-runfor offline validation)
| File | Prompts | Description |
|---|---|---|
fixtures/faq_loop.jsonl |
1,000 | Simulates a repetitive FAQ / agent-loop workload. Covers returns, shipping, billing, account management, and more — all with semantic rephrasings. Designed to stress L1a (exact cache) and L1b (semantic cache). |
fixtures/diverse_tasks.jsonl |
500 | Genuine variety: code generation, summarisation, Q&A, data extraction, and creative writing. Represents a realistic mixed-workload with lower deflection than the FAQ loop corpus. |
fixtures/claude_code_todo_app.jsonl |
58 | Deterministic TypeScript todo-app coding workload for the Claude Code three-way benchmark. Includes unique implementation prompts, semantic variants, and exact repeats. |
fixtures/claude_code_tasks.jsonl |
388 | Realistic Claude Code / Copilot workload: coding questions, algorithm explanations, Rust/Go/Python/TypeScript patterns, DevOps tasks, and architecture concepts. Designed to stress Layer 2 (SLM) with a Qwen 2.5 Coder 7B sidecar. Run via make benchmark-qwen. |
fixtures/claude_code_todo.jsonl |
105 | Realistic Claude Code session prompts for building a React TypeScript todo application. Covers component scaffolding, custom hooks, tests, routing, and tooling. Designed to model the repetitive, cache-friendly patterns of an AI-assisted coding workflow. |
Each file is in JSONL format — one JSON object per line:
{"prompt": "What is your return policy?"}
{"prompt": "How do returns work?"}usage: run.py [-h] [--url URL] [--api-key KEY] [--input INPUT]
[--requests REQUESTS] [--all] [--output OUTPUT] [--dry-run]
Isartor Benchmark Harness
options:
-h, --help show this help message and exit
--url URL Base URL of the running Isartor instance
(default: $ISARTOR_URL or http://localhost:8080)
--api-key KEY Value for the X-API-Key header; must match the server's
gateway_api_key setting
(default: $ISARTOR_API_KEY or 'changeme')
--input INPUT Path to a JSONL fixture file to benchmark
--requests REQUESTS Limit the number of prompts to send (0 = all)
--all Run all built-in fixtures and write results
--output OUTPUT Path for the results JSON file
(default: benchmarks/results/latest.json)
--dry-run Simulate responses locally (no server required).
Useful for CI validation and smoke-testing.
Both --url and --api-key honour environment variables:
ISARTOR_URL=http://localhost:3000 \
ISARTOR_API_KEY=mysecret \
python3 benchmarks/run.py --all-- faq_loop --
Total requests : 1000
L1a (exact) : 412 (41.2%)
L1b (semantic) : 189 (18.9%)
L2 (SLM) : 0 ( 0.0%)
L3 (cloud) : 399 (39.9%)
Errors : 0
Deflection rate: 60.1%
P50 latency : 0.3 ms
P95 latency : 820.0 ms
P99 latency : 950.0 ms
Cost saved : $0.1503 ($0.000150/req)
### faq_loop
| Layer | Hits | % of Traffic | Avg Latency (p50) |
|--------------------|--------|--------------|-------------------|
| L1a (exact) | 412 | 41.2% | 0.3 ms |
| L1b (semantic) | 189 | 18.9% | 3.1 ms |
| L2 (SLM) | 0 | 0.0% | - |
| L3 (cloud) | 399 | 39.9% | 820.0 ms |
| **Total deflected**| **601** | **60.1%** | |
| **Cost saved** | | | **$0.000150/req** |
> Overall latency — P50: 0.3 ms | P95: 820.0 ms | P99: 950.0 ms
Note: Overall P50 is sub-millisecond because >60% of requests are served from cache (L1a/L1b). P95 and P99 reflect cloud-latency once those percentiles fall into the L3 (cloud) bucket. Cost formula:
601 deflected × 50 tokens × $0.000005/token = $0.1503 total saved. Divided across all 1,000 requests →$0.1503 ÷ 1000 = $0.000150/req(the per-request figure shown in the table).
- L1a (exact) — request matched an exact (SHA-256) cache entry
- L1b (semantic) — request matched a semantically similar cached entry (cosine similarity)
- L2 (SLM) — resolved by the local Small Language Model without a cloud call
- L3 (cloud) — forwarded to the configured external LLM provider
- Deflection rate — percentage of requests resolved by L1a + L1b + L2 (cloud cost avoided)
- Cost saved — estimated USD saved per request using the gpt-4o input token price (
$0.000005/token × avg_prompt_tokens × deflected_requests)
Every response from Isartor includes two observability headers:
| Header | Values | Description |
|---|---|---|
X-Isartor-Layer |
l1a, l1b, l2, l3 |
Which layer resolved the request |
X-Isartor-Deflected |
true, false |
Whether the cloud call was avoided |
The harness estimates cloud cost saved using the following formula:
tokens_saved = avg_prompt_tokens × (L1a_hits + L1b_hits + L2_hits)
cost_saved = tokens_saved × 0.000005 # gpt-4o input rate, USD/token
cost_per_req = cost_saved / total_requests
avg_prompt_tokensdefaults to 50 (a conservative estimate for typical FAQ / agent traffic).- The gpt-4o input rate of
$0.000005/tokenis the public OpenAI pricing as of the benchmark baseline. - Only input tokens are counted; output tokens are not estimated.
After --all is used, results are written to benchmarks/results/latest.json:
{
"timestamp": "2025-01-15T10:23:00Z",
"isartor_version": "0.1.0",
"hardware": "4-core x86_64, 8 GB RAM, no GPU",
"fixtures": {
"faq_loop": {
"total_requests": 1000,
"deflection_rate": 0.712,
"l1a_hits": 423,
"l1b_hits": 214,
"l2_hits": 75,
"l3_hits": 288,
"l1a_rate": 0.423,
"l1b_rate": 0.214,
"l2_rate": 0.075,
"l3_rate": 0.288,
"error_count": 0,
"p50_ms": 0.4,
"p95_ms": 820.0,
"p99_ms": 950.0,
"l1a_p50_ms": 0.35,
"l1b_p50_ms": 3.1,
"l2_p50_ms": 130.0,
"l3_p50_ms": 820.0,
"tokens_saved": 35600,
"cost_saved_usd": 0.178,
"cost_per_req_usd": 0.000178
},
"diverse_tasks": {
"total_requests": 500,
"deflection_rate": 0.38,
"l1a_hits": 90,
"l1b_hits": 100,
"l2_hits": 0,
"l3_hits": 310,
"l1a_rate": 0.18,
"l1b_rate": 0.20,
"l2_rate": 0.0,
"l3_rate": 0.62,
"error_count": 0,
"p50_ms": 820.0,
"p95_ms": 1050.0,
"p99_ms": 1200.0,
"l1a_p50_ms": 0.35,
"l1b_p50_ms": 3.2,
"l2_p50_ms": null,
"l3_p50_ms": 820.0,
"tokens_saved": 9500,
"cost_saved_usd": 0.0475,
"cost_per_req_usd": 0.000095
}
}
}Any engineer can reproduce results in under 10 minutes:
# Clone and build
git clone https://github.com/isartor-ai/Isartor.git && cd Isartor
cargo build --release
# Start Isartor with default settings (exact + semantic cache enabled).
# The gateway_api_key defaults to "changeme"; the harness will use the
# same value via $ISARTOR_API_KEY so authentication passes automatically.
# For L3 (cloud) requests, configure an Azure OpenAI backend:
ISARTOR__CACHE_MODE=both \
ISARTOR__LLM_PROVIDER=azure \
ISARTOR__EXTERNAL_LLM_URL=https://<resource>.openai.azure.com \
ISARTOR__EXTERNAL_LLM_API_KEY=<your-azure-key> \
ISARTOR__AZURE_DEPLOYMENT_ID=gpt-4o-mini \
ISARTOR__AZURE_API_VERSION=2024-08-01-preview \
./target/release/isartor &
sleep 5 # wait for the server to start
# Run the full benchmark suite (ISARTOR_API_KEY defaults to 'changeme')
make benchmark
# or equivalently:
# ISARTOR_API_KEY=changeme python3 benchmarks/run.py --url http://localhost:8080 --allIf you have configured a custom API key, export it before running:
export ISARTOR__GATEWAY_API_KEY=your-secret-key # server
export ISARTOR_API_KEY=your-secret-key # harness
make benchmarkHardware: 4-core CPU, 8 GB RAM, no GPU. Results will vary based on hardware and whether L2 SLM inference is enabled.
The .github/workflows/benchmark.yml workflow runs automatically on every PR
targeting main. It:
- Builds Isartor from source.
- Starts the server and waits for it to become healthy.
- Runs both fixture corpora through the harness.
- Posts a formatted result table as a PR comment (updated on subsequent pushes).
- Uploads
benchmarks/results/ci_run.jsonas a workflow artifact.
A validate-harness job also runs in dry-run mode on every push to confirm
the harness itself is functioning correctly without requiring a live server.
The .github/workflows/roi-report.yml workflow generates the full ROI report:
- Runs a dry-run benchmark (or uses existing results).
- Generates
benchmarks/results/roi_report.jsonandbenchmarks/results/roi_report.md. - Uploads both files as workflow artifacts.
- Posts the Markdown report to the configured GitHub issue.
To trigger the ROI report manually:
gh workflow run roi-report.yml \
--field issue_number=<your-issue-number> \
--field dry_run=falseThe CI workflow routes L3 (cloud) requests through Azure OpenAI. Add the following secret to your repository (Settings → Secrets and variables → Actions → New repository secret):
| Secret name | Value |
|---|---|
AZURE_OPENAI_API_KEY |
Your Azure OpenAI key |
Without this secret the server cannot reach the Azure backend and L3 requests will return 502 errors. L1a/L1b cache-hit rows are unaffected.
benchmarks/claude_code_benchmark.py runs a three-way comparison benchmark
that quantifies the ROI of routing Claude Code through Isartor.
- Baseline — without Isartor: every prompt goes directly to the cloud LLM. No local deflection. All latency is cloud-round-trip latency.
- Isartor cold cache: first pass through Isartor with an empty cache. Novel prompts fall through to L2 (Qwen) or L3 (cloud). Only exact duplicate prompts within the run hit L1a.
- Isartor warm cache: second pass with the cache populated from the cold run. Previously-seen prompts are deflected locally at L1a or L1b.
The report.py script produces a full with-vs-without-Isartor comparison from
existing benchmark data.
# From live benchmark results:
make report
# Offline (dry-run simulation):
make report-dry-run
# From a specific results file:
python3 benchmarks/report.py --input benchmarks/results/ci_run.jsonOutputs:
| File | Description |
|---|---|
benchmarks/results/roi_report.json |
Machine-readable artifact (schema v1) |
benchmarks/results/roi_report.md |
Human-readable Markdown report |
The report covers:
- With vs without comparison — cloud token usage, cost, and latency for each scenario
- L1/L2/L3 layer breakdown — hit counts, rates, and per-layer p50 latencies
- Token distribution — separate input and output token estimates per layer
- Cost reduction — estimated USD savings based on public gpt-4o pricing
- Latency delta — P50 latency improvement from cache deflection
- Error / interruption resilience — deflected requests are immune to cloud outages
- L2 SLM justification — when the local SLM sidecar adds value vs falls through
- Methodology and assumptions — all estimates clearly labelled
benchmarks/claude_code_benchmark.py runs a dedicated comparison benchmark that
quantifies the ROI of routing Claude Code through Isartor compared to a direct
cloud path.
- Case A — without Isartor: every prompt goes directly to the cloud LLM provider (Anthropic API or GitHub Copilot-backed endpoint). All latency is cloud-round-trip latency and all tokens consume cloud quota.
- Case B — with Isartor (Qwen 2.5 Coder 7B as L2): prompts route through the full Isartor deflection stack: L1a exact cache → L1b semantic cache → L2 Qwen (llama.cpp) → L3 cloud.
Reported metrics for each case:
| Metric | Description |
|---|---|
| L1a / L1b / L2 / L3 hits | Requests resolved at each layer |
| Deflection rate | % of requests that avoided cloud |
| P50 / P95 / P99 latency | Overall and per-layer latencies |
| Est. cloud tokens avoided | Input + output tokens saved by deflection |
| Est. cost saving | USD reduction using Claude 3.5 Sonnet pricing |
# Dry-run — no server needed, CI-safe, deterministic output:
make benchmark-claude-code-dry-run
# Live three-way benchmark against a running Isartor instance:
ISARTOR_URL=http://localhost:8080 ISARTOR_API_KEY=changeme \
make benchmark-claude-code
# Full end-to-end with auto-start (requires Qwen GGUF + llama-server):
GITHUB_TOKEN=ghp_... \
./scripts/run_claude_code_benchmark.sh \
make benchmark-claude-copilot-dry-run
# Case B only against a live Isartor instance:
python3 benchmarks/claude_code_benchmark.py --case B \
--isartor-url http://localhost:8080
# Full comparison with a real Anthropic API key (Case A) and live Isartor (Case B):
ANTHROPIC_API_KEY=sk-ant-... \
python3 benchmarks/claude_code_benchmark.py --compare \
--isartor-url http://localhost:8080 \
--api-key changeme
# Full end-to-end orchestration (downloads model, starts sidecar + Isartor):
GITHUB_TOKEN=ghp_... \
./scripts/run_claude_code_benchmark.sh --compare \
--start-llama-server \
--start-isartor# 1. Download the Qwen 2.5 Coder 7B GGUF (~4.7 GB):
./scripts/download_qwen_gguf.sh
# 2. Start llama-server:
llama-server \
--model models/qwen2.5-coder-7b-instruct-q4_k_m.gguf \
--host 127.0.0.1 --port 8090 \
--ctx-size 4096 --n-predict 512
# 3. Start Isartor with Qwen as Layer 2:
ISARTOR__ENABLE_SLM_ROUTER=true \
ISARTOR__LAYER2__SIDECAR_URL=http://127.0.0.1:8090/v1 \
./target/release/isartor upThe .github/workflows/claude-code-benchmark.yml workflow runs the three-way
benchmark on demand and posts progress + final results as GitHub issue comments.
Trigger:
gh workflow run claude-code-benchmark.yml \
-f issue_number=<N> \
-f dry_run=trueIssue comment sequence:
- 🚀 Benchmark started
- ⚙️ Environment setup complete (live mode)
- ✅ Baseline run completed
- ✅ Isartor cold cache run completed
- ✅ Isartor warm cache run completed
- 📊 Final results table
| File | Description |
|---|---|
results/claude_code_benchmark.json |
Machine-readable three-way results |
results/claude_code_benchmark_report.md |
Human-readable Markdown report |
usage: claude_code_benchmark.py [-h] [--three-way] [--scenario {baseline,cold,warm}]
[--dry-run] [--isartor-url URL] [--api-key KEY]
[--direct-url URL] [--direct-api-key KEY]
[--input FILE] [--requests N]
[--output FILE] [--report FILE]
Options:
--three-way Run all three scenarios and generate a comparison report
--scenario {baseline,cold,warm}
Run a single scenario
--dry-run Simulate responses locally — no server required (CI-safe)
--isartor-url URL Isartor base URL (default: $ISARTOR_URL or http://localhost:8080)
--api-key KEY Isartor X-API-Key (default: $ISARTOR_API_KEY or 'changeme')
--direct-url URL Direct API URL for baseline (default: $ANTHROPIC_BASE_URL)
--direct-api-key KEY API key for baseline (default: $ANTHROPIC_API_KEY)
--input FILE JSONL fixture file (default: fixtures/claude_code_todo_app.jsonl)
--requests N Limit number of prompts per scenario (0 = all)
--output FILE JSON results file (default: results/claude_code_benchmark.json)
--report FILE Markdown report file (default: results/claude_code_benchmark_report.md)
-- Baseline — without Isartor --
Total requests : 58
L3 (cloud) : 58 (100.0%)
Deflection rate: 0.0% (no local deflection — every request hits cloud)
P50 latency : 1421.0 ms
Est. cloud cost: $0.3132 ($0.005400/req)
-- Isartor cold cache — with Qwen L2 --
Total requests : 58
L1a (exact) : 6 (10.3%)
L1b (semantic) : 3 ( 5.2%)
L2 (Qwen) : 6 (10.3%)
L3 (cloud) : 43 (74.1%)
Deflection rate: 25.9%
P50 latency : 1454.8 ms
Est. cloud cost: $0.2322 ($0.004003/req)
-- Isartor warm cache — with Qwen L2 --
Total requests : 58
L1a (exact) : 21 (36.2%)
L1b (semantic) : 9 (15.5%)
L2 (Qwen) : 5 ( 8.6%)
L3 (cloud) : 23 (39.7%)
Deflection rate: 60.3%
P50 latency : 7.4 ms
Est. cloud cost: $0.1242 ($0.002141/req)
The harness enforces these criteria and exits non-zero if any fail:
| Criterion | Threshold |
|---|---|
| Warm deflection rate | ≥ 60 % |
| Cold deflection rate | ≥ 10 % |
| Error rate (any scenario) | < 5 % |
docker/docker-compose.qwen-sidecar.yml defines the Layer 2 sidecar:
- Model:
Qwen/Qwen2.5-Coder-7B-Instruct-GGUF(Q4_K_M, ~4.4 GB) - Served by:
ghcr.io/ggml-org/llama.cpp:server - Port: 8081 (OpenAI-compatible
/v1/chat/completions) - Environment overrides:
QWEN_CTX_SIZE,QWEN_N_GPU_LAYERS,QWEN_N_THREADS,QWEN_PORT
Smoke test:
curl http://localhost:8081/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"qwen2.5-coder-7b","messages":[{"role":"user","content":"hi"}],"max_tokens":16}'benchmarks/roi_report.py reads benchmarks/results/claude_code_latest.json
and emits:
benchmarks/results/claude_code_roi_report.md— human-readable reportbenchmarks/results/claude_code_roi_<timestamp>.json— machine-readable artifact
Key assumptions (edit roi_report.py to match your environment):
| Assumption | Default |
|---|---|
| gpt-4o input price | $0.000005 / token |
| gpt-4o output price | $0.000015 / token |
| Avg input tokens / request | 75 |
| Avg output tokens / response | 300 |
| Monthly request volume | 50,000 |
| Isartor self-hosting cost | $50 / month |
usage: claude_code_benchmark.py [-h] [--case {A,B}] [--compare] [--dry-run]
[--isartor-url URL] [--api-key KEY]
[--direct-url URL] [--direct-api-key KEY]
[--input FILE] [--requests N]
[--output FILE] [--report FILE]
Options:
--case {A,B} Run a single case: A (without Isartor) or B (with Isartor)
--compare Run both cases and generate a comparison report
--dry-run Simulate responses locally — no server needed (CI-safe)
--isartor-url URL Isartor base URL (default: $ISARTOR_URL or http://localhost:8080)
--api-key KEY Isartor X-API-Key (default: $ISARTOR_API_KEY or 'changeme')
--direct-url URL Direct API URL for Case A (default: $ANTHROPIC_BASE_URL)
--direct-api-key KEY API key for Case A (default: $ANTHROPIC_API_KEY)
--input FILE JSONL fixture file (default: fixtures/claude_code_todo_app.jsonl)
--requests N Limit number of prompts (0 = all)
--output FILE JSON results file (default: results/claude_code_copilot.json)
--report FILE Markdown report file (default: results/claude_code_copilot_report.md)
| File | Description |
|---|---|
results/claude_code_copilot.json |
Machine-readable results (both cases) |
results/claude_code_copilot_report.md |
Human-readable Markdown comparison report |
-- Case A — without Isartor --
Total requests : 58
L3 (cloud) : 58 (100.0%)
Deflection rate: 0.0% (no local deflection — every request hits cloud)
P50 latency : 1408.5 ms
Est. cloud cost: $0.3132 ($0.005400/req)
-- Case B — with Isartor (Qwen L2) --
Total requests : 58
L1a (exact) : 20 (34.5%)
L1b (semantic) : 15 (25.9%)
L2 (Qwen) : 11 (19.0%)
L3 (cloud) : 12 (20.7%)
Deflection rate: 79.3%
P50 latency : 5.9 ms
Est. cloud cost: $0.0648 ($0.001117/req)
Note on Case A control path: Claude Code's native behavior routes requests through GitHub Copilot's infrastructure, which does not expose per-request layer or deflection metadata. The Case A baseline therefore uses a direct Anthropic API call (or a simulated all-L3 distribution in dry-run mode) as the nearest defensible control. This is documented explicitly in the benchmark report so comparisons are reproducible and transparent.
The scenario runner replays claude_code_todo_app.jsonl prompts through two code-generation scenarios:
- Baseline: Claude Code + Copilot (direct, no Isartor)
- Isartor: Claude Code + Isartor + Copilot
For each scenario, the runner:
- Creates a fresh git-initialized workspace
- Sends each fixture prompt to the target endpoint
- Extracts code from LLM responses and writes files into the workspace
- Commits generated code
- Runs
validate_todo_app.pystructural checks - Produces output artifacts
| File | Description |
|---|---|
tokens.json |
Per-prompt and aggregate token accounting with layer breakdown |
code.diff |
Unified diff between baseline and Isartor workspaces |
summary.md |
Markdown comparison report |
full_results.json |
Complete JSON results for programmatic consumption |
# Dry-run (no API calls, simulated responses)
make scenario-run-dry
# Live run with Copilot baseline
ISARTOR_API_KEY=changeme COPILOT_KEY=ghp_xxx make scenario-run
# Direct invocation with options
python3 benchmarks/scenario_runner.py \
--dry-run \
--requests 10 \
--output-dir benchmarks/results/my_run
# Run only one scenario
python3 benchmarks/scenario_runner.py --isartor-only --api-key changeme
python3 benchmarks/scenario_runner.py --baseline-only --copilot-token ghp_xxx| Variable | Description |
|---|---|
ISARTOR_URL |
Gateway URL (default: http://localhost:8080) |
ISARTOR_API_KEY |
Gateway API key |
COPILOT_KEY |
GitHub token for Copilot baseline |
DIRECT_LLM_URL |
Alternative baseline URL (Anthropic-compatible) |
DIRECT_LLM_API_KEY |
API key for alternative baseline |