Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
242 changes: 98 additions & 144 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,146 +1,100 @@
# LiteLLM Benchmarking System

## Purpose

This project provides a local-first benchmarking system for comparing provider, model, harness, and harness-configuration performance through a shared LiteLLM proxy.

The system is built for interactive terminal agents and IDE agents that can be pointed at a custom inference base URL. The benchmark application does not own the harness runtime. It owns session registration, correlation, collection, normalization, storage, reporting, and dashboards.

## What the system answers

The completed system should make it easy to answer questions such as:

- Which provider and model combination is fastest for the same task card and harness?
- How does Claude Code compare with Codex, OpenCode, OpenHands, Gemini-oriented clients, or other agent harnesses when routed through the same local proxy?
- Does a harness configuration change improve TTFT, total latency, output throughput, error rate, or cache behavior?
- Does a provider-specific routing change improve session-level performance?
- How much variance exists between repeated sessions of the same benchmark variant?

## Recommended local stack

Use Docker Compose for infrastructure and `uv` for the benchmark application.

Infrastructure services:

- LiteLLM proxy
- PostgreSQL
- Prometheus
- Grafana

Benchmark application capabilities:

- config loading and validation
- experiment, variant, and session registry
- session credential issuance
- harness env rendering
- LiteLLM request collection and normalization
- Prometheus metric collection and rollups
- query API and exports
- dashboards and reports

## Core design choices

1. LiteLLM is the single shared proxy and routing layer.
2. Every interactive benchmark session gets a benchmark-owned session ID.
3. Session correlation is built around a session-scoped proxy credential plus benchmark tags.
4. The project stores canonical benchmark records in a project-owned database.
5. LiteLLM and Prometheus are telemetry sources, not the canonical query model.
6. Prompt and response content are disabled by default.
7. The benchmark application stays harness-agnostic in its core path.

## Primary workflow

1. Define providers, harness profiles, variants, experiments, and task cards in versioned config files.
2. Create a benchmark session for a chosen variant and task card.
3. The session manager issues a session-scoped proxy credential and renders the exact environment snippet for the selected harness.
4. Launch the harness manually and use it interactively against the local LiteLLM proxy.
5. LiteLLM emits request data and Prometheus metrics while the benchmark app captures benchmark metadata.
6. Collectors normalize request- and session-level data into the project database.
7. Reports and dashboards compare sessions, variants, providers, models, and harnesses.

## Repository layout

```text
.
├── AGENTS.md
├── README.md
├── pyproject.toml
├── Makefile
├── docker-compose.yml
├── .env.example
├── configs/
│ ├── litellm/
│ ├── prometheus/
│ ├── grafana/
│ ├── providers/
│ ├── harnesses/
│ ├── variants/
│ ├── experiments/
│ └── task-cards/
├── dashboards/
├── docs/
│ ├── architecture.md
│ ├── benchmark-methodology.md
│ ├── config-and-contracts.md
│ ├── data-model-and-observability.md
│ ├── implementation-plan.md
│ ├── references.md
│ └── security-and-operations.md
├── skills/
│ └── convert-tasks-to-linear/
│ └── SKILL.md
├── src/
│ ├── benchmark_core/
│ ├── cli/
│ ├── collectors/
│ ├── reporting/
│ └── api/
└── tests/
# Benchmark Core

A harness-agnostic benchmarking system for comparing providers, models, and harnesses through a local LiteLLM proxy.

## Architecture

- **LiteLLM as single inference gateway** - All benchmarks route through local proxy
- **Session-scoped correlation** - Every session has unique correlation keys for traffic matching
- **Canonical data model** - Normalized storage for cross-harness comparisons

## Project Structure

```
src/
├── benchmark_core/ # Core domain logic
│ ├── models.py # Canonical domain models
│ ├── config.py # Pydantic settings
│ ├── db/
│ │ ├── connection.py # SQLAlchemy async engine
│ │ └── models.py # ORM models with FKs
│ ├── repositories/ # Data access layer (9 repositories)
│ └── services/ # Business logic layer
├── collectors/ # Data ingestion
│ ├── litellm_collector.py
│ ├── normalizer.py
│ ├── rollups.py
│ └── prometheus_collector.py
migrations/ # Alembic migrations
tests/ # Unit and integration tests
```

## Canonical Entities

- `provider` - Upstream inference provider definition
- `harness_profile` - How a harness is configured to talk to the proxy
- `variant` - Benchmarkable combination of provider/model/harness
- `experiment` - Named comparison grouping
- `task_card` - Benchmark task definition
- `session` - Interactive benchmark execution
- `request` - Normalized LLM call
- `metric_rollup` - Derived latency/throughput metrics

## Quick Start

```bash
# Install dependencies
pip install -e ".[dev]"

# Run migrations
alembic upgrade head

# Run tests
pytest tests/ -v
```

## Documentation map

- `AGENTS.md`
- persistent project context for coding agents
- architectural invariants
- delivery and testing rules
- `docs/architecture.md`
- system components
- data flow
- deployment boundaries
- `docs/benchmark-methodology.md`
- how to run comparable interactive benchmark sessions
- metric definitions and confounder controls
- `docs/config-and-contracts.md`
- config schemas
- session and CLI contracts
- normalization contracts
- `docs/data-model-and-observability.md`
- canonical entities
- storage model
- derived metrics
- `docs/security-and-operations.md`
- local security posture
- redaction, retention, and secrets
- operator safeguards
- `docs/implementation-plan.md`
- parent issues and sub-issues
- Definition of Ready information
- acceptance criteria and test plans
- `docs/references.md`
- external references that shaped the design
- `skills/convert-tasks-to-linear/SKILL.md`
- reusable instructions for converting a markdown implementation plan into Linear parent issues and sub-issues

## MVP success criteria

The MVP is complete when a developer can:

1. start LiteLLM, Postgres, Prometheus, and Grafana locally with one command
2. validate provider, harness profile, variant, experiment, and task-card configs
3. create a session for a specific benchmark variant
4. receive a session-specific environment snippet for a chosen harness
5. run the harness interactively against the proxy
6. collect and normalize request- and session-level data into the benchmark database
7. view live metrics in Grafana and historical comparisons in the benchmark app
8. export structured comparison results for providers, models, harnesses, and harness configurations
## Database Schema

All tables use UUID primary keys with proper foreign key relationships:

- `providers` - Inference providers
- `harness_profiles` - Harness connection configs
- `variants` - Provider + model + harness combinations
- `experiments` - Named comparison groups
- `task_cards` - Benchmark work definitions
- `sessions` - Interactive execution records
- `requests` - Normalized LLM calls
- `metric_rollups` - Aggregated statistics
- `artifacts` - Exported bundles

## Collectors

### LiteLLM Collector

Ingests raw request records from LiteLLM:
- Duplicate detection via `litellm_call_id`
- Correlation key extraction from tags
- Missing field diagnostics

### Request Normalizer

Maps raw requests to canonical format:
- Session/variant joining
- Canonical field validation
- Unmapped row surfacing

### Metric Rollups

Computes aggregated statistics:
- Request-level: latency, ttft, tokens/sec
- Session-level: request_count, success_rate, median/p95 latency
- Variant-level: session_count, session_success_rate
- Experiment-level: variant comparison

## Configuration

See `docs/config-and-contracts.md` for configuration schema.

## License

Internal use only.
43 changes: 43 additions & 0 deletions alembic.ini
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# A generic, single database configuration.

[alembic]
script_location = migrations
prepend_sys_path = .
version_path_separator = os
sqlalchemy.url = postgresql+asyncpg://postgres:postgres@localhost:5432/benchmark

[post_write_hooks]

[loggers]
keys = root,sqlalchemy,alembic

[handlers]
keys = console

[formatters]
keys = generic

[logger_root]
level = WARN
handlers = console
qualname =

[logger_sqlalchemy]
level = WARN
handlers =
qualname = sqlalchemy.engine

[logger_alembic]
level = INFO
handlers =
qualname = alembic

[handler_console]
class = StreamHandler
args = (sys.stderr,)
level = NOTSET
formatter = generic

[formatter_generic]
format = %(levelname)-5.5s [%(name)s] %(message)s
datefmt = %H:%M:%S
Binary file added migrations/__pycache__/env.cpython-312.pyc
Binary file not shown.
70 changes: 70 additions & 0 deletions migrations/env.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
"""Alembic environment configuration for async migrations."""
import asyncio
from logging.config import fileConfig

from alembic import context
from sqlalchemy import pool
from sqlalchemy.engine import Connection
from sqlalchemy.ext.asyncio import async_engine_from_config

from benchmark_core.db.connection import Base
from benchmark_core.db.models import ( # noqa: F401 - imported for model registration
ArtifactModel,
ExperimentModel,
HarnessProfileModel,
MetricRollupModel,
ProviderModel,
RequestModel,
SessionModel,
TaskCardModel,
VariantModel,
)

config = context.config

if config.config_file_name is not None:
fileConfig(config.config_file_name)

target_metadata = Base.metadata


def run_migrations_offline() -> None:
"""Run migrations in 'offline' mode."""
url = config.get_main_option("sqlalchemy.url")
context.configure(
url=url,
target_metadata=target_metadata,
literal_binds=True,
dialect_opts={"paramstyle": "named"},
)
with context.begin_transaction():
context.run_migrations()


def do_run_migrations(connection: Connection) -> None:
context.configure(connection=connection, target_metadata=target_metadata)
with context.begin_transaction():
context.run_migrations()


async def run_async_migrations() -> None:
"""Run migrations in async mode."""
connectable = async_engine_from_config(
config.get_section(config.config_ini_section, {}),
prefix="sqlalchemy.",
poolclass=pool.NullPool,
)
async with connectable.connect() as connection:
await connection.run_sync(do_run_migrations)
await connectable.dispose()


def run_migrations_online() -> None:
"""Run migrations in 'online' mode."""
asyncio.run(run_async_migrations())


if context.is_offline_mode():
run_migrations_offline()
else:
run_migrations_online()
26 changes: 26 additions & 0 deletions migrations/script.py.mako
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
"""${message}

Revision ID: ${up_revision}
Revises: ${down_revision | comma,n}
Create Date: ${create_date}

"""
from typing import Sequence, Union

from alembic import op
import sqlalchemy as sa
${imports if imports else ""}

# revision identifiers, used by Alembic.
revision: str = ${repr(up_revision)}
down_revision: Union[str, None] = ${repr(down_revision)}
branch_labels: Union[str, Sequence[str], None] = ${repr(branch_labels)}
depends_on: Union[str, Sequence[str], None] = ${repr(depends_on)}


def upgrade() -> None:
${upgrades if upgrades else "pass"}


def downgrade() -> None:
${downgrades if downgrades else "pass"}
Loading