trilogy-group · kumanday · Mar 21, 2026
diff --git a/README.md b/README.md
@@ -1,146 +1,100 @@
-# LiteLLM Benchmarking System
-
-## Purpose
-
-This project provides a local-first benchmarking system for comparing provider, model, harness, and harness-configuration performance through a shared LiteLLM proxy.
-
-The system is built for interactive terminal agents and IDE agents that can be pointed at a custom inference base URL. The benchmark application does not own the harness runtime. It owns session registration, correlation, collection, normalization, storage, reporting, and dashboards.
-
-## What the system answers
-
-The completed system should make it easy to answer questions such as:
-
-- Which provider and model combination is fastest for the same task card and harness?
-- How does Claude Code compare with Codex, OpenCode, OpenHands, Gemini-oriented clients, or other agent harnesses when routed through the same local proxy?
-- Does a harness configuration change improve TTFT, total latency, output throughput, error rate, or cache behavior?
-- Does a provider-specific routing change improve session-level performance?
-- How much variance exists between repeated sessions of the same benchmark variant?
-
-## Recommended local stack
-
-Use Docker Compose for infrastructure and `uv` for the benchmark application.
-
-Infrastructure services:
-
-- LiteLLM proxy
-- PostgreSQL
-- Prometheus
-- Grafana
-
-Benchmark application capabilities:
-
-- config loading and validation
-- experiment, variant, and session registry
-- session credential issuance
-- harness env rendering
-- LiteLLM request collection and normalization
-- Prometheus metric collection and rollups
-- query API and exports
-- dashboards and reports
-
-## Core design choices
-
-1. LiteLLM is the single shared proxy and routing layer.
-2. Every interactive benchmark session gets a benchmark-owned session ID.
-3. Session correlation is built around a session-scoped proxy credential plus benchmark tags.
-4. The project stores canonical benchmark records in a project-owned database.
-5. LiteLLM and Prometheus are telemetry sources, not the canonical query model.
-6. Prompt and response content are disabled by default.
-7. The benchmark application stays harness-agnostic in its core path.
-
-## Primary workflow
-
-1. Define providers, harness profiles, variants, experiments, and task cards in versioned config files.
-2. Create a benchmark session for a chosen variant and task card.
-3. The session manager issues a session-scoped proxy credential and renders the exact environment snippet for the selected harness.
-4. Launch the harness manually and use it interactively against the local LiteLLM proxy.
-5. LiteLLM emits request data and Prometheus metrics while the benchmark app captures benchmark metadata.
-6. Collectors normalize request- and session-level data into the project database.
-7. Reports and dashboards compare sessions, variants, providers, models, and harnesses.
-
-## Repository layout
-
-```text
-.
-├── AGENTS.md
-├── README.md
-├── pyproject.toml
-├── Makefile
-├── docker-compose.yml
-├── .env.example
-├── configs/
-│   ├── litellm/
-│   ├── prometheus/
-│   ├── grafana/
-│   ├── providers/
-│   ├── harnesses/
-│   ├── variants/
-│   ├── experiments/
-│   └── task-cards/
-├── dashboards/
-├── docs/
-│   ├── architecture.md
-│   ├── benchmark-methodology.md
-│   ├── config-and-contracts.md
-│   ├── data-model-and-observability.md
-│   ├── implementation-plan.md
-│   ├── references.md
-│   └── security-and-operations.md
-├── skills/
-│   └── convert-tasks-to-linear/
-│       └── SKILL.md
-├── src/
-│   ├── benchmark_core/
-│   ├── cli/
-│   ├── collectors/
-│   ├── reporting/
-│   └── api/
-└── tests/
+# Benchmark Core
+
+A harness-agnostic benchmarking system for comparing providers, models, and harnesses through a local LiteLLM proxy.
+
+## Architecture
+
+- **LiteLLM as single inference gateway** - All benchmarks route through local proxy
+- **Session-scoped correlation** - Every session has unique correlation keys for traffic matching
+- **Canonical data model** - Normalized storage for cross-harness comparisons
+
+## Project Structure
+
+```
+src/
+├── benchmark_core/          # Core domain logic
+│   ├── models.py           # Canonical domain models
+│   ├── config.py           # Pydantic settings
+│   ├── db/
+│   │   ├── connection.py   # SQLAlchemy async engine
+│   │   └── models.py       # ORM models with FKs
+│   ├── repositories/       # Data access layer (9 repositories)
+│   └── services/           # Business logic layer
+├── collectors/             # Data ingestion
+│   ├── litellm_collector.py
+│   ├── normalizer.py
+│   ├── rollups.py
+│   └── prometheus_collector.py
+migrations/                 # Alembic migrations
+tests/                     # Unit and integration tests
+```
+
+## Canonical Entities
+
+- `provider` - Upstream inference provider definition
+- `harness_profile` - How a harness is configured to talk to the proxy
+- `variant` - Benchmarkable combination of provider/model/harness
+- `experiment` - Named comparison grouping
+- `task_card` - Benchmark task definition
+- `session` - Interactive benchmark execution
+- `request` - Normalized LLM call
+- `metric_rollup` - Derived latency/throughput metrics
+
+## Quick Start
+
+```bash
+# Install dependencies
+pip install -e ".[dev]"
+
+# Run migrations
+alembic upgrade head
+
+# Run tests
+pytest tests/ -v
 ```
 
-## Documentation map
-
-- `AGENTS.md`
-  - persistent project context for coding agents
-  - architectural invariants
-  - delivery and testing rules
-- `docs/architecture.md`
-  - system components
-  - data flow
-  - deployment boundaries
-- `docs/benchmark-methodology.md`
-  - how to run comparable interactive benchmark sessions
-  - metric definitions and confounder controls
-- `docs/config-and-contracts.md`
-  - config schemas
-  - session and CLI contracts
-  - normalization contracts
-- `docs/data-model-and-observability.md`
-  - canonical entities
-  - storage model
-  - derived metrics
-- `docs/security-and-operations.md`
-  - local security posture
-  - redaction, retention, and secrets
-  - operator safeguards
-- `docs/implementation-plan.md`
-  - parent issues and sub-issues
-  - Definition of Ready information
-  - acceptance criteria and test plans
-- `docs/references.md`
-  - external references that shaped the design
-- `skills/convert-tasks-to-linear/SKILL.md`
-  - reusable instructions for converting a markdown implementation plan into Linear parent issues and sub-issues
-
-## MVP success criteria
-
-The MVP is complete when a developer can:
-
-1. start LiteLLM, Postgres, Prometheus, and Grafana locally with one command
-2. validate provider, harness profile, variant, experiment, and task-card configs
-3. create a session for a specific benchmark variant
-4. receive a session-specific environment snippet for a chosen harness
-5. run the harness interactively against the proxy
-6. collect and normalize request- and session-level data into the benchmark database
-7. view live metrics in Grafana and historical comparisons in the benchmark app
-8. export structured comparison results for providers, models, harnesses, and harness configurations
+## Database Schema
+
+All tables use UUID primary keys with proper foreign key relationships:
+
+- `providers` - Inference providers
+- `harness_profiles` - Harness connection configs
+- `variants` - Provider + model + harness combinations
+- `experiments` - Named comparison groups
+- `task_cards` - Benchmark work definitions
+- `sessions` - Interactive execution records
+- `requests` - Normalized LLM calls
+- `metric_rollups` - Aggregated statistics
+- `artifacts` - Exported bundles
+
+## Collectors
+
+### LiteLLM Collector
+
+Ingests raw request records from LiteLLM:
+- Duplicate detection via `litellm_call_id`
+- Correlation key extraction from tags
+- Missing field diagnostics
+
+### Request Normalizer
+
+Maps raw requests to canonical format:
+- Session/variant joining
+- Canonical field validation
+- Unmapped row surfacing
+
+### Metric Rollups
+
+Computes aggregated statistics:
+- Request-level: latency, ttft, tokens/sec
+- Session-level: request_count, success_rate, median/p95 latency
+- Variant-level: session_count, session_success_rate
+- Experiment-level: variant comparison
+
+## Configuration
+
+See `docs/config-and-contracts.md` for configuration schema.
+
+## License
+
+Internal use only.
diff --git a/alembic.ini b/alembic.ini
@@ -0,0 +1,43 @@
+# A generic, single database configuration.
+
+[alembic]
+script_location = migrations
+prepend_sys_path = .
+version_path_separator = os
+sqlalchemy.url = postgresql+asyncpg://postgres:postgres@localhost:5432/benchmark
+
+[post_write_hooks]
+
+[loggers]
+keys = root,sqlalchemy,alembic
+
+[handlers]
+keys = console
+
+[formatters]
+keys = generic
+
+[logger_root]
+level = WARN
+handlers = console
+qualname =
+
+[logger_sqlalchemy]
+level = WARN
+handlers =
+qualname = sqlalchemy.engine
+
+[logger_alembic]
+level = INFO
+handlers =
+qualname = alembic
+
+[handler_console]
+class = StreamHandler
+args = (sys.stderr,)
+level = NOTSET
+formatter = generic
+
+[formatter_generic]
+format = %(levelname)-5.5s [%(name)s] %(message)s
+datefmt = %H:%M:%S
diff --git a/migrations/__pycache__/env.cpython-312.pyc b/migrations/__pycache__/env.cpython-312.pyc
diff --git a/migrations/env.py b/migrations/env.py
@@ -0,0 +1,70 @@
+"""Alembic environment configuration for async migrations."""
+import asyncio
+from logging.config import fileConfig
+
+from alembic import context
+from sqlalchemy import pool
+from sqlalchemy.engine import Connection
+from sqlalchemy.ext.asyncio import async_engine_from_config
+
+from benchmark_core.db.connection import Base
+from benchmark_core.db.models import (  # noqa: F401 - imported for model registration
+    ArtifactModel,
+    ExperimentModel,
+    HarnessProfileModel,
+    MetricRollupModel,
+    ProviderModel,
+    RequestModel,
+    SessionModel,
+    TaskCardModel,
+    VariantModel,
+)
+
+config = context.config
+
+if config.config_file_name is not None:
+    fileConfig(config.config_file_name)
+
+target_metadata = Base.metadata
+
+
+def run_migrations_offline() -> None:
+    """Run migrations in 'offline' mode."""
+    url = config.get_main_option("sqlalchemy.url")
+    context.configure(
+        url=url,
+        target_metadata=target_metadata,
+        literal_binds=True,
+        dialect_opts={"paramstyle": "named"},
+    )
+    with context.begin_transaction():
+        context.run_migrations()
+
+
+def do_run_migrations(connection: Connection) -> None:
+    context.configure(connection=connection, target_metadata=target_metadata)
+    with context.begin_transaction():
+        context.run_migrations()
+
+
+async def run_async_migrations() -> None:
+    """Run migrations in async mode."""
+    connectable = async_engine_from_config(
+        config.get_section(config.config_ini_section, {}),
+        prefix="sqlalchemy.",
+        poolclass=pool.NullPool,
+    )
+    async with connectable.connect() as connection:
+        await connection.run_sync(do_run_migrations)
+    await connectable.dispose()
+
+
+def run_migrations_online() -> None:
+    """Run migrations in 'online' mode."""
+    asyncio.run(run_async_migrations())
+
+
+if context.is_offline_mode():
+    run_migrations_offline()
+else:
+    run_migrations_online()
diff --git a/migrations/script.py.mako b/migrations/script.py.mako
@@ -0,0 +1,26 @@
+"""${message}
+
+Revision ID: ${up_revision}
+Revises: ${down_revision | comma,n}
+Create Date: ${create_date}
+
+"""
+from typing import Sequence, Union
+
+from alembic import op
+import sqlalchemy as sa
+${imports if imports else ""}
+
+# revision identifiers, used by Alembic.
+revision: str = ${repr(up_revision)}
+down_revision: Union[str, None] = ${repr(down_revision)}
+branch_labels: Union[str, Sequence[str], None] = ${repr(branch_labels)}
+depends_on: Union[str, Sequence[str], None] = ${repr(depends_on)}
+
+
+def upgrade() -> None:
+    ${upgrades if upgrades else "pass"}
+
+
+def downgrade() -> None:
+    ${downgrades if downgrades else "pass"}