agent-diff-bench · hubert-marek · Feb 11, 2026 · Feb 11, 2026 · Feb 11, 2026 · Feb 11, 2026
diff --git a/AGENTS.md b/AGENTS.md
@@ -0,0 +1,247 @@
+# AGENTS.md — Agent-Diff Developer Guide
+
+## Project Overview
+
+Agent-Diff is a benchmarking platform for evaluating AI agents that interact with
+real-world SaaS APIs (Slack, Linear, Box, Google Calendar). It provides **isolated,
+reproducible environments** backed by PostgreSQL schema cloning.
+
+## Architecture
+
+```
+┌──────────────────────────┐       ┌──────────────────────┐
+│  Evaluation Client       │       │   Agent Sandbox      │
+│  (prime eval / SDK)      │──────▶│   (Docker container) │
+│                          │       │                      │
+│  1. initEnv              │       │  Runs agent code     │
+│  2. startRun             │       │  Makes API calls ──┐ │
+│  3. evaluateRun          │       └────────────────────┼─┘
+│  4. getResults           │                            │
+└──────────┬───────────────┘                            │
+           │                                            │
+           ▼                                            ▼
+┌──────────────────────────────────────────────────────────┐
+│  AgentDiff Backend (FastAPI/Starlette)                    │
+│                                                          │
+│  Platform API (/api/platform/*)                          │
+│    - initEnv, startRun, evaluateRun, diffRun             │
+│    - Template & test suite management                    │
+│                                                          │
+│  Service APIs (/api/env/{env_id}/services/{service}/*)   │
+│    - Box REST API replica   (/services/box/2.0/*)        │
+│    - Slack API replica      (/services/slack/*)          │
+│    - Linear GraphQL replica (/services/linear/*)         │
+│    - Calendar API replica   (/services/calendar/*)       │
+│                                                          │
+│  Middleware:                                             │
+│    PlatformMiddleware  → API key auth for platform calls │
+│    IsolationMiddleware → per-env DB session + auth       │
+└──────────────────────────────────────────────────────────┘
+```
+
+## Environment Lifecycle
+
+### 1. Create an Isolated Environment (initEnv)
+
+Every evaluation starts by creating an isolated copy of a template database schema.
+
+**Via SDK (Python):**
+```python
+from agent_diff import AgentDiff
+
+client = AgentDiff(
+    api_key="ad_live_sk_...",
+    base_url="https://api.agentdiff.dev",  # or http://localhost:8000
+)
+
+env = client.init_env(
+    templateService="box",              # "box" | "linear" | "slack" | "calendar"
+    templateName="box_default",         # name of the seeded template
+    impersonateUserId="27512847635",    # user ID from the seed data
+)
+# env.environmentId  → hex string, e.g. "824d0c408eeb42368f20e24d2d9f03c3"
+# env.environmentUrl → "/api/env/{env_id}/services/box"
+```
+
+**Via curl:**
+```bash
+curl -X POST https://api.agentdiff.dev/api/platform/initEnv \
+  -H "X-API-Key: ad_live_sk_..." \
+  -H "Content-Type: application/json" \
+  -d '{
+    "templateService": "box",
+    "templateName": "box_default",
+    "impersonateUserId": "27512847635"
+  }'
+```
+
+**What happens internally:**
+1. `templateManager.resolve_init_template()` finds the template by service+name
+2. `CoreIsolationEngine.create_environment()` clones the template PostgreSQL schema
+3. A new `state_<uuid>` schema is created with all tables and data copied
+4. A `RunTimeEnvironment` record is stored in the meta schema with TTL
+
+### 2. Make API Calls Against the Environment
+
+Once the environment is created, API calls go to the service replica endpoints:
+
+```
+Base URL: {base_url}/api/env/{env_id}/services/{service}
+
+Box:      /api/env/{env_id}/services/box/2.0/search?query=fomc
+Linear:   /api/env/{env_id}/services/linear/graphql
+Slack:    /api/env/{env_id}/services/slack/conversations.list
+Calendar: /api/env/{env_id}/services/calendar/calendars/{calendarId}/events
+```
+
+Each request goes through `IsolationMiddleware` which:
+1. Validates the API key via control plane (`get_principal_id`)
+2. Looks up the environment in meta DB to get impersonate_user_id
+3. Opens a DB session scoped to the environment's `state_<uuid>` schema
+4. Passes the request to the service route handler
+
+### 3. Start a Run & Evaluate
+
+```python
+run = client.start_run(envId=env.environmentId)
+# ... agent makes API calls that modify the environment ...
+result = client.evaluate_run(runId=run.runId, expectedOutput={...})
+results = client.get_results_for_run(runId=run.runId)
+```
+
+### 4. Cleanup
+
+```python
+client.delete_env(envId=env.environmentId)
+```
+
+## Available Templates
+
+| Service  | Template Name     | Impersonate User ID                    |
+|----------|-------------------|----------------------------------------|
+| box      | box_default       | 27512847635                            |
+| linear   | linear_default    | 2790a7ee-fde0-4537-9588-e233aa5a68d1   |
+| slack    | slack_default     | U01AGENBOT9                            |
+| calendar | calendar_base     | (varies by seed)                       |
+
+## Writing Tests
+
+### Integration Tests (in-process, no HTTP server)
+
+Tests create environments via `core_isolation_engine.create_environment()` and
+wire up an `AsyncClient` with middleware that injects the DB session:
+
+```python
+@pytest_asyncio.fixture
+async def box_client(test_user_id, core_isolation_engine, session_manager, environment_handler):
+    env_result = core_isolation_engine.create_environment(
+        template_schema="box_default",
+        ttl_seconds=3600,
+        created_by=test_user_id,
+        impersonate_user_id="27512847635",
+    )
+
+    async def add_db_session(request, call_next):
+        with session_manager.with_session_for_environment(env_result.environment_id) as session:
+            request.state.db_session = session
+            request.state.environment_id = env_result.environment_id
+            request.state.impersonate_user_id = "27512847635"
+            request.state.impersonate_email = None
+            response = await call_next(request)
+            return response
+
+    from src.services.box.api.routes import routes as box_routes
+    app = Starlette(routes=box_routes)
+    app.middleware("http")(add_db_session)
+
+    transport = ASGITransport(app=app)
+    async with AsyncClient(transport=transport, base_url="http://test") as client:
+        yield client
+
+    environment_handler.drop_schema(env_result.schema_name)
+```
+
+### Running Tests
+
+```bash
+cd backend
+# Requires DATABASE_URL in .env or environment
+pytest tests/performance/test_box_bench_perf.py -v -s
+pytest tests/integration/ -v
+```
+
+## Running Evaluations Locally
+
+```bash
+# 1. Activate the bench environment's venv
+source third_party/prime-environments/environments/agent_diff_bench/.venv/bin/activate
+
+# 2. Install the environment package
+cd third_party/prime-environments/environments/agent_diff_bench
+uv pip install -e .
+
+# 3. Run evaluation (from the agent_diff_bench directory)
+uv run prime eval run agent-diff-bench \
+  -m "openai/gpt-5-mini" \
+  -n 5 -r 3 -s \
+  -a '{"agentdiff_api_key": "ad_live_sk_..."}'
+```
+
+Results are saved to: `third_party/prime-environments/environments/agent_diff_bench/eval_results/`
+
+## Database Seeding
+
+Templates are seeded from JSON files in `backend/seeds/` (Docker) or `examples/` (local).
+
+Seed scripts in `backend/utils/`:
+- `seed_box_template.py` — creates box_default, box_base templates
+- `seed_linear_template.py` — creates linear_default, linear_base, linear_expanded
+- `seed_slack_template.py` — creates slack_default, slack_bench_default
+- `seed_calendar_template.py` — creates calendar_base
+- `seed_tests.py` — loads test suite JSON files
+
+On Railway, seeding runs automatically on deploy when `SEED=true` env var is set.
+The Dockerfile startup script runs Alembic migrations then all seed scripts.
+
+## Performance Profiling
+
+All `[PERF]` log lines are instrumented for performance tracking:
+
+- **Middleware**: `[PERF] GET /api/env/.../services/box/... total=Xms auth=Xms meta_db=Xms handler=Xms`
+- **Box operations**: `[PERF] search_content TOTAL=Xms`, `[PERF] get_folder_by_id(...) time=Xms`
+- **Box schema**: `[PERF] File._get_path_collection depth=N time=Xms`
+- **Calendar**: `[PERF] Calendar events_list took Xms`
+
+Filter with: `grep "\[PERF\]"` in Railway logs.
+
+## Key Directories
+
+```
+backend/
+  src/
+    platform/          # Platform API (initEnv, runs, evaluation)
+    services/
+      box/             # Box API replica
+      slack/           # Slack API replica
+      linear/          # Linear API replica
+      calendar/        # Calendar API replica
+  tests/
+    integration/       # Full-stack integration tests
+    performance/       # Performance/benchmark tests
+    validation/        # API parity tests
+    unit/              # Unit tests
+  utils/               # Seed scripts
+  seeds/               # Seed data JSON files (for Docker)
+
+sdk/agent-diff-python/ # Python SDK (agent_diff package)
+
+examples/
+  box/                 # Box seed data + test suites
+  linear/              # Linear seed data + test suites
+  slack/               # Slack seed data + test suites
+  calendar/            # Calendar seed data
+
+third_party/prime-environments/environments/agent_diff_bench/
+  agent_diff_bench.py  # Entry point for prime eval
+  src/environment.py   # Environment setup (initEnv, startRun, etc.)
+```
diff --git a/backend/src/platform/api/middleware.py b/backend/src/platform/api/middleware.py
@@ -1,6 +1,7 @@
 from __future__ import annotations
 
 import logging
+import time
 
 from starlette.middleware.base import BaseHTTPMiddleware
 from starlette.requests import Request
@@ -86,6 +87,8 @@ async def dispatch(self, request: Request, call_next) -> Response:
         if not path.startswith("/api/env/"):
             return await call_next(request)
 
+        t_total_start = time.perf_counter()
+
         try:
             path_after_prefix = path[len("/api/env/") :]
             env_id = path_after_prefix.split("/")[0] if path_after_prefix else ""
@@ -106,8 +109,11 @@ async def dispatch(self, request: Request, call_next) -> Response:
                     status_code=status.HTTP_401_UNAUTHORIZED,
                 )
 
+            t_auth_start = time.perf_counter()
             principal_id = await get_principal_id(api_key_hdr, action="api_request")
+            t_auth_ms = (time.perf_counter() - t_auth_start) * 1000
 
+            t_meta_start = time.perf_counter()
             with self.session_manager.with_meta_session() as meta_session:
                 request.state.principal_id = principal_id
 
@@ -125,11 +131,26 @@ async def dispatch(self, request: Request, call_next) -> Response:
                     logger.debug(
                         f"Could not load impersonation data for env {env_id}: {e}"
                     )
+            t_meta_ms = (time.perf_counter() - t_meta_start) * 1000
 
+            t_handler_start = time.perf_counter()
             with self.session_manager.with_session_for_environment(env_id) as session:
                 request.state.db_session = session
                 request.state.environment_id = env_id
-                return await call_next(request)
+                response = await call_next(request)
+            t_handler_ms = (time.perf_counter() - t_handler_start) * 1000
+
+            t_total_ms = (time.perf_counter() - t_total_start) * 1000
+            # Extract service from path for easier log filtering
+            parts = path_after_prefix.split("/")
+            service_name = parts[2] if len(parts) > 2 else "unknown"
+            logger.info(
+                f"[PERF] {request.method} {path} | service={service_name} "
+                f"total={t_total_ms:.0f}ms auth={t_auth_ms:.0f}ms "
+                f"meta_db={t_meta_ms:.0f}ms handler={t_handler_ms:.0f}ms "
+                f"status={response.status_code}"
+            )
+            return response
 
         except PermissionError as exc:
             return JSONResponse(

diff --git a/backend/src/platform/db/migrations/versions/a1b2c3d4e5f6_calendar_composite_indexes.py b/backend/src/platform/db/migrations/versions/a1b2c3d4e5f6_calendar_composite_indexes.py
@@ -0,0 +1,57 @@
+"""Add composite indexes for calendar event queries
+
+Adds composite indexes on calendar_events to optimize the most common
+query patterns: time-range filtering, status filtering, and sync-token
+incremental queries.
+
+Revision ID: a1b2c3d4e5f6
+Revises: merge_heads_20260130
+Create Date: 2026-02-11 12:00:00.000000
+
+"""
+
+from typing import Sequence, Union
+
+from alembic import op
+
+
+# revision identifiers, used by Alembic.
+revision: str = "a1b2c3d4e5f6"
+down_revision: Union[str, None] = "merge_heads_20260130"
+branch_labels: Union[str, Sequence[str], None] = None
+depends_on: Union[str, Sequence[str], None] = None
+
+
+def upgrade() -> None:
+    # Composite index for the most common list_events query pattern:
+    #   WHERE calendar_id = X AND status != 'cancelled' AND start_datetime < Y
+    op.create_index(
+        "ix_event_cal_status_start",
+        "calendar_events",
+        ["calendar_id", "status", "start_datetime"],
+        unique=False,
+    )
+
+    # Composite index for time-range queries (list_events with timeMin/timeMax, freebusy):
+    #   WHERE calendar_id = X AND start_datetime >= Y AND end_datetime <= Z
+    op.create_index(
+        "ix_event_cal_start_end",
+        "calendar_events",
+        ["calendar_id", "start_datetime", "end_datetime"],
+        unique=False,
+    )
+
+    # Composite index for sync-token incremental queries:
+    #   WHERE calendar_id = X AND updated_at > Y
+    op.create_index(
+        "ix_event_cal_updated",
+        "calendar_events",
+        ["calendar_id", "updated_at"],
+        unique=False,
+    )
+
+
+def downgrade() -> None:
+    op.drop_index("ix_event_cal_updated", table_name="calendar_events")
+    op.drop_index("ix_event_cal_start_end", table_name="calendar_events")
+    op.drop_index("ix_event_cal_status_start", table_name="calendar_events")