Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
247 changes: 247 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,247 @@
# AGENTS.md — Agent-Diff Developer Guide

## Project Overview

Agent-Diff is a benchmarking platform for evaluating AI agents that interact with
real-world SaaS APIs (Slack, Linear, Box, Google Calendar). It provides **isolated,
reproducible environments** backed by PostgreSQL schema cloning.

## Architecture

```
┌──────────────────────────┐ ┌──────────────────────┐
│ Evaluation Client │ │ Agent Sandbox │
│ (prime eval / SDK) │──────▶│ (Docker container) │
│ │ │ │
│ 1. initEnv │ │ Runs agent code │
│ 2. startRun │ │ Makes API calls ──┐ │
│ 3. evaluateRun │ └────────────────────┼─┘
│ 4. getResults │ │
└──────────┬───────────────┘ │
│ │
▼ ▼
┌──────────────────────────────────────────────────────────┐
│ AgentDiff Backend (FastAPI/Starlette) │
│ │
│ Platform API (/api/platform/*) │
│ - initEnv, startRun, evaluateRun, diffRun │
│ - Template & test suite management │
│ │
│ Service APIs (/api/env/{env_id}/services/{service}/*) │
│ - Box REST API replica (/services/box/2.0/*) │
│ - Slack API replica (/services/slack/*) │
│ - Linear GraphQL replica (/services/linear/*) │
│ - Calendar API replica (/services/calendar/*) │
│ │
│ Middleware: │
│ PlatformMiddleware → API key auth for platform calls │
│ IsolationMiddleware → per-env DB session + auth │
└──────────────────────────────────────────────────────────┘
```

## Environment Lifecycle

### 1. Create an Isolated Environment (initEnv)

Every evaluation starts by creating an isolated copy of a template database schema.

**Via SDK (Python):**
```python
from agent_diff import AgentDiff

client = AgentDiff(
api_key="ad_live_sk_...",
base_url="https://api.agentdiff.dev", # or http://localhost:8000
)

env = client.init_env(
templateService="box", # "box" | "linear" | "slack" | "calendar"
templateName="box_default", # name of the seeded template
impersonateUserId="27512847635", # user ID from the seed data
)
# env.environmentId → hex string, e.g. "824d0c408eeb42368f20e24d2d9f03c3"
# env.environmentUrl → "/api/env/{env_id}/services/box"
```

**Via curl:**
```bash
curl -X POST https://api.agentdiff.dev/api/platform/initEnv \
-H "X-API-Key: ad_live_sk_..." \
-H "Content-Type: application/json" \
-d '{
"templateService": "box",
"templateName": "box_default",
"impersonateUserId": "27512847635"
}'
```

**What happens internally:**
1. `templateManager.resolve_init_template()` finds the template by service+name
2. `CoreIsolationEngine.create_environment()` clones the template PostgreSQL schema
3. A new `state_<uuid>` schema is created with all tables and data copied
4. A `RunTimeEnvironment` record is stored in the meta schema with TTL

### 2. Make API Calls Against the Environment

Once the environment is created, API calls go to the service replica endpoints:

```
Base URL: {base_url}/api/env/{env_id}/services/{service}

Box: /api/env/{env_id}/services/box/2.0/search?query=fomc
Linear: /api/env/{env_id}/services/linear/graphql
Slack: /api/env/{env_id}/services/slack/conversations.list
Calendar: /api/env/{env_id}/services/calendar/calendars/{calendarId}/events
```

Each request goes through `IsolationMiddleware` which:
1. Validates the API key via control plane (`get_principal_id`)
2. Looks up the environment in meta DB to get impersonate_user_id
3. Opens a DB session scoped to the environment's `state_<uuid>` schema
4. Passes the request to the service route handler

### 3. Start a Run & Evaluate

```python
run = client.start_run(envId=env.environmentId)
# ... agent makes API calls that modify the environment ...
result = client.evaluate_run(runId=run.runId, expectedOutput={...})
results = client.get_results_for_run(runId=run.runId)
```

### 4. Cleanup

```python
client.delete_env(envId=env.environmentId)
```

## Available Templates

| Service | Template Name | Impersonate User ID |
|----------|-------------------|----------------------------------------|
| box | box_default | 27512847635 |
| linear | linear_default | 2790a7ee-fde0-4537-9588-e233aa5a68d1 |
| slack | slack_default | U01AGENBOT9 |
| calendar | calendar_base | (varies by seed) |

## Writing Tests

### Integration Tests (in-process, no HTTP server)

Tests create environments via `core_isolation_engine.create_environment()` and
wire up an `AsyncClient` with middleware that injects the DB session:

```python
@pytest_asyncio.fixture
async def box_client(test_user_id, core_isolation_engine, session_manager, environment_handler):
env_result = core_isolation_engine.create_environment(
template_schema="box_default",
ttl_seconds=3600,
created_by=test_user_id,
impersonate_user_id="27512847635",
)

async def add_db_session(request, call_next):
with session_manager.with_session_for_environment(env_result.environment_id) as session:
request.state.db_session = session
request.state.environment_id = env_result.environment_id
request.state.impersonate_user_id = "27512847635"
request.state.impersonate_email = None
response = await call_next(request)
return response

from src.services.box.api.routes import routes as box_routes
app = Starlette(routes=box_routes)
app.middleware("http")(add_db_session)

transport = ASGITransport(app=app)
async with AsyncClient(transport=transport, base_url="http://test") as client:
yield client

environment_handler.drop_schema(env_result.schema_name)
```

### Running Tests

```bash
cd backend
# Requires DATABASE_URL in .env or environment
pytest tests/performance/test_box_bench_perf.py -v -s
pytest tests/integration/ -v
```

## Running Evaluations Locally

```bash
# 1. Activate the bench environment's venv
source third_party/prime-environments/environments/agent_diff_bench/.venv/bin/activate

# 2. Install the environment package
cd third_party/prime-environments/environments/agent_diff_bench
uv pip install -e .

# 3. Run evaluation (from the agent_diff_bench directory)
uv run prime eval run agent-diff-bench \
-m "openai/gpt-5-mini" \
-n 5 -r 3 -s \
-a '{"agentdiff_api_key": "ad_live_sk_..."}'
```

Results are saved to: `third_party/prime-environments/environments/agent_diff_bench/eval_results/`

## Database Seeding

Templates are seeded from JSON files in `backend/seeds/` (Docker) or `examples/` (local).

Seed scripts in `backend/utils/`:
- `seed_box_template.py` — creates box_default, box_base templates
- `seed_linear_template.py` — creates linear_default, linear_base, linear_expanded
- `seed_slack_template.py` — creates slack_default, slack_bench_default
- `seed_calendar_template.py` — creates calendar_base
- `seed_tests.py` — loads test suite JSON files

On Railway, seeding runs automatically on deploy when `SEED=true` env var is set.
The Dockerfile startup script runs Alembic migrations then all seed scripts.

## Performance Profiling

All `[PERF]` log lines are instrumented for performance tracking:

- **Middleware**: `[PERF] GET /api/env/.../services/box/... total=Xms auth=Xms meta_db=Xms handler=Xms`
- **Box operations**: `[PERF] search_content TOTAL=Xms`, `[PERF] get_folder_by_id(...) time=Xms`
- **Box schema**: `[PERF] File._get_path_collection depth=N time=Xms`
- **Calendar**: `[PERF] Calendar events_list took Xms`

Filter with: `grep "\[PERF\]"` in Railway logs.

## Key Directories

```
backend/
src/
platform/ # Platform API (initEnv, runs, evaluation)
services/
box/ # Box API replica
slack/ # Slack API replica
linear/ # Linear API replica
calendar/ # Calendar API replica
tests/
integration/ # Full-stack integration tests
performance/ # Performance/benchmark tests
validation/ # API parity tests
unit/ # Unit tests
utils/ # Seed scripts
seeds/ # Seed data JSON files (for Docker)

sdk/agent-diff-python/ # Python SDK (agent_diff package)

examples/
box/ # Box seed data + test suites
linear/ # Linear seed data + test suites
slack/ # Slack seed data + test suites
calendar/ # Calendar seed data

third_party/prime-environments/environments/agent_diff_bench/
agent_diff_bench.py # Entry point for prime eval
src/environment.py # Environment setup (initEnv, startRun, etc.)
```
23 changes: 22 additions & 1 deletion backend/src/platform/api/middleware.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
from __future__ import annotations

import logging
import time

from starlette.middleware.base import BaseHTTPMiddleware
from starlette.requests import Request
Expand Down Expand Up @@ -86,6 +87,8 @@ async def dispatch(self, request: Request, call_next) -> Response:
if not path.startswith("/api/env/"):
return await call_next(request)

t_total_start = time.perf_counter()

try:
path_after_prefix = path[len("/api/env/") :]
env_id = path_after_prefix.split("/")[0] if path_after_prefix else ""
Expand All @@ -106,8 +109,11 @@ async def dispatch(self, request: Request, call_next) -> Response:
status_code=status.HTTP_401_UNAUTHORIZED,
)

t_auth_start = time.perf_counter()
principal_id = await get_principal_id(api_key_hdr, action="api_request")
t_auth_ms = (time.perf_counter() - t_auth_start) * 1000

t_meta_start = time.perf_counter()
with self.session_manager.with_meta_session() as meta_session:
request.state.principal_id = principal_id

Expand All @@ -125,11 +131,26 @@ async def dispatch(self, request: Request, call_next) -> Response:
logger.debug(
f"Could not load impersonation data for env {env_id}: {e}"
)
t_meta_ms = (time.perf_counter() - t_meta_start) * 1000

t_handler_start = time.perf_counter()
with self.session_manager.with_session_for_environment(env_id) as session:
request.state.db_session = session
request.state.environment_id = env_id
return await call_next(request)
response = await call_next(request)
t_handler_ms = (time.perf_counter() - t_handler_start) * 1000

t_total_ms = (time.perf_counter() - t_total_start) * 1000
# Extract service from path for easier log filtering
parts = path_after_prefix.split("/")
service_name = parts[2] if len(parts) > 2 else "unknown"
logger.info(
f"[PERF] {request.method} {path} | service={service_name} "
f"total={t_total_ms:.0f}ms auth={t_auth_ms:.0f}ms "
f"meta_db={t_meta_ms:.0f}ms handler={t_handler_ms:.0f}ms "
f"status={response.status_code}"
)
return response

except PermissionError as exc:
return JSONResponse(
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
"""Add composite indexes for calendar event queries

Adds composite indexes on calendar_events to optimize the most common
query patterns: time-range filtering, status filtering, and sync-token
incremental queries.

Revision ID: a1b2c3d4e5f6
Revises: merge_heads_20260130
Create Date: 2026-02-11 12:00:00.000000

"""

from typing import Sequence, Union

from alembic import op


# revision identifiers, used by Alembic.
revision: str = "a1b2c3d4e5f6"
down_revision: Union[str, None] = "merge_heads_20260130"
branch_labels: Union[str, Sequence[str], None] = None
depends_on: Union[str, Sequence[str], None] = None


def upgrade() -> None:
# Composite index for the most common list_events query pattern:
# WHERE calendar_id = X AND status != 'cancelled' AND start_datetime < Y
op.create_index(
"ix_event_cal_status_start",
"calendar_events",
["calendar_id", "status", "start_datetime"],
unique=False,
)

# Composite index for time-range queries (list_events with timeMin/timeMax, freebusy):
# WHERE calendar_id = X AND start_datetime >= Y AND end_datetime <= Z
op.create_index(
"ix_event_cal_start_end",
"calendar_events",
["calendar_id", "start_datetime", "end_datetime"],
unique=False,
)

# Composite index for sync-token incremental queries:
# WHERE calendar_id = X AND updated_at > Y
op.create_index(
"ix_event_cal_updated",
"calendar_events",
["calendar_id", "updated_at"],
unique=False,
)


def downgrade() -> None:
op.drop_index("ix_event_cal_updated", table_name="calendar_events")
op.drop_index("ix_event_cal_start_end", table_name="calendar_events")
op.drop_index("ix_event_cal_status_start", table_name="calendar_events")
Loading