factory-bench

An LLM evaluation harness for MLE-focused research tasks, inspired by SWE-bench. Each task runs in an isolated sandbox, is graded by a tiered composite judge, and is enforced against token / wall-time / tool-call budgets. The harness is currently at milestone M2: native tool calling, multi-service compose sandboxes (e.g. Postgres sidecars), suites with bounded fan-out, and per-run cost accounting.

┌───────────┐  RunOptions   ┌────────────────────┐  ┌─────────────────────┐
│   CLI     ├──────────────►│   Orchestrator     ├─►│ DockerRunner /      │
│ mle-eval  │               │  (per-task driver) │  │ ComposeRunner       │
└───────────┘               └─────────┬──────────┘  └─────────┬───────────┘
      ▲                               │                       │
      │ result.json,                  ▼                       ▼
      │ report.md                ┌──────────┐         ┌──────────────────┐
      │                          │  Agent   │◄────────┤ Sandboxed agent  │
      │                          │  loop    │  tools  │ container        │
      │                          └────┬─────┘         └──────────────────┘
      │                               │
      │                               ▼
      │                         ┌──────────────┐
      └─────────────────────────┤   Judge      │
                                │  (composite) │
                                └──────────────┘

Quickstart

Prerequisites

Python 3.10+
Docker Desktop (or any modern Docker engine). The harness expects docker version to succeed without sudo.

OpenAI API key in a local .env file:

cp .env.example .env
# then edit .env to set OPENAI_API_KEY=sk-...

Install (with uv, recommended)

uv sync               # creates .venv and installs everything from uv.lock
uv run mle-eval --help

Install (with pip)

python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
mle-eval --help

Smoke test

mle-eval list-tasks
mle-eval pricing show

If mle-eval list-tasks prints the four built-in tasks (hello_repo_fix, pandas_transform, postgres_migration, sklearn_classifier), you're set.

Running a task

mle-eval run \
  --task tasks/pandas_transform \
  --model openai/gpt-4o-mini

This will:

Snapshot the task's workspace/ into a fresh runs/<run_id>/workspace/.
Spin up the sandbox (a single Docker container for simple tasks; a Compose project with sidecar services for tasks that declare an environment.services block such as postgres_migration).
Drive the agent loop against OpenAI's Chat Completions tool-calling API, accounting tokens and dollars per call.
Run the composite judge: deterministic checks first (file presence, exit codes, metric values, SQL checks), then optional reference_output diffs and LLM-rubric scoring.

Persist artifacts under runs/<run_id>/:

runs/2026-05-13T20-40-21Z__pandas_transform__openai_gpt_4o_mini__a3f0/
├── manifest.snapshot.yaml      # the task as resolved
├── transcript.jsonl            # every agent turn + tool call
├── workspace_final/            # post-run workspace state
├── grader_outputs/             # per-check outputs
└── result.json                 # score, passed, tokens, cost, kill_reason

Each run is also indexed in runs/runs.sqlite for cross-run queries.

mle-eval show-run <run_id>

Running a suite

A suite is a YAML manifest that fans a set of tasks out across a bounded thread pool, with shared concurrency, model, and budget defaults.

mle-eval suite list
mle-eval suite run suites/m2_core.yaml \
  --model openai/gpt-4o-mini \
  --concurrency 2

Output:

suite_run_id: 2026-05-13T20-40-21Z__suite__m2_core__f60e
total: 3  passed: 3
avg_score: 1.000  cost_usd: 0.0021
artifacts: runs/2026-05-13T20-40-21Z__suite__m2_core__f60e

The suite directory contains a report.md with per-task results and per-difficulty / per-domain rollups.

Architecture

factory-bench's source of truth for architecture is the LikeC4 model under docs/architecture/. The diagrams below are exported from that model; regenerate them with npm run --prefix docs/architecture build:png (see docs/architecture/README.md).

Landscape

The system in context: a human researcher drives the CLI, which calls out to OpenAI for completions and to Docker for sandboxing. Future MCP servers (rendered with a dashed red border per the #m3_planned lifecycle tag) will provide extra remote tools.

Containers

Inside factory-bench, five containers cooperate: the CLI, the harness process (orchestrator + agents + judges + sandbox glue), run artifacts on disk, pricing data, and the sandbox runtime (the actual Docker containers spawned per run).

Subsystems

The harness process is the main Python codebase under src/mle_eval_harness/. It has six internal subsystems, each owning its own module tree:

1. Orchestrator (`orchestrator.py`)

The per-task driver. Given a RunOptions it loads the task, prepares the workspace, builds the runner (Docker vs Compose), wires the cost tracker and budget enforcer, drives the agent loop, then invokes the composite judge. Writes the final result.json and updates runs.sqlite.

2. Agents (`agents/`)

A pluggable LLM agent loop. Two production agents:

NativeToolAgent — uses OpenAI's native tool_calls field; recommended for any model that supports it.
ReactAgent / SimpleAgent — text-based tool-call protocols for models without native support.

Selection is driven by tools.style in the task manifest (native | json_text). The agents/factory.py module is the dispatch entry point.

3. Judges (`judges/`)

Tiered evaluation, executed in the order:

Deterministic — file_present, command_exit_zero, metric_check, sql_check. Cheap, reproducible. Gating checks here can short-circuit the whole evaluation.
Reference-output — diff stdout against a known-good fixture with configurable numeric / string tolerance.
LLM rubric — a separate judge model scores free-form work against a rubric prompt. Cached on (rubric, transcript, judge_model) hash via judges/cache.py.

judges/composite.py orchestrates the tiers; weights determine the final score; gating checks force-fail when violated.

4. Sandbox (`sandbox/`)

The runtime layer the agent's tools execute against. Three implementations of the SandboxRunner Protocol:

LocalSubprocessRunner — no isolation, dev-only (MLE_EVAL_SANDBOX=local).
DockerRunner — one container per task with the workspace bind-mounted. The default.
ComposeRunner — Docker Compose with multiple services and a user-defined network. Auto-selected when a task declares environment.services or non-empty network.allow.

Path inputs from agent tools (/workspace/...) are remapped to the host workspace and validated to prevent traversal outside the sandbox.

5. Storage (`runs/`)

Per-run artifacts on disk plus a per-runs-root SQLite index (runs/runs.sqlite) with tables runs, suite_runs, and check_results. The CLI's show-run and suite commands read from this index. The schema is additive-only across milestones.

6. Suites (`suites/`)

Suite manifest loader + a ThreadPoolExecutor-based runner with per-task isolation and graceful per-failure handling. The report writer emits a Markdown summary with rollups by difficulty and domain.

Cross-cutting

Budgets (budgets/) — per-run caps on tokens, tool calls, and wall time. The orchestrator passes a BudgetEnforcer into the agent loop; breaches return a budget_exceeded:<reason> signal that the orchestrator records as kill_reason.
Cost (cost/) — pulls token prices from pricing/model_prices.yaml. The CostTracker records per-tool-call usage and rolls up at run end.
Tools (tools/) — built-in tools (bash, python, text_editor, git_diff) plus JSON Schema definitions for the native tool-call API. MCP-server-backed remote tools are planned (#m3_planned).

Dynamic view: single task run

The sequence below shows what happens during mle-eval run. The suite-run sequence is similar but wrapped in a ThreadPoolExecutor with bounded fan-out, and persists an extra suite_run row.

Project layout

factory-bench/
├── src/mle_eval_harness/        # the harness package
│   ├── cli.py                   # Typer CLI: run, suite, list-tasks, ...
│   ├── orchestrator.py          # per-task driver
│   ├── agents/                  # LLM agent loops (native + json_text)
│   ├── judges/                  # composite judge + tier handlers
│   ├── sandbox/                 # Local / Docker / Compose runners
│   ├── suites/                  # suite manifest + parallel runner
│   ├── runs/                    # run store + sqlite index
│   ├── budgets/                 # token / wall / tool-call enforcement
│   ├── cost/                    # pricing table + cost tracker
│   ├── tasks/                   # task manifest schema + loader
│   └── tools/                   # built-in tools + native schemas
│
├── tasks/                       # task definitions (one folder each)
│   ├── hello_repo_fix/          # M1 smoke test (text-based protocol)
│   ├── pandas_transform/        # easy, reference_output graded
│   ├── sklearn_classifier/      # medium, metric_check graded
│   └── postgres_migration/      # hard, compose + sql_check graded
│
├── suites/                      # suite manifests (compose tasks)
│   └── m2_core.yaml
│
├── pricing/                     # token prices per model
│   └── model_prices.yaml
│
├── tests/                       # 43 unit + integration tests
│
├── docs/architecture/           # LikeC4 source of truth
│   ├── specification.c4
│   ├── model.c4
│   ├── views/*.c4
│   └── img/                     # exported PNGs (used by this README)
│
├── runs/                        # gitignored — per-run artifacts + sqlite
├── pyproject.toml
├── uv.lock
└── README.md

Development

Tests

uv run pytest -q                                # all 43 tests
uv run pytest -q tests/test_native_agent.py     # one file

Tests use scripted LLM clients (no real API calls), in-memory sandboxes where possible, and tmp dirs for sqlite — they finish in ~1 second.

Linting

uv run ruff check src tests
uv run ruff format src tests

The repo is ruff-clean and we keep it that way. Lint config lives under [tool.ruff] in pyproject.toml.

Working with the architecture spec

The .c4 files under docs/architecture/ are the source of truth for the system's structure. When you make non-trivial code changes, update the model first, then mirror the change in code:

cd docs/architecture
npx --yes likec4 validate         # CI-friendly schema check
npx --yes likec4 start            # interactive viewer at localhost:5173

To regenerate the PNGs in this README:

cd docs/architecture
npx --yes likec4 export png --output img --flat \
  --filter landscape --filter containers --filter dynamic_single_run --seq

See docs/architecture/README.md for the spec workflow.

Adding a new task

A task is a folder under tasks/ containing:

task.yaml — manifest (id, environment, agent, tools, budget, grader).
workspace/ — files seeded into the sandbox before the agent runs.
Optional grader fixtures (reference outputs, rubrics, expected metrics, init SQL).

The four shipped tasks are good references for the common shapes (simple, reference-graded, metric-graded, compose-graded). Run your new task with mle-eval describe --task tasks/<your_task> before scoring to validate the manifest.

Roadmap

M3 (planned) — Real MCP-server-backed tools (fetch, code-docs, vendor APIs) with budget integration. The #m3_planned components in docs/architecture/ preview the surface.
More tasks across domains: prompt engineering, RAG eval, agent trajectories with multi-turn debugging.
A small web dashboard over runs.sqlite for cross-run analysis.

License

TBD. Internal-only for now.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
docs/architecture		docs/architecture
pricing		pricing
scripts		scripts
src/mle_eval_harness		src/mle_eval_harness
suites		suites
tasks		tasks
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

factory-bench

Table of contents

Quickstart

Prerequisites

Install (with uv, recommended)

Install (with pip)

Smoke test

Running a task

Running a suite

Architecture

Landscape

Containers

Subsystems

1. Orchestrator (orchestrator.py)

2. Agents (agents/)

3. Judges (judges/)

4. Sandbox (sandbox/)

5. Storage (runs/)

6. Suites (suites/)

Cross-cutting

Dynamic view: single task run

Project layout

Development

Tests

Linting

Working with the architecture spec

Adding a new task

Roadmap

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Orchestrator (`orchestrator.py`)

2. Agents (`agents/`)

3. Judges (`judges/`)

4. Sandbox (`sandbox/`)

5. Storage (`runs/`)

6. Suites (`suites/`)

Packages