An LLM evaluation harness for MLE-focused research tasks, inspired by SWE-bench. Each task runs in an isolated sandbox, is graded by a tiered composite judge, and is enforced against token / wall-time / tool-call budgets. The harness is currently at milestone M2: native tool calling, multi-service compose sandboxes (e.g. Postgres sidecars), suites with bounded fan-out, and per-run cost accounting.
┌───────────┐ RunOptions ┌────────────────────┐ ┌─────────────────────┐
│ CLI ├──────────────►│ Orchestrator ├─►│ DockerRunner / │
│ mle-eval │ │ (per-task driver) │ │ ComposeRunner │
└───────────┘ └─────────┬──────────┘ └─────────┬───────────┘
▲ │ │
│ result.json, ▼ ▼
│ report.md ┌──────────┐ ┌──────────────────┐
│ │ Agent │◄────────┤ Sandboxed agent │
│ │ loop │ tools │ container │
│ └────┬─────┘ └──────────────────┘
│ │
│ ▼
│ ┌──────────────┐
└─────────────────────────┤ Judge │
│ (composite) │
└──────────────┘
-
Python 3.10+
-
Docker Desktop (or any modern Docker engine). The harness expects
docker versionto succeed withoutsudo. -
OpenAI API key in a local
.envfile:cp .env.example .env # then edit .env to set OPENAI_API_KEY=sk-...
Install (with uv, recommended)
uv sync # creates .venv and installs everything from uv.lock
uv run mle-eval --helppython3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
mle-eval --helpmle-eval list-tasks
mle-eval pricing showIf mle-eval list-tasks prints the four built-in tasks
(hello_repo_fix, pandas_transform, postgres_migration,
sklearn_classifier), you're set.
mle-eval run \
--task tasks/pandas_transform \
--model openai/gpt-4o-miniThis will:
-
Snapshot the task's
workspace/into a freshruns/<run_id>/workspace/. -
Spin up the sandbox (a single Docker container for simple tasks; a Compose project with sidecar services for tasks that declare an
environment.servicesblock such aspostgres_migration). -
Drive the agent loop against OpenAI's Chat Completions tool-calling API, accounting tokens and dollars per call.
-
Run the composite judge: deterministic checks first (file presence, exit codes, metric values, SQL checks), then optional
reference_outputdiffs and LLM-rubric scoring. -
Persist artifacts under
runs/<run_id>/:runs/2026-05-13T20-40-21Z__pandas_transform__openai_gpt_4o_mini__a3f0/ ├── manifest.snapshot.yaml # the task as resolved ├── transcript.jsonl # every agent turn + tool call ├── workspace_final/ # post-run workspace state ├── grader_outputs/ # per-check outputs └── result.json # score, passed, tokens, cost, kill_reason
Each run is also indexed in runs/runs.sqlite for cross-run queries.
mle-eval show-run <run_id>A suite is a YAML manifest that fans a set of tasks out across a bounded thread pool, with shared concurrency, model, and budget defaults.
mle-eval suite list
mle-eval suite run suites/m2_core.yaml \
--model openai/gpt-4o-mini \
--concurrency 2Output:
suite_run_id: 2026-05-13T20-40-21Z__suite__m2_core__f60e
total: 3 passed: 3
avg_score: 1.000 cost_usd: 0.0021
artifacts: runs/2026-05-13T20-40-21Z__suite__m2_core__f60e
The suite directory contains a report.md with per-task results and
per-difficulty / per-domain rollups.
factory-bench's source of truth for architecture is the
LikeC4 model under docs/architecture/.
The diagrams below are exported from that model; regenerate them with
npm run --prefix docs/architecture build:png (see
docs/architecture/README.md).
The system in context: a human researcher drives the CLI, which calls
out to OpenAI for completions and to Docker for sandboxing. Future MCP
servers (rendered with a dashed red border per the #m3_planned
lifecycle tag) will provide extra remote tools.
Inside factory-bench, five containers cooperate: the CLI, the
harness process (orchestrator + agents + judges + sandbox glue),
run artifacts on disk, pricing data, and the sandbox runtime
(the actual Docker containers spawned per run).
The harness process is the main Python codebase under
src/mle_eval_harness/. It has six internal subsystems, each owning
its own module tree:
The per-task driver. Given a RunOptions it loads the task, prepares
the workspace, builds the runner (Docker vs Compose), wires the cost
tracker and budget enforcer, drives the agent loop, then invokes the
composite judge. Writes the final result.json and updates
runs.sqlite.
A pluggable LLM agent loop. Two production agents:
NativeToolAgent— uses OpenAI's nativetool_callsfield; recommended for any model that supports it.ReactAgent/SimpleAgent— text-based tool-call protocols for models without native support.
Selection is driven by tools.style in the task manifest
(native | json_text). The agents/factory.py module is the
dispatch entry point.
Tiered evaluation, executed in the order:
- Deterministic —
file_present,command_exit_zero,metric_check,sql_check. Cheap, reproducible. Gating checks here can short-circuit the whole evaluation. - Reference-output — diff stdout against a known-good fixture with configurable numeric / string tolerance.
- LLM rubric — a separate judge model scores free-form work
against a rubric prompt. Cached on
(rubric, transcript, judge_model)hash viajudges/cache.py.
judges/composite.py orchestrates the tiers; weights determine the
final score; gating checks force-fail when violated.
The runtime layer the agent's tools execute against. Three
implementations of the SandboxRunner Protocol:
LocalSubprocessRunner— no isolation, dev-only (MLE_EVAL_SANDBOX=local).DockerRunner— one container per task with the workspace bind-mounted. The default.ComposeRunner— Docker Compose with multiple services and a user-defined network. Auto-selected when a task declaresenvironment.servicesor non-emptynetwork.allow.
Path inputs from agent tools (/workspace/...) are remapped to the
host workspace and validated to prevent traversal outside the sandbox.
Per-run artifacts on disk plus a per-runs-root SQLite index
(runs/runs.sqlite) with tables runs, suite_runs, and
check_results. The CLI's show-run and suite commands read from
this index. The schema is additive-only across milestones.
Suite manifest loader + a ThreadPoolExecutor-based runner with
per-task isolation and graceful per-failure handling. The report writer
emits a Markdown summary with rollups by difficulty and domain.
- Budgets (
budgets/) — per-run caps on tokens, tool calls, and wall time. The orchestrator passes aBudgetEnforcerinto the agent loop; breaches return abudget_exceeded:<reason>signal that the orchestrator records askill_reason. - Cost (
cost/) — pulls token prices frompricing/model_prices.yaml. TheCostTrackerrecords per-tool-call usage and rolls up at run end. - Tools (
tools/) — built-in tools (bash,python,text_editor,git_diff) plus JSON Schema definitions for the native tool-call API. MCP-server-backed remote tools are planned (#m3_planned).
The sequence below shows what happens during mle-eval run. The
suite-run sequence is similar but wrapped in a ThreadPoolExecutor
with bounded fan-out, and persists an extra suite_run row.
factory-bench/
├── src/mle_eval_harness/ # the harness package
│ ├── cli.py # Typer CLI: run, suite, list-tasks, ...
│ ├── orchestrator.py # per-task driver
│ ├── agents/ # LLM agent loops (native + json_text)
│ ├── judges/ # composite judge + tier handlers
│ ├── sandbox/ # Local / Docker / Compose runners
│ ├── suites/ # suite manifest + parallel runner
│ ├── runs/ # run store + sqlite index
│ ├── budgets/ # token / wall / tool-call enforcement
│ ├── cost/ # pricing table + cost tracker
│ ├── tasks/ # task manifest schema + loader
│ └── tools/ # built-in tools + native schemas
│
├── tasks/ # task definitions (one folder each)
│ ├── hello_repo_fix/ # M1 smoke test (text-based protocol)
│ ├── pandas_transform/ # easy, reference_output graded
│ ├── sklearn_classifier/ # medium, metric_check graded
│ └── postgres_migration/ # hard, compose + sql_check graded
│
├── suites/ # suite manifests (compose tasks)
│ └── m2_core.yaml
│
├── pricing/ # token prices per model
│ └── model_prices.yaml
│
├── tests/ # 43 unit + integration tests
│
├── docs/architecture/ # LikeC4 source of truth
│ ├── specification.c4
│ ├── model.c4
│ ├── views/*.c4
│ └── img/ # exported PNGs (used by this README)
│
├── runs/ # gitignored — per-run artifacts + sqlite
├── pyproject.toml
├── uv.lock
└── README.md
uv run pytest -q # all 43 tests
uv run pytest -q tests/test_native_agent.py # one fileTests use scripted LLM clients (no real API calls), in-memory sandboxes where possible, and tmp dirs for sqlite — they finish in ~1 second.
uv run ruff check src tests
uv run ruff format src testsThe repo is ruff-clean and we keep it that way. Lint config lives under
[tool.ruff] in pyproject.toml.
The .c4 files under docs/architecture/ are the source of truth for
the system's structure. When you make non-trivial code changes, update
the model first, then mirror the change in code:
cd docs/architecture
npx --yes likec4 validate # CI-friendly schema check
npx --yes likec4 start # interactive viewer at localhost:5173To regenerate the PNGs in this README:
cd docs/architecture
npx --yes likec4 export png --output img --flat \
--filter landscape --filter containers --filter dynamic_single_run --seqSee docs/architecture/README.md for the
spec workflow.
A task is a folder under tasks/ containing:
task.yaml— manifest (id, environment, agent, tools, budget, grader).workspace/— files seeded into the sandbox before the agent runs.- Optional grader fixtures (reference outputs, rubrics, expected metrics, init SQL).
The four shipped tasks are good references for the common shapes
(simple, reference-graded, metric-graded, compose-graded). Run your new
task with mle-eval describe --task tasks/<your_task> before scoring
to validate the manifest.
- M3 (planned) — Real MCP-server-backed tools (
fetch, code-docs, vendor APIs) with budget integration. The#m3_plannedcomponents indocs/architecture/preview the surface. - More tasks across domains: prompt engineering, RAG eval, agent trajectories with multi-turn debugging.
- A small web dashboard over
runs.sqlitefor cross-run analysis.
TBD. Internal-only for now.


