GitHub: https://github.com/runablehq/runable-agent-benchmark
This benchmark compares two ways of using pi for a long-horizon product task:
- Single-agent workflow — one agent owns the full product context, builds the app, then applies a cross-cutting edit.
- Subagent / multi-agent workflow — several isolated role-based agents build pieces of the app, then a coordinator applies the same edit.
The benchmark exists to support the Runable 2.0 architecture thesis:
For long-horizon outcome generation, a single agent with strong context, artifact state, and composable CLI-style tools is often more reliable than a workflow/subagent system that fragments context across handoffs.
Runable 1.0 was closer to a workflow/multi-agent approach. Runable 2.0 is designed around a single-agent outcome loop.
Most agent demos test the first generation of an artifact. That is not enough.
Real users do not just ask an agent to build once. They come back and say:
- "Add approvals to every asset."
- "Make this match my brand deck."
- "Update the app without breaking the existing workflow."
- "Add reporting across campaigns, proposals, and assets."
The edit-after-build phase is where architecture differences show up.
A multi-agent workflow may look good during the initial build, but later edits expose:
- context loss,
- unclear ownership,
- inconsistent assumptions,
- duplicated abstractions,
- more tool-call boundaries,
- more integration failures,
- higher cost to reconstruct intent.
A single agent has a better chance of preserving the product intent, data model, UI assumptions, and artifact state across the full lifecycle.
Both variants receive the same task:
Build a Vite + React + TypeScript app called RunOps Studio.
The app is an SMB operations workspace with:
- dashboard,
- brand profile,
- campaign board,
- asset library,
- ad performance analytics,
- proposal builder,
- client report preview,
- local-first persistence,
- realistic seed data,
- polished SaaS UI,
npm run buildvalidation.
After the app is built, apply a cross-cutting product change:
Client Approval Center
The edit must add approval state, reviewer comments, audit trail, filters, and approval actions across campaigns, assets, and proposal sections.
This edit intentionally touches multiple parts of the product. It tests whether the agent preserved enough context to modify the system coherently instead of rewriting or breaking it.
From this directory:
python3 run_benchmark.py --runs 1 --model anthropic/claude-sonnet-4-5 --thinking highOr use your default configured pi model:
python3 run_benchmark.py --runs 1Run only one variant:
python3 run_benchmark.py --mode single
python3 run_benchmark.py --mode subagentSkip validation if you only want to collect agent/tool metrics:
python3 run_benchmark.py --skip-validationDry run:
python3 run_benchmark.py --dry-runOutputs are written to:
results/<timestamp>/Each run includes:
- generated app directories,
- pi JSONL logs,
- prompts used,
- npm install/build logs,
summary.json,SUMMARY.md.
The script records:
| Metric | Meaning |
|---|---|
| Pi invocations | Number of separate agent sessions. More sessions means more context handoffs. |
| Total seconds | Wall-clock time spent in pi invocations. |
| Tool calls | Total tool calls made by pi. |
| Tool errors | Tool calls that ended with errors. |
| Turns | Number of assistant turns. |
| Final build | Whether npm run build passed after the edit. |
| Edit changed files | How many files changed during the edit phase. |
Manual review is still important. The benchmark measures reliability and cost proxies, but product quality should also be inspected.
The hypothesis is that the single-agent workflow should generally have:
- fewer context handoffs,
- fewer total tool calls,
- fewer integration mismatches,
- less duplicated or conflicting architecture,
- cleaner edit behavior,
- better preservation of product intent,
- lower cost per successful outcome.
The subagent workflow may sometimes parallelize or produce useful isolated work, but it pays a handoff tax. Each subagent only knows what was written to files or passed in the prompt. Rich product context, intent, taste, and design constraints are repeatedly compressed.
Every tool call is a possible failure point.
If a workflow requires 100 narrow tool calls, even a low per-call failure rate compounds into meaningful end-to-end failure risk. Long chains also increase:
- latency,
- context bloat,
- tool result parsing overhead,
- recovery complexity,
- model confusion.
Runable's architecture deliberately pushes many tools toward CLI-like composable primitives.
Instead of forcing the agent to make dozens of tiny calls, the agent can issue a larger command that composes work in the environment, similar to how engineers use Unix tools.
This is one of the reasons Runable 2.0 moved away from rigid multi-agent workflows.
Model labs are training frontier models heavily on:
- coding,
- shell commands,
- file operations,
- terminal workflows,
- browser/computer control,
- long-horizon software tasks.
Runable's harness is built to take advantage of that trend.
If models become better at coding and shell-style execution, Runable's environment becomes more powerful. Better models do not commoditize the harness; they increase the leverage of the harness.
This is different from systems that expose many narrow JSON tools. In those systems, the model must repeatedly choose among schemas and pass compressed context through tool inputs. In Runable, the agent can compose larger chunks of work inside a familiar computer-like environment.
Pros:
- Can divide responsibilities.
- May work for isolated tasks.
- Can create a sense of specialization.
Cons:
- Context is fragmented.
- Each handoff compresses intent.
- Subagents may make inconsistent assumptions.
- The coordinator must reconstruct global state.
- Long edits require rediscovering decisions.
- More sessions and tool boundaries increase failure surface.
Pros:
- One owner of product intent.
- Less handoff loss.
- Better continuity across build and edit.
- Easier to preserve artifact state.
- Better fit for cross-cutting edits.
- More aligned with CLI/shell/coding strengths of modern models.
Cons:
- Requires strong context management.
- Requires good compaction/memory/artifact state.
- Requires powerful tools that can compose meaningful work.
Runable 2.0 chooses the single-agent path because outcome quality depends more on preserved intent and coherent artifact state than on artificially splitting work into roles.
Do not claim a result before running the benchmark.
After running, use the generated SUMMARY.md and inspect both app outputs. The strongest investor-ready points will be:
- Did the single-agent version use fewer tool calls?
- Did it pass build more reliably?
- Did the edit touch fewer files or integrate more cleanly?
- Did the subagent version show duplicated abstractions or inconsistent state?
- Did the single-agent version preserve product intent better?
Suggested memo wording after successful runs:
In our pi benchmark, the single-agent workflow completed the same complex app + cross-cutting edit with fewer context handoffs and fewer tool-call boundaries than the subagent workflow. This validates the Runable 2.0 architecture decision: for outcome work, the key bottleneck is not dividing tasks among agents, but preserving intent, artifact state, and execution continuity across the full lifecycle.
The latest completed run is summarized in BENCHMARK_RESULT.md.
Headline result:
| Variant | Pi invocations | Total sec | Tool calls | Final build | Edit changed files |
|---|---|---|---|---|---|
| Single agent | 2 | 662.284 | 93 | PASS | 6 |
| Subagent workflow | 5 | 1186.555 | 252 | PASS | 13 |
Single-agent advantage:
- 60% fewer pi invocations.
- 44% faster wall-clock time.
- 63% fewer tool calls.
- 54% fewer files touched during the cross-cutting edit.
benchmark/
README.md
BENCHMARK_RESULT.md
run_benchmark.py
results/ # generated after runs