Skip to content

runablehq/runable-agent-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Runable 2.0 Agent Architecture Benchmark

GitHub: https://github.com/runablehq/runable-agent-benchmark

This benchmark compares two ways of using pi for a long-horizon product task:

  1. Single-agent workflow — one agent owns the full product context, builds the app, then applies a cross-cutting edit.
  2. Subagent / multi-agent workflow — several isolated role-based agents build pieces of the app, then a coordinator applies the same edit.

The benchmark exists to support the Runable 2.0 architecture thesis:

For long-horizon outcome generation, a single agent with strong context, artifact state, and composable CLI-style tools is often more reliable than a workflow/subagent system that fragments context across handoffs.

Runable 1.0 was closer to a workflow/multi-agent approach. Runable 2.0 is designed around a single-agent outcome loop.


Why this benchmark matters

Most agent demos test the first generation of an artifact. That is not enough.

Real users do not just ask an agent to build once. They come back and say:

  • "Add approvals to every asset."
  • "Make this match my brand deck."
  • "Update the app without breaking the existing workflow."
  • "Add reporting across campaigns, proposals, and assets."

The edit-after-build phase is where architecture differences show up.

A multi-agent workflow may look good during the initial build, but later edits expose:

  • context loss,
  • unclear ownership,
  • inconsistent assumptions,
  • duplicated abstractions,
  • more tool-call boundaries,
  • more integration failures,
  • higher cost to reconstruct intent.

A single agent has a better chance of preserving the product intent, data model, UI assumptions, and artifact state across the full lifecycle.


Benchmark task

Both variants receive the same task:

Build phase

Build a Vite + React + TypeScript app called RunOps Studio.

The app is an SMB operations workspace with:

  • dashboard,
  • brand profile,
  • campaign board,
  • asset library,
  • ad performance analytics,
  • proposal builder,
  • client report preview,
  • local-first persistence,
  • realistic seed data,
  • polished SaaS UI,
  • npm run build validation.

Edit phase

After the app is built, apply a cross-cutting product change:

Client Approval Center

The edit must add approval state, reviewer comments, audit trail, filters, and approval actions across campaigns, assets, and proposal sections.

This edit intentionally touches multiple parts of the product. It tests whether the agent preserved enough context to modify the system coherently instead of rewriting or breaking it.


How to run

From this directory:

python3 run_benchmark.py --runs 1 --model anthropic/claude-sonnet-4-5 --thinking high

Or use your default configured pi model:

python3 run_benchmark.py --runs 1

Run only one variant:

python3 run_benchmark.py --mode single
python3 run_benchmark.py --mode subagent

Skip validation if you only want to collect agent/tool metrics:

python3 run_benchmark.py --skip-validation

Dry run:

python3 run_benchmark.py --dry-run

Outputs are written to:

results/<timestamp>/

Each run includes:

  • generated app directories,
  • pi JSONL logs,
  • prompts used,
  • npm install/build logs,
  • summary.json,
  • SUMMARY.md.

Metrics collected

The script records:

Metric Meaning
Pi invocations Number of separate agent sessions. More sessions means more context handoffs.
Total seconds Wall-clock time spent in pi invocations.
Tool calls Total tool calls made by pi.
Tool errors Tool calls that ended with errors.
Turns Number of assistant turns.
Final build Whether npm run build passed after the edit.
Edit changed files How many files changed during the edit phase.

Manual review is still important. The benchmark measures reliability and cost proxies, but product quality should also be inspected.


What we expect to see

The hypothesis is that the single-agent workflow should generally have:

  • fewer context handoffs,
  • fewer total tool calls,
  • fewer integration mismatches,
  • less duplicated or conflicting architecture,
  • cleaner edit behavior,
  • better preservation of product intent,
  • lower cost per successful outcome.

The subagent workflow may sometimes parallelize or produce useful isolated work, but it pays a handoff tax. Each subagent only knows what was written to files or passed in the prompt. Rich product context, intent, taste, and design constraints are repeatedly compressed.


Why tool-call count matters

Every tool call is a possible failure point.

If a workflow requires 100 narrow tool calls, even a low per-call failure rate compounds into meaningful end-to-end failure risk. Long chains also increase:

  • latency,
  • context bloat,
  • tool result parsing overhead,
  • recovery complexity,
  • model confusion.

Runable's architecture deliberately pushes many tools toward CLI-like composable primitives.

Instead of forcing the agent to make dozens of tiny calls, the agent can issue a larger command that composes work in the environment, similar to how engineers use Unix tools.

This is one of the reasons Runable 2.0 moved away from rigid multi-agent workflows.


Why CLI-style tools are a model-lab tailwind

Model labs are training frontier models heavily on:

  • coding,
  • shell commands,
  • file operations,
  • terminal workflows,
  • browser/computer control,
  • long-horizon software tasks.

Runable's harness is built to take advantage of that trend.

If models become better at coding and shell-style execution, Runable's environment becomes more powerful. Better models do not commoditize the harness; they increase the leverage of the harness.

This is different from systems that expose many narrow JSON tools. In those systems, the model must repeatedly choose among schemas and pass compressed context through tool inputs. In Runable, the agent can compose larger chunks of work inside a familiar computer-like environment.


Single agent vs subagents: architecture lesson

Subagent / multi-agent workflow

Pros:

  • Can divide responsibilities.
  • May work for isolated tasks.
  • Can create a sense of specialization.

Cons:

  • Context is fragmented.
  • Each handoff compresses intent.
  • Subagents may make inconsistent assumptions.
  • The coordinator must reconstruct global state.
  • Long edits require rediscovering decisions.
  • More sessions and tool boundaries increase failure surface.

Single-agent workflow

Pros:

  • One owner of product intent.
  • Less handoff loss.
  • Better continuity across build and edit.
  • Easier to preserve artifact state.
  • Better fit for cross-cutting edits.
  • More aligned with CLI/shell/coding strengths of modern models.

Cons:

  • Requires strong context management.
  • Requires good compaction/memory/artifact state.
  • Requires powerful tools that can compose meaningful work.

Runable 2.0 chooses the single-agent path because outcome quality depends more on preserved intent and coherent artifact state than on artificially splitting work into roles.


How to use results in the memo

Do not claim a result before running the benchmark.

After running, use the generated SUMMARY.md and inspect both app outputs. The strongest investor-ready points will be:

  1. Did the single-agent version use fewer tool calls?
  2. Did it pass build more reliably?
  3. Did the edit touch fewer files or integrate more cleanly?
  4. Did the subagent version show duplicated abstractions or inconsistent state?
  5. Did the single-agent version preserve product intent better?

Suggested memo wording after successful runs:

In our pi benchmark, the single-agent workflow completed the same complex app + cross-cutting edit with fewer context handoffs and fewer tool-call boundaries than the subagent workflow. This validates the Runable 2.0 architecture decision: for outcome work, the key bottleneck is not dividing tasks among agents, but preserving intent, artifact state, and execution continuity across the full lifecycle.


Latest run

The latest completed run is summarized in BENCHMARK_RESULT.md.

Headline result:

Variant Pi invocations Total sec Tool calls Final build Edit changed files
Single agent 2 662.284 93 PASS 6
Subagent workflow 5 1186.555 252 PASS 13

Single-agent advantage:

  • 60% fewer pi invocations.
  • 44% faster wall-clock time.
  • 63% fewer tool calls.
  • 54% fewer files touched during the cross-cutting edit.

Files

benchmark/
  README.md
  BENCHMARK_RESULT.md
  run_benchmark.py
  results/                 # generated after runs

About

Benchmark comparing single-agent vs subagent workflows for long-horizon outcome generation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages