Skip to content

Releases: cipherfoxie/agent-bench

v0.1.0 — Initial release

18 Jun 19:36

Choose a tag to compare

First tagged release of agent-bench: measure whether an MCP server or skill actually makes your coding agent better, on the models you run.

What it does

  • Runs a coding agent (opencode headless) through the same tasks twice — with and without an enhancement (MCP server, skill prompt, model setting) — as explicit A/B arms
  • Scores every run with deterministic gates (build + typecheck + frozen fact checklists), no LLM judge anywhere
  • Records per run: hard-gate success, tool calls, in/out tokens, wallclock, and three objective edit-quality KPIs (diff minimality vs reference patch, regression-freedom, lint cleanliness)
  • First-class support for self-hosted / any OpenAI-compatible models
  • Adding a new tool to benchmark is one config entry

Published findings so far

  • Serena (semantic MCP tools): SITUATIONAL — a guardrail, not a turbo
  • caveman (token-savings skill): saved ~⅓ of the claimed ~75%, and cost money on every Claude model tested

Negative results are published on purpose. Full write-ups: https://sovgrid.org/blog/agent-bench-pillar/