Releases · cipherfoxie/agent-bench

First tagged release of agent-bench: measure whether an MCP server or skill actually makes your coding agent better, on the models you run.

What it does

Runs a coding agent (opencode headless) through the same tasks twice — with and without an enhancement (MCP server, skill prompt, model setting) — as explicit A/B arms
Scores every run with deterministic gates (build + typecheck + frozen fact checklists), no LLM judge anywhere
Records per run: hard-gate success, tool calls, in/out tokens, wallclock, and three objective edit-quality KPIs (diff minimality vs reference patch, regression-freedom, lint cleanliness)
First-class support for self-hosted / any OpenAI-compatible models
Adding a new tool to benchmark is one config entry

Published findings so far

Serena (semantic MCP tools): SITUATIONAL — a guardrail, not a turbo
caveman (token-savings skill): saved ~⅓ of the claimed ~75%, and cost money on every Claude model tested

Negative results are published on purpose. Full write-ups: https://sovgrid.org/blog/agent-bench-pillar/

Provide feedback