Releases: cipherfoxie/agent-bench
Releases · cipherfoxie/agent-bench
v0.1.0 — Initial release
First tagged release of agent-bench: measure whether an MCP server or skill actually makes your coding agent better, on the models you run.
What it does
- Runs a coding agent (opencode headless) through the same tasks twice — with and without an enhancement (MCP server, skill prompt, model setting) — as explicit A/B arms
- Scores every run with deterministic gates (build + typecheck + frozen fact checklists), no LLM judge anywhere
- Records per run: hard-gate success, tool calls, in/out tokens, wallclock, and three objective edit-quality KPIs (diff minimality vs reference patch, regression-freedom, lint cleanliness)
- First-class support for self-hosted / any OpenAI-compatible models
- Adding a new tool to benchmark is one config entry
Published findings so far
- Serena (semantic MCP tools): SITUATIONAL — a guardrail, not a turbo
- caveman (token-savings skill): saved ~⅓ of the claimed ~75%, and cost money on every Claude model tested
Negative results are published on purpose. Full write-ups: https://sovgrid.org/blog/agent-bench-pillar/