A shared infrastructure for evaluating how well AI coding agents perform on real software engineering tasks.
Why: How do you know if your agent actually did a good job? There's no community standard for evaluating agent quality — different teams use different ad-hoc proxies (did it pass CI? did the PM accept the PR? did it complete the task?). A shared eval harness would let members:
- Compare agents/objectives/frameworks objectively
- Share test suites and benchmark tasks
- Track performance over time as models improve
- Make evidence-based decisions about which tools to standardise on
What it could look like:
- A repo (
agenticsnz/eval) with:
- A standardised set of benchmark tasks (e.g. "fix this bug", "implement this feature", "write tests for this module")
- A runner that scores agent outputs against objective criteria
- A shared results dashboard or leaderboard
- Integration with popular agent frameworks
- Both automated metrics and human rubric scoring
Related reading:
- SWE-Bench (software engineering evals)
- BFCL (browser function calling)
- LiveCodeBench
How to start: Define a small set of agreed-upon benchmark tasks, pick one agent framework to target first, and build a minimal working runner that outputs scores.
A shared infrastructure for evaluating how well AI coding agents perform on real software engineering tasks.
Why: How do you know if your agent actually did a good job? There's no community standard for evaluating agent quality — different teams use different ad-hoc proxies (did it pass CI? did the PM accept the PR? did it complete the task?). A shared eval harness would let members:
What it could look like:
agenticsnz/eval) with:Related reading:
How to start: Define a small set of agreed-upon benchmark tasks, pick one agent framework to target first, and build a minimal working runner that outputs scores.