Agent Eval Harness — Shared Benchmark Infrastructure

A shared infrastructure for evaluating how well AI coding agents perform on real software engineering tasks.

**Why:** How do you know if your agent actually did a good job? There's no community standard for evaluating agent quality — different teams use different ad-hoc proxies (did it pass CI? did the PM accept the PR? did it complete the task?). A shared eval harness would let members:
- Compare agents/objectives/frameworks objectively
- Share test suites and benchmark tasks
- Track performance over time as models improve
- Make evidence-based decisions about which tools to standardise on

**What it could look like:**
- A repo (`agenticsnz/eval`) with:
  - A standardised set of benchmark tasks (e.g. "fix this bug", "implement this feature", "write tests for this module")
  - A runner that scores agent outputs against objective criteria
  - A shared results dashboard or leaderboard
- Integration with popular agent frameworks
- Both automated metrics and human rubric scoring

**Related reading:**
- SWE-Bench (software engineering evals)
- BFCL (browser function calling)
- LiveCodeBench

**How to start:** Define a small set of agreed-upon benchmark tasks, pick one agent framework to target first, and build a minimal working runner that outputs scores.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent Eval Harness — Shared Benchmark Infrastructure #5

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Agent Eval Harness — Shared Benchmark Infrastructure #5

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions