Skip to content

Agent Eval Harness — Shared Benchmark Infrastructure #5

@waylonkenning

Description

@waylonkenning

A shared infrastructure for evaluating how well AI coding agents perform on real software engineering tasks.

Why: How do you know if your agent actually did a good job? There's no community standard for evaluating agent quality — different teams use different ad-hoc proxies (did it pass CI? did the PM accept the PR? did it complete the task?). A shared eval harness would let members:

  • Compare agents/objectives/frameworks objectively
  • Share test suites and benchmark tasks
  • Track performance over time as models improve
  • Make evidence-based decisions about which tools to standardise on

What it could look like:

  • A repo (agenticsnz/eval) with:
    • A standardised set of benchmark tasks (e.g. "fix this bug", "implement this feature", "write tests for this module")
    • A runner that scores agent outputs against objective criteria
    • A shared results dashboard or leaderboard
  • Integration with popular agent frameworks
  • Both automated metrics and human rubric scoring

Related reading:

  • SWE-Bench (software engineering evals)
  • BFCL (browser function calling)
  • LiveCodeBench

How to start: Define a small set of agreed-upon benchmark tasks, pick one agent framework to target first, and build a minimal working runner that outputs scores.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions