Skip to content

Eval: add retry/iteration support for flakiness tracking #54

@olaservo

Description

@olaservo

Problem

LLM evals are inherently non-deterministic. Running each eval once gives a pass/fail but no signal about reliability. A task that passes 2 out of 3 runs is very different from one that passes 3 out of 3.

Currently:

  • Single-run only (no `--iterations` flag)
  • No pass-rate tracking
  • No aggregate statistics across repeated runs

Suggestion

Add optional multi-run support:

  • `--iterations=N` flag (default: 1 for backwards compatibility)
  • Track pass/fail per iteration
  • Report aggregate: `3/3 passed (100%)` or `2/3 passed (67%)`
  • Save aggregate results alongside individual run logs

Even N=3 would provide meaningful reliability signal without excessive cost.

Files

  • `evals/eval.ts` (parseArgs, main loop)
  • `evals/lib/metrics.ts` (aggregate result type)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions