Problem
LLM evals are inherently non-deterministic. Running each eval once gives a pass/fail but no signal about reliability. A task that passes 2 out of 3 runs is very different from one that passes 3 out of 3.
Currently:
- Single-run only (no `--iterations` flag)
- No pass-rate tracking
- No aggregate statistics across repeated runs
Suggestion
Add optional multi-run support:
- `--iterations=N` flag (default: 1 for backwards compatibility)
- Track pass/fail per iteration
- Report aggregate: `3/3 passed (100%)` or `2/3 passed (67%)`
- Save aggregate results alongside individual run logs
Even N=3 would provide meaningful reliability signal without excessive cost.
Files
- `evals/eval.ts` (parseArgs, main loop)
- `evals/lib/metrics.ts` (aggregate result type)
Problem
LLM evals are inherently non-deterministic. Running each eval once gives a pass/fail but no signal about reliability. A task that passes 2 out of 3 runs is very different from one that passes 3 out of 3.
Currently:
Suggestion
Add optional multi-run support:
Even N=3 would provide meaningful reliability signal without excessive cost.
Files