feat(bench): held-out rubric evaluator + dep bump to ai-consensus-core 0.11.1#4
Merged
marceloceccon merged 1 commit intoMay 26, 2026
Conversation
…e 0.11.1
The bench was measuring the wrong thing. Self-reported confidence is a
meta-signal, not a quality signal — and on every panel in this repo it
was further corrupted by the upstream `extractJudgeConfidence` silent
50 default (fixed in ai-consensus-core 0.11.1). The "consensus loses
11/12, costs 40× tokens for nothing" verdict the old bench produced
was a measurement artifact, not a result about the panel.
This change adds a held-out LLM-as-judge rubric evaluator. Pass
`--evaluator-model` + `--evaluator-provider`; when the panel declares
a rubric the bench scores both the consensus synthesis and the
single-model baseline against that rubric using a third model that
is on neither side of the comparison. Rubrics measure answer quality
against named criteria the panel commits to (for architecture_v2:
quantification, single-recommendation, reversibility-weighing,
tripwire-specificity, failure-mode-realism).
On architecture_v2 (4 cases × 3 runs × seed 42), with grok-4.3 as
both judge and baseline and claude-opus-4-5 as the held-out evaluator:
- Self-reported (consensus score vs. baseline confidence):
consensus 60.0 vs. baseline 75.4 → Δ −15.4 (consensus "loses")
- Held-out rubric:
consensus 83.3 vs. baseline 48.0 → Δ +35.3, 12/12 runs (100%)
Same 12 runs, opposite verdicts. The rubric measures quality directly;
self-reported confidence does not track quality (judge confidence on
these 12 runs is μ=66.9, under-estimating actual rubric score by ~16
points).
Headline implementation surface:
- New module: src/benchmark/rubric.ts. Builds a structured JSON-emitting
prompt for the evaluator, parses with a tolerant bracket-scanning JSON
extractor (no regex backtracking), validates with zod, clamps scores
into range, returns RubricEvaluation with errorMessage on any failure
path. Never throws — a rubric eval failure is data quality, not a
suite-fatal error (same contract baseline.ts uses).
- New Preset field: rubric?: readonly RubricCriterion[]. Panel-declared
because criteria are domain-specific. architecture_v2 ships a 5-
criterion rubric; adding rubrics to other panels is purely additive.
- New CLI flags: --evaluator-model, --evaluator-provider. Validated
together (must be passed as a pair). CLI warns when evaluator coincides
with baseline (self-grading) or with judge (same brain producing and
grading the consensus output) — the held-out contract is the bench's
only guarantee that the comparison isn't self-graded.
- New report metrics: consensusRubricNormalizedMean,
baselineRubricNormalizedMean, consensusBeatsBaselineRubricRate,
rubricRunsCounted. Markdown report grows a "Held-out rubric" block
and 3 per-case table columns (Rubric C, Rubric B, Δ rubric).
- Dep bump: ai-consensus-core ^0.10.0 → ^0.11.1. 0.11.1 fixes the
silent judge-confidence parser-contract bug — judge confidence now
reports a real distribution (μ=66.9, σ=5.2) instead of μ=50.0, σ=0.0.
Test surface:
- 16 new rubric.ts unit tests (happy path, fenced + prose-wrapped JSON,
score clamping, missing-criterion handling, all three failure modes:
caller throws, unparseable content, schema mismatch).
- 3 new runner integration tests (no-evaluator short-circuits, paired
rubric metrics populated, eval failure captured into errorMessage
without aborting the suite).
- 2 new format tests (rubric block + per-case columns render; ERR cells
on failed evals).
- Existing 320 tests pass unchanged. Total: 341/341. Coverage:
statements 79.4%, branches 66.1%, functions 86.4%, lines 80.9% — all
above thresholds.
Docs:
- New README section "Quality benchmark (held-out evaluator)" — headline
finding, per-case Δ-rubric table, methodology, exact reproduction
command, honest caveats (N=12, single panel, real cost).
- CHANGELOG [Unreleased] entry detailing the feature, the dep bump, and
what it changes for downstream callers.
- New gitignore patterns for bench-*.json and bench-*.log artifacts.
- Bench CLI help text gains an --evaluator-model example.
No breaking changes. Callers who don't pass --evaluator-model see the
existing report unchanged (with the side effect that judge confidence
now reports a real number rather than the silent 50).
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The bench was measuring the wrong thing. On
architecture_v2(4 cases × 3 runs × seed 42), withgrok-4.3as both judge and baseline andclaude-opus-4-5as a held-out evaluator:claude-opus-4-5, blind)Consensus wins on the held-out rubric in 12/12 runs (100%). Same 12 runs, the self-reported-confidence metric says consensus loses 11/12. The two metrics invert. Without a quality measurement, the bench was producing a measurement artifact, not a result about the panel.
This PR adds a held-out LLM-as-judge rubric evaluator. Pass
--evaluator-model+--evaluator-provider; when the panel declares arubric, both consensus and baseline outputs are scored blind by a third model (neither side of the comparison).architecture_v2ships a 5-criterion rubric (quantification, single-recommendation, reversibility-weighing, tripwire-specificity, failure-mode-realism). Adding a rubric to any other panel is purely additive.Also bumps
ai-consensus-coreto^0.11.1— fixes the silentextractJudgeConfidence50 default that previously made judge-confidence numbers in this repo's bench reports meaningless. Judge confidence on this 12-run suite now reports a real distribution (μ=66.9, σ=5.2) instead of (μ=50.0, σ=0.0).See the new README section "Quality benchmark — held-out evaluator" for the full methodology, per-case Δ-rubric table, exact reproduction command, and honest caveats.
Per-case Δ rubric (12 runs, seed=42)
arch-microservices-day-onearch-event-sourcing-billingarch-sync-vs-async-fanoutarch-db-multi-tenantBaseline scored 28/100 on every
event-sourcing-billingrun — a reproducible single-model blind spot that the panel surfaces every time.Implementation surface
src/benchmark/rubric.ts: structured JSON-emitting evaluator prompt, tolerant bracket-scanning JSON extractor (no regex backtracking), zod schema validation, score clamping, sentinelerrorMessageon any failure path. Never throws — a rubric eval failure is data quality, not a suite-fatal error (matches the contractbaseline.tsuses).Preset.rubric?: readonly RubricCriterion[]field. Panel-declared because criteria are domain-specific.--evaluator-model+--evaluator-provider(validated as a pair). CLI warns when the evaluator coincides with baseline (self-grading) or judge (same brain producing and grading the consensus output) — the held-out contract is the bench's only guarantee that the comparison isn't self-graded.consensusRubricNormalizedMean,baselineRubricNormalizedMean,consensusBeatsBaselineRubricRate,rubricRunsCounted. Markdown report grows a "Held-out rubric" block and 3 per-case table columns (Rubric C / Rubric B / Δ rubric).ai-consensus-core^0.10.0→^0.11.1(upstreamJUDGE_CONFIDENCEdirective contract fix, entropyvortex/consensus-core#4).Test plan
npm run checkclean — typecheck, lint, format, tests, coverage all greenbench --quick --evaluator-model claude-opus-4-5 --evaluator-provider anthropic) produces real rubric scores.--seed 42.--evaluator-modelwithout--evaluator-providerand vice versa.Notes for the reviewer
mainafter feat(v0.12): expert panels, benchmarking suite, persistent project memory #3 merges, or merge into the v0.12 branch first.--evaluator-modelsee the existing report unchanged — with the side effect that judge confidence now reports a real number rather than the silent 50..gitignore(bench-*.json,bench-*.log) so reproduction outputs don't trip future format checks.