feat(bench): held-out rubric evaluator + dep bump to ai-consensus-core 0.11.1#5
Merged
Merged
Conversation
…e 0.11.1
The bench was measuring the wrong thing. Self-reported confidence is a
meta-signal, not a quality signal — and on every panel in this repo it
was further corrupted by the upstream `extractJudgeConfidence` silent
50 default (fixed in ai-consensus-core 0.11.1). The "consensus loses
11/12, costs 40× tokens for nothing" verdict the old bench produced
was a measurement artifact, not a result about the panel.
This change adds a held-out LLM-as-judge rubric evaluator. Pass
`--evaluator-model` + `--evaluator-provider`; when the panel declares
a rubric the bench scores both the consensus synthesis and the
single-model baseline against that rubric using a third model that
is on neither side of the comparison. Rubrics measure answer quality
against named criteria the panel commits to (for architecture_v2:
quantification, single-recommendation, reversibility-weighing,
tripwire-specificity, failure-mode-realism).
On architecture_v2 (4 cases × 3 runs × seed 42), with grok-4.3 as
both judge and baseline and claude-opus-4-5 as the held-out evaluator:
- Self-reported (consensus score vs. baseline confidence):
consensus 60.0 vs. baseline 75.4 → Δ −15.4 (consensus "loses")
- Held-out rubric:
consensus 83.3 vs. baseline 48.0 → Δ +35.3, 12/12 runs (100%)
Same 12 runs, opposite verdicts. The rubric measures quality directly;
self-reported confidence does not track quality (judge confidence on
these 12 runs is μ=66.9, under-estimating actual rubric score by ~16
points).
Headline implementation surface:
- New module: src/benchmark/rubric.ts. Builds a structured JSON-emitting
prompt for the evaluator, parses with a tolerant bracket-scanning JSON
extractor (no regex backtracking), validates with zod, clamps scores
into range, returns RubricEvaluation with errorMessage on any failure
path. Never throws — a rubric eval failure is data quality, not a
suite-fatal error (same contract baseline.ts uses).
- New Preset field: rubric?: readonly RubricCriterion[]. Panel-declared
because criteria are domain-specific. architecture_v2 ships a 5-
criterion rubric; adding rubrics to other panels is purely additive.
- New CLI flags: --evaluator-model, --evaluator-provider. Validated
together (must be passed as a pair). CLI warns when evaluator coincides
with baseline (self-grading) or with judge (same brain producing and
grading the consensus output) — the held-out contract is the bench's
only guarantee that the comparison isn't self-graded.
- New report metrics: consensusRubricNormalizedMean,
baselineRubricNormalizedMean, consensusBeatsBaselineRubricRate,
rubricRunsCounted. Markdown report grows a "Held-out rubric" block
and 3 per-case table columns (Rubric C, Rubric B, Δ rubric).
- Dep bump: ai-consensus-core ^0.10.0 → ^0.11.1. 0.11.1 fixes the
silent judge-confidence parser-contract bug — judge confidence now
reports a real distribution (μ=66.9, σ=5.2) instead of μ=50.0, σ=0.0.
Test surface:
- 16 new rubric.ts unit tests (happy path, fenced + prose-wrapped JSON,
score clamping, missing-criterion handling, all three failure modes:
caller throws, unparseable content, schema mismatch).
- 3 new runner integration tests (no-evaluator short-circuits, paired
rubric metrics populated, eval failure captured into errorMessage
without aborting the suite).
- 2 new format tests (rubric block + per-case columns render; ERR cells
on failed evals).
- Existing 320 tests pass unchanged. Total: 341/341. Coverage:
statements 79.4%, branches 66.1%, functions 86.4%, lines 80.9% — all
above thresholds.
Docs:
- New README section "Quality benchmark (held-out evaluator)" — headline
finding, per-case Δ-rubric table, methodology, exact reproduction
command, honest caveats (N=12, single panel, real cost).
- CHANGELOG [Unreleased] entry detailing the feature, the dep bump, and
what it changes for downstream callers.
- New gitignore patterns for bench-*.json and bench-*.log artifacts.
- Bench CLI help text gains an --evaluator-model example.
No breaking changes. Callers who don't pass --evaluator-model see the
existing report unchanged (with the side effect that judge confidence
now reports a real number rather than the silent 50).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Re-target of #4, which merged into the
feat/asi-grade-foundation-v0.12branch rather thanmain. Single commit (1e7dc63) — the rubric work, unchanged.See #4 for the full description, evidence tables, methodology, and test plan. A short recap:
claude-opus-4-5, blind)Consensus wins on the held-out rubric in 12/12 runs (100%). Same 12 runs, the self-reported confidence metric says consensus loses 11/12. The two metrics invert; the held-out rubric measures answer quality, self-reported confidence measures the model's opinion of itself.
Also bumps
ai-consensus-core^0.10.0→^0.11.1(upstreamJUDGE_CONFIDENCEdirective contract fix, entropyvortex/consensus-core#4).Test plan
npm run checkclean — typecheck, lint, format, tests, coverage all green on this commit--seed 42Notes
1e7dc63), just a corrected target.feat/asi-grade-foundation-v0.12branch can be deleted; main will be at parity with it.