feat(bench): held-out rubric evaluator + dep bump to ai-consensus-core 0.11.1 by marceloceccon · Pull Request #5 · entropyvortex/ai-consensus-mcp

marceloceccon · 2026-05-26T12:27:23Z

Re-target of #4, which merged into the feat/asi-grade-foundation-v0.12 branch rather than main. Single commit (1e7dc63) — the rubric work, unchanged.

See #4 for the full description, evidence tables, methodology, and test plan. A short recap:

Metric (architecture_v2, 4 cases × 3 runs, seed=42)	Consensus	Baseline	Δ
Self-reported (consensus score vs. baseline confidence)	60.0	75.4	−15.4
Held-out rubric (judged by `claude-opus-4-5`, blind)	83.3	48.0	+35.3

Consensus wins on the held-out rubric in 12/12 runs (100%). Same 12 runs, the self-reported confidence metric says consensus loses 11/12. The two metrics invert; the held-out rubric measures answer quality, self-reported confidence measures the model's opinion of itself.

Also bumps ai-consensus-core ^0.10.0 → ^0.11.1 (upstream JUDGE_CONFIDENCE directive contract fix, entropyvortex/consensus-core#4).

Test plan

npm run check clean — typecheck, lint, format, tests, coverage all green on this commit
341/341 tests pass
Coverage above all thresholds (stmts 79.4%, branches 66.1%, fns 86.4%, lines 80.9%)
End-to-end full suite (12 runs) reported in the new README section, reproduces deterministically with --seed 42

Notes

No code differences from feat(bench): held-out rubric evaluator + dep bump to ai-consensus-core 0.11.1 #4 — same commit (1e7dc63), just a corrected target.
After this merges, the feat/asi-grade-foundation-v0.12 branch can be deleted; main will be at parity with it.

…e 0.11.1 The bench was measuring the wrong thing. Self-reported confidence is a meta-signal, not a quality signal — and on every panel in this repo it was further corrupted by the upstream `extractJudgeConfidence` silent 50 default (fixed in ai-consensus-core 0.11.1). The "consensus loses 11/12, costs 40× tokens for nothing" verdict the old bench produced was a measurement artifact, not a result about the panel. This change adds a held-out LLM-as-judge rubric evaluator. Pass `--evaluator-model` + `--evaluator-provider`; when the panel declares a rubric the bench scores both the consensus synthesis and the single-model baseline against that rubric using a third model that is on neither side of the comparison. Rubrics measure answer quality against named criteria the panel commits to (for architecture_v2: quantification, single-recommendation, reversibility-weighing, tripwire-specificity, failure-mode-realism). On architecture_v2 (4 cases × 3 runs × seed 42), with grok-4.3 as both judge and baseline and claude-opus-4-5 as the held-out evaluator: - Self-reported (consensus score vs. baseline confidence): consensus 60.0 vs. baseline 75.4 → Δ −15.4 (consensus "loses") - Held-out rubric: consensus 83.3 vs. baseline 48.0 → Δ +35.3, 12/12 runs (100%) Same 12 runs, opposite verdicts. The rubric measures quality directly; self-reported confidence does not track quality (judge confidence on these 12 runs is μ=66.9, under-estimating actual rubric score by ~16 points). Headline implementation surface: - New module: src/benchmark/rubric.ts. Builds a structured JSON-emitting prompt for the evaluator, parses with a tolerant bracket-scanning JSON extractor (no regex backtracking), validates with zod, clamps scores into range, returns RubricEvaluation with errorMessage on any failure path. Never throws — a rubric eval failure is data quality, not a suite-fatal error (same contract baseline.ts uses). - New Preset field: rubric?: readonly RubricCriterion[]. Panel-declared because criteria are domain-specific. architecture_v2 ships a 5- criterion rubric; adding rubrics to other panels is purely additive. - New CLI flags: --evaluator-model, --evaluator-provider. Validated together (must be passed as a pair). CLI warns when evaluator coincides with baseline (self-grading) or with judge (same brain producing and grading the consensus output) — the held-out contract is the bench's only guarantee that the comparison isn't self-graded. - New report metrics: consensusRubricNormalizedMean, baselineRubricNormalizedMean, consensusBeatsBaselineRubricRate, rubricRunsCounted. Markdown report grows a "Held-out rubric" block and 3 per-case table columns (Rubric C, Rubric B, Δ rubric). - Dep bump: ai-consensus-core ^0.10.0 → ^0.11.1. 0.11.1 fixes the silent judge-confidence parser-contract bug — judge confidence now reports a real distribution (μ=66.9, σ=5.2) instead of μ=50.0, σ=0.0. Test surface: - 16 new rubric.ts unit tests (happy path, fenced + prose-wrapped JSON, score clamping, missing-criterion handling, all three failure modes: caller throws, unparseable content, schema mismatch). - 3 new runner integration tests (no-evaluator short-circuits, paired rubric metrics populated, eval failure captured into errorMessage without aborting the suite). - 2 new format tests (rubric block + per-case columns render; ERR cells on failed evals). - Existing 320 tests pass unchanged. Total: 341/341. Coverage: statements 79.4%, branches 66.1%, functions 86.4%, lines 80.9% — all above thresholds. Docs: - New README section "Quality benchmark (held-out evaluator)" — headline finding, per-case Δ-rubric table, methodology, exact reproduction command, honest caveats (N=12, single panel, real cost). - CHANGELOG [Unreleased] entry detailing the feature, the dep bump, and what it changes for downstream callers. - New gitignore patterns for bench-*.json and bench-*.log artifacts. - Bench CLI help text gains an --evaluator-model example. No breaking changes. Callers who don't pass --evaluator-model see the existing report unchanged (with the side effect that judge confidence now reports a real number rather than the silent 50).

marceloceccon merged commit 30ed79b into main May 26, 2026
0 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bench): held-out rubric evaluator + dep bump to ai-consensus-core 0.11.1#5

feat(bench): held-out rubric evaluator + dep bump to ai-consensus-core 0.11.1#5
marceloceccon merged 1 commit into
mainfrom
feat/bench-rubric-eval

marceloceccon commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

marceloceccon commented May 26, 2026

Test plan

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant