feat(bench): held-out rubric evaluator + dep bump to ai-consensus-core 0.11.1 by marceloceccon · Pull Request #4 · entropyvortex/ai-consensus-mcp

marceloceccon · 2026-05-26T12:04:03Z

Summary

The bench was measuring the wrong thing. On architecture_v2 (4 cases × 3 runs × seed 42), with grok-4.3 as both judge and baseline and claude-opus-4-5 as a held-out evaluator:

Metric	Consensus	Baseline	Δ
Self-reported (consensus score vs. baseline confidence)	60.0	75.4	−15.4
Held-out rubric (judged by `claude-opus-4-5`, blind)	83.3	48.0	+35.3

Consensus wins on the held-out rubric in 12/12 runs (100%). Same 12 runs, the self-reported-confidence metric says consensus loses 11/12. The two metrics invert. Without a quality measurement, the bench was producing a measurement artifact, not a result about the panel.

This PR adds a held-out LLM-as-judge rubric evaluator. Pass --evaluator-model + --evaluator-provider; when the panel declares a rubric, both consensus and baseline outputs are scored blind by a third model (neither side of the comparison). architecture_v2 ships a 5-criterion rubric (quantification, single-recommendation, reversibility-weighing, tripwire-specificity, failure-mode-realism). Adding a rubric to any other panel is purely additive.

Also bumps ai-consensus-core to ^0.11.1 — fixes the silent extractJudgeConfidence 50 default that previously made judge-confidence numbers in this repo's bench reports meaningless. Judge confidence on this 12-run suite now reports a real distribution (μ=66.9, σ=5.2) instead of (μ=50.0, σ=0.0).

See the new README section "Quality benchmark — held-out evaluator" for the full methodology, per-case Δ-rubric table, exact reproduction command, and honest caveats.

Per-case Δ rubric (12 runs, seed=42)

Case	Runs	Mean Δ
`arch-microservices-day-one`	+36, +28, +32	+32
`arch-event-sourcing-billing`	+44, +52, +56	+51
`arch-sync-vs-async-fanout`	+24, +40, +40	+35
`arch-db-multi-tenant`	+12, +32, +28	+24

Baseline scored 28/100 on every event-sourcing-billing run — a reproducible single-model blind spot that the panel surfaces every time.

Implementation surface

New module src/benchmark/rubric.ts: structured JSON-emitting evaluator prompt, tolerant bracket-scanning JSON extractor (no regex backtracking), zod schema validation, score clamping, sentinel errorMessage on any failure path. Never throws — a rubric eval failure is data quality, not a suite-fatal error (matches the contract baseline.ts uses).
New Preset.rubric?: readonly RubricCriterion[] field. Panel-declared because criteria are domain-specific.
New CLI flags --evaluator-model + --evaluator-provider (validated as a pair). CLI warns when the evaluator coincides with baseline (self-grading) or judge (same brain producing and grading the consensus output) — the held-out contract is the bench's only guarantee that the comparison isn't self-graded.
New report metrics: consensusRubricNormalizedMean, baselineRubricNormalizedMean, consensusBeatsBaselineRubricRate, rubricRunsCounted. Markdown report grows a "Held-out rubric" block and 3 per-case table columns (Rubric C / Rubric B / Δ rubric).
Bumps ai-consensus-core ^0.10.0 → ^0.11.1 (upstream JUDGE_CONFIDENCE directive contract fix, entropyvortex/consensus-core#4).

Test plan

npm run check clean — typecheck, lint, format, tests, coverage all green
341/341 tests pass (+25 from this branch: 16 rubric unit + 3 runner integration + 2 format + restore existing on new shape).
Coverage above all thresholds (stmts 79.4%, branches 66.1%, fns 86.4%, lines 80.9%).
End-to-end quick bench (bench --quick --evaluator-model claude-opus-4-5 --evaluator-provider anthropic) produces real rubric scores.
End-to-end full suite (12 runs, ~$1) reported in README — reproduces deterministically with --seed 42.
CLI warning logic exercised manually: warns when evaluator coincides with judge or baseline; rejects --evaluator-model without --evaluator-provider and vice versa.

Notes for the reviewer

Stacks on top of feat(v0.12): expert panels, benchmarking suite, persistent project memory #3. Re-target to main after feat(v0.12): expert panels, benchmarking suite, persistent project memory #3 merges, or merge into the v0.12 branch first.
No breaking changes. Callers who don't pass --evaluator-model see the existing report unchanged — with the side effect that judge confidence now reports a real number rather than the silent 50.
Bench JSON artifacts and progress logs added to .gitignore (bench-*.json, bench-*.log) so reproduction outputs don't trip future format checks.

…e 0.11.1 The bench was measuring the wrong thing. Self-reported confidence is a meta-signal, not a quality signal — and on every panel in this repo it was further corrupted by the upstream `extractJudgeConfidence` silent 50 default (fixed in ai-consensus-core 0.11.1). The "consensus loses 11/12, costs 40× tokens for nothing" verdict the old bench produced was a measurement artifact, not a result about the panel. This change adds a held-out LLM-as-judge rubric evaluator. Pass `--evaluator-model` + `--evaluator-provider`; when the panel declares a rubric the bench scores both the consensus synthesis and the single-model baseline against that rubric using a third model that is on neither side of the comparison. Rubrics measure answer quality against named criteria the panel commits to (for architecture_v2: quantification, single-recommendation, reversibility-weighing, tripwire-specificity, failure-mode-realism). On architecture_v2 (4 cases × 3 runs × seed 42), with grok-4.3 as both judge and baseline and claude-opus-4-5 as the held-out evaluator: - Self-reported (consensus score vs. baseline confidence): consensus 60.0 vs. baseline 75.4 → Δ −15.4 (consensus "loses") - Held-out rubric: consensus 83.3 vs. baseline 48.0 → Δ +35.3, 12/12 runs (100%) Same 12 runs, opposite verdicts. The rubric measures quality directly; self-reported confidence does not track quality (judge confidence on these 12 runs is μ=66.9, under-estimating actual rubric score by ~16 points). Headline implementation surface: - New module: src/benchmark/rubric.ts. Builds a structured JSON-emitting prompt for the evaluator, parses with a tolerant bracket-scanning JSON extractor (no regex backtracking), validates with zod, clamps scores into range, returns RubricEvaluation with errorMessage on any failure path. Never throws — a rubric eval failure is data quality, not a suite-fatal error (same contract baseline.ts uses). - New Preset field: rubric?: readonly RubricCriterion[]. Panel-declared because criteria are domain-specific. architecture_v2 ships a 5- criterion rubric; adding rubrics to other panels is purely additive. - New CLI flags: --evaluator-model, --evaluator-provider. Validated together (must be passed as a pair). CLI warns when evaluator coincides with baseline (self-grading) or with judge (same brain producing and grading the consensus output) — the held-out contract is the bench's only guarantee that the comparison isn't self-graded. - New report metrics: consensusRubricNormalizedMean, baselineRubricNormalizedMean, consensusBeatsBaselineRubricRate, rubricRunsCounted. Markdown report grows a "Held-out rubric" block and 3 per-case table columns (Rubric C, Rubric B, Δ rubric). - Dep bump: ai-consensus-core ^0.10.0 → ^0.11.1. 0.11.1 fixes the silent judge-confidence parser-contract bug — judge confidence now reports a real distribution (μ=66.9, σ=5.2) instead of μ=50.0, σ=0.0. Test surface: - 16 new rubric.ts unit tests (happy path, fenced + prose-wrapped JSON, score clamping, missing-criterion handling, all three failure modes: caller throws, unparseable content, schema mismatch). - 3 new runner integration tests (no-evaluator short-circuits, paired rubric metrics populated, eval failure captured into errorMessage without aborting the suite). - 2 new format tests (rubric block + per-case columns render; ERR cells on failed evals). - Existing 320 tests pass unchanged. Total: 341/341. Coverage: statements 79.4%, branches 66.1%, functions 86.4%, lines 80.9% — all above thresholds. Docs: - New README section "Quality benchmark (held-out evaluator)" — headline finding, per-case Δ-rubric table, methodology, exact reproduction command, honest caveats (N=12, single panel, real cost). - CHANGELOG [Unreleased] entry detailing the feature, the dep bump, and what it changes for downstream callers. - New gitignore patterns for bench-*.json and bench-*.log artifacts. - Bench CLI help text gains an --evaluator-model example. No breaking changes. Callers who don't pass --evaluator-model see the existing report unchanged (with the side effect that judge confidence now reports a real number rather than the silent 50).

marceloceccon merged commit 796f71d into feat/asi-grade-foundation-v0.12 May 26, 2026

marceloceccon mentioned this pull request May 26, 2026

feat(bench): held-out rubric evaluator + dep bump to ai-consensus-core 0.11.1 #5

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bench): held-out rubric evaluator + dep bump to ai-consensus-core 0.11.1#4

feat(bench): held-out rubric evaluator + dep bump to ai-consensus-core 0.11.1#4
marceloceccon merged 1 commit into
feat/asi-grade-foundation-v0.12from
feat/bench-rubric-eval

marceloceccon commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

marceloceccon commented May 26, 2026

Summary

Per-case Δ rubric (12 runs, seed=42)

Implementation surface

Test plan

Notes for the reviewer

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant