Skip to content

feat(bench): held-out rubric evaluator + dep bump to ai-consensus-core 0.11.1#5

Merged
marceloceccon merged 1 commit into
mainfrom
feat/bench-rubric-eval
May 26, 2026
Merged

feat(bench): held-out rubric evaluator + dep bump to ai-consensus-core 0.11.1#5
marceloceccon merged 1 commit into
mainfrom
feat/bench-rubric-eval

Conversation

@marceloceccon

Copy link
Copy Markdown
Member

Re-target of #4, which merged into the feat/asi-grade-foundation-v0.12 branch rather than main. Single commit (1e7dc63) — the rubric work, unchanged.

See #4 for the full description, evidence tables, methodology, and test plan. A short recap:

Metric (architecture_v2, 4 cases × 3 runs, seed=42) Consensus Baseline Δ
Self-reported (consensus score vs. baseline confidence) 60.0 75.4 −15.4
Held-out rubric (judged by claude-opus-4-5, blind) 83.3 48.0 +35.3

Consensus wins on the held-out rubric in 12/12 runs (100%). Same 12 runs, the self-reported confidence metric says consensus loses 11/12. The two metrics invert; the held-out rubric measures answer quality, self-reported confidence measures the model's opinion of itself.

Also bumps ai-consensus-core ^0.10.0^0.11.1 (upstream JUDGE_CONFIDENCE directive contract fix, entropyvortex/consensus-core#4).

Test plan

  • npm run check clean — typecheck, lint, format, tests, coverage all green on this commit
  • 341/341 tests pass
  • Coverage above all thresholds (stmts 79.4%, branches 66.1%, fns 86.4%, lines 80.9%)
  • End-to-end full suite (12 runs) reported in the new README section, reproduces deterministically with --seed 42

Notes

…e 0.11.1

The bench was measuring the wrong thing. Self-reported confidence is a
meta-signal, not a quality signal — and on every panel in this repo it
was further corrupted by the upstream `extractJudgeConfidence` silent
50 default (fixed in ai-consensus-core 0.11.1). The "consensus loses
11/12, costs 40× tokens for nothing" verdict the old bench produced
was a measurement artifact, not a result about the panel.

This change adds a held-out LLM-as-judge rubric evaluator. Pass
`--evaluator-model` + `--evaluator-provider`; when the panel declares
a rubric the bench scores both the consensus synthesis and the
single-model baseline against that rubric using a third model that
is on neither side of the comparison. Rubrics measure answer quality
against named criteria the panel commits to (for architecture_v2:
quantification, single-recommendation, reversibility-weighing,
tripwire-specificity, failure-mode-realism).

On architecture_v2 (4 cases × 3 runs × seed 42), with grok-4.3 as
both judge and baseline and claude-opus-4-5 as the held-out evaluator:

  - Self-reported (consensus score vs. baseline confidence):
      consensus 60.0 vs. baseline 75.4 → Δ −15.4 (consensus "loses")
  - Held-out rubric:
      consensus 83.3 vs. baseline 48.0 → Δ +35.3, 12/12 runs (100%)

Same 12 runs, opposite verdicts. The rubric measures quality directly;
self-reported confidence does not track quality (judge confidence on
these 12 runs is μ=66.9, under-estimating actual rubric score by ~16
points).

Headline implementation surface:

- New module: src/benchmark/rubric.ts. Builds a structured JSON-emitting
  prompt for the evaluator, parses with a tolerant bracket-scanning JSON
  extractor (no regex backtracking), validates with zod, clamps scores
  into range, returns RubricEvaluation with errorMessage on any failure
  path. Never throws — a rubric eval failure is data quality, not a
  suite-fatal error (same contract baseline.ts uses).
- New Preset field: rubric?: readonly RubricCriterion[]. Panel-declared
  because criteria are domain-specific. architecture_v2 ships a 5-
  criterion rubric; adding rubrics to other panels is purely additive.
- New CLI flags: --evaluator-model, --evaluator-provider. Validated
  together (must be passed as a pair). CLI warns when evaluator coincides
  with baseline (self-grading) or with judge (same brain producing and
  grading the consensus output) — the held-out contract is the bench's
  only guarantee that the comparison isn't self-graded.
- New report metrics: consensusRubricNormalizedMean,
  baselineRubricNormalizedMean, consensusBeatsBaselineRubricRate,
  rubricRunsCounted. Markdown report grows a "Held-out rubric" block
  and 3 per-case table columns (Rubric C, Rubric B, Δ rubric).
- Dep bump: ai-consensus-core ^0.10.0 → ^0.11.1. 0.11.1 fixes the
  silent judge-confidence parser-contract bug — judge confidence now
  reports a real distribution (μ=66.9, σ=5.2) instead of μ=50.0, σ=0.0.

Test surface:

- 16 new rubric.ts unit tests (happy path, fenced + prose-wrapped JSON,
  score clamping, missing-criterion handling, all three failure modes:
  caller throws, unparseable content, schema mismatch).
- 3 new runner integration tests (no-evaluator short-circuits, paired
  rubric metrics populated, eval failure captured into errorMessage
  without aborting the suite).
- 2 new format tests (rubric block + per-case columns render; ERR cells
  on failed evals).
- Existing 320 tests pass unchanged. Total: 341/341. Coverage:
  statements 79.4%, branches 66.1%, functions 86.4%, lines 80.9% — all
  above thresholds.

Docs:

- New README section "Quality benchmark (held-out evaluator)" — headline
  finding, per-case Δ-rubric table, methodology, exact reproduction
  command, honest caveats (N=12, single panel, real cost).
- CHANGELOG [Unreleased] entry detailing the feature, the dep bump, and
  what it changes for downstream callers.
- New gitignore patterns for bench-*.json and bench-*.log artifacts.
- Bench CLI help text gains an --evaluator-model example.

No breaking changes. Callers who don't pass --evaluator-model see the
existing report unchanged (with the side effect that judge confidence
now reports a real number rather than the silent 50).
@marceloceccon marceloceccon merged commit 30ed79b into main May 26, 2026
0 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant