Skip to content

feat(bench): held-out rubric evaluator + dep bump to ai-consensus-core 0.11.1#4

Merged
marceloceccon merged 1 commit into
feat/asi-grade-foundation-v0.12from
feat/bench-rubric-eval
May 26, 2026
Merged

feat(bench): held-out rubric evaluator + dep bump to ai-consensus-core 0.11.1#4
marceloceccon merged 1 commit into
feat/asi-grade-foundation-v0.12from
feat/bench-rubric-eval

Conversation

@marceloceccon

Copy link
Copy Markdown
Member

Summary

The bench was measuring the wrong thing. On architecture_v2 (4 cases × 3 runs × seed 42), with grok-4.3 as both judge and baseline and claude-opus-4-5 as a held-out evaluator:

Metric Consensus Baseline Δ
Self-reported (consensus score vs. baseline confidence) 60.0 75.4 −15.4
Held-out rubric (judged by claude-opus-4-5, blind) 83.3 48.0 +35.3

Consensus wins on the held-out rubric in 12/12 runs (100%). Same 12 runs, the self-reported-confidence metric says consensus loses 11/12. The two metrics invert. Without a quality measurement, the bench was producing a measurement artifact, not a result about the panel.

This PR adds a held-out LLM-as-judge rubric evaluator. Pass --evaluator-model + --evaluator-provider; when the panel declares a rubric, both consensus and baseline outputs are scored blind by a third model (neither side of the comparison). architecture_v2 ships a 5-criterion rubric (quantification, single-recommendation, reversibility-weighing, tripwire-specificity, failure-mode-realism). Adding a rubric to any other panel is purely additive.

Also bumps ai-consensus-core to ^0.11.1 — fixes the silent extractJudgeConfidence 50 default that previously made judge-confidence numbers in this repo's bench reports meaningless. Judge confidence on this 12-run suite now reports a real distribution (μ=66.9, σ=5.2) instead of (μ=50.0, σ=0.0).

See the new README section "Quality benchmark — held-out evaluator" for the full methodology, per-case Δ-rubric table, exact reproduction command, and honest caveats.

Per-case Δ rubric (12 runs, seed=42)

Case Runs Mean Δ
arch-microservices-day-one +36, +28, +32 +32
arch-event-sourcing-billing +44, +52, +56 +51
arch-sync-vs-async-fanout +24, +40, +40 +35
arch-db-multi-tenant +12, +32, +28 +24

Baseline scored 28/100 on every event-sourcing-billing run — a reproducible single-model blind spot that the panel surfaces every time.

Implementation surface

  • New module src/benchmark/rubric.ts: structured JSON-emitting evaluator prompt, tolerant bracket-scanning JSON extractor (no regex backtracking), zod schema validation, score clamping, sentinel errorMessage on any failure path. Never throws — a rubric eval failure is data quality, not a suite-fatal error (matches the contract baseline.ts uses).
  • New Preset.rubric?: readonly RubricCriterion[] field. Panel-declared because criteria are domain-specific.
  • New CLI flags --evaluator-model + --evaluator-provider (validated as a pair). CLI warns when the evaluator coincides with baseline (self-grading) or judge (same brain producing and grading the consensus output) — the held-out contract is the bench's only guarantee that the comparison isn't self-graded.
  • New report metrics: consensusRubricNormalizedMean, baselineRubricNormalizedMean, consensusBeatsBaselineRubricRate, rubricRunsCounted. Markdown report grows a "Held-out rubric" block and 3 per-case table columns (Rubric C / Rubric B / Δ rubric).
  • Bumps ai-consensus-core ^0.10.0^0.11.1 (upstream JUDGE_CONFIDENCE directive contract fix, entropyvortex/consensus-core#4).

Test plan

  • npm run check clean — typecheck, lint, format, tests, coverage all green
  • 341/341 tests pass (+25 from this branch: 16 rubric unit + 3 runner integration + 2 format + restore existing on new shape).
  • Coverage above all thresholds (stmts 79.4%, branches 66.1%, fns 86.4%, lines 80.9%).
  • End-to-end quick bench (bench --quick --evaluator-model claude-opus-4-5 --evaluator-provider anthropic) produces real rubric scores.
  • End-to-end full suite (12 runs, ~$1) reported in README — reproduces deterministically with --seed 42.
  • CLI warning logic exercised manually: warns when evaluator coincides with judge or baseline; rejects --evaluator-model without --evaluator-provider and vice versa.

Notes for the reviewer

…e 0.11.1

The bench was measuring the wrong thing. Self-reported confidence is a
meta-signal, not a quality signal — and on every panel in this repo it
was further corrupted by the upstream `extractJudgeConfidence` silent
50 default (fixed in ai-consensus-core 0.11.1). The "consensus loses
11/12, costs 40× tokens for nothing" verdict the old bench produced
was a measurement artifact, not a result about the panel.

This change adds a held-out LLM-as-judge rubric evaluator. Pass
`--evaluator-model` + `--evaluator-provider`; when the panel declares
a rubric the bench scores both the consensus synthesis and the
single-model baseline against that rubric using a third model that
is on neither side of the comparison. Rubrics measure answer quality
against named criteria the panel commits to (for architecture_v2:
quantification, single-recommendation, reversibility-weighing,
tripwire-specificity, failure-mode-realism).

On architecture_v2 (4 cases × 3 runs × seed 42), with grok-4.3 as
both judge and baseline and claude-opus-4-5 as the held-out evaluator:

  - Self-reported (consensus score vs. baseline confidence):
      consensus 60.0 vs. baseline 75.4 → Δ −15.4 (consensus "loses")
  - Held-out rubric:
      consensus 83.3 vs. baseline 48.0 → Δ +35.3, 12/12 runs (100%)

Same 12 runs, opposite verdicts. The rubric measures quality directly;
self-reported confidence does not track quality (judge confidence on
these 12 runs is μ=66.9, under-estimating actual rubric score by ~16
points).

Headline implementation surface:

- New module: src/benchmark/rubric.ts. Builds a structured JSON-emitting
  prompt for the evaluator, parses with a tolerant bracket-scanning JSON
  extractor (no regex backtracking), validates with zod, clamps scores
  into range, returns RubricEvaluation with errorMessage on any failure
  path. Never throws — a rubric eval failure is data quality, not a
  suite-fatal error (same contract baseline.ts uses).
- New Preset field: rubric?: readonly RubricCriterion[]. Panel-declared
  because criteria are domain-specific. architecture_v2 ships a 5-
  criterion rubric; adding rubrics to other panels is purely additive.
- New CLI flags: --evaluator-model, --evaluator-provider. Validated
  together (must be passed as a pair). CLI warns when evaluator coincides
  with baseline (self-grading) or with judge (same brain producing and
  grading the consensus output) — the held-out contract is the bench's
  only guarantee that the comparison isn't self-graded.
- New report metrics: consensusRubricNormalizedMean,
  baselineRubricNormalizedMean, consensusBeatsBaselineRubricRate,
  rubricRunsCounted. Markdown report grows a "Held-out rubric" block
  and 3 per-case table columns (Rubric C, Rubric B, Δ rubric).
- Dep bump: ai-consensus-core ^0.10.0 → ^0.11.1. 0.11.1 fixes the
  silent judge-confidence parser-contract bug — judge confidence now
  reports a real distribution (μ=66.9, σ=5.2) instead of μ=50.0, σ=0.0.

Test surface:

- 16 new rubric.ts unit tests (happy path, fenced + prose-wrapped JSON,
  score clamping, missing-criterion handling, all three failure modes:
  caller throws, unparseable content, schema mismatch).
- 3 new runner integration tests (no-evaluator short-circuits, paired
  rubric metrics populated, eval failure captured into errorMessage
  without aborting the suite).
- 2 new format tests (rubric block + per-case columns render; ERR cells
  on failed evals).
- Existing 320 tests pass unchanged. Total: 341/341. Coverage:
  statements 79.4%, branches 66.1%, functions 86.4%, lines 80.9% — all
  above thresholds.

Docs:

- New README section "Quality benchmark (held-out evaluator)" — headline
  finding, per-case Δ-rubric table, methodology, exact reproduction
  command, honest caveats (N=12, single panel, real cost).
- CHANGELOG [Unreleased] entry detailing the feature, the dep bump, and
  what it changes for downstream callers.
- New gitignore patterns for bench-*.json and bench-*.log artifacts.
- Bench CLI help text gains an --evaluator-model example.

No breaking changes. Callers who don't pass --evaluator-model see the
existing report unchanged (with the side effect that judge confidence
now reports a real number rather than the silent 50).
@marceloceccon marceloceccon merged commit 796f71d into feat/asi-grade-foundation-v0.12 May 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant