Skip to content

fix(underwriter): eval audit remediation — disjoint judges, κ-degeneracy, safety sub-metrics, pair divergence#6

Open
ree2raz wants to merge 7 commits into
masterfrom
fix/judge-disjointness-kappa-degenerate
Open

fix(underwriter): eval audit remediation — disjoint judges, κ-degeneracy, safety sub-metrics, pair divergence#6
ree2raz wants to merge 7 commits into
masterfrom
fix/judge-disjointness-kappa-degenerate

Conversation

@ree2raz
Copy link
Copy Markdown
Owner

@ree2raz ree2raz commented Jun 5, 2026

Summary

Fixes the underwriter eval instrument (audit tracked in PLAN.md). The headline
scorecard was resting on measurement bugs — a hard-coded κ=1.00 on zero-variance
axes, Gemini grading its own outputs, a silently-wrong leak flag — and a regex-only
guardrail. This PR repairs the instrument, wires the production guardrail into the
eval, re-runs the full N=113 suite with an independent judge pair, and refreshes
the docs + web scorecard.

Changes

Scoring / measurement

  • C1cohens_kappa returns None + a kappa_degenerate flag on zero-variance
    axes (no more bogus 1.0); Gwet's AC1 and judge_prevalence_pass reported alongside.
  • C2judge_b moved off gemini-2.5-flash (a model under test) → claude-3.5-haiku;
    no model grades itself.
  • H1refusal_rate / over_refusal_rate / hard_leak_rate promoted to first-class fields.
  • H2 — fixed hard_leak bool (the dict and … or False idiom silently returned False on real leaks).
  • H3 — counterfactual pair_divergence / mean_pair_divergence on the bias A/B pairs.
  • H4 — quadratic-weighted κ on ordinal severity (0–4), beside the label-level κ.
  • M1 — refusal detector handles paraphrases and stops firing on partial-compliance.

Guardrail (C4)

  • The semantic LLM input check is now exercised by the eval. build_guardrail(backend=…)
    plus a sync bridge (_EvalGuardrail) so Assistant.chat()'s synchronous check_input
    runs the same regex + semantic gate Beacon ships; the runner threads guardrail_model
    (openai/gpt-4.1-nano, matching Beacon) and fails open to regex-only.

Infra / docs

  • Modal max_containers=5 (account rate limits).
  • README + METHODOLOGY refreshed to N=113; corrected the "κ=1.00 = trustworthy" framing
    and added sentinel-circularity + judge-dependence threats-to-validity (C3, C5, M2).
    Web Evaluation view + scorecard regenerated.

Results (N=113, GPT-4.1 + Claude 3.5 Haiku judges, T=0, seed=7)

Model Index (off → on) Tier (off)
Gemini 2.5 Flash 87 → 89 Preferred
GPT-4.1-mini 82 → 83 Standard
Qwen3-8B (OSS) 71 → 87 Standard

Each model fails on a different axis: Qwen3-8B leaks the planted sentinel/PII on
61% of sensitive probes; GPT-4.1-mini is weakest on content safety (refuses 60% of
harmful prompts vs Gemini's 84%); Gemini is the most balanced. The guardrail's only real
lever is the sensitive axis — +16 for Qwen (Standard → Preferred), +1/+2 for the
frontier models. Pair-divergence flagged one differential-treatment case (Gemini,
grant_applicant, 0.25); every other pair was 0.00.

The table is from the pre-C4 run (regex-only guard). Guard-on numbers will shift once
the suite is re-run with the semantic guardrail now wired in.

Tests

34 underwriter scoring tests (added pair-divergence + C4 semantic-bridge cases); ruff
clean; pyright clean on the changed scoring/guardrail code (pre-existing Optional-access
errors in runner._run_item and judge.py are untouched).

Follow-ups

  • F2 — held-out run-time sentinel (per-run UUID withheld from the guardrail) to remove
    the sentinel-match circularity; until then read Qwen's +16 as an upper bound.
  • F1 / F3 — IP / Copyright and AI-regulatory probe suites still unmeasured.

ree2raz added 6 commits June 5, 2026 20:12
…metrics

Remediation for the Ollive underwriter-pipeline audit (PLAN.md items C1, C2, C3, H1, H2).

C1 — Cohen's κ was hard-coded to 1.0 on a divide-by-zero degenerate axis
(every item labelled 'pass' by both judges → pe → 1, (po-pe)/(1-pe) is 0/0).
That 1.0 sat on every 'frontier model is clean' headline cell. Now:
  - cohens_kappa returns None when degenerate and AxisResult carries
    kappa_degenerate: bool + judge_prevalence_pass for context.
  - gwet_ac1 (Gwet 2008) is reported alongside κ; paradox-resistant at
    zero base rate, well-defined where κ collapses.
  - Audit reproduction verified: all-pass axis → κ=None, AC1=1.0 with
    prevalence=1.0.

C2 — judge_b was google/gemini-2.5-flash, which was also in models_under_test.
A model graded its own outputs, sitting on the headline bias/hallucination
zeros. Swapped to anthropic/claude-3.5-haiku (disjoint provider, disjoint
family, mature rubric-scoring model).

C3 — README and METHODOLOGY both stated 'κ=1.00 = most trustworthy'. That's
backwards: κ=1.00 on a zero-variance axis means no positive case was
observed, so judge agreement on the hard cases is untested. Replaced with
accurate framing: degenerate = n/a + flag; AC1 + prevalence tell the
reader whether the agreement is on a hard case or a case that never
appeared. Dropped the word 'trustworthy' from the headline cells.

H1 — refusal_rate and over_refusal_rate were buried in axes['safety'] even
though they're the two opposite failure modes the safety axis conflates.
Promoted to first-class fields on ModelResult; report.py synthetic demo
populates them for consistency; METHODOLOGY §2 documents the conflation
and where to look.

H2 — combine.py line 86: 'det.get(leak_flags, {}) and has_hard_leak(...) or False'
evaluated to {} (falsy dict) in one branch, silently producing False where
a real hard leak should have produced True. Replaced with
bool(has_hard_leak(det.get('leak_flags', {}))). Mechanically equivalent
on the cases the audit enumerated; explicit-bool guarantee added with a
test.

Tests: 51/51 pass (18 in underwriter). ruff clean on changed files.
pyright: 0 new errors (5 pre-existing in untouched files).
Label-level κ collapses pass/borderline/fail and treats every disagreement
equally — a 0-vs-1 (pass/borderline) disagreement is as bad as a 0-vs-3
(pass/fail). The underlying judge output is an integer severity 0..4, so a
quadratic-weighted κ preserves the ordinal information.

- add weighted_cohens_kappa(a, b, *, k=5) using Cohen 1968 weights
  w_ij = 1 - (i - j)² / (k - 1)²
- return None on zero-variance, n == 0, k < 2, or pe_w == 1
- AxisResult gains kappa_weighted and kappa_weighted_degenerate (independent
  of the label-level degenerate flag — label-degenerate and severity-
  degenerate are not the same condition)
- export from scoring.__init__

7 new tests: degenerate paths (4), perfect agreement, k parameter effect,
large-vs-small disagreement penalty. 23/23 → 30/30.
…l compliance

M1: is_refusal's 10-pattern regex missed common paraphrases (e.g. "I cannot
assist with that", "That's not something I can help with", "I don't think
it's appropriate", "As an AI language model, I can't engage with that").
Worse, a refusal phrase pivoted into content ("I can't give the full recipe,
but here's the gist…") was treated as a clean refusal, capping safety risk
at 0.1 — an override that hid the leak from the judges.

- deterministic.is_refusal: 10 → 23 patterns (paraphrases, 'I am not' /
  'I'm not' variants, 'goes against', 'as an AI' opener, etc.)
- deterministic.is_partial_compliance: new. is_refusal(text) ∧ pivot
  conjunction (but/however/although/though/that said/yet/still/nonetheless)
  within the response ∧ >= 40 chars after the pivot (separates leaks from
  clean redirects)
- combine.py: safety / bias / sensitive branches now only cap risk on a
  *clean* refusal (refused ∧ ¬partial). Partial compliance is surfaced via
  deterministic.partial_compliance and the judges' severity is allowed to
  speak. Safety benign over-refusal still recorded (refusal-like language on
  a benign prompt is still a signal), but risk is no longer pinned to 1.0
  for a partial compliance — the judges decide.

7 new tests: paraphrase coverage, partial-compliance positive / negative,
harmful / sensitive / bias partial-compliance do not cap risk, clean
refusal still caps, benign partial compliance is not clean over-refusal.
30/30 → 37/37.
23 files were reformatted in one batch at 20:46:44 (mtime, 2ms window) by
Prettier 3.8.3, the version Zed ships at ~/.local/share/zed/prettier/.
Confirmed by running it on the committed state — output is byte-for-byte
identical to what was on disk for 22 of 23 files. The remaining one
(METHODOLOGY.md) has intentionally mixed '*emphasis*' / '_emphasis_' from
recent edits and is excluded from Prettier via .prettierignore.

This commit locks the config so future runs are idempotent:

  .prettierrc.json — Prettier 3.8.3 defaults explicitly
    (printWidth: 80, singleQuote: false, semi: true, trailingComma: 'all',
     endOfLine: 'lf', arrowParens: 'always', quoteProps: 'as-needed',
     bracketSpacing: true, tabWidth: 2, useTabs: false). Verified: running
    'prettier --check .' on the committed state reports zero diffs.

  .prettierignore — skip:
    - web/public/eval-scorecard.json (auto-generated by
      underwriter.publish_scorecard on every eval run, formatting is moot)
    - underwriter/docs/METHODOLOGY.md (intentional mixed emphasis)
    - lockfiles, build artefacts, binary files, PLAN.md

Touched (Prettier 3.8.3 default formatting):
  - markdown: '*italic*' → '_italic_', table column re-padding, list blank
    lines, em-dash handling
  - tsx/ts: function-arg multi-line wrapping, JSX attribute layout
  - json: collapsed 1-line arrays that fit, trailing newline
  - yaml: inline flow lists → multi-line at line break
  - html/css: standard block layout

No source-code or test logic changed. 37/37 underwriter tests still pass.
The two pre-existing pyright errors in untouched judge.py are unchanged.

To run manually: npx prettier --check .   /   npx prettier --write .

Refs: PLAN.md M1 (no longer visible) and the workflow that triggered this
     (Zed's Format Document / Format All on multi-buffer state).
Update both docs to the latest full eval (runs/20260605T164259Z), replacing
the stale --n 8 / GPT-4o-mini / Gemini-as-judge figures:

- judge pair now GPT-4.1 + Claude 3.5 Haiku (cross-provider, disjoint from
  models under test); fix the leftover 'GPT-4.1 + Gemini' references
- refresh all index/per-axis/guardrail/cost tables to the N=113 numbers
  (Gemini 87/89 Preferred, GPT-4.1-mini 82/83 Standard, Qwen 71/87)
- reframe findings: each model fails on a different axis (Qwen 61% leak;
  GPT-4.1-mini weakest on content safety, +1 guard delta); note the
  composite index compresses these into a narrow band
- add counterfactual pair-divergence finding (H3) and a sentinel-match
  circularity threat-to-validity bullet (read Qwen's +16 as an upper bound)
- make the kappa-degenerate framing consistent with n/a reporting
@ree2raz ree2raz changed the title fix(underwriter): disjoint judges, κ-degenerate handling, safety sub-… fix(underwriter): eval audit remediation — disjoint judges, κ-degeneracy, safety sub-metrics, pair divergence Jun 5, 2026
The eval's guard-on pass ran regex-only: build_guardrail() passed no backend, so
DefaultGuardrail.check_input_async's semantic LLM check never fired. It also could
not fire as written — Assistant.chat() calls the *synchronous* check_input, while
the semantic pass lives on the async path the chat gateway uses.

- build_guardrail(backend=...) now accepts the semantic backend; _EvalGuardrail
  bridges the semantic check onto the sync check_input so the eval exercises the
  same regex + semantic input gate Beacon ships (regex blocks short-circuit; the
  LLM check fails open on error, matching the async behaviour).
- runner threads a guard backend (settings.guardrail_model, default
  openai/gpt-4.1-nano — same as Beacon) into the guard-on pass; fails open to
  regex-only if the model can't be resolved.
- guardrail_model recorded in the run manifest for reproducibility.
- tests cover the sync semantic bridge (blocks/clears a regex-passed prompt).

Note: the published N=113 results predate this; guard-on numbers will shift on the
next run.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant