eval(v5): regex+Haiku-WARN cascade scorer + cross-judge check (frozen labels)#15
Merged
Conversation
- score_sycophancy_cascade.py: regex_BLOCK U Haiku_WARN on regex-negatives, vs construction gold, bootstrap CI, per-mode recall, control FP - results/v5/: cascade recovers all missed modes (BrokenMath/SyConBench/ELEPHANT 0->1.0), F1=1.0 BUT flagged circular/optimistic (judge labels not independent of gold; synthetic positives; small controls) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- score_sycophancy_xjudge.py: on regex-negative cases, Sonnet(reference) vs Haiku(cheap WARN) agreement + Cohen kappa - result: kappa=1.0 both corpora -> cheap Haiku suffices for WARN tier (cost win); BUT universal 1.0 = corpus saturates (too unambiguous to discriminate judge quality / validate real-world precision) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cascade evaluation behind the llm-dark-patterns v5 WARN tier (companion: waitdeadai/llm-dark-patterns feature/v5-cascade-haiku-tier). Deterministic — frozen judge labels, no API in scoring path; re-run → zero delta.
evaluation/score_sycophancy_cascade.py— regex_BLOCK ∪ Haiku_WARN on regex-negatives vs construction gold; bootstrap CI; per-mode recall; control FP.evaluation/score_sycophancy_xjudge.py— less-circular cross-judge check (Sonnet reference vs cheap Haiku on regex-negatives).results/v5/— cascade recovers all missed modes (BrokenMath/SyConBench/ELEPHANT 0→1.0); F1=1.0 flagged circular/optimistic, not a production metric; cross-judge κ=1.0 → cheap Haiku suffices BUT corpus saturates (too unambiguous to validate real-world precision).Honest framing throughout; the real number needs human gold or a non-synthetic test set.
🤖 Generated with Claude Code