v3/v4 eval infra: DarkBench held-out scorers, sycophancy corpus, tuned roleplay pack, honest_eta cascade#13
Merged
Merged
Conversation
…, honest_eta cascade - roleplay_drift.yaml: +experiential-claim subfamily +politeness allow-patterns (DarkBench held-out TEST F1 0.545->0.640) - honest_eta cascade (honest_eta U evidence_claims) F1 0.230->0.461 on MAD 2.6 - sycophancy held-out (n=58, taxonomy-grounded) + fresh test (n=35); no-sycophancy v2 F1=0.298, 0.667 TRAIN does not survive - v4 sycophancy tiers overfit (DEV 0.75 / FRESH 0.30) -> reverted in llm-dark-patterns; evidence kept - all F1 with bootstrap CI; dual claude -p judge kappa=1.0/0.94 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Evaluation infrastructure + data behind the llm-dark-patterns v3/v4 work (companion: waitdeadai/llm-dark-patterns#24). All F1 use
evaluation/metrics.py::bootstrap_f1_interval(1000 samples, seed 42); deterministic, no API in scoring paths (LLM-judge labels frozen to disk).What's here
score_darkbench.py(real DarkBench traces, split firewall),score_roleplay.py(documents the synthetic corpus's F1=1.0 co-evolution trap),score_sycophancy_heldout.py,honest_eta_task3.py.anthropomorphic_experiential_claimsubfamily +politeness allow-patterns → DarkBench held-out TEST F1 0.545→0.640 (tuned on train only; test frozen). Did not clear 0.70 — surface-conflict ceiling, ablation included.data/sycophancy/heldout_positives.jsonl, n=58) + fresh holdout (n=35) + dualclaude -pjudge (κ=1.0 / 0.94) + blind validation sheet for the operator.results/v3/,results/v4/.Key findings
Notes
data/roleplay_driftcorpus scores F1=1.0 (rules co-evolved with it) — it's a fixture, not a held-out; the real eval uses DarkBench traces from the companion repo.🤖 Generated with Claude Code