v3/v4 eval: held-out validation closes measurement gaps + sycophancy negative result by waitdeadai · Pull Request #24 · waitdeadai/llm-dark-patterns

waitdeadai · 2026-05-23T17:13:04Z

Closes the v2 measurement/performance gaps with statistical rigor and zero train/test leakage. Every F1 carries a bootstrap 95% CI (seed 42); every scoring path is deterministic (re-run twice → zero delta). Companion data/scorers PR: waitdeadai/agent-closeout-bench#evaluation/hooks-v3.

Results (see `evaluation/v3/RESULTS.md`)

Task	Hook	Before	After	Note
1	no-sycophancy	0.667 TRAIN (unvalidated)	0.298 on real held-out (P=1.00, R=0.175)	0.667 does NOT survive
2	no-roleplay-drift	0.545 TEST	0.640 TEST [CI 0.40,0.82]	improved; 0.70 not cleared on n=22 (documented surface-conflict ceiling)
3	honest_eta	0.230 (MAST 2.6)	0.461 via cascade [CI 0.42,0.50]	recall lift via honest_eta ∪ evidence_claims, not a hook change

Task 1 detail: built a redistributable, taxonomy-grounded held-out (n=58: 40 positives across SycEval/SyConBench/ELEPHANT/BrokenMath + 18 controls). no-sycophancy v2 is a high-precision opener-praise detector — recall 0.83 on opener-praise but 0.0 on BrokenMath false-validation, 0.0 on SyConBench multi-turn, 0.08 on ELEPHANT. Dual claude -p judge confirms the corpus (κ=1.0). Blind human-validation sheet exported (ACB repo).

v4 sycophancy — rigorous NEGATIVE result (`evaluation/v4/`)

Added capitulation + social-validation regex tiers: DEV recall 0.175→0.600 (F1 0.750) but a fresh, judge-validated holdout (n=35, novel phrasing) showed F1 only 0.231→0.296 (overlapping CIs, not significant; SyConBench 0/5, ELEPHANT 0/6). The tiers overfit. Per the pre-registered rollback, no-sycophancy.sh is kept at v2 (this PR contains no change to it). Conclusion: sycophancy recall is semantic, not lexical → v5 needs a classifier.

Also

packs/locale/es.txt: Spanish sycophancy sections — no-sycophancy Spanish positives 0/9 → 9/9 fire, 0 FP, English unaffected. no-roleplay-drift/no-anthropomorphization need hook-side locale wiring (documented, not done here).
Spanish smoke confirms total cross-lingual collapse for the un-wired hooks (DarkBench+ AAAI 2026 finding).

Honest limitations

Roleplay 0.70 target not met (0.640) — reported as-is, ceiling documented; no misleading number.
Sycophancy human validation is pending (blind sheet handed to operator); LLM-judge κ is an inter-Claude proxy (no non-Claude key available), not human/cross-provider.

🤖 Generated with Claude Code

…ative result - v3 SPEC + per-task RESULTS (roleplay 0.640, honest_eta cascade 0.461, sycophancy 0.298/0.667-does-not-survive, cross-cutting) - packs/locale/es.txt: Spanish sycophancy sections -> no-sycophancy 0/9->9/9 fire, 0 FP; roleplay/anthro need hook wiring (documented) - v4 SPEC + sycophancy negative result: regex tiers overfit (DEV 0.75 / FRESH 0.30), no-sycophancy.sh kept at v2 - staged-rules/roleplay_drift.yaml (applied to ACB live pack) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

waitdeadai mentioned this pull request May 23, 2026

v3/v4 eval infra: DarkBench held-out scorers, sycophancy corpus, tuned roleplay pack, honest_eta cascade waitdeadai/agent-closeout-bench#13

Merged

waitdeadai merged commit ae3503e into main May 23, 2026
2 checks passed

waitdeadai mentioned this pull request May 23, 2026

docs(v3): README held-out finding + prior-art positioning; negative-result paper #25

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v3/v4 eval: held-out validation closes measurement gaps + sycophancy negative result#24

v3/v4 eval: held-out validation closes measurement gaps + sycophancy negative result#24
waitdeadai merged 1 commit into
mainfrom
evaluation/hooks-v3

waitdeadai commented May 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

waitdeadai commented May 23, 2026

Results (see evaluation/v3/RESULTS.md)

v4 sycophancy — rigorous NEGATIVE result (evaluation/v4/)

Also

Honest limitations

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Results (see `evaluation/v3/RESULTS.md`)

v4 sycophancy — rigorous NEGATIVE result (`evaluation/v4/`)