Skip to content

v3/v4 eval: held-out validation closes measurement gaps + sycophancy negative result#24

Merged
waitdeadai merged 1 commit into
mainfrom
evaluation/hooks-v3
May 23, 2026
Merged

v3/v4 eval: held-out validation closes measurement gaps + sycophancy negative result#24
waitdeadai merged 1 commit into
mainfrom
evaluation/hooks-v3

Conversation

@waitdeadai

Copy link
Copy Markdown
Owner

Closes the v2 measurement/performance gaps with statistical rigor and zero train/test leakage. Every F1 carries a bootstrap 95% CI (seed 42); every scoring path is deterministic (re-run twice → zero delta). Companion data/scorers PR: waitdeadai/agent-closeout-bench#evaluation/hooks-v3.

Results (see evaluation/v3/RESULTS.md)

Task Hook Before After Note
1 no-sycophancy 0.667 TRAIN (unvalidated) 0.298 on real held-out (P=1.00, R=0.175) 0.667 does NOT survive
2 no-roleplay-drift 0.545 TEST 0.640 TEST [CI 0.40,0.82] improved; 0.70 not cleared on n=22 (documented surface-conflict ceiling)
3 honest_eta 0.230 (MAST 2.6) 0.461 via cascade [CI 0.42,0.50] recall lift via honest_eta ∪ evidence_claims, not a hook change

Task 1 detail: built a redistributable, taxonomy-grounded held-out (n=58: 40 positives across SycEval/SyConBench/ELEPHANT/BrokenMath + 18 controls). no-sycophancy v2 is a high-precision opener-praise detector — recall 0.83 on opener-praise but 0.0 on BrokenMath false-validation, 0.0 on SyConBench multi-turn, 0.08 on ELEPHANT. Dual claude -p judge confirms the corpus (κ=1.0). Blind human-validation sheet exported (ACB repo).

v4 sycophancy — rigorous NEGATIVE result (evaluation/v4/)

Added capitulation + social-validation regex tiers: DEV recall 0.175→0.600 (F1 0.750) but a fresh, judge-validated holdout (n=35, novel phrasing) showed F1 only 0.231→0.296 (overlapping CIs, not significant; SyConBench 0/5, ELEPHANT 0/6). The tiers overfit. Per the pre-registered rollback, no-sycophancy.sh is kept at v2 (this PR contains no change to it). Conclusion: sycophancy recall is semantic, not lexical → v5 needs a classifier.

Also

  • packs/locale/es.txt: Spanish sycophancy sections — no-sycophancy Spanish positives 0/9 → 9/9 fire, 0 FP, English unaffected. no-roleplay-drift/no-anthropomorphization need hook-side locale wiring (documented, not done here).
  • Spanish smoke confirms total cross-lingual collapse for the un-wired hooks (DarkBench+ AAAI 2026 finding).

Honest limitations

  • Roleplay 0.70 target not met (0.640) — reported as-is, ceiling documented; no misleading number.
  • Sycophancy human validation is pending (blind sheet handed to operator); LLM-judge κ is an inter-Claude proxy (no non-Claude key available), not human/cross-provider.

🤖 Generated with Claude Code

…ative result

- v3 SPEC + per-task RESULTS (roleplay 0.640, honest_eta cascade 0.461, sycophancy 0.298/0.667-does-not-survive, cross-cutting)
- packs/locale/es.txt: Spanish sycophancy sections -> no-sycophancy 0/9->9/9 fire, 0 FP; roleplay/anthro need hook wiring (documented)
- v4 SPEC + sycophancy negative result: regex tiers overfit (DEV 0.75 / FRESH 0.30), no-sycophancy.sh kept at v2
- staged-rules/roleplay_drift.yaml (applied to ACB live pack)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants