v3/v4 eval infra: DarkBench held-out scorers, sycophancy corpus, tuned roleplay pack, honest_eta cascade by waitdeadai · Pull Request #13 · waitdeadai/agent-closeout-bench

waitdeadai · 2026-05-23T17:13:23Z

Evaluation infrastructure + data behind the llm-dark-patterns v3/v4 work (companion: waitdeadai/llm-dark-patterns#24). All F1 use evaluation/metrics.py::bootstrap_f1_interval (1000 samples, seed 42); deterministic, no API in scoring paths (LLM-judge labels frozen to disk).

What's here

Scorers: score_darkbench.py (real DarkBench traces, split firewall), score_roleplay.py (documents the synthetic corpus's F1=1.0 co-evolution trap), score_sycophancy_heldout.py, honest_eta_task3.py.
roleplay_drift.yaml (live pack): +anthropomorphic_experiential_claim subfamily +politeness allow-patterns → DarkBench held-out TEST F1 0.545→0.640 (tuned on train only; test frozen). Did not clear 0.70 — surface-conflict ceiling, ablation included.
honest_eta cascade: honest_eta ∪ evidence_claims → MAST-2.6 F1 0.230→0.461 (CIs non-overlapping at n=954). honest_eta itself unchanged (its hedge-gate is correct for the closeout surface).
Sycophancy held-out (data/sycophancy/heldout_positives.jsonl, n=58) + fresh holdout (n=35) + dual claude -p judge (κ=1.0 / 0.94) + blind validation sheet for the operator.
Results JSONs under results/v3/, results/v4/.

Key findings

no-sycophancy 0.667 TRAIN does NOT survive on a literature-grounded held-out (F1 0.298): it's an opener-praise detector, 0% recall on BrokenMath/SyConBench.
v4 regex tiers overfit (DEV 0.75 / FRESH 0.30) → the hook stays at v2 in the companion repo; sycophancy recall needs semantic detection (v5).

Notes

The ACB synthetic data/roleplay_drift corpus scores F1=1.0 (rules co-evolved with it) — it's a fixture, not a held-out; the real eval uses DarkBench traces from the companion repo.
BrokenMath false-statement validation is intentionally not regex-detected (needs truth assessment) — documented, not faked.

🤖 Generated with Claude Code

…, honest_eta cascade - roleplay_drift.yaml: +experiential-claim subfamily +politeness allow-patterns (DarkBench held-out TEST F1 0.545->0.640) - honest_eta cascade (honest_eta U evidence_claims) F1 0.230->0.461 on MAD 2.6 - sycophancy held-out (n=58, taxonomy-grounded) + fresh test (n=35); no-sycophancy v2 F1=0.298, 0.667 TRAIN does not survive - v4 sycophancy tiers overfit (DEV 0.75 / FRESH 0.30) -> reverted in llm-dark-patterns; evidence kept - all F1 with bootstrap CI; dual claude -p judge kappa=1.0/0.94 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

waitdeadai merged commit c54fa4a into main May 23, 2026
9 of 10 checks passed

waitdeadai deleted the evaluation/hooks-v3 branch May 25, 2026 16:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v3/v4 eval infra: DarkBench held-out scorers, sycophancy corpus, tuned roleplay pack, honest_eta cascade#13

v3/v4 eval infra: DarkBench held-out scorers, sycophancy corpus, tuned roleplay pack, honest_eta cascade#13
waitdeadai merged 1 commit into
mainfrom
evaluation/hooks-v3

waitdeadai commented May 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

waitdeadai commented May 23, 2026

What's here

Key findings

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants