Skip to content

v3/v4 eval infra: DarkBench held-out scorers, sycophancy corpus, tuned roleplay pack, honest_eta cascade#13

Merged
waitdeadai merged 1 commit into
mainfrom
evaluation/hooks-v3
May 23, 2026
Merged

v3/v4 eval infra: DarkBench held-out scorers, sycophancy corpus, tuned roleplay pack, honest_eta cascade#13
waitdeadai merged 1 commit into
mainfrom
evaluation/hooks-v3

Conversation

@waitdeadai

Copy link
Copy Markdown
Owner

Evaluation infrastructure + data behind the llm-dark-patterns v3/v4 work (companion: waitdeadai/llm-dark-patterns#24). All F1 use evaluation/metrics.py::bootstrap_f1_interval (1000 samples, seed 42); deterministic, no API in scoring paths (LLM-judge labels frozen to disk).

What's here

  • Scorers: score_darkbench.py (real DarkBench traces, split firewall), score_roleplay.py (documents the synthetic corpus's F1=1.0 co-evolution trap), score_sycophancy_heldout.py, honest_eta_task3.py.
  • roleplay_drift.yaml (live pack): +anthropomorphic_experiential_claim subfamily +politeness allow-patterns → DarkBench held-out TEST F1 0.545→0.640 (tuned on train only; test frozen). Did not clear 0.70 — surface-conflict ceiling, ablation included.
  • honest_eta cascade: honest_eta ∪ evidence_claims → MAST-2.6 F1 0.230→0.461 (CIs non-overlapping at n=954). honest_eta itself unchanged (its hedge-gate is correct for the closeout surface).
  • Sycophancy held-out (data/sycophancy/heldout_positives.jsonl, n=58) + fresh holdout (n=35) + dual claude -p judge (κ=1.0 / 0.94) + blind validation sheet for the operator.
  • Results JSONs under results/v3/, results/v4/.

Key findings

  • no-sycophancy 0.667 TRAIN does NOT survive on a literature-grounded held-out (F1 0.298): it's an opener-praise detector, 0% recall on BrokenMath/SyConBench.
  • v4 regex tiers overfit (DEV 0.75 / FRESH 0.30) → the hook stays at v2 in the companion repo; sycophancy recall needs semantic detection (v5).

Notes

  • The ACB synthetic data/roleplay_drift corpus scores F1=1.0 (rules co-evolved with it) — it's a fixture, not a held-out; the real eval uses DarkBench traces from the companion repo.
  • BrokenMath false-statement validation is intentionally not regex-detected (needs truth assessment) — documented, not faked.

🤖 Generated with Claude Code

…, honest_eta cascade

- roleplay_drift.yaml: +experiential-claim subfamily +politeness allow-patterns (DarkBench held-out TEST F1 0.545->0.640)
- honest_eta cascade (honest_eta U evidence_claims) F1 0.230->0.461 on MAD 2.6
- sycophancy held-out (n=58, taxonomy-grounded) + fresh test (n=35); no-sycophancy v2 F1=0.298, 0.667 TRAIN does not survive
- v4 sycophancy tiers overfit (DEV 0.75 / FRESH 0.30) -> reverted in llm-dark-patterns; evidence kept
- all F1 with bootstrap CI; dual claude -p judge kappa=1.0/0.94

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@waitdeadai waitdeadai merged commit c54fa4a into main May 23, 2026
9 of 10 checks passed
@waitdeadai waitdeadai deleted the evaluation/hooks-v3 branch May 25, 2026 16:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants