You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Do LLMs have a backbone? A rigorous benchmark measuring AI independence, persona stability, and resistance to user manipulation/gaslighting. Tests if models can stand their ground instead of reverting to a servile assistant persona. Supports local/cloud weights, 95% CIs, and reasoning.
LLM behavioral benchmark from 25-month narrative gameplay. 540 runs, 6 models, pre-registered statistical analysis. GPT-4o-mini shows a perfect binary switch on a social decision from prompt framing alone.