behavioral-benchmark

Here are 2 public repositories matching this topic...

mikhailsal / ai-independence-bench

Do LLMs have a backbone? A rigorous benchmark measuring AI independence, persona stability, and resistance to user manipulation/gaslighting. Tests if models can stand their ground instead of reverting to a servile assistant persona. Supports local/cloud weights, 95% CIs, and reasoning.

ai-alignment llm-evaluation llm-benchmark ai-autonomy behavioral-benchmark sycophancy-resistance

Updated Jun 29, 2026
Python

rozetyp / win95stack-benchmark

Star

LLM behavioral benchmark from 25-month narrative gameplay. 540 runs, 6 models, pre-registered statistical analysis. GPT-4o-mini shows a perfect binary switch on a social decision from prompt framing alone.

gemini claude narrative-game chi-square-analysis open-dataset prompt-engineering llm-evaluation llm-agents gpt-4o llm-benchmark behavioral-benchmark

Updated Apr 21, 2026
TypeScript

Improve this page

Add a description, image, and links to the behavioral-benchmark topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the behavioral-benchmark topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly