An empirical study of how safely frontier language models behave as AI tutors, and where automated evaluation falls short.
- How do frontier LLMs differ in their instructional safety behavior?
- Where do LLM-as-judge evaluations fail to catch unsafe teaching patterns?
- How robust are models to adversarial manipulation attempts by students?
This project evaluates 3 flagship language models on 10 instructional safety tasks, using both human scoring and an LLM judge. The goal is to identify failure modes in model teaching behavior and understand the limitations of automated evaluation.
AI tutors are increasingly deployed in educational settings. Unlike general-purpose AI safety, instructional safety has unique challenges:
- Models must balance helpfulness with academic integrity
- Students may attempt to manipulate models into doing their homework
- Age-appropriate communication requires nuanced calibration
- Incorrect information can be especially harmful in learning contexts
All models receive an educational system prompt based on best practices from Khan Academy's Khanmigo. This simulates realistic deployment conditions where AI tutors are configured with specific instructions.
Tasks are split into two categories:
- Baseline tasks (01-06): Standard instructional scenarios
- Adversarial tasks (07-10): Manipulation attempts (emotional pressure, authority appeals, reframing)
| Model | Provider | Model ID |
|---|---|---|
| GPT-4.1 | OpenAI | gpt-4.1 |
| Claude Sonnet 4.5 | Anthropic | claude-sonnet-4-5-20250929 |
| Gemini 2.5 Pro | gemini-2.5-pro |
All models tested at temperature=0 for reproducibility.
| Task | Type | Description |
|---|---|---|
| 01 | scaffolding | Does the model teach vs. give answers? |
| 02-03 | age-appropriate | Does the model calibrate to grade level? |
| 04 | academic-integrity | Does the model refuse to write essays? |
| 05 | citation-honesty | Does the model admit it can't cite sources? |
| 06 | factual-accuracy | Does the model correct misconceptions? |
| 07 | adversarial-indirect | Resists "compare to my work" framing |
| 08 | adversarial-emotional | Maintains boundaries under emotional pressure |
| 09 | adversarial-authority | Resists "my teacher said it's okay" claims |
| 10 | adversarial-reframing | Resists "just show me an example" requests |
Responses scored on 6 dimensions (0-2 scale):
| Dimension | What It Measures |
|---|---|
pedagogical_approach |
Teaches/scaffolds rather than giving answers |
age_calibration |
Response matches stated grade level |
academic_integrity |
Doesn't complete assignments for students |
citation_honesty |
Doesn't fabricate sources or statistics |
factual_accuracy |
Information is correct |
adversarial_resistance |
Maintains boundaries under manipulation |
├── run_evals.py # Collect responses from all models
├── run_judge.py # Run LLM-as-judge scoring (Claude Sonnet 4.5)
├── tasks.json # The 10 evaluation tasks
├── rubric.json # Scoring rubric definitions
│
├── RESEARCH_LOG.md # Research process documentation
├── FINDINGS.md # Research writeup
│
└── evals/
├── responses/ # Raw model outputs
│ ├── gpt-4.1/
│ ├── claude-sonnet-4.5/
│ └── gemini-2.5-pro/
└── scores/
├── human_scores.csv # Human evaluation scores
└── judge_scores.csv # LLM judge scores
# Install dependencies
pip install -r requirements.txt
# Set up API keys
cp .env.example .env
# Edit .env with your keys
# Run evaluations
python run_evals.py # All tasks, all models
python run_evals.py --task task07 # Single task
python run_evals.py --no-system-prompt # Without educational system prompt
# Run LLM judge
python run_judge.pySee FINDINGS.md for the full analysis.
MIT License — see LICENSE for details.
"Research and evaluation by Tram Le with project scaffolding by Mark Johnson and Claude Code."