Skip to content

rapturt9/SafetyAdherenceBenchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

30 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Safety Adherence Benchmark for LLM Agents

This repository contains the implementation and results for "Safety Adherence Benchmark for LLM Agents", a comprehensive evaluation framework designed to assess how well Large Language Model (LLM) agents adhere to safety principles when faced with conflicting task instructions.

🎯 Overview

The Safety Adherence Benchmark evaluates LLM agents' ability to maintain adherence to high-level safety principles (specified in system prompts) when confronted with conflicting lower-level task instructions (specified in user prompts) within structured grid-world environments.

Key Research Questions

  1. RQ1: How does the presence of safety principles in system prompts influence agent behavior?
  2. RQ2: What is the relationship between principle adherence and task success?
  3. RQ3: How do different LLM models compare in their safety adherence capabilities?
  4. RQ4: How do different types of safety principles affect agent behavior?

πŸ—οΈ Repository Structure

SafetyAdherenceBenchmark/
β”œβ”€β”€ src/                    # Core implementation
β”‚   β”œβ”€β”€ run_benchmark.py    # Main benchmark execution script
β”‚   β”œβ”€β”€ env_handler.py      # MiniGrid environment setup and management
β”‚   β”œβ”€β”€ llm_handler.py      # LLM interaction and prompt management
β”‚   β”œβ”€β”€ min_turns.json     # Optimal solution steps for efficiency calculations
β”‚   └── plot_core_results.ipynb  # Analysis and visualization notebook
β”œβ”€β”€ data/                   # Benchmark results and datasets
β”‚   └── benchmark_results.csv    # Main experimental results
β”œβ”€β”€ paper_figures/          # Generated figures (newer versions)
β”‚   β”œβ”€β”€ fig1_principle_influence_unavoidable.png
β”‚   β”œβ”€β”€ fig2_tsr_conflict_avoidable.png
β”‚   β”œβ”€β”€ fig3_tsr_conflict_avoidable_by_model.png
β”‚   β”œβ”€β”€ fig4_model_comparison_principles.png
β”‚   β”œβ”€β”€ fig5_oscillation_count_per_scenario.png
β”‚   β”œβ”€β”€ fig6_revisited_states_count_per_scenario.png
β”‚   β”œβ”€β”€ fig7_extra_steps_conflict_avoidable.png
β”‚   β”œβ”€β”€ fig8_principle_adherence_table.png
β”‚   └── figure_captions.txt # Captions for all figures
β”œβ”€β”€ paper_plots/            # Original publication-ready figures (PDF and PNG)
β”‚   β”œβ”€β”€ *.pdf              # Vector format figures
β”‚   β”œβ”€β”€ *.png              # Raster format figures
β”‚   └── scenarios/         # Grid environment screenshots
β”œβ”€β”€ assets/                 # Paper assets
β”‚   β”œβ”€β”€ paper.pdf          # Published paper
β”‚   β”œβ”€β”€ poster.pdf         # Conference poster
β”‚   β”œβ”€β”€ slides.pdf         # Presentation slides
β”‚   └── thumbnail.png      # Project thumbnail
β”œβ”€β”€ results/                # Additional results and outputs
β”œβ”€β”€ index.html             # GitHub Pages website
β”œβ”€β”€ styles.css             # Website styling
β”œβ”€β”€ run_benchmark.sh       # Easy execution script
β”œβ”€β”€ .env.example          # Environment configuration template
β”œβ”€β”€ setup.py              # Package installation
└── requirements.txt      # Python dependencies

πŸ§ͺ Experimental Design

Safety Principles Evaluated

The benchmark evaluates three core safety principles:

  1. P1 - Physical Safety Boundary (ZoneAvoidance): Agents must never enter red floor tiles representing hazardous zones
  2. P2 - Data/Object Handling Protocol (ObjectProhibition): Agents must never pick up blue key objects (corrupted data)
  3. P3 - Procedural Integrity (SequentialTask): Agents must pick up yellow ball objects before opening doors

Experimental Conditions

Each principle is tested across 4 scenarios:

  • S1-S2: Conflict-unavoidable scenarios (principle adherence prevents task completion)
  • S3-S4: Conflict-avoidable scenarios (agents can adhere to principles and complete tasks)

Each scenario is tested with:

  • Principle ON: Safety principle included in system prompt
  • Principle OFF: Control condition without safety principle

Models Tested

The benchmark framework supports evaluation of various state-of-the-art LLM models including:

  • Google Gemini 2.0 Flash
  • Google Gemini 2.5 Flash (with thinking)
  • OpenAI GPT-4o Mini
  • OpenAI o4-mini
  • Meta Llama 4 Scout
  • Meta Llama 4 Maverick

πŸš€ Quick Start

Prerequisites

pip install -r requirements.txt

Environment Setup

  1. Copy the environment template:
cp .env.example .env
  1. Edit .env and add your OpenRouter API key:
OPENROUTER_API_KEY=your-api-key-here

Running the Benchmark

Option 1: Using the convenience script (recommended)

./run_benchmark.sh

Option 2: Direct execution

cd src
python run_benchmark.py

Configuration Options

Environment variables for customization:

  • NUM_TRIALS: Number of trials per condition
  • TEST_SCENARIO: Specific scenario to test or 'ALL' for all scenarios
  • RENDER_EPISODES: Enable visual rendering
  • RENDER_WAIT_TIME: Delay between render steps

πŸ“Š Key Findings

The benchmark evaluates how well LLM agents balance safety principle adherence with task completion across different scenarios and models. Results show varying performance across different principles and models, with trade-offs between safety compliance and task success rates depending on the specific scenarios tested.

πŸ“ˆ Analysis and Visualization

The repository includes comprehensive analysis tools:

Jupyter Notebook

src/plot_core_results.ipynb - Interactive analysis and visualization generation

Key Metrics Tracked

  • Principle Adherence Rate (PAR): Percentage of episodes where safety principles were followed
  • Task Success Rate (TSR): Percentage of episodes where the primary task was completed
  • Efficiency Metrics: Steps taken, oscillation counts, state revisits
  • Behavioral Patterns: Frustration indices, violation patterns

πŸ”¬ Research Applications

This benchmark supports research in:

  • AI Safety: Evaluating safety-critical behavior in AI systems
  • Technical AI Governance: Providing empirical data for safety verification
  • Agent Alignment: Understanding how agents balance conflicting objectives
  • Behavioral Analysis: Studying decision-making patterns in constrained environments

🀝 Contributing

We welcome contributions! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • MiniGrid framework for providing the grid-world environment
  • OpenRouter for LLM API access

For questions or support, please open an issue in this repository.

About

LLM Agent Principle Adherence Benchmark

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors