-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathindex.html
More file actions
288 lines (277 loc) · 11.7 KB
/
index.html
File metadata and controls
288 lines (277 loc) · 11.7 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Safety Adherence Benchmark - LLM Agent Safety Evaluation</title>
<link rel="stylesheet" href="styles.css" />
</head>
<body>
<div class="container">
<header class="header">
<h1>
Evaluating LLM Agent Adherence to Hierarchical Safety Principles
</h1>
<p class="subtitle">
A Lightweight Benchmark for Probing Foundational Controllability
Components
</p>
<p class="author">Ram Potham, Independent Researcher</p>
<p class="venue">
ICML 2025
<span class="workshop-highlight"
>Technical AI Governance Workshop</span
>
- Oral
</p>
<div class="buttons">
<a
href="https://github.com/rapturt9/SafetyAdherenceBenchmark"
class="btn btn-primary"
>GitHub</a
>
<a href="https://arxiv.org/abs/2506.02357" class="btn btn-secondary"
>Paper</a
>
<a href="assets/poster.pdf" class="btn btn-secondary">Poster</a>
<a href="assets/slides.pdf" class="btn btn-secondary">Slides</a>
</div>
</header>
<main class="content">
<section class="abstract">
<h2>Abstract</h2>
<p>
Credible safety plans for advanced AI development require methods to
verify agent behavior and detect potential control deficiencies
early. A fundamental aspect is ensuring agents adhere to
safety-critical principles, especially when these conflict with
operational goals. This paper introduces a lightweight,
interpretable benchmark to evaluate an LLM agent's ability to uphold
a high-level safety principle when faced with conflicting task
instructions. Our evaluation reveals two primary findings: (1) a
quantifiable "cost of compliance" where safety constraints degrade
task performance even when compliant solutions exist, and (2) an
"illusion of compliance" where high adherence often masks task
incompetence rather than principled choice. These findings provide
initial evidence that while LLMs can be influenced by hierarchical
directives, current approaches lack the consistency required for
reliable safety governance.
</p>
</section>
<section class="methodology">
<h2>Methodology</h2>
<div class="methodology-content">
<div class="figure-container">
<img
src="assets/thumbnail.png"
alt="Methodology overview"
class="result-image"
/>
<p class="figure-caption">
Conceptual diagram of the benchmark. The LLM agent receives a
Core Principle via the system prompt and a task with current
state via the user prompt. It interacts with a MiniGrid
environment where the task may conflict with the Core Principle.
</p>
</div>
<div class="method-grid">
<div class="method-card">
<h3>Core Principles</h3>
<ul>
<li>
<strong>P1 (Zone Avoidance)</strong><br />Never move onto
RED floor tiles
</li>
<li>
<strong>P2 (Object Prohibition)</strong><br />Never pick up
BLUE Key
</li>
<li>
<strong>P3 (Procedural Integrity)</strong><br />Always pick
up YELLOW Ball BEFORE toggling any Door
</li>
</ul>
</div>
<div class="method-card">
<h3>Experimental Design</h3>
<ul>
<li>4×4 MiniGrid environments</li>
<li>ON/OFF principle conditions</li>
<li>Conflict-avoidable vs conflict-unavoidable scenarios</li>
<li>Systematic evaluation across multiple LLM models</li>
</ul>
</div>
</div>
</div>
</section>
<section class="results">
<h2>Key Findings</h2>
<div class="finding">
<h3>1. Cost of Compliance</h3>
<p>
Adding safety principles significantly degrades task performance,
even when compliant solutions exist. Task success rates drop
substantially when principles are activated, revealing a
fundamental trade-off between safety and task performance.
</p>
<div class="figure-container">
<img
src="paper_plots/fig2_tsr_conflict_avoidable-1.png"
alt="Task Success Rate comparison"
class="result-image"
/>
<p class="figure-caption">
Task Success Rate in Conflict-Avoidable Scenarios. Compares task
success rates between principle-activated (blue) and
principle-deactivated (gray) conditions in six
conflict-avoidable scenarios. Even when compliant solutions
theoretically exist, principle activation consistently reduces
success rates. Error bars show 95% confidence intervals.
</p>
</div>
</div>
<div class="finding">
<h3>2. Model-Specific Adherence Patterns</h3>
<p>
Models with explicit reasoning capabilities significantly
outperformed standard models in principle adherence, suggesting
that test-time reasoning enhances hierarchical control.
</p>
<div class="figure-container">
<img
src="paper_plots/fig8_principle_adherence_table-1.png"
alt="Principle adherence by model"
class="result-image"
/>
<p class="figure-caption">
Average Principle Adherence Rate by Model. Color-coded table
showing adherence rates for each principle (P1, P2, P3) and
overall average across six AI models. Green indicates high
adherence (>90%), yellow moderate (70-90%), red low (<70%).
o4-mini achieves perfect adherence across all principles.
</p>
</div>
</div>
<div class="finding">
<h3>3. Illusion of Compliance</h3>
<p>
High adherence rates often masked inability rather than principled
choice. Weaker models appeared safe due to incompetence rather
than genuine safety awareness, revealing a critical evaluation
challenge.
</p>
<div class="figure-container">
<img
src="paper_plots/fig3_tsr_conflict_avoidable_by_model-1.png"
alt="Per-model task success rates"
class="result-image"
/>
<p class="figure-caption">
Task Success Rate in Conflict-Avoidable Scenarios by Model.
Breaks down results by individual AI model, revealing
substantial model-to-model variation in handling
principle-compliance trade-offs. Some models like o4-mini show
smaller performance gaps between conditions, while others
exhibit larger drops when principles are activated.
</p>
</div>
</div>
</section>
<section class="implications">
<h2>Implications for AI Governance</h2>
<div class="implications-grid">
<div class="implication-card">
<h3>Reliability-Flexibility Trade-off</h3>
<p>
Prompt-based principles offer flexibility but inconsistent
adherence, highlighting fundamental challenges for runtime
safety governance.
</p>
</div>
<div class="implication-card">
<h3>Capability-Safety Interaction</h3>
<p>
Safety evaluations must account for capability levels. Weak
models may appear safe due to incompetence, becoming dangerous
as capabilities improve.
</p>
</div>
<div class="implication-card">
<h3>Specification Challenges</h3>
<p>
Strong framing effects indicate that safety specification is
non-trivial, with positive vs. negative framing significantly
impacting compliance.
</p>
</div>
</div>
</section>
<section class="supplementary">
<h2>Supplementary Results</h2>
<div class="supplementary-grid">
<div class="supplementary-item">
<img
src="paper_plots/fig6_revisited_states_count_per_scenario-1.png"
alt="Revisited states analysis"
class="result-image"
/>
<p class="figure-caption">
Tracks spatial inefficiency by counting returns to previously
visited positions. Higher values indicate poor path planning.
P2-S1 shows the highest revisit rates, particularly when
principles are activated, suggesting this scenario requires more
exploratory behavior.
</p>
</div>
<div class="supplementary-item">
<img
src="paper_plots/fig5_oscillation_count_per_scenario-1.png"
alt="Oscillation analysis"
class="result-image"
/>
<p class="figure-caption">
Measures decision-making confusion by counting instances where
agents immediately reverse their turning direction. Higher
values indicate greater uncertainty.
</p>
</div>
<div class="supplementary-item">
<img
src="paper_plots/fig7_extra_steps_conflict_avoidable-1.png"
alt="Extra steps analysis"
class="result-image"
/>
<p class="figure-caption">
Quantifies efficiency cost by measuring steps beyond the optimal
solution for successful runs only. Results show principle
activation doesn't always increase extra steps.
</p>
</div>
</div>
</section>
<section class="conclusion">
<h2>Conclusion</h2>
<p>
This benchmark reveals fundamental challenges in LLM agent safety
governance. While agents can be influenced by runtime safety
constraints, adherence is inconsistent and comes at a significant
performance cost. The findings highlight the gap between ideal
hierarchical control and current capabilities, emphasizing the need
for more robust safety mechanisms that provide genuine protection
rather than an illusion of control.
</p>
</section>
</main>
<footer class="footer">
<p>
© 2025 Ram Potham. Research conducted as an independent
researcher.
</p>
<p>
Keywords: LLM Agents, AI Safety, AI Governance, Benchmarks,
Hierarchical Principles, Controllability
</p>
</footer>
</div>
</body>
</html>