Internal Safety Collapse: Turning the LLM or an AI Agent into a sensitive data generator.
-
Updated
Apr 23, 2026 - Python
Internal Safety Collapse: Turning the LLM or an AI Agent into a sensitive data generator.
Introducing XSafeClaw: The Open-Source Agent Safety Platform from Fudan University
An open taxonomy and scoring framework for evaluating AI agent sandboxes: 7 defense layers, 7 threat categories, 3 evaluation dimensions, 27 "sandboxes" scored.
Human-in-the-loop execution for LLM agents
Security scanner for AI agent tool definitions
Claude Code agent-in-container orchestration and automation
Runtime detector for reward hacking and misalignment in LLM agents (89.7% F1 on 5,391 trajectories).
Audit log + guard for AI agents. Passive logging, human-in-the-loop approval for dangerous ops (rm, drop, transfer) via Telegram. Diary, daily digest, timeline UI. Cursor & MCP ready. Cloudflare Workers + Hono + D1.
Deterministic Guardrails for AI Agents. Ark acts as a logic-based firewall, preventing unauthorized actions through a rigorous rule engine. Ensure your AI behaves exactly as intended.
Deterministic execution authorization for AI agents
🛡️ Safe AI Agents through Action Classifier
Runtime network egress control for Python. One function call to restrict which hosts your code can connect to.
Open Threat Classification (OTC) — 10 threat patterns for AI agent skills, MCP servers, and plugins. CC-BY-4.0.
Guardrails for LLMs: detect and block hallucinated tool calls to improve safety and reliability.
Policy engine for AI agents — enforceable rules, risk limits, approval gates, obligation tracking, and violation detection. One .acon file. Rust core + MCP server.
🛡️ Open-source safety guardrail for AI agent tool calls. <2ms, zero dependencies.
27 free, open-source plugins for Claude Code & Cowork — Google Drive, WhatsApp, YouTube, WordPress, Apollo & more. Built on the SOSA™ security framework.
Execution control layer for AI agents — prevents duplicate or incorrect real-world actions under retries, uncertainty, and stale context.
The missing safety layer for AI Agents. Adaptive High-Friction Guardrails (Time-locks, Biometrics) for critical operations to prevent catastrophic errors.
Drop-in prompt patterns, policy blocks, and checklists for building safe AI agents.
Add a description, image, and links to the agent-safety topic page so that developers can more easily learn about it.
To associate your repository with the agent-safety topic, visit your repo's landing page and select "manage topics."