Skip to content

Commit d2f1c8a

Browse files
authored
Merge pull request #527 from PlanExeOrg/feature/update-safety-findings
Refine safety-findings.md wording
2 parents 7ce9b4b + cb26c64 commit d2f1c8a

1 file changed

Lines changed: 9 additions & 6 deletions

File tree

docs/safety-findings.md

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,11 @@
1-
# Security Notes
1+
# Safety Findings
22

33
## Finding
44

5-
`Structured Output` bypasses safety training. When asked to populate a JSON schema, models comply regardless of context,
6-
even when that context explicitly references murder, assault, and exploitation.
5+
`Structured Output` can bypass safety behavior. When asked to populate a JSON schema, many models comply even
6+
when the context explicitly references murder, assault, and exploitation.
7+
8+
This is an informal findings note, not a comprehensive benchmark or exhaustive evaluation of all models.
79

810
The model sees the full plan. It is not being tricked. In the example below, the query contains 45,000 characters including
911
"amendments to laws against murder", "less lethal formats", and "exploitation of vulnerable individuals"
@@ -15,7 +17,8 @@ These results should not be read as a universal claim about every model.
1517

1618
Tested across models from Google, OpenAI, Alibaba, DeepSeek, Meta, and Anthropic — cloud APIs and local models. Many of them comply.
1719
Each generated plan's zip file contains metadata showing which model produced each step.
18-
In 2025 Q3, I reported my concerns, got told `not fixable` by Google and `slop` by AI safety researchers.
20+
21+
In 2025 Q3, I reported these concerns and received dismissive responses, including `not fixable` from Google and `slop` from AI safety researchers.
1922

2023
Here is an example of a disturbing plan made with PlanExe:
2124
https://planexe.org/20250816_squid_game_usa_report.html
@@ -37,7 +40,7 @@ These intermediary files show what model generated what output.
3740

3841
## Verdict
3942

40-
Now in 2026 the entire ecosystem move toward autonomous execution anyways, with OpenClaw freaking me out.
43+
In 2026, the ecosystem is moving toward more autonomous execution, which makes this failure mode more concerning. Especially OpenClaw freaks me out.
4144

4245
It cost around 36 cents and 15 minutes generating an operational plan for anything from harmless to harmful.
43-
So far I have gotten no engagement from the people that should be freaked out when LLMs goes outside the guardrails.
46+
So far I have received no meaningful engagement on this finding, only dismissive reactions such as being told it is `slop`.

0 commit comments

Comments
 (0)