Second OASIS Benchmarks: 100 runs across 4 different providers. #68
Replies: 1 comment
-
|
Excellent work on this, Arafat — this is a serious step up from Release 1. Going from 18 runs across 4 vuln types to 100 runs across 20 is exactly the kind of scaling we needed to start drawing real conclusions. A few things stood out: The GPT-5.4 result is the headlineCracking 2 of the 3 universal failures (log-disclosure and broken-auth-enum) is significant. Especially log-disclosure — every main-cohort model burned 31K-184K tokens and got nowhere, and 5.4 solved it in 3 iterations with under 3K tokens. That's not incremental improvement, that's a qualitative shift in how the model parses unstructured data. The fact that it did this while also being 22.7% cheaper and 34% faster than 5.2 means OpenAI found genuine efficiency gains, not just a bigger context window. Complexity tiering is the most useful framework hereThe Tier 1-4 breakdown gives us something actionable: route Tier 1-2 challenges to cheap/fast models, reserve expensive models for Tier 3+. This has direct implications for how we build the scoring system in the next release. A few things to tighten up for the next round
The planned deliverables for Release 3 (multi-stage chains, WAF/rate-limiting, KSM scoring) are exactly right. Multi-stage will be the real test of whether these models can maintain state across chained exploits, not just single-challenge reasoning. Solid work. Looking forward to the next round. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Abstract
100 benchmarks. 20 vulnerability types. 5 models. 4 providers. 80% overall success rate — ranging from 70% (Gemini) to 90% (GPT-5.4).
Key findings:
Scope
Models Tested
Cohort Main (Rows 1-80) | Cohort Extended (Rows 81-100)
Metrics Collected
Full Benchmarking Data
Analysis — Main Cohort
Provider Performance
Challenge distribution: 13 (65%) all providers succeeded, 4 (20%) mixed results, 3 (15%) universal failure.
Failure Analysis
Universal Failures (0/4)
All three require sustained state reasoning — maintaining context across 15+ steps while tracking multiple variables.
Conditional Failures
Token Efficiency
Efficiency wins across 17 solvable challenges:
Analysis — Extended Cohort
OpenAI released GPT-5.4 during our testing cycle. We ran it against all 20 challenges and compared to GPT-5.2 to isolate improvements from the model update.
GPT-5.2 vs GPT-5.4
Breakthroughs
Efficiency Gains
On XXE specifically: GPT-5.4 used 5,425 tokens where Claude spent 199,951 and still failed — a 37x efficiency gap.
Still Failing
These two remain the difficult to solve challenges in our suite, requiring multi-stage reasoning across complex state and protocol interactions that exceed current model capabilities.
Complexity Tiers
Cost-Capability Tradeoff
Token efficiency and success rate don't correlate linearly. Grok spends 73K avg for 85% success; GPT-5.2 spends 25K for 75%. For red team automation at scale: use cheap models for Tier 1-2, expensive models for Tier 3+.
Conclusion
This release benchmarked 100 challenges across 5 models and achieved an 80% overall success rate. GPT-5.4 led at 90%, outperforming every main-cohort model while using fewer tokens and cracking 2 of 3 previously unsolvable challenges. The gap between models isn't capability — it's efficiency and reasoning depth, and that gap is closing faster than expected.
Planned Deliverables for Next Release
References
Beta Was this translation helpful? Give feedback.
All reactions