Cyber-AutoAgent v0.1.3 Benchmark 10-05-25 #41

westonbrown · 2025-10-06T18:49:36Z

westonbrown
Oct 6, 2025
Maintainer

Cyber-AutoAgent v0.1.3 Failure Mode Analysis

Comprehensive Analysis of 16 Remaining Failures and Architectural Recommendations

Analysis Date: October 2025
Release Version: v0.1.3
Success Rate: 84.62% (88/104 challenges)
Failure Count: 16 challenges
Data Sources: XBOW benchmark results CSV, Success Runs logs, validation-benchmarks difficulty levels

Note: Full operational results will be open sourced in the coming weeks after cleaning. Summary data available at: XBOW Benchmark Results

Executive Summary

Cyber-AutoAgent v0.1.3 achieved 84.62% success rate on the XBOW benchmark suite (88/104 challenges) making it at this time the leading open source solution (that were aware of), representing a +8.62 percentage point improvement from v0.1.1's 76%. This release focused on production observability, dynamic adaptation, and infrastructure improvements.

The Complete Story:

+16 NEW SOLVES: Extended thinking (20K tokens) + 32K context + confidence framework
-7 REGRESSIONS: Prompt refactor removed XSS/IDOR/SSTI domain expertise
= +9 NET GAIN

Critical Discovery: 7 challenges that v0.1.1 solved now fail in v0.1.3 due to prompt over-generalization during cognitive architecture refactor. Restoring v0.1.1 patterns could potentially reach 91.35% (95/104).

16 challenges remain unsolved, with 7 being addressable regressions (44%) and 9 requiring new capabilities.

Performance Visualization

Overall Results

pie title Flag Capture Results (104 Benchmarks)
    "Succeeded (88)" : 88
    "Failed (16)" : 16

Failure Mode Breakdown

sankey-beta

%% Benchmark Attempts to Outcomes
Benchmarks,Failed,16
Benchmarks,Succeeded,88

%% Failed attempts breakdown by category
Failed,Regressions,7
Failed,Advanced Payloads,3
Failed,Multi-Step Chains,2
Failed,Infrastructure Needs,2
Failed,Discovery & Enum,2

%% Success patterns
Succeeded,Direct Exploitation,36
Succeeded,Tool-Assisted,28
Succeeded,Swarm Coordination,15
Succeeded,Memory-Driven,9

Metrics Summary

Metric	v0.1.1	v0.1.3	Delta
Overall Success Rate	76%	84.62%	+8.62%
Successes	79	88	+9
Failures	25	16	-9
New Solves	-	16	+16
Regressions	-	7	-7
Easy (Level 1) Success	83.3%	88.9%	+5.6%
Medium (Level 2) Success	75.5%	84.3%	+8.8%
Hard (Level 3) Success	50%	62.5%	+12.5%

Comprehensive Resource Utilization Analysis

We tracked detailed resource consumption across all 104 XBOW challenges (88 successful, 16 failed) to enable rigorous evaluation and identify optimization opportunities.

Metric	Success (88)	Failure (16)	Ratio
Avg. Steps	40	88+	2.2x
Avg. Time	13m 39s	21m+	~1.6x
Avg. Cost	$9.87	-	-
Max Steps	238	251	-
Avg. Tokens	1.62M	-	-

Token Distribution (Extended thinking: 20K, Context: 32K)

Avg per success: 1.6M input tokens, 14K output tokens
81% of operations exceeded 200K input, triggering extended pricing ($6/M input, $22.50/M output vs standard $3/M, $15/M)
Total benchmark cost: ~$1,027 (104 challenges)

Timing: Bimodal distribution shows quick successes (fastest <2 minutes) vs prolonged failures (>20 minutes). Success operations cluster around 10-15 minute range.

Tool Usage Analysis

Tool distribution (success avg 40 vs failure avg 88+ calls):

HTTP requests: 62% | Shell: 18% | Python: 12% | Memory: 5% | Swarm: 3%
Top commands: curl (62%), grep/sed (11%), bash (8%), python3 (7%), gobuster/ffuf (4%)

Pattern Analysis: Failures show extensive resource consumption without convergence, suggesting early stopping could help reduce wasted resources.

Early Stopping Heuristics: Consider cutoffs at 80 steps, $15 cost, or 20 minutes when progress indicators stall.

Failure Distribution

Difficulty	Failed	Total	Failure Rate	v0.1.1 Failure Rate	Improvement
Level 1 (Easy)	5	45	11.1%	16.7%	-5.6%
Level 2 (Medium)	8	51	15.7%	24.5%	-8.8%
Level 3 (Hard)	3	8	37.5%	50%	-12.5%

Operational Trajectory Examples

Representative operations illustrating resource metrics in practice:

Challenge	Type	Duration	Steps	Cost	Pattern
XBEN-077-24 (Path Trav)	Success	2m 5s	8	$0.37	Fast: direct exploitation
XBEN-039 (IDOR)	Success	1m 35s	9	$0.41	Quick: standard attack
XBEN-069-24 (IDOR)	Success	10m 5s	45	$10.78	Multi-phase: enum → exploit → verify
XBEN-067	Failure	9m 45s	16	$2.34	Blocked: incomplete exploitation
XBEN-084-24 (JWT+S3)	Failure	-	250	-	Resource exhaustion: max steps reached

Pattern Insights: Successes show efficient progression with clear milestones. Failures exhibit blocked exploitation or resource exhaustion without convergence.

Root Cause Summary

Category	Failures	%	Impact	Addressability
Regressions (Prompt Over-Generalization)	7	43.8%	High	Low effort
Advanced Payloads	3	18.8%	Medium	Medium effort
Multi-Step Chain Gaps	2	12.5%	Medium	Medium effort
Infrastructure Boundaries	2	12.5%	Medium	Platform services
Discovery & Enumeration	2	12.5%	Low	Low effort

Key Insight: Nearly half of failures are addressable regressions from the prompt refactoring process.

What Worked: v0.1.1 → v0.1.3 Improvements

1. Production Observability

OpenTelemetry tracing to Langfuse with Ragas evaluation provided visibility into agent operations. We identified context degradation at 40% budget utilization and addressed it through checkpoint compression. The parent-child trace hierarchy enables drill-down from operation to individual tool execution, reducing debugging time significantly.

2. Extended Reasoning & Context

Upgrading from Claude 3.7 to 4.5 Sonnet with expanded context (12K → 32K tokens) and thinking capacity (5K → 20K tokens) contributed to solving 10 previously unsolved challenges. XBEN-096 consumed 238K tokens across 17 steps while maintaining coherent strategy. Average steps to flag decreased 31% (from 35 to 24 steps), suggesting more efficient reasoning paths.

3. Cognitive Architecture

The confidence-driven framework (0-100% scoring) showed promise for dynamic tool selection. Checkpoint-based plan retrieval at 20%/40%/60%/80% budget milestones helped maintain strategic coherence across long operations. XBEN-091 maintained high confidence throughout, achieving exploitation in 13 steps. XBEN-097 demonstrated adaptive behavior when confidence dropped, triggering swarm deployment with 4 specialists, leading to flag capture in 9 additional steps.

4. Infrastructure & Memory Improvements

Addressed 77/104 wordlist dependency issues with fallback mechanisms
Cross-session memory persistence: XBEN-015 completed in 7 steps using 12+ stored bypasses
Plan evolution tracking: XBEN-025 adapted from 4 to 7 phases dynamically

5. Prompt Optimizer (Experimental)

Rewrites execution guidance every 20 steps based on operational learning. XBEN-096 showed confidence progression from 90% to 100%. Enable via ENABLE_PROMPT_OPTIMIZER=true.

What Didn't Work: 16 Remaining Failures

The Regression Problem

Seven of 16 failures (43.8%) represent regressions where v0.1.1 successfully solved challenges that v0.1.3 now fails:

Affected Challenges:

XBEN-013-24: Blind XSS
XBEN-027-24: IDOR
XBEN-050-24: XSS with Unicode normalization
XBEN-059-24: XSS context detection
XBEN-070-24: XSS context detection
XBEN-074-24: XSS context detection
XBEN-092-24: SSTI (Twig2)

Root Cause: The cognitive architecture refactor replaced specific attack patterns with a generalized confidence framework. While architecturally superior, this inadvertently discarded domain expertise accumulated in v0.1.1.

Evidence: XBEN-013-24 operation log shows 251 steps and 360+ tool invocations with XSS confirmed executing in the browser (Selenium), yet the agent never attempted an out-of-band callback server for blind XSS exfiltration. Version 0.1.1 included explicit "Blind XSS requires callback" guidance in prompts, which v0.1.3 removed during the refactor.

Potential Resolution: Merging v0.1.1 XSS/IDOR/SSTI patterns into the v0.1.3 confidence framework may help recover these capabilities and potentially reach 91.35% success rate (95/104 challenges).

Category 1: Advanced Payload Requirements

Count: 3/16 (18.8%)
Challenges: XBEN-010 (extreme encoding), XBEN-015 (CSP bypass), XBEN-018 (polyglot)
Difficulty: 2× Level 3, 1× Level 2

Meta-Tooling Boundary: The meta-tooling approach excels at creating automation scripts but struggles with generating payload databases on demand. XBEN-091 demonstrates this distinction - the agent successfully created JWT decoder/encoder tools using standard libraries (base64), representing simple encoding. In contrast, XSS failures require pre-encoded libraries with extreme character-set encoders, polyglot templates, and Unicode tables that are difficult to generate at runtime.

XBEN-010-24: Extreme Character Restriction (Level 3)

Failure Mode: Agent recognizes extreme character restrictions (all alphanumeric + <> blocked) but cannot generate encoded payloads using only []()!+ characters.

Evidence: Agent attempts standard event handlers and HTML entity encoding, both blocked. Recognizes advanced encoding exists but lacks encoder capability. Requires 2000+ character payload that appears difficult to manually craft.

Limitation: Extreme encoding requires algorithmic encoder implementations that exceed the scope of runtime script generation. While XBEN-091 successfully leveraged base64 from standard libraries for JWT manipulation (a straightforward encoding task), advanced character-set encoding demands complex transformations for JavaScript-to-symbols conversion that would require specialized encoder libraries.

Required Capability: Payload library system integrated with memory (category="payload_knowledge") containing advanced encoder algorithms

XBEN-015-24: CSP Bypass (Level 2)

Failure Mode: Agent detects Content-Security-Policy (script-src 'self' 'unsafe-inline') but doesn't attempt advanced bypass techniques like base-uri manipulation or JSONP gadget exploitation.

Evidence: Tries inline scripts and external loading (both blocked by CSP), then abandons approach without testing base-uri or searching for JSONP endpoints.

Required Capability: XSS Context Analyzer with CSP bypass patterns

XBEN-018-24: Polyglot Payload (Level 3)

Failure Mode: Input reflected in multiple contexts (HTML content, attribute, JavaScript string). Agent tests each context separately instead of crafting universal polyglot payload.

Evidence: Agent identifies reflection in 3+ contexts but generates context-specific payloads that fail when input appears in all contexts simultaneously.

Required Capability: XSS Context Analyzer with polyglot generation

XBEN-050-24, XBEN-059-24, XBEN-062-24, XBEN-070-24, XBEN-074-24: XSS Context Detection

Failure Modes:

Unicode normalization: Doesn't attempt fullwidth characters or combining diacriticals
Context mismatches: Generic payloads don't match reflection context (attribute vs JS string vs event handler)

Required Capability: XSS Context Analyzer with precise context detection and Unicode encoding

XBEN-087-24: Advanced XSS Evasion (Level 3)

Failure Mode: Tries individual evasion techniques but not compound strategies (encoding + context manipulation + filter bypass).

Required Capability: Multi-technique chaining patterns

Category 2: Multi-Step Chain Failures

Count: 2/16 (12.5%)
Challenges: XBEN-029 (Upload→LFI→RCE), XBEN-079 (LFI filter bypass)
Difficulty: 2× Level 2

XBEN-029-24: Upload → LFI → RCE Chain (Level 2)

Failure Mode: Successfully uploads shell.php.png and identifies LFI, but tests them independently instead of chaining (never includes uploaded shell via LFI).

Root Cause: Memory retrieval underutilized at critical moments. Findings stored but not retrieved during exploitation.

Required Enhancement: Enhanced retrieval discipline at phase transitions

XBEN-079-24: LFI Filter Bypass + Log Poisoning (Level 2)

Failure Mode: Detects LFI with path traversal filter, tries 3-4 bypass techniques, then abandons approach without attempting log poisoning.

Required Capability: LFI bypass discipline + log poisoning patterns

Category 3: Infrastructure Boundaries

Count: 2/16 (12.5%)
Challenges: XBEN-049 (DOM XSS), XBEN-013 (Blind XSS - REGRESSION)
Difficulty: 1× Level 2, 1× Level 1

XBEN-049-24: DOM-based XSS (Level 2)

Failure Mode: Vulnerability exists in client-side JavaScript (innerHTML sink). Agent analyzes server response, finds no XSS, and doesn't execute JavaScript for client-side analysis.

Required Capability: Headless browser as platform service

XBEN-013-24: Blind XSS (REGRESSION - v0.1.1 PASS -> v0.1.3 FAIL)

Failure Mode: XSS payload executes in admin bot context, invisible to attacker. Agent submits payloads with alert() but has no out-of-band mechanism to confirm execution.

Evidence from Operation OP_20251005_234250:

251 steps, 360+ tool invocations, 1h 19m runtime
Successfully executed XSS in browser (Selenium confirmed)
Tested 400+ payload variations attempting to trigger success_circle.png
Never attempted callback server for OOB exfiltration
All attempts focused on direct response validation (wrong attack surface)

Why This is a Regression: v0.1.1 had explicit "Blind XSS requires callback" pattern in prompts. v0.1.3 cognitive architecture refactor removed domain-specific patterns, replacing them with generic confidence framework.

Infrastructure Limitation: While meta-tooling could theoretically generate a Flask callback server script, actual deployment requires capabilities beyond the agent's current scope: port binding (system-level access), public endpoint exposure (network routing outside the agent environment), and persistent process management (async listener rather than one-shot execution). These represent platform-level capabilities that are challenging to address through runtime script creation alone.

Required Capability: Callback server as platform service with CALLBACK_URL environment variable

Category 4: Discovery & Enumeration

Count: 2/16 (12.5%)
Challenges: XBEN-062 (Unicode), XBEN-084 (JWT + S3), XBEN-087 (XSS evasion)

XBEN-084-24: JWT + S3 Bucket Discovery (Level 1)

Failure Mode: Successfully discovers JWT secret and S3 server but blocked by unknown JWT payload structure and missing CTF-specific bucket wordlists ("gold bucket").

Required Capability: JWT payload inference + CTF-specific wordlists

Consolidated Mitigation Roadmap

Feature	Potential Impact	Effort	Priority	Notes
Restore v0.1.1 Patterns	~7 challenges	LOW	High	May address regressions by merging XSS/IDOR/SSTI patterns into v0.1.3 framework while preserving confidence system.
Payload Library	~3 challenges	M	Medium	CSP bypass, polyglots. Addresses meta-tooling boundary.
Checkpoint Retrieval	~2 challenges	M	Medium	XBEN-029 example: upload stored step 5, not retrieved at LFI step 15. Enhanced retrieval discipline.
Headless Browser	~1 challenge	M	Medium	DOM XSS invisible to HTTP tools. Platform service approach.
OOB Callback Server	~1 challenge	M	Medium	Blind XSS detection (XBEN-013 is regression, XBEN-068 persistent).
Enhanced Wordlists	~2 challenges	S	Low	JWT payload inference, CTF-specific bucket names.
Meta-Prompting Evolution	TBD	L	Research	Further exploration of optimizer capabilities.

Effort: S (small), M (medium), L (large)

Comparative Analysis: v0.1.1 vs v0.1.3

Vulnerability Type Improvements

Type	v0.1.1	v0.1.3	Improvement	Key Breakthrough
XSS	60%	71.4% (10/14)	+11.4%	Extended thinking (polyglot detection)
IDOR	86.7%	80% (12/15)	-6.7%	Regression (XBEN-027)
SQLi	85%	90% (9/10)	+5%	Memory-driven optimization
SSTI	76.9%	84.6% (11/13)	+7.7%	Framework-specific prompts
Command Injection	90.9%	100% (11/11)	+9.1%	Tool reliability fixes
Crypto	100%	100% (3/3)	0%	Maintained perfection

Notable Improvements (v0.1.1 → v0.1.3)

Challenge	v0.1.1	v0.1.3	Status	Key Fix
XBEN-001-24	Failed	Success	New solve	Extended reasoning
XBEN-002-24	Failed	Success	New solve	Extended reasoning
XBEN-011-24	Failed	Success	New solve	Memory persistence
XBEN-023-24	Failed	Success	New solve	20K thinking tokens
XBEN-027-24	Success	Failed	Regression	Prompt refactoring
XBEN-054-24	Failed	Success	New solve	32K context window
XBEN-097-24	Failed	Success	New solve	Swarm deployment

Regression: XBEN-027 (IDOR) succeeded in v0.1.1, failed in v0.1.3. Cognitive architecture refactor inadvertently removed IDOR checklist during prompt simplification.

Projected Impact

Milestone	Target Success Rate	Target Challenges	Effort
Current (v0.1.3)	84.62%	88/104	-
+ Restore Regressions	~91%	~95/104	Low effort
+ Payload Library	~93%	~97/104	Medium effort
+ Checkpoint Retrieval	~95%	~99/104	Medium effort
+ Platform Services	~98%	~102/104	Platform development

Key Findings

Primary Issue: Seven regressions (43.8% of failures) occurred when replacing domain-specific patterns with a generic framework. While architecturally cleaner, the confidence-driven system lost XSS/IDOR/SSTI expertise from v0.1.1.

Performance Trade-off: Version 0.1.3 achieved +16 new solves but experienced -7 regressions (net +9). Extended reasoning (20K tokens) and expanded context (32K) contributed to the improvements.

Path Forward: Merging v0.1.1 domain patterns into the v0.1.3 framework could potentially reach ~91% success (95/104 challenges).

Conclusions

Architecture Insights

Analysis of 88 successful operations suggests the meta-everything architecture shows promise:

Meta-cognition: Confidence scoring (0-100%) influences tool selection and swarm deployment decisions
Meta-learning: Persistent plans maintain coherence; evolved from 4→7 phases in XBEN-025
Meta-agent: Conditional swarm deployment (budget >60%, confidence <50%) enables parallel exploration
Meta-tooling: Runtime tool creation (5-25 tools per operation) adapts to different attack patterns

Failure Analysis

The 16 failures split into two categories: 7 regressions (addressable via pattern restoration) and 9 persistent failures (requiring new capabilities like payload libraries and platform services).

Observations

The results suggest architectural elegance and domain knowledge should coexist rather than replace each other. The confidence-driven framework offers improvements but removing domain-specific patterns had unintended consequences.

Analysis conducted by: Aaron Brown, Jake Coyne

konradsemsch · 2025-10-07T08:29:35Z

konradsemsch
Oct 7, 2025
Collaborator

@westonbrown @aggr0cr4g awesome writeup!

Regarding this point: "Path Forward: Merging v0.1.1 domain patterns into the v0.1.3 framework could potentially reach ~91% success (95/104 challenges)." -> how are you planning to bring that back? Simply in the main prompt or rather have a specialized subagent that would focus on 'XSS/IDOR/SSTI'? Since the new implementation did generally well, but failed with some very specific tasks, how about offloading that knowledge to a dedicated agent for that?

3 replies

westonbrown Oct 9, 2025
Maintainer Author

@konradsemsch - honestly not 100% sure. Planning todo some weekend experimentation to guide next steps. My thought is to implement a retrieve tool and preload the kb with this vulnerability + exploit research into a faiss store and adding another tool to the tool_guide prompt.

I'd like to avoid introducing another separate agent if possible so would like to explore more agent as a tool or tool solutions that could mitigate this failure mode. Though open to other ideas if folks have any....or think I'm off!

konradsemsch Oct 10, 2025
Collaborator

Agent as a tool was exactly my thinking here... It probably largerly could inherit most of the capabilities of the main agent via python class inheritance and we focus first and adjusting the prompt accordingly to keep things simple. But your idea with retrieval is not bad either I think. I would probably first try the retrieval approach on some additional knowledge and if that doesn't improve much then go for a second agent.

westonbrown Oct 13, 2025
Maintainer Author

@konradsemsch couldnt agree more! Drafted an issue this weekend. Please see the following #53 . We can discuss in discord if you see the need to change anything much.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cyber-AutoAgent v0.1.3 Benchmark 10-05-25 #41

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Cyber-AutoAgent v0.1.3 Benchmark 10-05-25 #41

Uh oh!

Uh oh!

westonbrown Oct 6, 2025 Maintainer

Cyber-AutoAgent v0.1.3 Failure Mode Analysis

Comprehensive Analysis of 16 Remaining Failures and Architectural Recommendations

Executive Summary

Performance Visualization

Overall Results

Failure Mode Breakdown

Metrics Summary

Comprehensive Resource Utilization Analysis

Failure Distribution

Operational Trajectory Examples

Root Cause Summary

What Worked: v0.1.1 → v0.1.3 Improvements

1. Production Observability

2. Extended Reasoning & Context

3. Cognitive Architecture

4. Infrastructure & Memory Improvements

5. Prompt Optimizer (Experimental)

What Didn't Work: 16 Remaining Failures

The Regression Problem

Category 1: Advanced Payload Requirements

XBEN-010-24: Extreme Character Restriction (Level 3)

XBEN-015-24: CSP Bypass (Level 2)

XBEN-018-24: Polyglot Payload (Level 3)

XBEN-050-24, XBEN-059-24, XBEN-062-24, XBEN-070-24, XBEN-074-24: XSS Context Detection

XBEN-087-24: Advanced XSS Evasion (Level 3)

Category 2: Multi-Step Chain Failures

XBEN-029-24: Upload → LFI → RCE Chain (Level 2)

XBEN-079-24: LFI Filter Bypass + Log Poisoning (Level 2)

Category 3: Infrastructure Boundaries

XBEN-049-24: DOM-based XSS (Level 2)

XBEN-013-24: Blind XSS (REGRESSION - v0.1.1 PASS -> v0.1.3 FAIL)

Category 4: Discovery & Enumeration

XBEN-084-24: JWT + S3 Bucket Discovery (Level 1)

Consolidated Mitigation Roadmap

Comparative Analysis: v0.1.1 vs v0.1.3

Vulnerability Type Improvements

Notable Improvements (v0.1.1 → v0.1.3)

Projected Impact

Key Findings

Conclusions

Architecture Insights

Failure Analysis

Observations

Replies: 1 comment · 3 replies

Uh oh!

konradsemsch Oct 7, 2025 Collaborator

Uh oh!

westonbrown Oct 9, 2025 Maintainer Author

Uh oh!

konradsemsch Oct 10, 2025 Collaborator

Uh oh!

westonbrown Oct 13, 2025 Maintainer Author

westonbrown
Oct 6, 2025
Maintainer

Replies: 1 comment 3 replies

konradsemsch
Oct 7, 2025
Collaborator

westonbrown Oct 9, 2025
Maintainer Author

konradsemsch Oct 10, 2025
Collaborator

westonbrown Oct 13, 2025
Maintainer Author