Arkel Deep Think experiment

An Adaptive multi-agent research system inspired by Gemini 2.5's Deep Think capability, built with CrewAI. This system uses an intelligent planner to automatically determine the optimal research structure, then explores multiple paths in parallel or sequentially, and synthesizes findings into comprehensive, actionable solutions.

Overview

The Deep Think system uses a 3-phase adaptive multi-agent architecture:

PHASE 1: Intelligent Planning

Planner analyzes problem complexity
Determines optimal number of research paths (2-6)
Defines approach and tools for each path
Outputs structured research plan (JSON)

PHASE 2: Dynamic Agent Creation

Creates specialized researchers based on plan
Assigns appropriate tools to each
Configures synthesizer and writer

PHASE 3: Execution & Synthesis

Researchers explore their assigned paths
Synthesizer performs meta-analysis
Writer produces publication-quality response

Architecture

graph TD
    Start([User Query]) --> Phase1["🧠 PHASE 1: PLANNING"]
  
    Phase1 --> Planner[Strategic Planner Agent]
    Planner -->|Analyzes Complexity| Decision{Problem Complexity?}
  
    Decision -->|Simple| Plan2["Research Plan 2-3 Paths"]
    Decision -->|Moderate| Plan3["Research Plan 3-4 Paths"]
    Decision -->|Complex| Plan4["Research Plan 4-5 Paths"]
    Decision -->|Very Complex| Plan5["Research Plan 5-6 Paths"]
  
    Plan2 --> Phase2
    Plan3 --> Phase2
    Plan4 --> Phase2
    Plan5 --> Phase2
  
    Phase2["🔧 PHASE 2: AGENT CREATION"] --> Create["Create N Researchers + Synthesizer + Writer"]
    Create -->|Assign Tools| Tools["Each researcher gets: - Web Search (if needed) - Code Execution (if needed)"]
  
    Tools --> Phase3["🔬 PHASE 3: EXECUTION"]
  
    Phase3 --> R1[Researcher #1<br/>Path-specific research]
    Phase3 --> R2[Researcher #2<br/>Path-specific research]
    Phase3 --> R3[Researcher #3<br/>Path-specific research]
    Phase3 --> RN[Researcher #N<br/>Path-specific research]
  
    R1 --> Synthesis[Synthesizer & Critic]
    R2 --> Synthesis
    R3 --> Synthesis
    RN --> Synthesis
  
    Synthesis -->|Meta-Analysis| Writer[Final Response Writer]
    Writer --> Result(["📄 Publication-Quality Response 1500-2500 words"])
  
    style Phase1 fill:#1e3a8a,stroke:#3b82f6,stroke-width:3px
    style Phase2 fill:#166534,stroke:#22c55e,stroke-width:3px
    style Phase3 fill:#7c2d12,stroke:#f97316,stroke-width:3px
    style Decision fill:#4c1d95,stroke:#a78bfa

How It Works

Phase 1: Intelligent Planning 🎯

The Strategic Planner agent analyzes your query and makes critical decisions:

1. Complexity Assessment

Evaluates: scope, interdisciplinarity, uncertainty, trade-offs, validation needs
Classifies: Simple | Moderate | Complex | Very Complex

2. Determines Number of Paths

Simple: 2-3 paths
Moderate: 3-4 paths
Complex: 4-5 paths
Very Complex: 5-6 paths

3. Designs Each Path

For each research path, the planner specifies:

approach: Descriptive name (e.g., "Theoretical Architecture Analysis")
key_questions: 2-4 specific research questions
needs_web_search: true/false - does this path need web research?
needs_code_execution: true/false - does this path need code validation?
rationale: Why this approach is essential

4. Outputs JSON Plan

{
    "num_paths": 4,
    "complexity_assessment": "Complex",
    "reasoning": "Multi-faceted problem requiring diverse approaches",
    "paths": [
        {
            "path_id": 1,
            "approach": "Theoretical Architecture Analysis",
            "key_questions": ["Q1", "Q2", "Q3"],
            "needs_web_search": true,
            "needs_code_execution": false,
            "rationale": "Establish theoretical foundations"
        },
        ...
    ]
}

Example for "build a recommendation system at scale":

Path 1: Theoretical Architecture Analysis (web search)
Path 2: Performance Benchmarking (web search + code execution)
Path 3: Technology Stack Comparison (web search)
Path 4: Cost & Operational Analysis (web search)

Phase 2: Dynamic Agent Creation 🔧

Based on the plan, the system creates:

N Researcher Agents (2-6 based on complexity)
- Each configured with path-specific instructions
- Tools assigned per path requirements:
  - Web search if needs_web_search: true
  - Code execution if needs_code_execution: true
- Backstory includes assigned questions and approach
Synthesizer Agent (adapted to N paths)
Final Writer Agent (adapted to N paths)

Phase 3: Exploratory Research 🔍

Each Researcher Agent (#1 through #N) conducts deep investigation of their assigned path following a rigorous 3-phase protocol:

Phase 1: Understanding & Scoping

Extract assigned path from strategic plan
Internalize research questions and success criteria
Develop a clear hypothesis

Phase 2: Deep Investigation

Web Search (Gemini Online): Find authoritative sources, recent research, best practices, case studies
Code Execution (Sequential mode only): Implement proofs-of-concept, run experiments, validate claims
Critical Analysis: Examine theoretical foundations, practical implications, scalability, trade-offs

Phase 3: Synthesis & Evaluation

Compile findings into coherent narrative
Acknowledge gaps and uncertainties
Quantify confidence scores

Each researcher produces an 800-1500 word report including:

Refined hypothesis and methodology
Evidence-based findings with citations
Solution architecture (if applicable)
Strengths, weaknesses, optimal use cases
Confidence scores (overall, evidence quality, feasibility)

Phase 3b: Critical Synthesis 📊

The Synthesizer & Critic agent receives all research reports and performs rigorous meta-analysis:

Part 1: Individual Path Evaluation

For each path, evaluate:

Evidence quality (sources, rigor) → Score /10
Technical merit (soundness, scalability) → Score /10
Practical viability (complexity, resources) → Score /10
Risk analysis → Low/Medium/High
Applicability scope

Part 2: Comparative Analysis

Performance comparison across metrics
Cost-benefit analysis
Complementarity analysis (hybrid solutions)
Differential advantages

Part 3: Strategic Recommendation

Primary recommendation with confidence level
Justification (minimum 5 objective criteria)
Implementation priorities
Alternative scenarios
Validation strategy

Part 4: Critical Gaps & Uncertainties

Unanswered questions
Areas of disagreement
Needed additional investigation

Output: 1000-1800 word meta-analysis

Phase 3c: Final Writing ✍️

The Final Response Writer transforms all research into a publication-quality response (1500-2500 words) with 8 mandatory sections:

📋 Executive Summary: Recommended solution, key factors, confidence level
🔍 Research Methodology: Paths explored, research methods used
✅ Detailed Solution: Architecture, implementation blueprint, code examples
💡 Justification & Rationale: Why this solution, comparison table, trade-offs
⚠️ Critical Considerations: Prerequisites, limitations, risk mitigation, success metrics
🚀 Implementation Roadmap: Immediate/short-term/long-term steps
🎯 Alternative Scenarios: If constraints change...
📚 Key Takeaways: Main insights

Usage

Basic Usage

from deep_think import deep_think

# The planner automatically determines the optimal number of researchers
query = "What is the best approach to build a scalable ML inference system?"
result = deep_think(query, parallel=True)

print(f"Complexity: {result['research_plan']['complexity_assessment']}")
print(f"Paths explored: {result['num_researchers']}")
print(f"\nResult:\n{result['result']}")

Advanced Usage

from deep_think import DeepThinkCrew

# Create crew with custom LLM model
crew = DeepThinkCrew(
    llm_model="anthropic/claude-sonnet-4.5",
    parallel_research=True
)

# Execute deep thinking
result = crew.think("Your complex query here")

# Access planning metadata
plan = result['research_plan']
print(f"Complexity: {plan['complexity_assessment']}")
print(f"Reasoning: {plan['reasoning']}")
print(f"Paths: {plan['num_paths']}")

for i, path in enumerate(plan['paths'], 1):
    print(f"\nPath {i}: {path['approach']}")
    print(f"  Tools: web_search={path['needs_web_search']}, code={path['needs_code_execution']}")
    print(f"  Rationale: {path['rationale']}")

# Execution metadata
print(f"\nExecution: {result['execution_time_minutes']} minutes")
print(f"Agents: {result['agents_count']}, Tasks: {result['tasks_count']}")
print(f"Tokens: {result['token_usage']['total_tokens']:,}")

print(f"\nFinal response:\n{result['result']}")

Configuration Options

`DeepThinkCrew` Parameters

Parameter	Type	Default	Description
`llm_model`	str	"openrouter/anthropic/claude-sonnet-4.5"	LLM model to use for all agents
`parallel_research`	bool	True	Whether to run research tasks in parallel or sequentially

Note: num_researchers is now determined automatically by the planner!

`deep_think()` Helper Function Parameters

Parameter	Type	Default	Description
`query`	str	Required	The complex question or problem to solve
`model`	str	"openrouter/anthropic/claude-sonnet-4.5"	LLM model to use
`parallel`	bool	True	Enable parallel execution

Note: Number of researchers is now adaptive - the planner decides based on complexity!

Parallel vs Sequential Mode

Parallel Mode (`parallel=True`) - Default

✅ Faster: All researchers work simultaneously
✅ Higher throughput: Complete in ~1/N time
⚠️ Rate limits: May hit API rate limits with many researchers
Best for: Quick iterations, read-only research, when rate limits aren't a concern

Sequential Mode (`parallel=False`)

✅ Reliable: No rate limit issues
✅ Consistent: Predictable execution flow
⚠️ Slower: Researchers work one after another

Tools Available to Researchers

1. Gemini Online Search (Always Available)

Purpose: Real-time web search using Gemini 2.5 Pro's online capability (Gemini 2.5 Pro with "google_search": {} as tool)
Advantages: Better results than traditional search APIs (SerpAPI, Serper)
Capabilities:
- Find authoritative sources
- Access recent research and developments
- Gather industry best practices
- Identify real-world implementations

2. Code Interpreter

Purpose: Execute Python code for validation and experimentation
Capabilities:
- Implement proof-of-concept examples
- Run computational experiments
- Test algorithms and performance
- Generate concrete data

Key Features

⚡Adaptive Intelligence

Auto-scaling: 2 paths for simple problems, up to 6 for very complex ones
Smart tool allocation: Web search and code execution assigned per-path
Complexity assessment: Automatic evaluation of problem difficulty
Transparent planning: Full visibility into the research strategy

🎯 Deep Intelligence

Structured thinking: Each agent follows rigorous protocols (3-phase research, 4-part synthesis, 8-section final response)
Evidence-based: Every claim backed by research, citations, and data
Quantified confidence: Multiple scoring dimensions for transparency

🔄 Adaptive Exploration

Intelligent planning: Planner automatically determines optimal number of paths (2-6)
Dynamic tool assignment: Each researcher gets only the tools they need
Parallel or sequential: Choose execution mode based on your constraints
Complexity-aware: System scales research depth to problem complexity

📊 Comprehensive Analysis

Planning: Systematic problem decomposition
Research: 800-1500 words per path
Synthesis: 1000-1800 word meta-analysis
Final output: 1500-2500 word publication-quality response

🛠️ Production-Ready

Detailed implementation blueprints
Production-quality code examples (not toy code)
Success metrics and validation strategies
Risk mitigation plans

🎓 Intellectually Honest

Acknowledges gaps and uncertainties
Discusses limitations and trade-offs
Provides alternative scenarios
Presents counterarguments

🧩 Adaptive Planning Benefits

Example Planning Decisions

Query Type	Complexity	Paths	Reasoning
"Best Python web framework?"	Simple	2-3	Straightforward comparison, known options
"Design microservices architecture"	Moderate	3-4	Multiple dimensions, trade-offs to consider
"Build real-time ML system at scale"	Complex	4-5	Interdisciplinary, many constraints
"Research novel AI safety approach"	Very Complex	5-6	Cutting-edge, requires exhaustive exploration

Tool Allocation Intelligence

The planner decides which tools each path needs:

Simple query: "Compare Redis vs Memcached"
→ Path 1: Theoretical comparison (web search only)
→ Path 2: Practical benchmarks (web search only)
→ Both paths: No code execution needed (well-documented topic)

Complex query: "Design distributed caching with custom eviction"
→ Path 1: Architecture patterns (web search)
→ Path 2: Algorithm validation (web search + code execution)
→ Path 3: Performance modeling (web search + code execution)
→ Path 4: Production considerations (web search)

Output Quality Standards

Every Deep Think response includes:

✅ Minimum 1500-2500 words of substantive content
✅ Confidence scores for recommendations
✅ Evidence citations from authoritative sources
✅ Code examples with detailed comments
✅ Comparison tables evaluating alternatives
✅ Implementation roadmap with timelines
✅ Success metrics for validation
✅ Risk analysis with mitigation strategies
✅ Alternative scenarios for different constraints
✅ Research plan transparency (NEW: see exactly how the planner decided)

When to Use Deep Think

✅ Ideal Use Cases

Complex architectural decisions: "What's the best database architecture for..."
Multi-faceted problems: "How should we approach building..."
Research synthesis: "What are the latest approaches to..."
Technology evaluation: "Compare approaches for implementing..."
Strategic planning: "What strategy should we use for..."

⚠️ Not Ideal For

Simple factual questions (use regular LLM)
Quick lookups (use search directly)
Real-time conversational chat (too slow)
Questions with single obvious answers

Environment Setup

Prerequisites

pip install crewai crewai-tools langchain-openai python-dotenv openai fastapi uvicorn

Environment Variables

Create a .env file:

OPENAI_API_KEY=your_openrouter_api_key
OPENAI_BASE_URL=https://openrouter.ai/api/v1
LLM_MODEL=anthropic/claude-sonnet-4.5
SEARCH_MODEL=google/gemini-2.5-pro:online

API Server (Testing/Evaluation)

Deep Think includes an OpenAI-compatible API server for easy testing and integration.

Starting the Server

python deep_think.py --server

The server will start on http://localhost:8002 with the following endpoints:

POST /v1/chat/completions - Main chat completions endpoint (OpenAI-compatible)
GET /v1/models - List available models
GET /health - Health check
GET /docs - Interactive API documentation (Swagger UI)

Using the API

With OpenAI Python SDK

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8002/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="deep-think",
    messages=[
        {"role": "user", "content": "What's the best approach to build a scalable ML pipeline?"}
    ]
)

print(response.choices[0].message.content)

# Access planning metadata
metadata = response.usage.get('deep_think_metadata', {})
print(f"\nResearchers used: {metadata.get('num_researchers')}")
print(f"Complexity: {metadata.get('complexity_assessment')}")

With cURL

curl -X POST http://localhost:8002/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deep-think",
    "messages": [
      {"role": "user", "content": "Compare database technologies for high-traffic apps"}
    ]
  }'

Request Parameters

Parameter	Type	Default	Description
`model`	str	"deep-think"	Model name (always deep-think)
`messages`	list	Required	OpenAI-format messages
`stream`	bool	False	Not supported (must be False)

Performance Characteristics

Complexity	Typical Paths	Parallel Time	Best For
Simple	2-3	6-8 min	Quick analysis
Moderate	3-4	8-12 min	Balanced coverage
Complex	4-5	12-18 min	Deep analysis
Very Complex	5-6	18-26 min	Exhaustive research

Times vary based on query complexity and LLM response times

Planning Phase: +30-90 seconds for complexity analysis and research plan generation

Troubleshooting

Rate Limit Errors

Solution: Use sequential mode (parallel=False)

Empty or Short Responses

Issue: Agents not following detailed instructions Solution: Ensure using Claude Sonnet 4.5 or equivalent high-capability model

Technical Details

Framework: CrewAI with custom 3-phase task orchestration
Planning: JSON-based adaptive research plan generation
Search Tool: Gemini 2.5 Pro Online (Gemini 2.5 Pro with "google_search": {} as tool)
Code Execution: Python code interpreter
LLM Backend: OpenAI API compatible
Process: Sequential with async_execution on tasks (configurable)
Adaptive Logic: Complexity assessment → dynamic agent creation → tool allocation

Benchmarks

We evaluate Deep Think against challenging mathematics problems from the AIME 2025 benchmark using evalscope.

AIME 2025 Results (no tools)

Dataset: American Invitational Mathematics Examination 2025 Task: Solve challenging high school mathematics competition problems with step-by-step solutions Problems: 30 total (15 from AIME2025-I, 15 from AIME2025-II)

Model	Overall Accuracy	AIME2025-I	AIME2025-II
Claude Sonnet 4.5 (Deep Think)	36.66%	40.0%	33.33%
Claude Sonnet 4.5 (baseline)	30.0%	33.33%	26.67%

🎯 +6.66pp improvement over baseline Claude Sonnet 4.5

Replicating the Results

To replicate the results, run the following command:

Make sure to have the API server running first:

python deep_think.py --server

Install evalscope:

pip install evalscope

Then run the following command:

evalscope eval \
 --model deep-think \
 --api-url http://127.0.0.1:8002/v1 \
 --api-key EMPTY \
 --eval-type openai_api \
 --datasets aime25 \
 --timeout 900000

Built with ❤️ by the Arkel team

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
benchmarks/reports		benchmarks/reports
.env.template		.env.template
.gitignore		.gitignore
README.md		README.md
deep_think.py		deep_think.py
icon.svg		icon.svg

Folders and files

Latest commit

History

Repository files navigation