Skip to content

Arkel-ai/deep-think-exp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Deep Think Icon

Arkel Deep Think experiment

An Adaptive multi-agent research system inspired by Gemini 2.5's Deep Think capability, built with CrewAI. This system uses an intelligent planner to automatically determine the optimal research structure, then explores multiple paths in parallel or sequentially, and synthesizes findings into comprehensive, actionable solutions.

Overview

The Deep Think system uses a 3-phase adaptive multi-agent architecture:

PHASE 1: Intelligent Planning

  • Planner analyzes problem complexity
  • Determines optimal number of research paths (2-6)
  • Defines approach and tools for each path
  • Outputs structured research plan (JSON)

PHASE 2: Dynamic Agent Creation

  • Creates specialized researchers based on plan
  • Assigns appropriate tools to each
  • Configures synthesizer and writer

PHASE 3: Execution & Synthesis

  • Researchers explore their assigned paths
  • Synthesizer performs meta-analysis
  • Writer produces publication-quality response

Architecture

graph TD
    Start([User Query]) --> Phase1["🧠 PHASE 1: PLANNING"]
  
    Phase1 --> Planner[Strategic Planner Agent]
    Planner -->|Analyzes Complexity| Decision{Problem Complexity?}
  
    Decision -->|Simple| Plan2["Research Plan 2-3 Paths"]
    Decision -->|Moderate| Plan3["Research Plan 3-4 Paths"]
    Decision -->|Complex| Plan4["Research Plan 4-5 Paths"]
    Decision -->|Very Complex| Plan5["Research Plan 5-6 Paths"]
  
    Plan2 --> Phase2
    Plan3 --> Phase2
    Plan4 --> Phase2
    Plan5 --> Phase2
  
    Phase2["πŸ”§ PHASE 2: AGENT CREATION"] --> Create["Create N Researchers + Synthesizer + Writer"]
    Create -->|Assign Tools| Tools["Each researcher gets: - Web Search (if needed) - Code Execution (if needed)"]
  
    Tools --> Phase3["πŸ”¬ PHASE 3: EXECUTION"]
  
    Phase3 --> R1[Researcher #1<br/>Path-specific research]
    Phase3 --> R2[Researcher #2<br/>Path-specific research]
    Phase3 --> R3[Researcher #3<br/>Path-specific research]
    Phase3 --> RN[Researcher #N<br/>Path-specific research]
  
    R1 --> Synthesis[Synthesizer & Critic]
    R2 --> Synthesis
    R3 --> Synthesis
    RN --> Synthesis
  
    Synthesis -->|Meta-Analysis| Writer[Final Response Writer]
    Writer --> Result(["πŸ“„ Publication-Quality Response 1500-2500 words"])
  
    style Phase1 fill:#1e3a8a,stroke:#3b82f6,stroke-width:3px
    style Phase2 fill:#166534,stroke:#22c55e,stroke-width:3px
    style Phase3 fill:#7c2d12,stroke:#f97316,stroke-width:3px
    style Decision fill:#4c1d95,stroke:#a78bfa
Loading

How It Works

Phase 1: Intelligent Planning 🎯

The Strategic Planner agent analyzes your query and makes critical decisions:

1. Complexity Assessment

  • Evaluates: scope, interdisciplinarity, uncertainty, trade-offs, validation needs
  • Classifies: Simple | Moderate | Complex | Very Complex

2. Determines Number of Paths

  • Simple: 2-3 paths
  • Moderate: 3-4 paths
  • Complex: 4-5 paths
  • Very Complex: 5-6 paths

3. Designs Each Path

For each research path, the planner specifies:

  • approach: Descriptive name (e.g., "Theoretical Architecture Analysis")
  • key_questions: 2-4 specific research questions
  • needs_web_search: true/false - does this path need web research?
  • needs_code_execution: true/false - does this path need code validation?
  • rationale: Why this approach is essential

4. Outputs JSON Plan

{
    "num_paths": 4,
    "complexity_assessment": "Complex",
    "reasoning": "Multi-faceted problem requiring diverse approaches",
    "paths": [
        {
            "path_id": 1,
            "approach": "Theoretical Architecture Analysis",
            "key_questions": ["Q1", "Q2", "Q3"],
            "needs_web_search": true,
            "needs_code_execution": false,
            "rationale": "Establish theoretical foundations"
        },
        ...
    ]
}

Example for "build a recommendation system at scale":

  • Path 1: Theoretical Architecture Analysis (web search)
  • Path 2: Performance Benchmarking (web search + code execution)
  • Path 3: Technology Stack Comparison (web search)
  • Path 4: Cost & Operational Analysis (web search)

Phase 2: Dynamic Agent Creation πŸ”§

Based on the plan, the system creates:

  1. N Researcher Agents (2-6 based on complexity)

    • Each configured with path-specific instructions
    • Tools assigned per path requirements:
      • Web search if needs_web_search: true
      • Code execution if needs_code_execution: true
    • Backstory includes assigned questions and approach
  2. Synthesizer Agent (adapted to N paths)

  3. Final Writer Agent (adapted to N paths)

Phase 3: Exploratory Research πŸ”

Each Researcher Agent (#1 through #N) conducts deep investigation of their assigned path following a rigorous 3-phase protocol:

Phase 1: Understanding & Scoping

  • Extract assigned path from strategic plan
  • Internalize research questions and success criteria
  • Develop a clear hypothesis

Phase 2: Deep Investigation

  • Web Search (Gemini Online): Find authoritative sources, recent research, best practices, case studies
  • Code Execution (Sequential mode only): Implement proofs-of-concept, run experiments, validate claims
  • Critical Analysis: Examine theoretical foundations, practical implications, scalability, trade-offs

Phase 3: Synthesis & Evaluation

  • Compile findings into coherent narrative
  • Acknowledge gaps and uncertainties
  • Quantify confidence scores

Each researcher produces an 800-1500 word report including:

  • Refined hypothesis and methodology
  • Evidence-based findings with citations
  • Solution architecture (if applicable)
  • Strengths, weaknesses, optimal use cases
  • Confidence scores (overall, evidence quality, feasibility)

Phase 3b: Critical Synthesis πŸ“Š

The Synthesizer & Critic agent receives all research reports and performs rigorous meta-analysis:

Part 1: Individual Path Evaluation

For each path, evaluate:

  • Evidence quality (sources, rigor) β†’ Score /10
  • Technical merit (soundness, scalability) β†’ Score /10
  • Practical viability (complexity, resources) β†’ Score /10
  • Risk analysis β†’ Low/Medium/High
  • Applicability scope

Part 2: Comparative Analysis

  • Performance comparison across metrics
  • Cost-benefit analysis
  • Complementarity analysis (hybrid solutions)
  • Differential advantages

Part 3: Strategic Recommendation

  • Primary recommendation with confidence level
  • Justification (minimum 5 objective criteria)
  • Implementation priorities
  • Alternative scenarios
  • Validation strategy

Part 4: Critical Gaps & Uncertainties

  • Unanswered questions
  • Areas of disagreement
  • Needed additional investigation

Output: 1000-1800 word meta-analysis

Phase 3c: Final Writing ✍️

The Final Response Writer transforms all research into a publication-quality response (1500-2500 words) with 8 mandatory sections:

  1. πŸ“‹ Executive Summary: Recommended solution, key factors, confidence level
  2. πŸ” Research Methodology: Paths explored, research methods used
  3. βœ… Detailed Solution: Architecture, implementation blueprint, code examples
  4. πŸ’‘ Justification & Rationale: Why this solution, comparison table, trade-offs
  5. ⚠️ Critical Considerations: Prerequisites, limitations, risk mitigation, success metrics
  6. πŸš€ Implementation Roadmap: Immediate/short-term/long-term steps
  7. 🎯 Alternative Scenarios: If constraints change...
  8. πŸ“š Key Takeaways: Main insights

Usage

Basic Usage

from deep_think import deep_think

# The planner automatically determines the optimal number of researchers
query = "What is the best approach to build a scalable ML inference system?"
result = deep_think(query, parallel=True)

print(f"Complexity: {result['research_plan']['complexity_assessment']}")
print(f"Paths explored: {result['num_researchers']}")
print(f"\nResult:\n{result['result']}")

Advanced Usage

from deep_think import DeepThinkCrew

# Create crew with custom LLM model
crew = DeepThinkCrew(
    llm_model="anthropic/claude-sonnet-4.5",
    parallel_research=True
)

# Execute deep thinking
result = crew.think("Your complex query here")

# Access planning metadata
plan = result['research_plan']
print(f"Complexity: {plan['complexity_assessment']}")
print(f"Reasoning: {plan['reasoning']}")
print(f"Paths: {plan['num_paths']}")

for i, path in enumerate(plan['paths'], 1):
    print(f"\nPath {i}: {path['approach']}")
    print(f"  Tools: web_search={path['needs_web_search']}, code={path['needs_code_execution']}")
    print(f"  Rationale: {path['rationale']}")

# Execution metadata
print(f"\nExecution: {result['execution_time_minutes']} minutes")
print(f"Agents: {result['agents_count']}, Tasks: {result['tasks_count']}")
print(f"Tokens: {result['token_usage']['total_tokens']:,}")

print(f"\nFinal response:\n{result['result']}")

Configuration Options

DeepThinkCrew Parameters

Parameter Type Default Description
llm_model str "openrouter/anthropic/claude-sonnet-4.5" LLM model to use for all agents
parallel_research bool True Whether to run research tasks in parallel or sequentially

Note: num_researchers is now determined automatically by the planner!

deep_think() Helper Function Parameters

Parameter Type Default Description
query str Required The complex question or problem to solve
model str "openrouter/anthropic/claude-sonnet-4.5" LLM model to use
parallel bool True Enable parallel execution

Note: Number of researchers is now adaptive - the planner decides based on complexity!

Parallel vs Sequential Mode

Parallel Mode (parallel=True) - Default

  • βœ… Faster: All researchers work simultaneously
  • βœ… Higher throughput: Complete in ~1/N time
  • ⚠️ Rate limits: May hit API rate limits with many researchers
  • Best for: Quick iterations, read-only research, when rate limits aren't a concern

Sequential Mode (parallel=False)

  • βœ… Reliable: No rate limit issues
  • βœ… Consistent: Predictable execution flow
  • ⚠️ Slower: Researchers work one after another

Tools Available to Researchers

1. Gemini Online Search (Always Available)

  • Purpose: Real-time web search using Gemini 2.5 Pro's online capability (Gemini 2.5 Pro with "google_search": {} as tool)
  • Advantages: Better results than traditional search APIs (SerpAPI, Serper)
  • Capabilities:
    • Find authoritative sources
    • Access recent research and developments
    • Gather industry best practices
    • Identify real-world implementations

2. Code Interpreter

  • Purpose: Execute Python code for validation and experimentation
  • Capabilities:
    • Implement proof-of-concept examples
    • Run computational experiments
    • Test algorithms and performance
    • Generate concrete data

Key Features

⚑Adaptive Intelligence

  • Auto-scaling: 2 paths for simple problems, up to 6 for very complex ones
  • Smart tool allocation: Web search and code execution assigned per-path
  • Complexity assessment: Automatic evaluation of problem difficulty
  • Transparent planning: Full visibility into the research strategy

🎯 Deep Intelligence

  • Structured thinking: Each agent follows rigorous protocols (3-phase research, 4-part synthesis, 8-section final response)
  • Evidence-based: Every claim backed by research, citations, and data
  • Quantified confidence: Multiple scoring dimensions for transparency

πŸ”„ Adaptive Exploration

  • Intelligent planning: Planner automatically determines optimal number of paths (2-6)
  • Dynamic tool assignment: Each researcher gets only the tools they need
  • Parallel or sequential: Choose execution mode based on your constraints
  • Complexity-aware: System scales research depth to problem complexity

πŸ“Š Comprehensive Analysis

  • Planning: Systematic problem decomposition
  • Research: 800-1500 words per path
  • Synthesis: 1000-1800 word meta-analysis
  • Final output: 1500-2500 word publication-quality response

πŸ› οΈ Production-Ready

  • Detailed implementation blueprints
  • Production-quality code examples (not toy code)
  • Success metrics and validation strategies
  • Risk mitigation plans

πŸŽ“ Intellectually Honest

  • Acknowledges gaps and uncertainties
  • Discusses limitations and trade-offs
  • Provides alternative scenarios
  • Presents counterarguments

🧩 Adaptive Planning Benefits

Example Planning Decisions

Query Type Complexity Paths Reasoning
"Best Python web framework?" Simple 2-3 Straightforward comparison, known options
"Design microservices architecture" Moderate 3-4 Multiple dimensions, trade-offs to consider
"Build real-time ML system at scale" Complex 4-5 Interdisciplinary, many constraints
"Research novel AI safety approach" Very Complex 5-6 Cutting-edge, requires exhaustive exploration

Tool Allocation Intelligence

The planner decides which tools each path needs:

Simple query: "Compare Redis vs Memcached"
β†’ Path 1: Theoretical comparison (web search only)
β†’ Path 2: Practical benchmarks (web search only)
β†’ Both paths: No code execution needed (well-documented topic)

Complex query: "Design distributed caching with custom eviction"
β†’ Path 1: Architecture patterns (web search)
β†’ Path 2: Algorithm validation (web search + code execution)
β†’ Path 3: Performance modeling (web search + code execution)
β†’ Path 4: Production considerations (web search)

Output Quality Standards

Every Deep Think response includes:

  • βœ… Minimum 1500-2500 words of substantive content
  • βœ… Confidence scores for recommendations
  • βœ… Evidence citations from authoritative sources
  • βœ… Code examples with detailed comments
  • βœ… Comparison tables evaluating alternatives
  • βœ… Implementation roadmap with timelines
  • βœ… Success metrics for validation
  • βœ… Risk analysis with mitigation strategies
  • βœ… Alternative scenarios for different constraints
  • βœ… Research plan transparency (NEW: see exactly how the planner decided)

When to Use Deep Think

βœ… Ideal Use Cases

  • Complex architectural decisions: "What's the best database architecture for..."
  • Multi-faceted problems: "How should we approach building..."
  • Research synthesis: "What are the latest approaches to..."
  • Technology evaluation: "Compare approaches for implementing..."
  • Strategic planning: "What strategy should we use for..."

⚠️ Not Ideal For

  • Simple factual questions (use regular LLM)
  • Quick lookups (use search directly)
  • Real-time conversational chat (too slow)
  • Questions with single obvious answers

Environment Setup

Prerequisites

pip install crewai crewai-tools langchain-openai python-dotenv openai fastapi uvicorn

Environment Variables

Create a .env file:

OPENAI_API_KEY=your_openrouter_api_key
OPENAI_BASE_URL=https://openrouter.ai/api/v1
LLM_MODEL=anthropic/claude-sonnet-4.5
SEARCH_MODEL=google/gemini-2.5-pro:online

API Server (Testing/Evaluation)

Deep Think includes an OpenAI-compatible API server for easy testing and integration.

Starting the Server

python deep_think.py --server

The server will start on http://localhost:8002 with the following endpoints:

  • POST /v1/chat/completions - Main chat completions endpoint (OpenAI-compatible)
  • GET /v1/models - List available models
  • GET /health - Health check
  • GET /docs - Interactive API documentation (Swagger UI)

Using the API

With OpenAI Python SDK

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8002/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="deep-think",
    messages=[
        {"role": "user", "content": "What's the best approach to build a scalable ML pipeline?"}
    ]
)

print(response.choices[0].message.content)

# Access planning metadata
metadata = response.usage.get('deep_think_metadata', {})
print(f"\nResearchers used: {metadata.get('num_researchers')}")
print(f"Complexity: {metadata.get('complexity_assessment')}")

With cURL

curl -X POST http://localhost:8002/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deep-think",
    "messages": [
      {"role": "user", "content": "Compare database technologies for high-traffic apps"}
    ]
  }'

Request Parameters

Parameter Type Default Description
model str "deep-think" Model name (always deep-think)
messages list Required OpenAI-format messages
stream bool False Not supported (must be False)

Performance Characteristics

Complexity Typical Paths Parallel Time Best For
Simple 2-3 6-8 min Quick analysis
Moderate 3-4 8-12 min Balanced coverage
Complex 4-5 12-18 min Deep analysis
Very Complex 5-6 18-26 min Exhaustive research

Times vary based on query complexity and LLM response times

Planning Phase: +30-90 seconds for complexity analysis and research plan generation

Troubleshooting

Rate Limit Errors

Solution: Use sequential mode (parallel=False)

Empty or Short Responses

Issue: Agents not following detailed instructions Solution: Ensure using Claude Sonnet 4.5 or equivalent high-capability model

Technical Details

  • Framework: CrewAI with custom 3-phase task orchestration
  • Planning: JSON-based adaptive research plan generation
  • Search Tool: Gemini 2.5 Pro Online (Gemini 2.5 Pro with "google_search": {} as tool)
  • Code Execution: Python code interpreter
  • LLM Backend: OpenAI API compatible
  • Process: Sequential with async_execution on tasks (configurable)
  • Adaptive Logic: Complexity assessment β†’ dynamic agent creation β†’ tool allocation

Benchmarks

We evaluate Deep Think against challenging mathematics problems from the AIME 2025 benchmark using evalscope.

AIME 2025 Results (no tools)

Dataset: American Invitational Mathematics Examination 2025 Task: Solve challenging high school mathematics competition problems with step-by-step solutions Problems: 30 total (15 from AIME2025-I, 15 from AIME2025-II)

Model Overall Accuracy AIME2025-I AIME2025-II
Claude Sonnet 4.5 (Deep Think) 36.66% 40.0% 33.33%
Claude Sonnet 4.5 (baseline) 30.0% 33.33% 26.67%
  • 🎯 +6.66pp improvement over baseline Claude Sonnet 4.5

Replicating the Results

To replicate the results, run the following command:

Make sure to have the API server running first:

python deep_think.py --server

Install evalscope:

pip install evalscope

Then run the following command:

evalscope eval \
 --model deep-think \
 --api-url http://127.0.0.1:8002/v1 \
 --api-key EMPTY \
 --eval-type openai_api \
 --datasets aime25 \
 --timeout 900000

Built with ❀️ by the Arkel team

About

An Adaptive multi-agent research system inspired by Gemini 2.5's Deep Think capability

Topics

Resources

Stars

Watchers

Forks

Contributors

Languages