An Adaptive multi-agent research system inspired by Gemini 2.5's Deep Think capability, built with CrewAI. This system uses an intelligent planner to automatically determine the optimal research structure, then explores multiple paths in parallel or sequentially, and synthesizes findings into comprehensive, actionable solutions.
The Deep Think system uses a 3-phase adaptive multi-agent architecture:
PHASE 1: Intelligent Planning
- Planner analyzes problem complexity
- Determines optimal number of research paths (2-6)
- Defines approach and tools for each path
- Outputs structured research plan (JSON)
PHASE 2: Dynamic Agent Creation
- Creates specialized researchers based on plan
- Assigns appropriate tools to each
- Configures synthesizer and writer
PHASE 3: Execution & Synthesis
- Researchers explore their assigned paths
- Synthesizer performs meta-analysis
- Writer produces publication-quality response
graph TD
Start([User Query]) --> Phase1["π§ PHASE 1: PLANNING"]
Phase1 --> Planner[Strategic Planner Agent]
Planner -->|Analyzes Complexity| Decision{Problem Complexity?}
Decision -->|Simple| Plan2["Research Plan 2-3 Paths"]
Decision -->|Moderate| Plan3["Research Plan 3-4 Paths"]
Decision -->|Complex| Plan4["Research Plan 4-5 Paths"]
Decision -->|Very Complex| Plan5["Research Plan 5-6 Paths"]
Plan2 --> Phase2
Plan3 --> Phase2
Plan4 --> Phase2
Plan5 --> Phase2
Phase2["π§ PHASE 2: AGENT CREATION"] --> Create["Create N Researchers + Synthesizer + Writer"]
Create -->|Assign Tools| Tools["Each researcher gets: - Web Search (if needed) - Code Execution (if needed)"]
Tools --> Phase3["π¬ PHASE 3: EXECUTION"]
Phase3 --> R1[Researcher #1<br/>Path-specific research]
Phase3 --> R2[Researcher #2<br/>Path-specific research]
Phase3 --> R3[Researcher #3<br/>Path-specific research]
Phase3 --> RN[Researcher #N<br/>Path-specific research]
R1 --> Synthesis[Synthesizer & Critic]
R2 --> Synthesis
R3 --> Synthesis
RN --> Synthesis
Synthesis -->|Meta-Analysis| Writer[Final Response Writer]
Writer --> Result(["π Publication-Quality Response 1500-2500 words"])
style Phase1 fill:#1e3a8a,stroke:#3b82f6,stroke-width:3px
style Phase2 fill:#166534,stroke:#22c55e,stroke-width:3px
style Phase3 fill:#7c2d12,stroke:#f97316,stroke-width:3px
style Decision fill:#4c1d95,stroke:#a78bfa
The Strategic Planner agent analyzes your query and makes critical decisions:
1. Complexity Assessment
- Evaluates: scope, interdisciplinarity, uncertainty, trade-offs, validation needs
- Classifies: Simple | Moderate | Complex | Very Complex
2. Determines Number of Paths
- Simple: 2-3 paths
- Moderate: 3-4 paths
- Complex: 4-5 paths
- Very Complex: 5-6 paths
3. Designs Each Path
For each research path, the planner specifies:
- approach: Descriptive name (e.g., "Theoretical Architecture Analysis")
- key_questions: 2-4 specific research questions
- needs_web_search: true/false - does this path need web research?
- needs_code_execution: true/false - does this path need code validation?
- rationale: Why this approach is essential
4. Outputs JSON Plan
{
"num_paths": 4,
"complexity_assessment": "Complex",
"reasoning": "Multi-faceted problem requiring diverse approaches",
"paths": [
{
"path_id": 1,
"approach": "Theoretical Architecture Analysis",
"key_questions": ["Q1", "Q2", "Q3"],
"needs_web_search": true,
"needs_code_execution": false,
"rationale": "Establish theoretical foundations"
},
...
]
}Example for "build a recommendation system at scale":
- Path 1: Theoretical Architecture Analysis (web search)
- Path 2: Performance Benchmarking (web search + code execution)
- Path 3: Technology Stack Comparison (web search)
- Path 4: Cost & Operational Analysis (web search)
Based on the plan, the system creates:
-
N Researcher Agents (2-6 based on complexity)
- Each configured with path-specific instructions
- Tools assigned per path requirements:
- Web search if
needs_web_search: true - Code execution if
needs_code_execution: true
- Web search if
- Backstory includes assigned questions and approach
-
Synthesizer Agent (adapted to N paths)
-
Final Writer Agent (adapted to N paths)
Each Researcher Agent (#1 through #N) conducts deep investigation of their assigned path following a rigorous 3-phase protocol:
- Extract assigned path from strategic plan
- Internalize research questions and success criteria
- Develop a clear hypothesis
- Web Search (Gemini Online): Find authoritative sources, recent research, best practices, case studies
- Code Execution (Sequential mode only): Implement proofs-of-concept, run experiments, validate claims
- Critical Analysis: Examine theoretical foundations, practical implications, scalability, trade-offs
- Compile findings into coherent narrative
- Acknowledge gaps and uncertainties
- Quantify confidence scores
Each researcher produces an 800-1500 word report including:
- Refined hypothesis and methodology
- Evidence-based findings with citations
- Solution architecture (if applicable)
- Strengths, weaknesses, optimal use cases
- Confidence scores (overall, evidence quality, feasibility)
The Synthesizer & Critic agent receives all research reports and performs rigorous meta-analysis:
For each path, evaluate:
- Evidence quality (sources, rigor) β Score /10
- Technical merit (soundness, scalability) β Score /10
- Practical viability (complexity, resources) β Score /10
- Risk analysis β Low/Medium/High
- Applicability scope
- Performance comparison across metrics
- Cost-benefit analysis
- Complementarity analysis (hybrid solutions)
- Differential advantages
- Primary recommendation with confidence level
- Justification (minimum 5 objective criteria)
- Implementation priorities
- Alternative scenarios
- Validation strategy
- Unanswered questions
- Areas of disagreement
- Needed additional investigation
Output: 1000-1800 word meta-analysis
The Final Response Writer transforms all research into a publication-quality response (1500-2500 words) with 8 mandatory sections:
- π Executive Summary: Recommended solution, key factors, confidence level
- π Research Methodology: Paths explored, research methods used
- β Detailed Solution: Architecture, implementation blueprint, code examples
- π‘ Justification & Rationale: Why this solution, comparison table, trade-offs
β οΈ Critical Considerations: Prerequisites, limitations, risk mitigation, success metrics- π Implementation Roadmap: Immediate/short-term/long-term steps
- π― Alternative Scenarios: If constraints change...
- π Key Takeaways: Main insights
from deep_think import deep_think
# The planner automatically determines the optimal number of researchers
query = "What is the best approach to build a scalable ML inference system?"
result = deep_think(query, parallel=True)
print(f"Complexity: {result['research_plan']['complexity_assessment']}")
print(f"Paths explored: {result['num_researchers']}")
print(f"\nResult:\n{result['result']}")from deep_think import DeepThinkCrew
# Create crew with custom LLM model
crew = DeepThinkCrew(
llm_model="anthropic/claude-sonnet-4.5",
parallel_research=True
)
# Execute deep thinking
result = crew.think("Your complex query here")
# Access planning metadata
plan = result['research_plan']
print(f"Complexity: {plan['complexity_assessment']}")
print(f"Reasoning: {plan['reasoning']}")
print(f"Paths: {plan['num_paths']}")
for i, path in enumerate(plan['paths'], 1):
print(f"\nPath {i}: {path['approach']}")
print(f" Tools: web_search={path['needs_web_search']}, code={path['needs_code_execution']}")
print(f" Rationale: {path['rationale']}")
# Execution metadata
print(f"\nExecution: {result['execution_time_minutes']} minutes")
print(f"Agents: {result['agents_count']}, Tasks: {result['tasks_count']}")
print(f"Tokens: {result['token_usage']['total_tokens']:,}")
print(f"\nFinal response:\n{result['result']}")| Parameter | Type | Default | Description |
|---|---|---|---|
llm_model |
str | "openrouter/anthropic/claude-sonnet-4.5" | LLM model to use for all agents |
parallel_research |
bool | True | Whether to run research tasks in parallel or sequentially |
Note: num_researchers is now determined automatically by the planner!
| Parameter | Type | Default | Description |
|---|---|---|---|
query |
str | Required | The complex question or problem to solve |
model |
str | "openrouter/anthropic/claude-sonnet-4.5" | LLM model to use |
parallel |
bool | True | Enable parallel execution |
Note: Number of researchers is now adaptive - the planner decides based on complexity!
- β Faster: All researchers work simultaneously
- β Higher throughput: Complete in ~1/N time
β οΈ Rate limits: May hit API rate limits with many researchers- Best for: Quick iterations, read-only research, when rate limits aren't a concern
- β Reliable: No rate limit issues
- β Consistent: Predictable execution flow
β οΈ Slower: Researchers work one after another
- Purpose: Real-time web search using Gemini 2.5 Pro's online capability (Gemini 2.5 Pro with
"google_search": {}as tool) - Advantages: Better results than traditional search APIs (SerpAPI, Serper)
- Capabilities:
- Find authoritative sources
- Access recent research and developments
- Gather industry best practices
- Identify real-world implementations
- Purpose: Execute Python code for validation and experimentation
- Capabilities:
- Implement proof-of-concept examples
- Run computational experiments
- Test algorithms and performance
- Generate concrete data
- Auto-scaling: 2 paths for simple problems, up to 6 for very complex ones
- Smart tool allocation: Web search and code execution assigned per-path
- Complexity assessment: Automatic evaluation of problem difficulty
- Transparent planning: Full visibility into the research strategy
- Structured thinking: Each agent follows rigorous protocols (3-phase research, 4-part synthesis, 8-section final response)
- Evidence-based: Every claim backed by research, citations, and data
- Quantified confidence: Multiple scoring dimensions for transparency
- Intelligent planning: Planner automatically determines optimal number of paths (2-6)
- Dynamic tool assignment: Each researcher gets only the tools they need
- Parallel or sequential: Choose execution mode based on your constraints
- Complexity-aware: System scales research depth to problem complexity
- Planning: Systematic problem decomposition
- Research: 800-1500 words per path
- Synthesis: 1000-1800 word meta-analysis
- Final output: 1500-2500 word publication-quality response
- Detailed implementation blueprints
- Production-quality code examples (not toy code)
- Success metrics and validation strategies
- Risk mitigation plans
- Acknowledges gaps and uncertainties
- Discusses limitations and trade-offs
- Provides alternative scenarios
- Presents counterarguments
| Query Type | Complexity | Paths | Reasoning |
|---|---|---|---|
| "Best Python web framework?" | Simple | 2-3 | Straightforward comparison, known options |
| "Design microservices architecture" | Moderate | 3-4 | Multiple dimensions, trade-offs to consider |
| "Build real-time ML system at scale" | Complex | 4-5 | Interdisciplinary, many constraints |
| "Research novel AI safety approach" | Very Complex | 5-6 | Cutting-edge, requires exhaustive exploration |
The planner decides which tools each path needs:
Simple query: "Compare Redis vs Memcached"
β Path 1: Theoretical comparison (web search only)
β Path 2: Practical benchmarks (web search only)
β Both paths: No code execution needed (well-documented topic)
Complex query: "Design distributed caching with custom eviction"
β Path 1: Architecture patterns (web search)
β Path 2: Algorithm validation (web search + code execution)
β Path 3: Performance modeling (web search + code execution)
β Path 4: Production considerations (web search)
Every Deep Think response includes:
- β Minimum 1500-2500 words of substantive content
- β Confidence scores for recommendations
- β Evidence citations from authoritative sources
- β Code examples with detailed comments
- β Comparison tables evaluating alternatives
- β Implementation roadmap with timelines
- β Success metrics for validation
- β Risk analysis with mitigation strategies
- β Alternative scenarios for different constraints
- β Research plan transparency (NEW: see exactly how the planner decided)
- Complex architectural decisions: "What's the best database architecture for..."
- Multi-faceted problems: "How should we approach building..."
- Research synthesis: "What are the latest approaches to..."
- Technology evaluation: "Compare approaches for implementing..."
- Strategic planning: "What strategy should we use for..."
- Simple factual questions (use regular LLM)
- Quick lookups (use search directly)
- Real-time conversational chat (too slow)
- Questions with single obvious answers
pip install crewai crewai-tools langchain-openai python-dotenv openai fastapi uvicornCreate a .env file:
OPENAI_API_KEY=your_openrouter_api_key
OPENAI_BASE_URL=https://openrouter.ai/api/v1
LLM_MODEL=anthropic/claude-sonnet-4.5
SEARCH_MODEL=google/gemini-2.5-pro:onlineDeep Think includes an OpenAI-compatible API server for easy testing and integration.
python deep_think.py --serverThe server will start on http://localhost:8002 with the following endpoints:
- POST
/v1/chat/completions- Main chat completions endpoint (OpenAI-compatible) - GET
/v1/models- List available models - GET
/health- Health check - GET
/docs- Interactive API documentation (Swagger UI)
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8002/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="deep-think",
messages=[
{"role": "user", "content": "What's the best approach to build a scalable ML pipeline?"}
]
)
print(response.choices[0].message.content)
# Access planning metadata
metadata = response.usage.get('deep_think_metadata', {})
print(f"\nResearchers used: {metadata.get('num_researchers')}")
print(f"Complexity: {metadata.get('complexity_assessment')}")curl -X POST http://localhost:8002/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deep-think",
"messages": [
{"role": "user", "content": "Compare database technologies for high-traffic apps"}
]
}'| Parameter | Type | Default | Description |
|---|---|---|---|
model |
str | "deep-think" | Model name (always deep-think) |
messages |
list | Required | OpenAI-format messages |
stream |
bool | False | Not supported (must be False) |
| Complexity | Typical Paths | Parallel Time | Best For |
|---|---|---|---|
| Simple | 2-3 | 6-8 min | Quick analysis |
| Moderate | 3-4 | 8-12 min | Balanced coverage |
| Complex | 4-5 | 12-18 min | Deep analysis |
| Very Complex | 5-6 | 18-26 min | Exhaustive research |
Times vary based on query complexity and LLM response times
Planning Phase: +30-90 seconds for complexity analysis and research plan generation
Solution: Use sequential mode (parallel=False)
Issue: Agents not following detailed instructions Solution: Ensure using Claude Sonnet 4.5 or equivalent high-capability model
- Framework: CrewAI with custom 3-phase task orchestration
- Planning: JSON-based adaptive research plan generation
- Search Tool: Gemini 2.5 Pro Online (Gemini 2.5 Pro with
"google_search": {}as tool) - Code Execution: Python code interpreter
- LLM Backend: OpenAI API compatible
- Process: Sequential with async_execution on tasks (configurable)
- Adaptive Logic: Complexity assessment β dynamic agent creation β tool allocation
We evaluate Deep Think against challenging mathematics problems from the AIME 2025 benchmark using evalscope.
Dataset: American Invitational Mathematics Examination 2025 Task: Solve challenging high school mathematics competition problems with step-by-step solutions Problems: 30 total (15 from AIME2025-I, 15 from AIME2025-II)
| Model | Overall Accuracy | AIME2025-I | AIME2025-II |
|---|---|---|---|
| Claude Sonnet 4.5 (Deep Think) | 36.66% | 40.0% | 33.33% |
| Claude Sonnet 4.5 (baseline) | 30.0% | 33.33% | 26.67% |
- π― +6.66pp improvement over baseline Claude Sonnet 4.5
To replicate the results, run the following command:
Make sure to have the API server running first:
python deep_think.py --serverInstall evalscope:
pip install evalscopeThen run the following command:
evalscope eval \
--model deep-think \
--api-url http://127.0.0.1:8002/v1 \
--api-key EMPTY \
--eval-type openai_api \
--datasets aime25 \
--timeout 900000Built with β€οΈ by the Arkel team