Crawler Agent

An autonomous, AI-native web crawler agent that intelligently analyzes websites and automatically generates production-ready crawling plans with validation datasets. Powered by a cutting-edge multi-agent orchestration architecture to discover site structure, extract optimal CSS selectors, and prepare comprehensive test suites.

Why Crawler Agent?

Feature	Crawler Agent	Manual Planning	Generic Scrapers
Autonomous site discovery	Yes	No	No
AI-powered selector extraction	Yes	No	No
Auto-generated test datasets	Yes	No	No
Intelligent pagination handling	Yes	Manual	Partial
Multi-model orchestration	Yes	N/A	No
Enterprise observability	Yes	No	No

Key Benefits

Zero-Config AI Discovery — Leverages state-of-the-art LLMs to autonomously analyze page structure, identify content patterns, and extract optimal CSS selectors without manual configuration.
Agentic Architecture — Purpose-built specialized agents for discovery, selector extraction, accessibility validation, and test data preparation — each with isolated memory and explicit data contracts.
Contract-First Design — JSON Schema-validated input/output contracts ensure type-safe, reliable inter-agent communication and seamless integration.
Multi-Model Orchestration — Dynamically assign different LLM models to different agents and tools for cost optimization, latency tuning, or capability matching.
Enterprise-Grade Observability — OpenTelemetry-native distributed tracing with structured logging, Elasticsearch-ready schemas, and full trace correlation.
Cloud-Native Storage — Pluggable repository pattern with in-memory for rapid development and SQLAlchemy for production persistence (MySQL, PostgreSQL, SQLite).

Current Limitations

This project is in active development. Contributions welcome!

Listing-Items Focus — Currently agent has been optimized for article-based sites with pagination
Requires Chrome DevTools — Needs Chrome running with remote debugging enabled
Single-site focus — Designed for analyzing one site at a time
No proxy support — Does not handle IP rotation or geo-targeting
No authentication — Cannot handle login-protected content
No JSON API discovery — Only discovers HTML-based content, not REST/GraphQL APIs

Quick Start

Prerequisites

Python 3.11+
Chrome browser with remote debugging enabled
OpenAI API key

Installation

# Clone and setup
git clone <repo-url> && cd agent2
python3 -m venv .venv
source .venv/bin/activate

# Install with locked versions (recommended)
pip install -r requirements.lock

# Or install with flexible versions
pip install -e ".[dev]"

# Configure
cp .env.example .env
# Edit .env with your OPENAI_API_KEY

Running Chrome with DevTools

# macOS
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9222

# Linux
google-chrome --remote-debugging-port=9222

# With dedicated profile (recommended)
google-chrome --remote-debugging-port=9222 --user-data-dir=/tmp/chrome-debug

Usage

# Activate venv
source .venv/bin/activate

# Analyze a website and generate crawl plan
python main.py https://example.com/blog

# Enable debug logging
python main.py https://example.com/blog -l DEBUG

# Use multi-model mode (per-agent model assignments)
python main.py https://example.com/blog --multi-model

# List available models
python main.py --list-models

# Use different env file
python main.py https://example.com/blog --env-file .env-gpt5.1

# Override DevTools URL
python main.py https://example.com/blog --devtools-url ws://localhost:9222

Output

The agent generates output in PLANS_OUTPUT_DIR/<site_name>/:

plans_output/
└── example_com/
    ├── plan.md           # Crawl plan with selectors and strategy
    ├── test.md           # Test documentation
    └── data/
        └── test_set.jsonl  # Test dataset for validation

Examples

See examples/ for sample outputs including:

examples/
└── example_one/
    ├── plan.md              # Generated crawl plan
    └── extras/              # Reference docs for implementation
        ├── headfull-chrome.md
        ├── crawler-blueprint.md
        ├── logging-standards.md
        └── python-tips.md

Using with AI Coding Assistants

The extras/ folder contains reference documentation designed to be passed as context to AI coding assistants (Claude Code, Cursor, GitHub Copilot) when implementing a production crawler based on the generated plan:

# Pass plan + extras to Claude Code for implementation
claude "Implement a crawler based on examples/example_one/plan.md
        using patterns from examples/example_one/extras/"

Architecture

┌─────────────────────────────────────────────────────────────┐
│                        CLI (main.py)                         │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐  │
│  │   Config    │  │ Observability│  │     Container       │  │
│  │  (AppConfig)│  │   (OTel)    │  │  (DI + Services)    │  │
│  └─────────────┘  └─────────────┘  └─────────────────────┘  │
└─────────────────────────────────────────────────────────────┘
                              │
┌─────────────────────────────────────────────────────────────┐
│                      Agent Layer                             │
│  ┌─────────────────────────────────────────────────────────┐│
│  │                    MainAgent                            ││
│  │              (Workflow Orchestrator)                    ││
│  └─────────────────────────────────────────────────────────┘│
│         │              │              │              │       │
│  ┌──────┴──────┐┌──────┴──────┐┌──────┴──────┐┌──────┴────┐│
│  │ Discovery   ││  Selector   ││Accessibility││ DataPrep  ││
│  │   Agent     ││   Agent     ││   Agent     ││  Agent    ││
│  │(Navigation) ││(CSS Finder) ││(HTTP Check) ││(Test Data)││
│  └─────────────┘└─────────────┘└─────────────┘└───────────┘│
└─────────────────────────────────────────────────────────────┘
                              │
┌─────────────────────────────────────────────────────────────┐
│                       Tools Layer                            │
│  ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌──────────┐ │
│  │  Browser   │ │  Memory    │ │  Selector  │ │   Plan   │ │
│  │   Tools    │ │   Tools    │ │   Tools    │ │  Tools   │ │
│  └────────────┘ └────────────┘ └────────────┘ └──────────┘ │
│  ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌──────────┐ │
│  │ Extraction │ │   File     │ │Orchestration│ │  Agent   │ │
│  │   Tools    │ │   Tools    │ │   Tools    │ │  Tools   │ │
│  └────────────┘ └────────────┘ └────────────┘ └──────────┘ │
└─────────────────────────────────────────────────────────────┘
                              │
┌─────────────────────────────────────────────────────────────┐
│                    Infrastructure                            │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌────────────┐  │
│  │LLMClient │  │ Browser  │  │ Memory   │  │  Session   │  │
│  │(OpenAI)  │  │ Session  │  │ Service  │  │  Service   │  │
│  └──────────┘  └──────────┘  └──────────┘  └────────────┘  │
│                      │                            │          │
│              ┌───────┴───────┐         ┌─────────┴────────┐│
│              │Chrome DevTools│         │    Repository    ││
│              │     (CDP)     │         │(InMem/SQLAlchemy)││
│              └───────────────┘         └──────────────────┘│
└─────────────────────────────────────────────────────────────┘

Component Descriptions

Component	Purpose
MainAgent	Orchestrates workflow: discovery, selector finding, accessibility check, data prep
DiscoveryAgent	Navigates pages, extracts links, identifies pagination patterns
SelectorAgent	Discovers and validates CSS selectors for listings and articles
AccessibilityAgent	Tests if site works without JavaScript via HTTP
DataPrepAgent	Samples pages and creates test datasets with extracted content
Container	Dependency injection for services and repositories
MemoryService	Isolated key-value storage per agent
SessionService	Tracks crawler runs with status and timing

Configuration

Copy .env.example to .env and configure:

Required

Variable	Description
`OPENAI_API_KEY`	Your OpenAI API key

Optional - Model Configuration

Variable	Default	Description
`OPENAI_MODEL`	gpt-5.1	Model for legacy mode
`OPENAI_TEMPERATURE`	0.0	Temperature setting
`DEFAULT_MODEL`	gpt-5.1	Default model for multi-model mode
`MAIN_AGENT_MODEL`	-	Model for main orchestrator
`SELECTOR_AGENT_MODEL`	-	Model for selector discovery

Optional - Chrome DevTools

Variable	Default	Description
`CDP_URL`	-	Full WebSocket URL (preferred)
`CDP_HOST`	localhost	DevTools host
`CDP_PORT`	9222	DevTools port
`CDP_TIMEOUT`	30	Connection timeout (seconds)

Optional - Storage

Variable	Default	Description
`STORAGE_BACKEND`	memory	Backend type: `memory` or `sqlalchemy`
`DATABASE_URL`	-	SQLAlchemy connection string

Optional - Output

Variable	Default	Description
`PLANS_OUTPUT_DIR`	./plans_output	Output directory for plans
`PLANS_TEMPLATE_DIR`	-	Template files to copy to output
`AGENT_VERSION`	1.0.0	Version identifier for tracking

Optional - Observability

Variable	Default	Description
`LOG_LEVEL`	INFO	Logging level
`SERVICE_NAME`	crawler-agent	Service name for tracing
`OTEL_ENDPOINT`	localhost:4317	OpenTelemetry collector endpoint
`OTEL_INSECURE`	true	Use insecure gRPC connection

Workflow

sequenceDiagram
    participant User
    participant Main as MainAgent
    participant Discovery as DiscoveryAgent
    participant Selector as SelectorAgent
    participant Access as AccessibilityAgent
    participant DataPrep as DataPrepAgent
    participant Files as Output Files

    User->>Main: create_crawl_plan(url)

    Main->>Discovery: Extract article links & pagination
    Discovery-->>Main: AgentResult(links, pagination_type, max_pages)

    Main->>Selector: Find CSS selectors
    Selector-->>Main: AgentResult(listing_selectors, article_selectors)

    Main->>Access: Check HTTP accessibility
    Access-->>Main: AgentResult(http_accessible, requires_js)

    Main->>DataPrep: Create test dataset
    DataPrep-->>Main: AgentResult(listing_count, article_count)

    Main->>Files: Generate plan.md, test.md, test_set.jsonl
    Main-->>User: AgentResult(success, summary)

Docker Setup (Optional)

For persistent storage with MySQL:

# Start MySQL container
docker-compose up -d mysql

# Run migrations
alembic upgrade head

# Set environment variables
export STORAGE_BACKEND=sqlalchemy
export DATABASE_URL=mysql+pymysql://crawler:crawler_password@localhost:3306/crawler

# Run the crawler
python main.py https://example.com/blog

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run linter
ruff check src/

# Run tests
pytest

# Run tests with coverage
pytest --cov=src

Roadmap

Proxy support for IP rotation
Authentication handling (login flows)
Parallel agent execution
WebSocket-based real-time updates
Additional LLM provider support (Anthropic, Google)

Related Projects

Project	Description
Headfull-Chrome	Cloud-native browser automation API with stealth-first design. Runs Chrome with real virtual display (Xvfb) for anti-bot resistance. Recommended for production crawlers implementing plans generated by this agent.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.claude		.claude
.github/workflows		.github/workflows
config		config
docs		docs
examples		examples
infra		infra
scripts		scripts
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
alembic.ini		alembic.ini
docker-compose.logging.yml		docker-compose.logging.yml
docker-compose.yml		docker-compose.yml
main.py		main.py
pyproject.toml		pyproject.toml
requirements.lock		requirements.lock
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crawler Agent

Why Crawler Agent?

Key Benefits

Current Limitations

Quick Start

Prerequisites

Installation

Running Chrome with DevTools

Usage

Output

Examples

Using with AI Coding Assistants

Architecture

Component Descriptions

Configuration

Required

Optional - Model Configuration

Optional - Chrome DevTools

Optional - Storage

Optional - Output

Optional - Observability

Workflow

Docker Setup (Optional)

Development

Roadmap

Related Projects

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Crawler Agent

Why Crawler Agent?

Key Benefits

Current Limitations

Quick Start

Prerequisites

Installation

Running Chrome with DevTools

Usage

Output

Examples

Using with AI Coding Assistants

Architecture

Component Descriptions

Configuration

Required

Optional - Model Configuration

Optional - Chrome DevTools

Optional - Storage

Optional - Output

Optional - Observability

Workflow

Docker Setup (Optional)

Development

Roadmap

Related Projects

Contributing

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages