Skip to content

GustyCube/spyder

Repository files navigation

SPYDER

CI Go Version Go Report Card Docker Docs License Issues PRs Welcome OpenTelemetry Prometheus

System for Probing and Yielding DNS-based Entity Relations -- a distributed network reconnaissance tool for mapping inter-domain relationships through DNS resolution, TLS certificate chain analysis, and HTTP link extraction.

Architecture

graph TD
    A[Seed Domains] --> B[SPYDER Probe]

    B --> C[DNS Resolver]
    B --> D[TLS Inspector]
    B --> E[HTTP Crawler]

    C --> F[Discovery Sink]
    D --> F
    E --> F

    F -->|continuous mode| B

    G[Redis / LRU Cache] <--> B
    H[robots.txt Cache] <--> E

    C --> I[Batch Emitter]
    D --> I
    E --> I

    I --> J[Ingest API / stdout]
    I --> K[Spool Directory]

    B --> L[Prometheus Metrics]
    B --> M[OpenTelemetry Traces]
Loading

Features

  • Multi-protocol discovery -- DNS (A/AAAA, CNAME, MX, NS), TLS certificate chains, HTTP link extraction
  • Recursive crawling -- discovered domains feed back into the work queue with -continuous mode
  • Configurable concurrency -- worker pool with per-host token bucket rate limiting
  • Deduplication -- in-memory LRU or Redis-backed for distributed deployments
  • Policy compliance -- RFC-compliant robots.txt parsing, configurable TLD exclusions
  • Observability -- Prometheus metrics, OpenTelemetry traces, structured Zap logging
  • Fault tolerance -- circuit breaker pattern, exponential backoff, disk-based spool for failed deliveries
  • Secure transport -- full mTLS support with CA bundle validation for ingest endpoints
  • Control API -- REST API with auth, hot config reload, dynamic worker scaling, and query layer
  • Cloud-native -- distroless container images, Kubernetes health checks, Redis work queues

Quick Start

# Build
make build

# Basic scan (single pass)
echo -e "example.com\ngithub.com\ncloudflare.com" > domains.txt
./bin/spyder -domains=domains.txt -concurrency=64

# Recursive crawling (discovers and follows new domains)
./bin/spyder -domains=domains.txt -continuous -max_domains=10000

# With Redis deduplication
REDIS_ADDR=127.0.0.1:6379 ./bin/spyder -domains=domains.txt -concurrency=256

# Docker
docker compose up -d

Configuration

CLI Flags

Flag Default Description
-domains required Path to newline-delimited domain list
-config Path to YAML/JSON config file
-concurrency 256 Worker goroutines
-continuous false Enable recursive domain discovery
-max_domains 0 Max discovered domains in continuous mode (0 = unlimited)
-ingest HTTP(S) ingest endpoint (empty = stdout)
-probe local-1 Probe identifier
-run run-{timestamp} Run identifier
-ua SPYDERProbe/1.0 User-Agent string
-exclude_tlds gov,mil,int TLDs to skip
-metrics_addr :9090 Prometheus metrics address
-output_format json Output format: json, jsonl, csv
-mtls_cert Client certificate for mTLS
-mtls_key Client key for mTLS
-mtls_ca CA bundle for mTLS
-otel_endpoint OTLP HTTP endpoint

Environment Variables

Variable Description
REDIS_ADDR Redis for deduplication
REDIS_QUEUE_ADDR Redis for distributed work queue
REDIS_QUEUE_KEY Queue key name (default: spyder:queue)
LOG_LEVEL Log level: debug, info, warn, error

Output Format

{
  "probe_id": "prod-us-east-1",
  "run_id": "run-20240101",
  "nodes_domain": [
    {"host": "example.com", "apex": "example.com", "first_seen": "...", "last_seen": "..."}
  ],
  "nodes_ip": [
    {"ip": "93.184.216.34", "first_seen": "...", "last_seen": "..."}
  ],
  "nodes_cert": [
    {"spki_sha256": "a1b2c3...", "subject_cn": "*.example.com", "issuer_cn": "DigiCert", "not_before": "...", "not_after": "..."}
  ],
  "edges": [
    {"type": "RESOLVES_TO", "source": "example.com", "target": "93.184.216.34", "observed_at": "...", "probe_id": "...", "run_id": "..."}
  ]
}

Edge Types

Type Source Target Discovery Method
RESOLVES_TO Domain IP DNS A/AAAA
USES_NS Domain Nameserver DNS NS
ALIAS_OF Domain Domain DNS CNAME
USES_MX Domain Mail server DNS MX
LINKS_TO Domain Domain HTTP link extraction
USES_CERT Domain SPKI hash TLS handshake

Deployment

Docker Compose

The included docker-compose.yml starts SPYDER with Redis, Prometheus, and Grafana:

docker compose up -d        # start all services
docker compose logs -f spyder  # follow probe logs

Distributed Mode

Use Redis work queues for multi-instance deployments:

# Seed the queue
./bin/seed -redis=redis:6379 -domains=domains.txt

# Run multiple probes
REDIS_QUEUE_ADDR=redis:6379 ./bin/spyder -probe=probe-1 -continuous
REDIS_QUEUE_ADDR=redis:6379 ./bin/spyder -probe=probe-2 -continuous

Control API

SPYDER exposes a REST API on the metrics port (:9090) for runtime control:

# Check status (requires API key with read scope)
curl -H "Authorization: Bearer $API_KEY" http://localhost:9090/api/v1/status

# Submit domains for crawling
curl -X POST -H "Authorization: Bearer $API_KEY" \
  -d '{"host":"example.com"}' http://localhost:9090/api/v1/domains

# Scale workers at runtime
curl -X POST -H "Authorization: Bearer $API_KEY" \
  -d '{"count":512}' http://localhost:9090/api/v1/workers/scale

# Hot-reload configuration
curl -X PATCH -H "Authorization: Bearer $API_KEY" \
  -d '{"crawling":{"rate_per_host":5.0}}' http://localhost:9090/api/v1/config

Configure API keys in your config file or via SPYDER_API_KEYS env var. See the API docs for all ~60 endpoints.

Documentation

Full documentation at gustycube.github.io/spyder

# Run docs locally
cd docs && npm install && npm run docs:dev

Contributing

See CONTRIBUTING.md. Quick setup:

git clone https://github.com/gustycube/spyder.git
cd spyder
make lint test build

License

MIT

About

Distributed DNS + HTTP crawler with queueing, deduplication, tracing, and ingest batching.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages