A fast, efficient, and embeddable web crawler written in Rust.
rustcrawl is a command line spider built on a small, clean core
library. It crawls in breadth first order from one or more seed URLs, respects
robots.txt and per host rate limits by default, and streams every page it
finds as JSON Lines, ready to pipe into an indexer,
a database, or jq.
The engine (rustcrawl core) is deliberately decoupled from the CLI so it can
be reused as the crawling layer of a larger system (a search index, an
archiver, a site auditor).
- Async, concurrent engine built on
tokiowith a configurable worker pool. - Politeness by default: obeys
robots.txt(includingCrawl delay) and enforces a minimum delay between requests to the same host, while crawling different hosts in parallel. - Smart scoping: stay on the same host, the same registrable domain (Public Suffix List aware), or roam the open web, plus include and exclude regex filters.
- Robust fetching: timeouts, bounded retries with backoff for transient network failures and temporary HTTP statuses, redirect following, and a hard cap on response body size.
- Deduplication via URL normalization, so the same page is never queued twice.
- Sitemap seeding from
sitemap.xml(and sitemap indexes). - Crawl Deck terminal control center: live request logs, request rate graph, success rate, stats, pause and resume, stop, and rerun. Machine readable JSON Lines stay on stdout so scripts still work.
- Pluggable output: implement one trait (
Sink) to send crawled pages anywhere.
Requires a recent stable Rust toolchain (rustup).
git clone https://github.com/rustcrawl/rustcrawl
cd rustcrawlIf you just cloned the repo, rustcrawl is not on your PATH yet.
During development, run it through Cargo:
cargo run -p rustcrawl-cli -- https://example.com -n 50 -o pages.jsonlEverything after -- is passed to the crawler.
If you prefer running the built binary directly:
cargo build -p rustcrawl-cli
# Windows (PowerShell, debug build):
.\target\debug\rustcrawl.exe https://example.com -n 50
cargo build --release -p rustcrawl-cli
# Linux/macOS:
./target/release/rustcrawl https://example.com -n 50
# Windows (PowerShell, release build):
.\target\release\rustcrawl.exe https://example.com -n 50To install rustcrawl on your PATH for daily use:
cargo install --path crates/rustcrawl-cli
# then, in any directory:
rustcrawl https://example.com -n 50On Windows, ensure %USERPROFILE%\.cargo\bin is on your PATH (rustup usually
adds this). Open a new terminal after installing.
cargo or rustcrawl is not recognized
Rust installs to %USERPROFILE%\.cargo\bin. If a terminal was open before
rustup ran, it won't see that path until you restart it.
PowerShell (current session only):
$env:Path += ";$env:USERPROFILE\.cargo\bin"
cargo --versionOr close and reopen your terminal (or Cursor) and try again.
Windows build errors about dlltool.exe or link.exe
If the MSVC linker is missing, either install Visual Studio Build Tools with the "Desktop development with C++" workload, or use the GNU toolchain + MinGW:
rustup default stable-x86_64-pc-windows-gnu
# MinGW-w64 must provide dlltool/gcc on PATH (e.g. C:\Users\...\mingw64\bin)cargo run -p rustcrawl-cli -- --help
cargo testCrawl a site, staying within its domain, and write results to a file:
.\target\debug\rustcrawl.exe https://example.com --max-pages 500 -o pages.jsonlWhen run in an interactive terminal, rustcrawl opens a full screen dashboard.
Run it with no target to open the control center without starting a crawl:
.\target\debug\rustcrawl.exeDashboard controls:
n: create a new crawl job from the dashboardc: clear saved recent jobs when the control center is open- Up / Down: select a recent job
e: edit the selected recent job and run it againp/ space: pause or resume leasing new URLss: gracefully stop after in flight requests finishr: rerun the same crawl after a run finishesq/Esc: exit the dashboard
The new job form supports the high level per run settings you usually need while iterating: target URL, max pages, depth, concurrency, scope, per host delay, optional output file, and whether the job should be saved to local history.
For script friendly output with no dashboard, use --quiet:
.\target\debug\rustcrawl.exe https://example.com -n 50 --quiet -o pages.jsonlSkip local job history for throwaway runs:
.\target\debug\rustcrawl.exe https://example.com -n 50 --quiet --no-saveClear saved recent jobs from the command line:
.\target\debug\rustcrawl.exe --clear-jobsPipe results straight into jq (progress is printed to stderr, data to stdout):
rustcrawl https://example.com -n 50 | jq -r '.url'Seed from a sitemap and follow links two levels deep:
rustcrawl --sitemap https://example.com/sitemap.xml --depth 2Tune concurrency and politeness, and restrict to a section of a site:
rustcrawl https://docs.example.com \
--concurrency 32 \
--delay 100ms \
--include '/guide/' \
--exclude '\.(png|jpg|pdf)$'| Flag | Description | Default |
|---|---|---|
<URL>... |
Seed URLs | none |
--sitemap <URL> |
Also seed from a sitemap (repeatable) | none |
-d, --depth <N> |
Maximum link depth | unlimited |
-n, --max-pages <N> |
Stop after N pages | unlimited |
-c, --concurrency <N> |
Concurrent in flight requests across all hosts, 1 to 1024 | 16 |
--delay <DUR> |
Min delay per host (e.g. 250ms, 1s) |
250ms |
--ignore-crawl-delay |
Ignore Crawl-delay while still obeying robots allow and deny rules |
off |
--timeout <DUR> |
Per-request timeout | 30s |
--retries <N> |
Retries for transient network failures and temporary HTTP statuses | 2 |
--scope <host|domain|any> |
How far to roam | domain |
--include <REGEX> |
Only crawl matching URLs (repeatable) | none |
--exclude <REGEX> |
Skip matching URLs (repeatable) | none |
--user-agent <STRING> |
Override the User-Agent | project UA |
--ignore-robots |
Do not obey robots.txt |
off |
-o, --output <FILE> |
Write JSON Lines to a file | stdout |
-q, --quiet |
Disable the dashboard; final summary only | off |
--no-save |
Do not write this run to local job history | off |
--clear-jobs |
Clear local job history and exit | off |
-v |
Increase log verbosity (-v, -vv) |
warn |
Run rustcrawl --help for the full list.
Each crawled page is one JSON object:
{
"url": "https://example.com/",
"final_url": "https://example.com/",
"status": 200,
"depth": 0,
"referrer": null,
"content_type": "text/html; charset=utf-8",
"title": "Example Domain",
"content_length": 1256,
"links": ["https://example.com/about"],
"fetched_at": "2026-01-01T00:00:00Z",
"elapsed_ms": 42
} ┌────────────────────────── rustcrawl-cli ──────────────────────────┐
│ clap args ─► CrawlConfig Crawl Deck (ratatui, stderr) │
└───────────────┬───────────────────────────▲───────────────────────┘
│ │ events / stats
┌───────────────▼────────────── rustcrawl-core ──────────────────────┐
│ │
│ Engine ──spawns──► worker pool (tokio tasks) │
│ │ │ │
│ ▼ ▼ │
│ Frontier Fetcher ─► RobotsCache │
│ (dedup, (retries, (robots.txt │
│ per host timeouts, cache + delay) │
│ scheduling, size cap) │ │
│ budgets) │ ▼ │
│ ▲ ▼ parser (links, title) │
│ └──── enqueue in-scope links ◄── UrlFilter (scope + regex) │
│ │ │
│ ▼ │
│ Sink ─► JSON Lines / your storage │
└───────────────────────────────────────────────────────────────────┘
The pieces are intentionally small and independently testable:
Frontier, the politeness aware URL queue: deduplication, per host scheduling, depth tracking, and the page budget.Fetcher, HTTP with retries, timeouts, and a response size ceiling.RobotsCache, per hostrobots.txt, includingCrawl delay.parser/sitemap, link and title extraction plus sitemap parsing.UrlFilter, scope and include or exclude rules.Sink, the output seam; implement it to plug into anything.Engine, wires it all together and runs the worker pool.
use rustcrawl_core::{sink::JsonlSink, CrawlConfig, Engine, Result};
use std::sync::Arc;
#[tokio::main]
async fn main() -> Result<()> {
let config = CrawlConfig::builder()
.add_seed("https://example.com")?
.max_pages(Some(100))
.concurrency(8)
.build()?;
let summary = Engine::new(config, Arc::new(JsonlSink::stdout()))?
.run()
.await?;
eprintln!("crawled {} pages", summary.pages_fetched);
Ok(())
}Contributions are very welcome, see CONTRIBUTING.md.
If you want to help, this project is intentionally open to practical improvements from real crawling and search workflows. Good contributions include:
- Better crawl correctness and safety (URL normalization edge cases, robots behavior, dedup quality, stronger defaults).
- Better Crawl Deck UX (new control views, clearer metrics, better job form ergonomics, keyboard workflow improvements).
- Better reliability and performance (frontier efficiency, fetch throughput, retry behavior, memory pressure control, large crawl stability).
- Better developer experience (cleaner local setup, easier Windows support, packaging, release automation, onboarding docs).
- Better interoperability (new
Sinkimplementations for common data stores, index pipelines, and analytics stacks). - Better test depth (integration scenarios, regression suites, failure mode tests, reproducible fixtures).
What we want most: small, focused PRs with clear behavior changes, tests for new logic, and defaults that keep the crawler polite and safe. If you are not sure where to start, open an issue with your idea and we can shape it together.
rustcrawl is a general purpose crawler framework and terminal control center.
You are solely responsible for how you use it.
By using this software, you agree to all of the following:
- You will comply with all applicable laws, regulations, contracts, and site terms of service in your jurisdiction.
- You will respect
robots.txt, rate limits, and operational safety practices for systems you crawl. - You will only crawl content you are authorized to access and process.
- You are responsible for any legal, operational, or financial consequences of your usage, including traffic impact and data handling.
This project is provided for legitimate engineering and research workflows. It is provided as is, without warranty of any kind, express or implied. The maintainers and contributors assume no responsibility or liability for misuse, damages, claims, or losses resulting from use of this software.
Licensed under either of Apache License, Version 2.0 or MIT license at your option.
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in this project by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.