blue dragon

A Chrome extension for crawling and SEO analysis. Opens as a full tab, fetches pages in parallel, and extracts metadata, structured data, and article text into an interactive table.

Getting started guide → · Chrome Web Store

Installation

npm install && npm run build
Open chrome://extensions → enable Developer mode → Load unpacked → select the dist/ folder
Click the dragon icon in the toolbar

Crawl tab

Building a URL list

Spider URL — enter a page URL and click fetch. The extension fetches the page and extracts all linked URLs into the URL list below.

URL list — paste or type URLs directly, one per line.

CSV import — via the URLs → CSV menu. Drag & drop or pick a CSV file. Columns named URL, page, or Page are imported as URLs.

Sitemap import — via the URLs → sitemap menu. Enter a sitemap XML URL (including .gz sitemaps). All <loc> entries are appended to the URL list.

Filtering the URL list

A regex filter can be applied to the URL list before crawling. Write a pattern in the filter field and click include (keep only matching) or exclude (remove matching). Three quick presets are available: uuid, ends with /, ends with .html. Custom named presets can be saved and managed in the extension Options page and will appear in the same dropdown with their type applied automatically on click.

The URL counter turns orange when the number of entered URLs exceeds the configured max pages limit while in list mode.

Crawling

Click crawl to start. Pages are fetched concurrently up to the configured connection limit. A progress bar tracks completion. The results table renders automatically when the queue is empty.

The crawl runs in the background service worker and survives closing the extension tab — reopen the tab at any time to reconnect and see current progress.

Crawl modes

Mode	Behaviour
list	Crawls exactly the URLs in the list
autonomous	Recursive spider — discovers and follows internal links, stays on the start hostname. Automatically enables stay on hostname.

In autonomous mode the max pages limit caps how many URLs are crawled in total.

An optional crawl filter regex (include or exclude) can be set in Configuration to restrict which discovered URLs are added to the queue during a recursive crawl.

In list mode, outbound links (links to external hosts) are detected on each page and stored for the link explorer. If fetch outbound is enabled in Configuration, those external URLs are also fully crawled (status code, title, H1, etc.) and appear in the results table with the outbound column set to true. They are not added to the recursive queue.

Results table

Each row is one crawled URL. Rows are colour-coded by HTTP status: red for 4xx/5xx, amber for 3xx. Columns visible by default are marked with ✓; all others can be enabled via the columns toggle button.

Column	Default	Description
url	✓	Crawled URL
status	✓	HTTP status code
duration (ms)	✓	Response time
redirected	✓	Whether the request was redirected
encoded size	✓	Compressed response size
crawlable	✓	Whether the active bot UA is allowed by robots.txt
indexable	✓	True when status is 2xx, content-type is text/html, and no `noindex` directive is present
canonical		`<link rel="canonical">`
title		`<title>`
description		`<meta name="description">`
keyword		`<meta name="keywords">`
robots		`<meta name="robots">` content
h1		First `<h1>`
content-type		`Content-Type` response header
decoded size		Uncompressed response size
deliveryType		Cache delivery type from Performance API
content-encoding		Transfer encoding used (gzip, br, identity, etc.)
next-hop-protocol		HTTP protocol version (h2, http/1.1, h3)
timestamp		Time of the request
ok		Whether the response was successful (2xx)
og:image / og:title / og:site / og:description		Open Graph tags
publisher		`schema.org` publisher name
dateModified / datePublished		`schema.org` dates
authors		`schema.org` author names
headline / altHeadline		`schema.org` headlines
content		Article body text (requires extract text setting)
outbound links		Number of internal links found on the page (autonomous mode)
inbound links		Number of internal links pointing to this page (autonomous mode)
outbound	✓	Whether this URL was crawled as an outbound-only link (list mode + fetch outbound)
broken outbound		Number of outbound links resolving to a 4xx/5xx response
html		View/copy stored HTML (requires save HTML to OPFS setting)
clicks / impressions		Available if a Search Console CSV was imported

Search — filter visible rows by any text.
Export CSV — downloads all rows and all columns (including hidden ones, except the internal HTML file path). The filename follows the crawl-save naming convention: {hostname}_{datetime}.csv.

URL detail view

Click any row to open a detail panel for that URL. The panel shows:

All crawled fields grouped by category (fetch metadata, Open Graph, Schema.org, custom data, …)
Outbound links — every link found on the page, with its HTTP status, follow attribute, and directive source. Rows are colour-coded (red = broken, amber = redirect). Click a row to drill into that URL's detail view.
Inbound links — every crawled page that links to this URL, with the same columns and drill-down behaviour.

Internal link explorer

After an autonomous crawl, the links button opens a modal table of every internal link edge discovered: source URL → target URL, the follow status (follow, nofollow, ugc, sponsored), and the directive source (anchor rel attribute, meta robots tag, or header X-Robots-Tag). The table is searchable and paginated.

Saving and loading crawls

Use the Crawls menu to save the current crawl, load a previous one, or delete saved crawls. Crawls are stored in chrome.storage.local. Scheduled crawls are automatically saved here after each run. Saved crawls are named {hostname} {datetime} by default.

Issues tab

After a crawl, the Issues tab runs a set of SEO checks across all crawled HTML pages and lists findings grouped by severity.

Check	Severity	Description
Broken pages	Error	Pages returning 4xx or 5xx
Robots-blocked	Error	Pages blocked by robots.txt for the active UA
Redirects	Warning	Pages returning 3xx
Missing title	Warning	HTML pages with no `<title>`
Missing H1	Warning	HTML pages with no `<h1>`
Missing description	Warning	HTML pages with no meta description
Missing canonical	Info	HTML pages with no canonical tag
Duplicate titles	Warning	Title text shared by more than one HTML page
Broken outbound links	Warning	Pages that link to URLs returning 4xx/5xx

Checks for title, H1, description, canonical, and duplicate titles are only applied to pages with a text/html content type. Clicking an issue row opens the URL detail view for that page.

Configuration

Open via the Configuration button in the navbar.

Setting	Default	Description
crawl mode	list	`list` or `autonomous` (recursive spider)
max pages limit	500	Upper bound for autonomous mode
crawl filter regex	—	Regex applied to discovered URLs in autonomous mode
crawl filter type	off	`include` or `exclude` the regex matches
max retries	2	How many times to retry a failed URL
crawl delay	0 ms	Pause between requests
max connections	5	Concurrent fetch limit
stay on hostname	off	Skip URLs from other hostnames (auto-enabled in autonomous mode)
respect robots.txt	on	Honour robots.txt rules for the active bot User-Agent
fetch outbound	off	In list mode: fully crawl discovered outbound links (all stats) without adding them to the queue
credentials	omit	Cookie/auth handling (`omit`, `same-origin`, `include`)
cache	no-store	Browser cache behaviour
document charset	utf-8	Encoding used to decode response bodies
extract text	off	Extract article body with Mozilla Readability
save HTML to OPFS	off	Save raw HTML responses to the browser's Origin Private File System
OPFS root directory	crawl_archive	Folder name within OPFS

A manage overrides → link in the robots.txt row opens the robots.txt management page.

robots.txt management

A dedicated page (robots.txt management) is accessible from the Configuration modal. It lets you view and override the robots.txt content for any crawled hostname, so you can control crawl behaviour without modifying the live server.

Fetched column — the robots.txt retrieved from the server (read-only).
Override column — editable content that replaces the fetched robots.txt during crawls. Leave empty to allow all.
Save override — stores the override in chrome.storage.local; takes effect on the next crawl.
Clear override — removes the override and reverts to the fetched robots.txt.
Re-fetch — re-downloads the live robots.txt from the server.
Add override for new origin — enter any origin URL, optionally click Fetch to populate the textarea with the live robots.txt, edit as needed, and click Save override.

Options page

Right-click the dragon icon → Options (or navigate to the options URL from chrome://extensions).

User-Agent

Lets you override the browser's User-Agent for all extension requests. Presets include common bots:

Googlebot Desktop / Smartphone
Googlebot-Image, Googlebot-Video, Google-Other
Bingbot Desktop / Mobile
YandexBot, Baiduspider, DuckDuckBot
GPTBot, ClaudeBot
Custom (free-text input)
Default (browser UA)

The active bot name is extracted from the selected UA and used for robots.txt evaluation.

URL-Filter Presets

Named regex presets that appear in the regex dropdown on the crawl page. Each preset stores a name, a pattern, and a type (include or exclude). Clicking a saved preset applies it to the URL list immediately.

Field	Description
Name	Display label for the preset
Regex	Regular expression pattern
Type	`include` (keep matching) or `exclude` (remove matching)

Saved Crawls

Storage management for all IndexedDB (IDB) and OPFS crawl data.

Storage quota bar — shows total browser storage used vs. available.
Crawl list — unified view of all IDB entries and OPFS directories matched by crawl ID. Newest crawls appear first; OPFS directories without a matching IDB entry are flagged as orphans.
Rename — inline rename of the crawl name in IDB (press Enter or click away to save).
↓ ZIP — downloads all HTML files stored in the OPFS directory for that crawl as a ZIP archive.
Delete IDB — removes the JSON result blob from IndexedDB; OPFS files are kept.
Delete OPFS — removes the HTML archive directory from OPFS; IDB entry is kept.
Delete all — removes both IDB and OPFS data for the crawl.

Schedules tab

Scheduled crawls run in the background at a fixed interval — even when the extension page is closed.

Creating a schedule

Name — a label for this schedule. {hostname} and {datetime} are replaced at runtime. Default: {hostname} {datetime}.
URL sources — one or more of:
- Spider URL — fetches the page and crawls all linked URLs found on it
- URL list — a fixed list of URLs, one per line
- Sitemap URL — fetches the sitemap XML and crawls all <loc> entries
Regex filter — optionally filter the resolved URL list by include or exclude.
Frequency — how often to run: Hourly, Every 2h / 4h / 6h / 12h, Daily, or Weekly.
At time — anchor the run to a specific time of day (e.g. 03:00). Hourly and multi-hour schedules use this as the starting point on the period grid; daily schedules fire at exactly this time each day.
Day — (Weekly only) the day of the week to run.

Click add to save the schedule.

Schedule table

Each row shows: property (the crawl target), interval, last run time with ok/total stats, next scheduled run, and status.

Status	Meaning
`active` (green)	Schedule is enabled and waiting for its next run
`N/M pages` (blue, pulsing)	Crawl is running — live page count updated in real time
`paused` (grey)	Schedule is disabled

Button	Action
↓	Loads the last result into the table and switches to the Crawl tab
⚡	Triggers an immediate crawl (disabled while a run is in progress)
pause / resume	Enables or disables the scheduled interval
✕	Deletes the schedule

Completed crawls are automatically saved to Crawls → Load. If a scheduled crawl fires while a manual crawl is already running, it is queued and starts as soon as the manual crawl finishes.

Stats tab

Aggregated crawl statistics, visible during and after a crawl. Updates in real-time as pages are fetched; also populated when loading a saved crawl.

Section	Content
Summary	Total URLs encountered, crawled, blocked by robots.txt, internal, external, indexable, non-indexable
Status codes	Count per HTTP status code, sorted by frequency
Content types	Count per content-type, sorted by frequency

Clicking any Status codes or Content types row — or the Indexable / Non-indexable / Internal / External rows in Summary — filters the results table to matching rows and switches to the Crawl tab.

Tech stack


Bundler	Vite 5
CSS	Tailwind CSS v3
Table	Tabulator v6
CSV parsing	PapaParse
Text extraction	@mozilla/readability
Robots.txt parsing	google-robotstxt-parser
HTML parsing	node-html-parser
Background scheduling	`chrome.alarms` API (Manifest V3)
Storage	`chrome.storage.local` + Origin Private File System (OPFS)

License

Apache 2.0 — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.githooks		.githooks
.github/workflows		.github/workflows
docs		docs
icons		icons
public		public
src		src
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
postcss.config.js		postcss.config.js
tailwind.config.js		tailwind.config.js
vite.config.js		vite.config.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

blue dragon

Installation

Crawl tab

Building a URL list

Filtering the URL list

Crawling

Crawl modes

Results table

URL detail view

Internal link explorer

Saving and loading crawls

Issues tab

Configuration

robots.txt management

Options page

User-Agent

URL-Filter Presets

Saved Crawls

Schedules tab

Creating a schedule

Schedule table

Stats tab

Tech stack

License

About

Uh oh!

Releases

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

blue dragon

Installation

Crawl tab

Building a URL list

Filtering the URL list

Crawling

Crawl modes

Results table

URL detail view

Internal link explorer

Saving and loading crawls

Issues tab

Configuration

robots.txt management

Options page

User-Agent

URL-Filter Presets

Saved Crawls

Schedules tab

Creating a schedule

Schedule table

Stats tab

Tech stack

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages