page2md

Render any URL with headless Chromium and extract clean Markdown to stdout.

Quickstart

docker run --rm ghcr.io/michalschroeder/page2md example.com

Tags: latest, semver (1.0.0), and major (1). Pin to a major in scripts:

docker run --rm ghcr.io/michalschroeder/page2md:1 example.com

Why

Many pages need JS to render. curl | defuddle misses SPA content. page2md runs the page in headless Chromium first, so what you get is what a browser sees — then defuddle extracts the article.

When to use it

Feeding pages to an LLM — Markdown is ~19× smaller than rendered HTML (see Token savings), and the model sees content, not chrome.
Archiving / note-taking — clean, diff-able Markdown for Obsidian, Logseq, git.
Scraping pipelines — deterministic output, no AI in the loop, no hallucinated content.
Bot-blocked pages — sites that 403 a plain fetch/curl (e.g. Stack Overflow) render here because page2md drives a real browser.

vs. alternatives

Approach	JS rendering	Cleanup	Deterministic	Token cost to convert
`curl` (raw HTML)	✗	✗	✓	none
`curl \| pandoc`	✗	✗	✓	none
`curl \| readability-cli \| pandoc`	✗	✓ (articles only)	✓	none
AI fetchers (Claude WebFetch, etc.)	✗	✓	✗	pays LLM tokens to extract
page2md	✓	✓ (Defuddle)	✓	none

Reach for curl | pandoc if the page is static and you want every byte. Reach for page2md when you want the actual rendered article as faithful Markdown — no LLM in the conversion path, so the only tokens you spend are on the output you actually feed downstream.

Usage

Output goes to stdout. Pipe or redirect:

docker run --rm ghcr.io/michalschroeder/page2md example.com > page.md
docker run --rm ghcr.io/michalschroeder/page2md https://news.ycombinator.com | less

Or write to a file with -o (mount a host dir so the file lands on your machine, not in the container):

docker run --rm -v "$PWD":/out ghcr.io/michalschroeder/page2md -o /out/page.md example.com

URL scheme is optional (https:// is assumed).

Tip: add --init (docker run --init --rm ...) so Chromium's child processes get reaped cleanly if you ctrl-C mid-render.

Slow or JS-heavy pages

By default page2md grabs the DOM the moment it's ready (domcontentloaded). Single-page apps that load their content after the initial render come back empty — give them an extra wait:

# wait 6s after load for late-rendering SPA content
docker run --rm ghcr.io/michalschroeder/page2md --wait-ms 6000 https://bsky.app/profile/bsky.app

--wait-ms <n> — extra delay (ms, 0–300000) after load. The reliable lever for client-rendered SPAs.
--wait-until <event> — domcontentloaded (default), load, or networkidle. networkidle waits for the network to go quiet; skip it on apps that hold connections open (it never settles).

Or don't pick flags by hand — --auto retries with progressively stronger render settings until something extracts:

docker run --rm ghcr.io/michalschroeder/page2md --auto https://bsky.app/profile/bsky.app

It escalates only on failure, so simple pages still take the fast default path; each retry is logged to stderr. Not every page is reachable — some stay blocked or empty.

Shell wrapper

Bash/zsh:

page2md() { docker run --rm ghcr.io/michalschroeder/page2md "$@"; }

Nushell:

def page2md [url: string] { docker run --rm ghcr.io/michalschroeder/page2md $url }

Token savings

Feeding raw or rendered HTML to an LLM burns tokens on tags, scripts, and chrome. page2md hands the model just the article.

The numbers below compare page2md's Markdown against the HTML itself — what you'd spend dumping a page straight into context. AI fetchers like WebFetch already extract before returning (and don't run JS), so page2md's edge over those is JS rendering and deterministic output, not raw token count.

Across 8 pages, page2md output is 18.9× smaller than rendered HTML in tokens (95% savings, cl100k_base).

Page	Rendered HTML	page2md MD	Savings
example.com	153	29	81%
Wikipedia article	114,220	20,322	82%
Hacker News front page	11,766	2,121	82%
MDN reference page	26,063	2,775	89%
React docs page	152,462	6,251	96%
Personal blog post	30,205	1,713	94%
Stack Overflow Q&A	269,408	4,727	98%
page2md GH repo	139,025	1,297	99%
Total	743,302	39,235	95%

Full numbers and method: docs/token-comparison/.

How it works

Open the page in headless Chromium. Skip images, fonts, video, CSS.
Don't wait for the network to go quiet — as soon as the DOM is ready, grab the HTML.
Hand it to defuddle, which figures out the main content and turns it into Markdown.
Print it.

Roughly 1.3s for a trivial page (example.com), 2s for a typical article (Wikipedia, HN front page), 3s+ for heavier SPAs (e.g. GitHub). Container startup is a fixed ~1s of that; the rest is page render.

Local dev

You'll need Bun.

bun install
bunx playwright install chromium-headless-shell
bun index.ts <url>

Or build the Docker image locally:

docker build -t page2md .
docker run --rm page2md example.com

Lint & test

All checks run via Docker — no host installs needed:

make lint    # tsc + biome + hadolint + actionlint
make fmt     # auto-format with Biome
make test    # snapshot tests

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.github		.github
docs		docs
scripts		scripts
src		src
test		test
.dockerignore		.dockerignore
.gitignore		.gitignore
.release-please-manifest.json		.release-please-manifest.json
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
biome.json		biome.json
bun.lock		bun.lock
index.ts		index.ts
package.json		package.json
release-please-config.json		release-please-config.json
renovate.json		renovate.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

page2md

Quickstart

Why

When to use it

vs. alternatives

Usage

Slow or JS-heavy pages

Shell wrapper

Token savings

How it works

Local dev

Lint & test

License

About

Uh oh!

Releases 6

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

page2md

Quickstart

Why

When to use it

vs. alternatives

Usage

Slow or JS-heavy pages

Shell wrapper

Token savings

How it works

Local dev

Lint & test

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages