Möllendorff Ref

Renders web pages and PDFs into token-optimized JSON for LLM agents.

The problem

LLM agents are bad at getting fresh web content:

Stale training data — without explicit tooling, models default to their training corpus, returning confident answers from months-old snapshots. Even with web search enabled, models can weight stale training knowledge over fresh results.
Bot protection — curl, wget, and built-in web tools get blocked (403/999) by most sites. When they do get through, they return raw HTML — 10,000-50,000 tokens of navigation, ads, and markup noise.
SPA blindness — modern sites return empty HTML shells. The actual content loads via JavaScript after the initial response.

What ref does

Chrome renders the page, waits for all network requests to settle (SPAs included), then ref extracts structured content and outputs compact JSON.

100 KB of rendered HTML becomes 1-5 KB of structured JSON.

┌──────────┐    ┌───────────┐    ┌──────────┐    ┌──────────┐
│ Headless │ →  │ networkIdle│ →  │ Strip    │ →  │ Compact  │
│ Chrome   │    │ wait (SPA) │    │ nav/ads  │    │ JSON out │
└──────────┘    └───────────┘    └──────────┘    └──────────┘

Output example

ref fetch https://example.com 2>/dev/null | jq .

{
  "url": "https://example.com",
  "status": "ok",
  "title": "Example Domain",
  "sections": [
    {
      "level": 1,
      "heading": "Example Domain",
      "content": "This domain is for use in illustrative examples..."
    }
  ],
  "links": [
    { "text": "More information...", "url": "https://www.iana.org/domains/example" }
  ],
  "chars": 1256
}

Extract a specific element with --selector:

ref fetch --selector "#pricing-table" https://example.com 2>/dev/null | jq .

Null fields and empty arrays are omitted. Sections are capped (200 char headings, 2,000 char content). Code blocks include detected language. Status detects paywalls, login walls, and dead links.

PDF extraction

ref pdf document.pdf 2>/dev/null | jq .

Same JSON structure, plus table detection (whitespace column analysis, header inference, markdown output) and heading detection (numbered sections, Roman numerals, ALL CAPS, academic/legal formats).

Also accepts URLs — downloads and extracts in one step:

ref pdf https://example.com/report.pdf 2>/dev/null | jq .

Commands

Command	Description
`ref fetch <url>`	Render page via Chrome, output structured JSON
`ref pdf <file\|url>`	Extract text and tables from PDFs
`ref scan <files>`	Find URLs in markdown, build references.yaml
`ref verify-refs <file>`	Check reference entries, update status
`ref check-links <file>`	Validate URL health (HTTP status codes)
`ref refresh-data --url <url>`	Extract live data (market sizes, stats)
`ref init`	Create references.yaml template
`ref update`	Self-update from GitHub releases
`ref mcp`	Start MCP server (JSON-RPC 2.0 over stdio)

MCP Server Mode

ref mcp starts a persistent MCP server over stdio. AI applications call tools directly — no shell spawning, browser pool stays warm between calls.

Six tools: ref_fetch, ref_pdf, ref_check_links, ref_scan, ref_verify_refs, ref_refresh_data.

Claude Code (.mcp.json in project root):

{
  "mcpServers": {
    "ref": {
      "command": "ref",
      "args": ["mcp"]
    }
  }
}

Claude Desktop (~/Library/Application Support/Claude/claude_desktop_config.json):

{
  "mcpServers": {
    "ref": {
      "command": "/usr/local/bin/ref",
      "args": ["mcp"]
    }
  }
}

See docs/mcp-integration.md for full setup guide and tool reference.

AI orchestration

ref is designed to work with Asimov, a vendor-neutral orchestrator for AI coding CLIs (Claude Code, Gemini CLI, Codex CLI).

Asimov's freshness protocol forces agents to use ref fetch for all web content instead of relying on built-in search tools or training data:

{
  "rule": "MUST use ref fetch <url> for all web fetching. NEVER use WebSearch or WebFetch."
}

This ensures agents work with current, verified content rather than stale or hallucinated sources.

Install

From releases:

# macOS (Apple Silicon)
curl -L https://github.com/mollendorff-ai/ref/releases/latest/download/ref-aarch64-apple-darwin.tar.gz | tar xz
sudo mv ref /usr/local/bin/

# macOS (Intel)
curl -L https://github.com/mollendorff-ai/ref/releases/latest/download/ref-x86_64-apple-darwin.tar.gz | tar xz
sudo mv ref /usr/local/bin/

# Linux (x64)
curl -L https://github.com/mollendorff-ai/ref/releases/latest/download/ref-x86_64-unknown-linux-musl.tar.gz | tar xz
sudo mv ref /usr/local/bin/

# Linux (ARM64)
curl -L https://github.com/mollendorff-ai/ref/releases/latest/download/ref-aarch64-unknown-linux-musl.tar.gz | tar xz
sudo mv ref /usr/local/bin/

From crates.io:

cargo install mollendorff-ref

Requirements

Chrome or Chromium (for fetch, check-links, verify-refs)
Rust toolchain (build from source only)

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
.asimov		.asimov
.cargo		.cargo
.github		.github
docs		docs
schemas		schemas
src		src
test-data		test-data
tests		tests
.gitignore		.gitignore
.intentionally-empty-file.o		.intentionally-empty-file.o
.markdownlint-cli2.jsonc		.markdownlint-cli2.jsonc
.markdownlint.jsonc		.markdownlint.jsonc
.markdownlintignore		.markdownlintignore
CHANGELOG.md		CHANGELOG.md
Cargo.toml		Cargo.toml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Möllendorff Ref

The problem

What ref does

Output example

PDF extraction

Commands

MCP Server Mode

AI orchestration

Install

Requirements

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Möllendorff Ref

The problem

What ref does

Output example

PDF extraction

Commands

MCP Server Mode

AI orchestration

Install

Requirements

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages