Skip to content

appautomaton/markmaton

Repository files navigation

markmaton

CI Release PyPI version Python versions

markmaton is a lightweight HTML-to-Markdown parser core built for agent workflows.

It solves the last-mile parsing problem in a web pipeline: you already have page HTML, but it is still too noisy and awkward for downstream agent use. Feed markmaton HTML from a fetcher or browser layer and get back cleaner Markdown, metadata, links, images, and quality signals.

Note

markmaton is a general parser, not a crawler. Feed it HTML from Playwright, fetch, Firecrawl, or another upstream page-visit tool.

Why it exists

  • Raw page HTML is usually not directly useful for downstream agent workflows.
  • Modern pages often mix the real content with navigation, overlays, cards, and app shell chrome.
  • markmaton keeps that cleanup and conversion step deterministic and separate from crawling.
  • The project stays narrow by design: no crawling, browser control, network, or LLM features.
  • The user-facing entrypoint is a Python CLI and API wrapped around a fast Go engine.

Install

pip

pip install markmaton

uv tool

uv tool install markmaton

Tip

The installed package works through plain pip. Local development uses uv with Python 3.12.

Quickstart

CLI

markmaton convert \
  --html-file page.html \
  --url https://example.com/article \
  --output-format markdown

To get the full structured response:

markmaton convert \
  --html-file page.html \
  --url https://example.com/article \
  --output-format json

Python API

from markmaton import ConvertOptions, ConvertRequest, convert_html

html = "<article><h1>Hello</h1><p>World</p></article>"

response = convert_html(
    ConvertRequest(
        html=html,
        url="https://example.com/article",
        options=ConvertOptions(only_main_content=True),
    )
)

print(response.markdown)
print(response.metadata.title)

Tip

Pass url whenever you can. markmaton uses it as parsing context for canonical metadata and absolute link resolution.

Output

JSON mode returns markdown, html_clean, metadata, links, images, and quality. See response shape for details.

Project shape

  • Go engine: cmd/markmaton-engine
  • Python wrapper and CLI: markmaton/
  • Parser fixtures and golden files: testdata/
  • Research, benchmark, and release docs: docs/

Documentation

Development

Set up the local development environment:

uv sync --group dev

Run the core test suites:

uv run python -m unittest discover -s tests -p 'test_*.py'
go test ./...

For a manual end-to-end smoke:

The repo is pinned to:

Important

Automated tests are unit-test-first. Live page visits and benchmarks are manual.

Release notes

About

Lightweight HTML-to-Markdown parser for AI agent workflows — Python CLI and API around a fast Go engine, available on PyPI.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors