Skip to content

Releases: Michaelliv/markit

v0.5.3

11 Apr 17:54

Choose a tag to compare

Opportunistic markdown source discovery

markit now fetches raw markdown from docs sites instead of converting their HTML. Four detection methods, in priority order:

  • Accept: text/markdown header on every request (Cloudflare, Vercel)
  • <link rel="alternate" type="text/markdown"> tags in the HTML
  • VitePress marker detection β†’ fetches .md source files
  • /llms.txt for root URLs (Stripe, Anthropic, etc.)

No extra requests for normal sites. The second fetch only fires when a markdown source is confirmed in the response.

Closes #13

Full Changelog: v0.5.2...v0.5.3

v0.5.2

11 Apr 16:58

Choose a tag to compare

Full Changelog: v0.5.1...v0.5.2

v0.5.1

09 Apr 18:17

Choose a tag to compare

Fix: PDF table column splitting

Text boxes from mupdf sometimes span multiple table columns when fragments on the same line are close together. The grid code used the center point to assign the whole box to one cell, merging content that belongs in separate columns.

What changed

  • Cross-column splitting β€” text boxes that span vertical column boundaries are now split at word boundaries and placed in their correct cells
  • Header detection guard β€” wide paragraph text just above a table is no longer absorbed as a header row
  • Column layout fix β€” pages with tables no longer trigger false multi-column layout detection

Tested on Anthropic's 244-page Claude Mythos Preview System Card β†’ result

Test Fixtures v1

06 Apr 06:44

Choose a tag to compare

Binary test fixtures (PDFs, benchmark corpus) for running the test suite. Downloaded on demand by test/download-fixtures.sh.

v0.5.0

30 Mar 19:04

Choose a tag to compare

Apple iWork support and a DOCX table fix.

Apple iWork support

Pages, Keynote, and Numbers files are now supported. Turns out they're all just XML in a zip.

markit document.pages
markit presentation.key
markit spreadsheet.numbers

Fixes

  • Fixed DOCX tables breaking when cells contain multiple paragraphs (#10)

v0.4.0

29 Mar 22:42

Choose a tag to compare

GitHub URLs, image extraction, and a crash fix.

GitHub URL support

Convert GitHub repos, files, gists, issues, and PRs directly to clean markdown. No scraping, no third-party proxies.

markit https://github.com/owner/repo
markit https://github.com/owner/repo/blob/main/src/index.ts
markit https://github.com/owner/repo/issues/42
markit https://gist.github.com/user/id

Image extraction for PPTX and DOCX

Embedded images are now extracted from PowerPoint and Word files. Images go to a temp directory by default, or to a custom path with --image-dir.

markit slides.pptx
markit document.docx --image-dir ./images

Fixes

  • Fixed XML entity expansion crash on large XLSX/PPTX/EPUB files with >1000 entity references

v0.3.0 β€” PDF converter rewrite

27 Mar 22:19

Choose a tag to compare

PDF converter rewrite

Rewrote the PDF converter from scratch with mupdf (native WASM).

What's new

  • Table detection β€” vector line extraction + raycasting places text into markdown tables
  • Diagram filtering β€” block diagrams (sparse grids, repeated labels) are excluded from table detection
  • Multi-column layout β€” two-column documents (legal docs, datasheets) read in correct order
  • Header/footer stripping β€” repeated running headers removed across pages
  • Image extraction β€” diagrams cropped and saved as PNGs when imageDir is provided
  • CTM tracking β€” content stream coordinate transforms applied correctly
  • Agent skill β€” npx skills add Michaelliv/markit

Performance

PDF Pages Time
Bitcoin whitepaper 9 26ms
US Constitution 16 56ms
Intel PCH datasheet 224 640ms
NXP S32K3xx datasheet 164 1.9s

Testing

58 tests across 4 test files covering grid detection, rendering, extraction, and column detection. Validated against Intel, NXP, Microchip, and Bitcoin whitepaper PDFs.

v0.2.0

26 Mar 17:28

Choose a tag to compare

Full Changelog: v0.1.3...v0.2.0

v0.1.3

25 Mar 15:12

Choose a tag to compare