Releases: Michaelliv/markit
v0.5.3
Opportunistic markdown source discovery
markit now fetches raw markdown from docs sites instead of converting their HTML. Four detection methods, in priority order:
Accept: text/markdownheader on every request (Cloudflare, Vercel)<link rel="alternate" type="text/markdown">tags in the HTML- VitePress marker detection β fetches
.mdsource files /llms.txtfor root URLs (Stripe, Anthropic, etc.)
No extra requests for normal sites. The second fetch only fires when a markdown source is confirmed in the response.
Closes #13
Full Changelog: v0.5.2...v0.5.3
v0.5.2
Full Changelog: v0.5.1...v0.5.2
v0.5.1
Fix: PDF table column splitting
Text boxes from mupdf sometimes span multiple table columns when fragments on the same line are close together. The grid code used the center point to assign the whole box to one cell, merging content that belongs in separate columns.
What changed
- Cross-column splitting β text boxes that span vertical column boundaries are now split at word boundaries and placed in their correct cells
- Header detection guard β wide paragraph text just above a table is no longer absorbed as a header row
- Column layout fix β pages with tables no longer trigger false multi-column layout detection
Tested on Anthropic's 244-page Claude Mythos Preview System Card β result
Test Fixtures v1
Binary test fixtures (PDFs, benchmark corpus) for running the test suite. Downloaded on demand by test/download-fixtures.sh.
v0.5.0
Apple iWork support and a DOCX table fix.
Apple iWork support
Pages, Keynote, and Numbers files are now supported. Turns out they're all just XML in a zip.
markit document.pages
markit presentation.key
markit spreadsheet.numbersFixes
- Fixed DOCX tables breaking when cells contain multiple paragraphs (#10)
v0.4.0
GitHub URLs, image extraction, and a crash fix.
GitHub URL support
Convert GitHub repos, files, gists, issues, and PRs directly to clean markdown. No scraping, no third-party proxies.
markit https://github.com/owner/repo
markit https://github.com/owner/repo/blob/main/src/index.ts
markit https://github.com/owner/repo/issues/42
markit https://gist.github.com/user/idImage extraction for PPTX and DOCX
Embedded images are now extracted from PowerPoint and Word files. Images go to a temp directory by default, or to a custom path with --image-dir.
markit slides.pptx
markit document.docx --image-dir ./imagesFixes
- Fixed XML entity expansion crash on large XLSX/PPTX/EPUB files with >1000 entity references
v0.3.0 β PDF converter rewrite
PDF converter rewrite
Rewrote the PDF converter from scratch with mupdf (native WASM).
What's new
- Table detection β vector line extraction + raycasting places text into markdown tables
- Diagram filtering β block diagrams (sparse grids, repeated labels) are excluded from table detection
- Multi-column layout β two-column documents (legal docs, datasheets) read in correct order
- Header/footer stripping β repeated running headers removed across pages
- Image extraction β diagrams cropped and saved as PNGs when
imageDiris provided - CTM tracking β content stream coordinate transforms applied correctly
- Agent skill β
npx skills add Michaelliv/markit
Performance
| Pages | Time | |
|---|---|---|
| Bitcoin whitepaper | 9 | 26ms |
| US Constitution | 16 | 56ms |
| Intel PCH datasheet | 224 | 640ms |
| NXP S32K3xx datasheet | 164 | 1.9s |
Testing
58 tests across 4 test files covering grid detection, rendering, extraction, and column detection. Validated against Intel, NXP, Microchip, and Bitcoin whitepaper PDFs.
v0.2.0
Full Changelog: v0.1.3...v0.2.0
v0.1.3
Full Changelog: https://github.com/Michaelliv/markit/commits/v0.1.3