Releases · Michaelliv/markit

Text boxes from mupdf sometimes span multiple table columns when fragments on the same line are close together. The grid code used the center point to assign the whole box to one cell, merging content that belongs in separate columns.

What changed

Cross-column splitting — text boxes that span vertical column boundaries are now split at word boundaries and placed in their correct cells
Header detection guard — wide paragraph text just above a table is no longer absorbed as a header row
Column layout fix — pages with tables no longer trigger false multi-column layout detection

Tested on Anthropic's 244-page Claude Mythos Preview System Card → result

Assets 2

06 Apr 06:44

Michaelliv

test-fixtures-v1

e21c37b

Test Fixtures v1

Binary test fixtures (PDFs, benchmark corpus) for running the test suite. Downloaded on demand by test/download-fixtures.sh.

Assets 3

30 Mar 19:04

github-actions

v0.5.0

5efc62b

v0.5.0

Apple iWork support and a DOCX table fix.

Apple iWork support

Pages, Keynote, and Numbers files are now supported. Turns out they're all just XML in a zip.

markit document.pages
markit presentation.key
markit spreadsheet.numbers

Fixes

Fixed DOCX tables breaking when cells contain multiple paragraphs (#10)

Assets 2

29 Mar 22:42

github-actions

v0.4.0

56bc2dd

v0.4.0

GitHub URLs, image extraction, and a crash fix.

GitHub URL support

Convert GitHub repos, files, gists, issues, and PRs directly to clean markdown. No scraping, no third-party proxies.

markit https://github.com/owner/repo
markit https://github.com/owner/repo/blob/main/src/index.ts
markit https://github.com/owner/repo/issues/42
markit https://gist.github.com/user/id

Image extraction for PPTX and DOCX

Embedded images are now extracted from PowerPoint and Word files. Images go to a temp directory by default, or to a custom path with --image-dir.

markit slides.pptx
markit document.docx --image-dir ./images

Fixes

Fixed XML entity expansion crash on large XLSX/PPTX/EPUB files with >1000 entity references

Assets 3

27 Mar 22:19

Michaelliv

v0.3.0

7818b87

v0.3.0 — PDF converter rewrite

PDF converter rewrite

Rewrote the PDF converter from scratch with mupdf (native WASM).

What's new

Table detection — vector line extraction + raycasting places text into markdown tables
Diagram filtering — block diagrams (sparse grids, repeated labels) are excluded from table detection
Multi-column layout — two-column documents (legal docs, datasheets) read in correct order
Header/footer stripping — repeated running headers removed across pages
Image extraction — diagrams cropped and saved as PNGs when imageDir is provided
CTM tracking — content stream coordinate transforms applied correctly
Agent skill — npx skills add Michaelliv/markit

Performance

PDF	Pages	Time
Bitcoin whitepaper	9	26ms
US Constitution	16	56ms
Intel PCH datasheet	224	640ms
NXP S32K3xx datasheet	164	1.9s

Testing

58 tests across 4 test files covering grid detection, rendering, extraction, and column detection. Validated against Intel, NXP, Microchip, and Bitcoin whitepaper PDFs.

Assets 2