feat: extract and describe embedded images in DOCX files by olearydj · Pull Request #7 · Michaelliv/markit

olearydj · 2026-03-28T20:43:18Z

Summary

The DOCX converter currently uses mammoth's default HTML conversion which drops embedded images. This PR adds image extraction via mammoth's convertImage hook and a custom Turndown rule, giving DOCX files the same AI-powered image description capability that standalone image files and PPTX files (PR #6) already have.

Approach

Uses mammoth's images.imgElement callback to capture image buffers during HTML conversion, then a custom Turndown rule to convert placeholder <img> nodes into final markdown. This avoids post-Turndown string replacement, which breaks in structured contexts (tables, lists) and with escaped alt text.

Changes

Accept options parameter in the DOCX converter's convert method
Capture image buffers via mammoth's convertImage hook
Resolve descriptions via options.describe before Turndown runs
Custom Turndown rule emits markdown directly, preserving:
- Structured context (table cells get <br> inline format)
- Description markdown from the provider (passed through verbatim)
- Escaped alt text (no regex matching against Turndown output)
Local TurndownNodeLike type avoids DOM lib dependency in tsconfig
14 tests covering placeholders, alt text, describe callback, markdown preservation, $ sequence safety, error fallback, table cell images, and no raw placeholder tokens
DOCX test fixture with images in body text and table cells

Behavior

Without API key: images produce *[Image: alt text]* placeholders or *[Image N]* when no alt text exists.

With configured provider: each embedded image is described via the same options.describe pipeline used by standalone image files, producing **[Image: label]** followed by the description markdown.

In table cells: output uses inline <br> format to avoid breaking table structure.

Error handling: if describe throws, falls back to placeholder text.

Test plan

bun run build — clean tsc compilation
bun test — 72 tests pass (58 existing + 14 new), 0 failures
Manual test on DOCX with cat/dog images — descriptions generated correctly with Anthropic provider
bun run check — biome passes

🤖 Generated with Claude Code

Add image extraction to the DOCX converter using mammoth's convertImage hook and a custom Turndown rule. Images are captured during mammoth's HTML conversion, described via the existing options.describe callback when configured, and emitted as markdown by a Turndown rule that preserves structured context (tables, lists). Key design decisions: - Turndown rule approach: image nodes are converted to markdown during Turndown traversal, not via pre/post string replacement. This keeps table cells and list items intact and avoids regex fragility with escaped alt text. - Description markdown is passed through verbatim by the Turndown rule, preserving formatting from the describe callback. - Table cell detection adjusts output format (inline with <br>) to avoid breaking table structure. - Local TurndownNodeLike type avoids DOM lib dependency. 14 tests covering: text extraction, image placeholders, alt text, describe callback, markdown preservation, dollar sequence safety, error fallback, table cell images, and no raw placeholder tokens. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Michaelliv · 2026-03-29T22:38:37Z

thanks @olearydj! same as #6, added docx image extraction too. closing in favor of that, appreciate both PRs!

olearydj · 2026-03-30T18:19:35Z

Thanks for the clarification here. I understand and respect the choice to keep image handling extract-first by default.

One counter-proposal that might preserve that direction while covering a different workflow would be to make description-at-conversion-time an opt-in mode rather than a behavior change. It is something I commonly use in my workflows and had implemented in my own, less less sophisticated tool, but missed it here.

Something like:

default stays extract
add --image-mode extract|describe|both (or a narrower --describe-images flag)
extract: current behavior unchanged
describe: emit self-contained markdown descriptions/placeholders via the existing describe pipeline
both: extract files and also emit descriptions

Why I think this could be worth supporting:

no default behavior change
supports self-contained markdown for downstream agents/indexing/search
avoids repeated downstream vision passes for users who want one-shot conversion
can reuse the existing describe hook instead of introducing a separate provider path

Let me know if you are open to that approach.

Michaelliv closed this Mar 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: extract and describe embedded images in DOCX files#7

feat: extract and describe embedded images in DOCX files#7
olearydj wants to merge 1 commit intoMichaelliv:mainfrom
olearydj:feat/docx-image-extraction

olearydj commented Mar 28, 2026

Uh oh!

Michaelliv commented Mar 29, 2026

Uh oh!

olearydj commented Mar 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

olearydj commented Mar 28, 2026

Summary

Approach

Changes

Behavior

Test plan

Uh oh!

Michaelliv commented Mar 29, 2026

Uh oh!

olearydj commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

olearydj commented Mar 30, 2026 •

edited

Loading