Skip to content

feat: extract and describe embedded images in DOCX files#7

Closed
olearydj wants to merge 1 commit intoMichaelliv:mainfrom
olearydj:feat/docx-image-extraction
Closed

feat: extract and describe embedded images in DOCX files#7
olearydj wants to merge 1 commit intoMichaelliv:mainfrom
olearydj:feat/docx-image-extraction

Conversation

@olearydj
Copy link
Copy Markdown

Summary

The DOCX converter currently uses mammoth's default HTML conversion which drops embedded images. This PR adds image extraction via mammoth's convertImage hook and a custom Turndown rule, giving DOCX files the same AI-powered image description capability that standalone image files and PPTX files (PR #6) already have.

Approach

Uses mammoth's images.imgElement callback to capture image buffers during HTML conversion, then a custom Turndown rule to convert placeholder <img> nodes into final markdown. This avoids post-Turndown string replacement, which breaks in structured contexts (tables, lists) and with escaped alt text.

Changes

  • Accept options parameter in the DOCX converter's convert method
  • Capture image buffers via mammoth's convertImage hook
  • Resolve descriptions via options.describe before Turndown runs
  • Custom Turndown rule emits markdown directly, preserving:
    • Structured context (table cells get <br> inline format)
    • Description markdown from the provider (passed through verbatim)
    • Escaped alt text (no regex matching against Turndown output)
  • Local TurndownNodeLike type avoids DOM lib dependency in tsconfig
  • 14 tests covering placeholders, alt text, describe callback, markdown preservation, $ sequence safety, error fallback, table cell images, and no raw placeholder tokens
  • DOCX test fixture with images in body text and table cells

Behavior

Without API key: images produce *[Image: alt text]* placeholders or *[Image N]* when no alt text exists.

With configured provider: each embedded image is described via the same options.describe pipeline used by standalone image files, producing **[Image: label]** followed by the description markdown.

In table cells: output uses inline <br> format to avoid breaking table structure.

Error handling: if describe throws, falls back to placeholder text.

Test plan

  • bun run build — clean tsc compilation
  • bun test — 72 tests pass (58 existing + 14 new), 0 failures
  • Manual test on DOCX with cat/dog images — descriptions generated correctly with Anthropic provider
  • bun run check — biome passes

🤖 Generated with Claude Code

Add image extraction to the DOCX converter using mammoth's convertImage
hook and a custom Turndown rule. Images are captured during mammoth's
HTML conversion, described via the existing options.describe callback
when configured, and emitted as markdown by a Turndown rule that
preserves structured context (tables, lists).

Key design decisions:
- Turndown rule approach: image nodes are converted to markdown during
  Turndown traversal, not via pre/post string replacement. This keeps
  table cells and list items intact and avoids regex fragility with
  escaped alt text.
- Description markdown is passed through verbatim by the Turndown rule,
  preserving formatting from the describe callback.
- Table cell detection adjusts output format (inline with <br>) to
  avoid breaking table structure.
- Local TurndownNodeLike type avoids DOM lib dependency.

14 tests covering: text extraction, image placeholders, alt text,
describe callback, markdown preservation, dollar sequence safety,
error fallback, table cell images, and no raw placeholder tokens.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Michaelliv
Copy link
Copy Markdown
Owner

thanks @olearydj! same as #6, added docx image extraction too. closing in favor of that, appreciate both PRs!

@Michaelliv Michaelliv closed this Mar 29, 2026
@olearydj
Copy link
Copy Markdown
Author

olearydj commented Mar 30, 2026

Thanks for the clarification here. I understand and respect the choice to keep image handling extract-first by default.

One counter-proposal that might preserve that direction while covering a different workflow would be to make description-at-conversion-time an opt-in mode rather than a behavior change. It is something I commonly use in my workflows and had implemented in my own, less less sophisticated tool, but missed it here.

Something like:

  • default stays extract
  • add --image-mode extract|describe|both (or a narrower --describe-images flag)
  • extract: current behavior unchanged
  • describe: emit self-contained markdown descriptions/placeholders via the existing describe pipeline
  • both: extract files and also emit descriptions

Why I think this could be worth supporting:

  • no default behavior change
  • supports self-contained markdown for downstream agents/indexing/search
  • avoids repeated downstream vision passes for users who want one-shot conversion
  • can reuse the existing describe hook instead of introducing a separate provider path

Let me know if you are open to that approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants