feat: extract and describe embedded images in DOCX files#7
Closed
olearydj wants to merge 1 commit intoMichaelliv:mainfrom
Closed
feat: extract and describe embedded images in DOCX files#7olearydj wants to merge 1 commit intoMichaelliv:mainfrom
olearydj wants to merge 1 commit intoMichaelliv:mainfrom
Conversation
Add image extraction to the DOCX converter using mammoth's convertImage hook and a custom Turndown rule. Images are captured during mammoth's HTML conversion, described via the existing options.describe callback when configured, and emitted as markdown by a Turndown rule that preserves structured context (tables, lists). Key design decisions: - Turndown rule approach: image nodes are converted to markdown during Turndown traversal, not via pre/post string replacement. This keeps table cells and list items intact and avoids regex fragility with escaped alt text. - Description markdown is passed through verbatim by the Turndown rule, preserving formatting from the describe callback. - Table cell detection adjusts output format (inline with <br>) to avoid breaking table structure. - Local TurndownNodeLike type avoids DOM lib dependency. 14 tests covering: text extraction, image placeholders, alt text, describe callback, markdown preservation, dollar sequence safety, error fallback, table cell images, and no raw placeholder tokens. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Owner
Author
|
Thanks for the clarification here. I understand and respect the choice to keep image handling extract-first by default. One counter-proposal that might preserve that direction while covering a different workflow would be to make description-at-conversion-time an opt-in mode rather than a behavior change. It is something I commonly use in my workflows and had implemented in my own, less less sophisticated tool, but missed it here. Something like:
Why I think this could be worth supporting:
Let me know if you are open to that approach. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The DOCX converter currently uses mammoth's default HTML conversion which drops embedded images. This PR adds image extraction via mammoth's
convertImagehook and a custom Turndown rule, giving DOCX files the same AI-powered image description capability that standalone image files and PPTX files (PR #6) already have.Approach
Uses mammoth's
images.imgElementcallback to capture image buffers during HTML conversion, then a custom Turndown rule to convert placeholder<img>nodes into final markdown. This avoids post-Turndown string replacement, which breaks in structured contexts (tables, lists) and with escaped alt text.Changes
optionsparameter in the DOCX converter'sconvertmethodconvertImagehookoptions.describebefore Turndown runs<br>inline format)TurndownNodeLiketype avoids DOM lib dependency in tsconfig$sequence safety, error fallback, table cell images, and no raw placeholder tokensBehavior
Without API key: images produce
*[Image: alt text]*placeholders or*[Image N]*when no alt text exists.With configured provider: each embedded image is described via the same
options.describepipeline used by standalone image files, producing**[Image: label]**followed by the description markdown.In table cells: output uses inline
<br>format to avoid breaking table structure.Error handling: if
describethrows, falls back to placeholder text.Test plan
bun run build— clean tsc compilationbun test— 72 tests pass (58 existing + 14 new), 0 failuresbun run check— biome passes🤖 Generated with Claude Code