Add markdown parsing and AST diffing pipeline#63
Open
sripathikrishnan wants to merge 2 commits into
Open
Conversation
Adds the ability to parse edited GFM markdown back into the AST, compare it against the original DOCX-derived AST, and produce a list of typed edit operations that describe the user's intent. New modules: - md_parser.py: GFM markdown → AST parser (same node types as DOCX parser) - diff_ops.py: Operation types (InsertBlock, DeleteBlock, ReplaceHeading, ReplaceParagraph, ReplaceCodeBlock, ReplaceTable, ReplaceList, etc.) - md_diff.py: DP-based block alignment + per-block diffing, inspired by extradoc/diffmerge/content_align.py 56 new tests covering parser, no-change identity, text edits, structural changes, list/table/blockquote edits, formatting changes, complex scenarios, and full DOCX round-trip. https://claude.ai/code/session_01UhhrCRypxppAPCw3exXtVj
Add docx_apply.py that projects diff ops back onto the original DOCX via XML manipulation (zipfile + ElementTree). Key design: - Safe op ordering: replaces → reverse-indexed deletes → ascending-anchor inserts, ensuring XPath tag-count indices stay valid throughout - Per-tag XPath navigation matching the DOCX parser's indexing scheme - Inline content replacement clears existing w:r/w:hyperlink runs and rewrites them from the AST InlineNode tree - Heading style mapping (Heading1–Heading6) for level changes - Table cell rewriting preserving w:tc structure Also adds 54 end-to-end tests (test_e2e.py) covering the full pipeline: DOCX → AST → markdown → edit → diff → apply → pandoc verification Test scenarios cover every markdown feature: headings (text + level changes), paragraphs, bold/italic/strikethrough, inline code, links, bullet lists, ordered lists, tables, block deletion, block insertion, and complex multi-edits. Commits BEFORE_test_report.docx + 26 after-state fixtures to testdata/e2e_fixtures/ for manual review of before/after states. All 139 tests pass. https://claude.ai/code/session_01UhhrCRypxppAPCw3exXtVj
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR introduces a complete markdown round-trip pipeline for extradocx, enabling users to edit documents as markdown and have changes tracked back to the original DOCX structure.
Key Changes
Markdown Parser (
md_parser.py): New GFM markdown parser that converts markdown text into the same AST node types used by the DOCX parser. Supports headings, paragraphs, code blocks, lists, tables, block quotes, thematic breaks, and inline formatting (bold, italic, strikethrough, code, links, images).AST Diff Engine (
md_diff.py): Implements a two-layer diffing algorithm:Diff Operations (
diff_ops.py): New operation types representing edits:InsertBlock,DeleteBlock,ReplaceHeading,ReplaceParagraph,ReplaceCodeBlock,ReplaceTable,ReplaceList,ReplaceBlockQuoteInsertListItem,DeleteListItem,ReplaceListItemComprehensive Test Suite (
test_md_diff.py): 868 lines of tests covering:Implementation Details
parse_markdown(text) -> Documentanddiff(base, derived) -> list[DiffOp]https://claude.ai/code/session_01UhhrCRypxppAPCw3exXtVj