Skip to content

Add markdown parsing and AST diffing pipeline#63

Open
sripathikrishnan wants to merge 2 commits into
claude/docx-to-markdown-ast-8Ddudfrom
claude/markdown-diff-operations-jSaad
Open

Add markdown parsing and AST diffing pipeline#63
sripathikrishnan wants to merge 2 commits into
claude/docx-to-markdown-ast-8Ddudfrom
claude/markdown-diff-operations-jSaad

Conversation

@sripathikrishnan

Copy link
Copy Markdown
Contributor

Summary

This PR introduces a complete markdown round-trip pipeline for extradocx, enabling users to edit documents as markdown and have changes tracked back to the original DOCX structure.

Key Changes

  • Markdown Parser (md_parser.py): New GFM markdown parser that converts markdown text into the same AST node types used by the DOCX parser. Supports headings, paragraphs, code blocks, lists, tables, block quotes, thematic breaks, and inline formatting (bold, italic, strikethrough, code, links, images).

  • AST Diff Engine (md_diff.py): Implements a two-layer diffing algorithm:

    • Block-level alignment using dynamic programming to match base blocks with derived blocks, detecting insertions, deletions, and modifications
    • Per-block content comparison that emits specific operation types (ReplaceHeading, ReplaceParagraph, etc.) only when content actually changes
    • Uses word-level Jaccard similarity and cost-based matching to handle structural changes intelligently
  • Diff Operations (diff_ops.py): New operation types representing edits:

    • Block-level: InsertBlock, DeleteBlock, ReplaceHeading, ReplaceParagraph, ReplaceCodeBlock, ReplaceTable, ReplaceList, ReplaceBlockQuote
    • List-item level: InsertListItem, DeleteListItem, ReplaceListItem
    • Each operation references nodes via xpath/index for traceability back to the original DOCX
  • Comprehensive Test Suite (test_md_diff.py): 868 lines of tests covering:

    • Markdown parsing of all supported block and inline elements
    • Diff detection with no changes (identity tests)
    • Text edits within blocks
    • Structural changes (insertions, deletions, type changes)
    • List and table modifications
    • Edge cases and realistic mixed documents

Implementation Details

  • The diff algorithm uses a cost-based DP approach inspired by sequence alignment, with configurable costs for different block types (tables, code blocks, paragraphs)
  • Matchability is gated by block kind and word-level similarity (minimum 30% Jaccard) to avoid spurious matches
  • Nodes created from markdown carry no xpath; xpaths are preserved from the base DOCX AST for traceability
  • The public API is minimal: parse_markdown(text) -> Document and diff(base, derived) -> list[DiffOp]

https://claude.ai/code/session_01UhhrCRypxppAPCw3exXtVj

claude added 2 commits April 9, 2026 01:15
Adds the ability to parse edited GFM markdown back into the AST, compare
it against the original DOCX-derived AST, and produce a list of typed
edit operations that describe the user's intent.

New modules:
- md_parser.py: GFM markdown → AST parser (same node types as DOCX parser)
- diff_ops.py: Operation types (InsertBlock, DeleteBlock, ReplaceHeading,
  ReplaceParagraph, ReplaceCodeBlock, ReplaceTable, ReplaceList, etc.)
- md_diff.py: DP-based block alignment + per-block diffing, inspired by
  extradoc/diffmerge/content_align.py

56 new tests covering parser, no-change identity, text edits, structural
changes, list/table/blockquote edits, formatting changes, complex scenarios,
and full DOCX round-trip.

https://claude.ai/code/session_01UhhrCRypxppAPCw3exXtVj
Add docx_apply.py that projects diff ops back onto the original DOCX
via XML manipulation (zipfile + ElementTree). Key design:

- Safe op ordering: replaces → reverse-indexed deletes → ascending-anchor
  inserts, ensuring XPath tag-count indices stay valid throughout
- Per-tag XPath navigation matching the DOCX parser's indexing scheme
- Inline content replacement clears existing w:r/w:hyperlink runs and
  rewrites them from the AST InlineNode tree
- Heading style mapping (Heading1–Heading6) for level changes
- Table cell rewriting preserving w:tc structure

Also adds 54 end-to-end tests (test_e2e.py) covering the full pipeline:
  DOCX → AST → markdown → edit → diff → apply → pandoc verification

Test scenarios cover every markdown feature:
  headings (text + level changes), paragraphs, bold/italic/strikethrough,
  inline code, links, bullet lists, ordered lists, tables, block deletion,
  block insertion, and complex multi-edits.

Commits BEFORE_test_report.docx + 26 after-state fixtures to
testdata/e2e_fixtures/ for manual review of before/after states.

All 139 tests pass.

https://claude.ai/code/session_01UhhrCRypxppAPCw3exXtVj
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants