Skip to content

chore: release 0.4.3#14

Merged
maish merged 2 commits into
mainfrom
release-0.4.3
Jun 11, 2026
Merged

chore: release 0.4.3#14
maish merged 2 commits into
mainfrom
release-0.4.3

Conversation

@maish

@maish maish commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Patch release: fix reprinted continuation-page headers appended as data rows on multi-page merge.

Fix

When a table's column header is reprinted at the top of each page — especially a multi-row (hierarchical) header — the repeated header rows survived the merge as bogus data rows, misaligning the stitched table for downstream consumers.

Injection now drops a body row when it is both flagged column_header by Docling and a tokenized match (Jaccard ≥ 0.6) for the reconstructed header block. Both signals are required:

  • The flag alone is unreliable — Docling over-flags rowspan/continuation data rows as headers (e.g. repeated-header/rowspan-insurance-payout: 5 leading rows flagged, 3 are data). Trusting it alone deletes real data.
  • The tokenized comparison is punctuation-agnostic, so per-cell OCR/VLM drift like (S$) vs ($$) is tolerated with no threshold tuning.

The merged DataFrame (lt.df) is unchanged; only the injected DoclingDocument is de-duplicated. A debug log reports each dropped row.

Tests

  • 107 unit (1 new: drops reprinted header, keeps a column_header-mis-flagged data row).
  • 54 integration (full snapshot lane). Three fixtures that encoded the old leaky output were re-baselined with an explicit, stricter injected_rows assertion: 15-page-druglist, covid-misc-labs ([3,4] table), retirement-portfolio. Every dropped row was verified to be a reprinted header (e.g. Sl No/Drug name/Category), never data.
  • Verified against a real 3-page insurance-benefit table: 100 → 91 rows, body header pollution 9 → 0; header block, multi-level col_spans, and data integrity preserved.

🤖 Generated with Claude Code

maish and others added 2 commits June 11, 2026 12:02
…i-page merge

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@maish maish merged commit 14d60a2 into main Jun 11, 2026
7 checks passed
@maish maish deleted the release-0.4.3 branch June 11, 2026 04:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant