Releases: PebbleRoad/table-stitcher
v0.4.3
Fixed
- Reprinted continuation-page headers appended as data rows on multi-page
merge (adapters/docling.py). When a table's column header is reprinted at
the top of each page — especially a multi-row (hierarchical) header — the
repeated header rows survived the merge as bogus data rows, misaligning the
stitched table. Injection now drops a body row when it is both flagged
column_headerby Docling and a tokenized match (Jaccard ≥ 0.6) for the
reconstructed header block. Both signals are required: the flag alone is
unreliable (Docling over-flags rowspan/continuation data rows as headers),
and the tokenized comparison is punctuation-agnostic, so per-cell OCR drift
such as(S$)vs($$)is tolerated without any threshold tuning. The merged
DataFrame (lt.df) is unchanged; only the injected document is de-duplicated.
Adebuglog reports each dropped row.
v0.4.2
Fixed
__version__was hardcoded and stale (__init__.py). It read"0.2.0"
regardless of the installed release, since nothing tied it to the version in
pyproject.toml. It is now derived from the installed distribution metadata
viaimportlib.metadata.version("table-stitcher"), so it always reflects the
actual release (falling back to"0.0.0+unknown"when run from an
uninstalled source tree). The release gate now also asserts
__version__matches thepyproject.tomlversion, so the two can't drift
again.
v0.4.1
Fixed
- Spanning body cells duplicated across columns on multi-page merge
(adapters/docling.py). Docling repeats acol_span=Ncell's text across
every column it covers; the merge round-trip rebuilt those asNseparate
col_span=1cells, leaking a full-width description into every value column
and displacing the real values (a repeatedcol_spanheader behaved the same
way). Injection now matches each merged row back to its source grid row and
re-emits the original spans; rows the merger transformed (stitched
continuations, folded overflow) fall back to the flat 1x1 rebuild. The match
uses the original span metadata, never value equality, so coincidentally-equal
adjacent values (e.g. two plan columns sharing a cap) stay separate cells.
v0.4.0
Added — intervening-content guard (#10)
Two tables that share a column schema but belong to different sections (a heading sits between them in reading order) are no longer stitched into one. A genuine page-split continuation has only page furniture between its fragments, so a section heading between two tables is a reliable "separate tables" signal.
- New
MultiPageConfig.block_on_intervening_content(defaultTrue) andTableMeta.content_before; both merge paths (_classify_sequential_pairandshould_force_orphan_merge) consult it. - Running headers — including a banner one page labels
page_headerand another mislabelssection_header— are detected via near-identical (Jaccard >= 0.8) cross-page recurrence, so legitimate continuations still merge. - Only
section_header/titleblock; paragraphs, list items, captions, footnotes and figures are ignored.
Fixes over-eager merging of same-schema per-section tables (e.g. an insurance policy's eight Prestige | Elite | Classic benefit grids collapsing into one).
v0.3.0
Fixed
-
Category rows incorrectly folded into preceding data rows (
merger.py).
stitch_split_cells()previously folded any row with exactly one non-empty
cell into the row above it. Category/section-header rows (e.g. "Theme 2:
Trust and Credibility" with text only in col 0) matched this pattern and
were silently merged into the preceding data row, mangling participant IDs
and destroying table structure. The fix: a non-empty col 0 in the candidate
row signals a new record or section header — not an overflow — and folding
is skipped. Legitimate split-cell continuations always have col 0 empty.
Six existing fixture YAMLs updated to reflect the corrected (higher) row
counts — the old YAMLs encoded the buggy folded output. -
False merge of independent same-width headerless tables (
merger.py).
When two adjacent tables both haveis_headerless=Trueand the same column
count, the merger now requires a layout signal (the left table must end near
the bottom of its page,vert_bottom >= bottom_band_min) before merging.
Previously, column count alone was sufficient — three independent clinical
lab panels (each 4 columns, no header row) collapsed into one 22-row table.
Legitimate multi-page headerless tables are unaffected: they fill their pages
and always produce a strong layout signal.
Added
- Parser-neutral YAML fixture layer (
tests/fixtures/tablemeta/) plus
tests/test_tablemeta_fixtures.py. New adapters can validate against the
merger's full test surface by feeding the same YAMLs through their own
extract()— no PDF or OCR involvement. - Public-API integration coverage: every fixture now runs through both
merge_multipage_tables()(parser-neutral) andstitch_tables()
(full pipeline including docling injection). scripts/release_gate.sh— offline-friendly release gate that runs unit
tests, rebuildsdist/, installs the wheel into a clean venv, and
smoke-tests the installed package.RELEASE_GATE_ONLINE=1toggles
isolated build/install for CI.
Changed
- Core merger refactored for readability.
merge_multipage_tables()is
now a four-phase orchestrator (setup → pass 1 sequential → pass 2 orphan repair → build) delegating to named helpers._classify_sequential_pair()
isolates adjacent-pair merge logic for independent review. Behavior is
unchanged — 127 tests prove equivalence. align_dataframe_to_header()dispatches to per-policy handlers
(_overflow_preserve_extra,_overflow_warn_drop,_overflow_fail,
_overflow_merge_tail) instead of branching inline.
Removed
- Dead
pos_to_origvariable in the merger setup path.