Skip to content

Releases: PebbleRoad/table-stitcher

v0.4.3

11 Jun 04:27

Choose a tag to compare

Fixed

  • Reprinted continuation-page headers appended as data rows on multi-page
    merge
    (adapters/docling.py). When a table's column header is reprinted at
    the top of each page — especially a multi-row (hierarchical) header — the
    repeated header rows survived the merge as bogus data rows, misaligning the
    stitched table. Injection now drops a body row when it is both flagged
    column_header by Docling and a tokenized match (Jaccard ≥ 0.6) for the
    reconstructed header block. Both signals are required: the flag alone is
    unreliable (Docling over-flags rowspan/continuation data rows as headers),
    and the tokenized comparison is punctuation-agnostic, so per-cell OCR drift
    such as (S$) vs ($$) is tolerated without any threshold tuning. The merged
    DataFrame (lt.df) is unchanged; only the injected document is de-duplicated.
    A debug log reports each dropped row.

v0.4.2

08 Jun 14:09

Choose a tag to compare

Fixed

  • __version__ was hardcoded and stale (__init__.py). It read "0.2.0"
    regardless of the installed release, since nothing tied it to the version in
    pyproject.toml. It is now derived from the installed distribution metadata
    via importlib.metadata.version("table-stitcher"), so it always reflects the
    actual release (falling back to "0.0.0+unknown" when run from an
    uninstalled source tree). The release gate now also asserts
    __version__ matches the pyproject.toml version, so the two can't drift
    again.

v0.4.1

08 Jun 13:53

Choose a tag to compare

Fixed

  • Spanning body cells duplicated across columns on multi-page merge
    (adapters/docling.py). Docling repeats a col_span=N cell's text across
    every column it covers; the merge round-trip rebuilt those as N separate
    col_span=1 cells, leaking a full-width description into every value column
    and displacing the real values (a repeated col_span header behaved the same
    way). Injection now matches each merged row back to its source grid row and
    re-emits the original spans; rows the merger transformed (stitched
    continuations, folded overflow) fall back to the flat 1x1 rebuild. The match
    uses the original span metadata, never value equality, so coincidentally-equal
    adjacent values (e.g. two plan columns sharing a cap) stay separate cells.

v0.4.0

29 May 10:32
c8a5d82

Choose a tag to compare

Added — intervening-content guard (#10)

Two tables that share a column schema but belong to different sections (a heading sits between them in reading order) are no longer stitched into one. A genuine page-split continuation has only page furniture between its fragments, so a section heading between two tables is a reliable "separate tables" signal.

  • New MultiPageConfig.block_on_intervening_content (default True) and TableMeta.content_before; both merge paths (_classify_sequential_pair and should_force_orphan_merge) consult it.
  • Running headers — including a banner one page labels page_header and another mislabels section_header — are detected via near-identical (Jaccard >= 0.8) cross-page recurrence, so legitimate continuations still merge.
  • Only section_header/title block; paragraphs, list items, captions, footnotes and figures are ignored.

Fixes over-eager merging of same-schema per-section tables (e.g. an insurance policy's eight Prestige | Elite | Classic benefit grids collapsing into one).

v0.3.0

06 May 08:10

Choose a tag to compare

Fixed

  • Category rows incorrectly folded into preceding data rows (merger.py).
    stitch_split_cells() previously folded any row with exactly one non-empty
    cell into the row above it. Category/section-header rows (e.g. "Theme 2:
    Trust and Credibility" with text only in col 0) matched this pattern and
    were silently merged into the preceding data row, mangling participant IDs
    and destroying table structure. The fix: a non-empty col 0 in the candidate
    row signals a new record or section header — not an overflow — and folding
    is skipped. Legitimate split-cell continuations always have col 0 empty.
    Six existing fixture YAMLs updated to reflect the corrected (higher) row
    counts — the old YAMLs encoded the buggy folded output.

  • False merge of independent same-width headerless tables (merger.py).
    When two adjacent tables both have is_headerless=True and the same column
    count, the merger now requires a layout signal (the left table must end near
    the bottom of its page, vert_bottom >= bottom_band_min) before merging.
    Previously, column count alone was sufficient — three independent clinical
    lab panels (each 4 columns, no header row) collapsed into one 22-row table.
    Legitimate multi-page headerless tables are unaffected: they fill their pages
    and always produce a strong layout signal.

Added

  • Parser-neutral YAML fixture layer (tests/fixtures/tablemeta/) plus
    tests/test_tablemeta_fixtures.py. New adapters can validate against the
    merger's full test surface by feeding the same YAMLs through their own
    extract() — no PDF or OCR involvement.
  • Public-API integration coverage: every fixture now runs through both
    merge_multipage_tables() (parser-neutral) and stitch_tables()
    (full pipeline including docling injection).
  • scripts/release_gate.sh — offline-friendly release gate that runs unit
    tests, rebuilds dist/, installs the wheel into a clean venv, and
    smoke-tests the installed package. RELEASE_GATE_ONLINE=1 toggles
    isolated build/install for CI.

Changed

  • Core merger refactored for readability. merge_multipage_tables() is
    now a four-phase orchestrator (setup → pass 1 sequential → pass 2 orphan repair → build) delegating to named helpers. _classify_sequential_pair()
    isolates adjacent-pair merge logic for independent review. Behavior is
    unchanged — 127 tests prove equivalence.
  • align_dataframe_to_header() dispatches to per-policy handlers
    (_overflow_preserve_extra, _overflow_warn_drop, _overflow_fail,
    _overflow_merge_tail) instead of branching inline.

Removed

  • Dead pos_to_orig variable in the merger setup path.