Releases: PebbleRoad/table2rules
v0.6.3
Fixed
-
Label-only row-group headers now thread through real docling schedule
shapes. Three additional shapes from realTableItem.export_to_htmloutput
were dropping the line-item title from grouped values'row_headers:- Narrow title → full-width description band → values. The title's row-group
extent now extends through an immediately-following full-width description
band (a nested member of the header block, not a boundary), so the title
threads as the outer ancestor instead of being dropped:
9. Trip Cancellation > If your trip is cancelled… > 1. Adult insured person | …. - Multi-cell title rows (a leading item number/key plus a textual title, e.g.
10 | Travel delay) are now recognized as group headers. A row is a title
when at most one of its label cells is numeric-only — this admits the
number+title shape while still rejecting a data row whose value columns merely
happen to be empty (Average: | 80.2 | 10.7 | 3.3, ≥2 numeric). A repeating
key column is excluded from the promoted title so it is not duplicated. - Two-column
Label | Valueschedules. The left column is now promoted to the
row-label/stub even under a single-row thead header (Benefit | Maximum limit)
— Signal D, scoped to exactly two columns so multi-column property tables are
untouched. This also produces proper one-record-per-line output for plain
two-column relational tables (North | Sales: 100) instead of splitting each
row into two disconnectedHeader: valuelines.
New fixtures
matrix/label-only-title-then-description-bandand
matrix/label-only-title-number-key-matrix.Known limitation: a sub-grouped header of the form
<n>. | Group: | (empty)
with acolspantitle over a promoted descriptor column still falls back to
flat (the spanned cell trips the gate's "rules originate from<td>"
invariant). This is a pre-existing, separate gate interaction — not the
label-only-threading path — and is tracked for a follow-up. - Narrow title → full-width description band → values. The title's row-group
v0.6.2
Fixed
-
Label-only row-group headers are now threaded across columns (multi-column
matrix variant). A label-only band groups a row range, but 0.6.1 only reached
it from the value cell's own column and the row-label columns. When a group
header sits in a leading stub / line-number column while the sub-rows leave
that column empty and carry their identity in a different column (a numbered
schedule withplan × covervalue columns), the band was unreachable — so the
header was dropped entirely rather than threaded. The maze now scans all
columns for label-only bands (full-width and sourcescope="rowgroup"bands
keep the column-restricted scan, so unrelated stub dividers don't cross-attach),
and each group's extent still closes at the next group:10. Travel Delay > Adult under 70 | PLANS > BASIC: 100This also recovers data in three real-world corpus tables (
pubtabnet-180372,
-357665,-374857) whose year / cohort group headers — in a stub column the
data rows leave empty — were silently dropped in 0.6.1. New fixture
matrix/label-only-rowgroup-stub-column.
v0.6.1
Fixed
-
Label-only rows are now threaded as row-group headers. 0.6.0's row-group
threading only handled the full-width band form (a cell spanning the value
region /scope="rowgroup"). The label-only-row form — a body row whose
value columns are empty while a single leading label cell carries text
(9. Trip Cancellation | (empty) | (empty)), pervasive inLabel | Value
financial/insurance schedules — was missed: it was emitted as an orphaned
is_labelrule and the values beneath it lost their group identity. Such a
row is now promoted to a row-group ancestor and threaded into every value line
under it, until the next label-only row at the same level:10. Travel Delay > If the departure is delayed > 1. Adult insured person | Maximum limit (S$) > Value Plan: 100Detection stays geometric and additive to the existing band path:
- A row with no value-bearing
<td>but exactly one non-empty label
source cell is a group header. Consecutive label-only rows nest in order
(a title followed by a description), and the two forms compose — a
full-width section band and a label-only group nest consistently. - The single-label-cell requirement separates a group header from a data
row whose designated value columns merely happen to be empty: a summary row
spreading several values across its label cells (Total | n=4, or a row
under a header that over-promoted numeric columns to row labels) keeps the
is_labelpreservation path, so its cells are never invented into a
breadcrumb and misattributed onto the rows below. - A trailing label-only row with no value rows beneath it stays a label
(parity with the full-width-note guard) — no empty group is created.
- A row with no value-bearing
v0.6.0
Added
-
Hierarchical row-groups are threaded into every value line, symmetric with
the multi-level column header path. Previously rules-mode fully qualified
the column side of each value (… | A > B > C: value) but dropped the row
hierarchy — section bands, group headers, and the data-row label were emitted
as separate, indistinguishable lines, and value lines carried none of that
row context. A consumer feeding the text to an LLM had to reconstruct "which
group / which sub-class does this value belong to" by stateful inference
across lines, which is non-deterministic. Now each value line carries its full
row path, so one line maps to exactly one record:1. PERSONAL ACCIDENT > Accidental death and permanent disability > 1 > Each adult insured person under 70 | MAXIMUM LIMIT OF BENEFIT (S$) > VALUE PLAN > INDIVIDUAL COVER: 150,000Detection is geometric and honors explicit markup:
- A value-region-wide body cell (
is_full_width_notegeometry — reaches the
last column, spans a majority) is a row-group band: a full-width section
band, or a group header / description spanning the value region. It is
promoted toscope="rowgroup"and threaded as a bounded ancestor of the
rows it groups. Nested bands are bounded by colspan — a band's extent
ends at the next band of equal-or-wider span — so an inner group header does
not close an outer section band. Source cells already marked
scope="rowgroup"are honored as-is. - The maze walks the data cell's own column (which a value-region-wide
band spans) in addition to the row-label columns, so a band is found even
when the row's label cell is empty — fixing the case where a group's
unlabeled continuation row previously lost its band entirely. - Columns under the leftmost top-level header group (the row-label / stub
dimension, e.g. aSECTIONheader spanning the rownum and person-class
columns beside a separate value-header group) are threaded as row labels
even though they carry header text, so the row identity — not just the
leading number — appears on each value line.
Guards keep it faithful: a band is promoted only when its extent contains a
real data row (so a standalone trailing note is left as a note, never
stranded), and a full-width body cell that merely repeats a column header (a
units caption like "(In thousands…)" reprinted between sections) is not
treated as a band. Existingscope="rowgroup"tables (e.g. FinTabNet year
bands) are unchanged. New fixturesmatrix/rowgroup-nested-bands,
matrix/full-width-note-colspan,matrix/section-divider-td-colspan. - A value-region-wide body cell (
v0.5.2
Fixed
-
Full-width descriptive
<td colspan>cells no longer replicate across every
value column. In a plan×cover benefit matrix, a row shaped
<td>marker</td><td colspan="7">description…</td>is a full-width note, not a
per-column value. The per-position expansion in_build_ruleswas emitting it
once per spanned column, each stamped with that column's header — so a single
benefit description ("If the departure of your public transport is delayed…")
became the value of six unrelated plan×cover cells. A data cell that reaches
the last column and spans a majority of the grid's columns
(spans.is_full_width_note) is now attributed to its origin column's header at
every spanned position, and the exporter's origin-aware dedup collapses it to a
single line. The cell is still emitted at every position, so an overlapping-span
corruption (a rowspan intruding into the note's row) is still detected by the
gate and fails open to flat. Legitimate narrow spans (a right-edgecolspan=2
amount covering two sub-columns of one group) keep their per-column fan-out.
New fixturematrix/full-width-note-colspan. -
Body section dividers no longer bleed onto every row as fabricated column
headers. A single full-width<td colspan="N">divider (e.g. a benefits
schedule1. PERSONAL ACCIDENTrow) expands to N non-empty logical positions,
so the colspan-expanded cell count read it as a full header row —
detect_header_blockswept the first divider and the body rows between it and
the first clean data row into the<thead>, and they then bled onto every line
as column headers (observed on a 100-row matrix: a benefit name attached to 474
of 478 output lines, including unrelated sections). The header now ends at the
first of a series (≥ 2) of full-width single-cell dividers — distinguishing
body section dividers from a one-off header subtitle like "(Dollars in
thousands)". Such dividers stay in the body and render once as full-width notes;
Fix 8 no longer promotes a single full-width cell to<th scope="row">(which
stranded it, since it has no value column to anchor a rule). New fixture
matrix/section-divider-td-colspan. -
All
mypyerrors insrc/table2rulesresolved (find_all("table")result
narrowed toTagbefore use).
v0.5.1
Fixed
-
De-spanned section headers are preserved instead of dropped. A body row
whose row-label is present but which carries no independent value — the shape
a full-width section title takes after an OCR/HTML pipeline drops its
colspan— used to vanish entirely, silently losing the label and leaving
repeated sub-item rows (Each child insured person) context-less.
_build_rulesnow emits a label-only rule for such rows, preserving the text
verbatim. The label is kept as-is rather than reconstructed into a section
breadcrumb, because such a row is structurally indistinguishable from a leaf
row with a genuinely missing value. Two encodings are covered:- Empty value column — e.g. a benefits-schedule
1. Accidental death …
row, or a balance-sheetCurrent assets:/ FinTabNetSegments:label. - Value echoes the column header — e.g. a
24. COVID-19 Coverage Extension | Sum Insuredrow. The echoedSum Insuredvalue is a
self-echo thatclean_rulesstrips; previously that took the section
label down with it. The row is now recognised as label-only first, so the
label survives while the redundant echo is still dropped.
A
LogicRule.is_labelflag marks these rows; the confidence gate treats them
as pass-through (excluded from header-attachment and coverage scoring) so a
table does not degrade to flat output merely for carrying them. New fixture
relational/despanned-section-headersexercises both encodings. Recovers
previously-dropped labels across the real-world corpus. - Empty value column — e.g. a benefits-schedule
Changed
- The real-world corpus is now byte-checked in CI.
test_regression_golds
previously skippedtests/realworld/, leaving 401 benchmark golds with
nothing asserting them — two had silently drifted out of sync with the code.
All real-world golds are now frozen and byte-compared alongside the
hand-authored fixtures. The correctness/robustness oracles still assert the
output is right; the golds now assert it does not change unless a human
regenerates them, which is what makes a silent-drop regression visible.
scripts/benchmark.pyno longer emits stray golds for top-level*.md
(README, scratch files), keeping--update-goldin lockstep with the test.
v0.5.0
Added
- Two universal structural witnesses extend
simple_repair.detect_header_block
so headless tables whose row 0 was previously misclassified as data now
promote correctly to a header block:- Fuller-than-body. Row 0 is disqualified as a clean data row when its
non-empty cell count is strictly greater than the minimum non-empty
count of the non-divider rows below it AND every column row 0 fills is
covered by at least one body row. Catches headerless tables with
implicit-rowspan group-label columns (a leading column repeated at
group starts and left empty in continuation rows), multi-stub
indentation pyramids (one of several leading columns filled per row by
hierarchy level), and alternating coefficient/std-error layouts. The
body-coverage clause excludes receipts whose row 0 is a wider line
item over narrower totals — there the body never uses row 0's extra
columns. - Cell-type inversion. Row 0 is disqualified when it contains at
least one corner-stub<td>(where the body majority is<th>) and a
strict majority of compared columns have row 0's cell tag inverted
versus the body majority. Catches header rows authored as
[td, th, th]above body rows[th, td, td]. The corner-stub
requirement is load-bearing: an all-<th>row above an all-<td>
body would also "invert" every column, but Fix 7 already wraps that
case in<thead>via the contiguous-<th>chain.
- Fuller-than-body. Row 0 is disqualified as a clean data row when its
- Both witnesses apply only at row 0 — beyond the first row, "fuller than
the body below" and "cell tags differ from body majority" describe normal
body geometry, not a header. - Fixtures
tests/fixtures/relational/headerless-{corner-stub-inversion,implicit-rowspan-group,stub-pyramid}.md
exercise the three new patterns.
v0.4.1
Fixed
extract_cell_textnow skips text nodes whose direct parent is a<style>
or<script>tag. Wikipedia multi-column list templates inject inline
<style>blocks directly inside<td>cells; previously the raw CSS leaked
into emitted rule values while the quality gate still reported
mode=rules, score=1.00— a silent failure invisible to callers.
v0.4.0
Header detection reframed as a set of universal structural rules, and
the Layer-2 / Layer-3 real-world corpus doubled to add IBM 10-K
financial tables (FinTabNet). Combined, the rules unlock rules-format
output for 199 of 200 FinTabNet fixtures (up from 0 with the
previous "all row-0 cells non-empty" heuristic) while leaving
PubTabNet's 170/200 pass count unchanged.
Structural rule — header-block detection (replaces Fix 4's
row-0-only heuristic)
A new deterministic rule in simple_repair.detect_header_block picks
the header block via two geometric definitions:
- Clean data row. A row where every logical position is an origin
cell at this row (no rowspan copy from above) with
rowspan == colspan == 1and non-empty text. The first such row
(with col 0 non-empty) marks the header/body boundary. - Row-stub column. A column empty in every non-divider header row
AND non-empty in every non-divider body row. The conjunction is
load-bearing: neither condition alone distinguishes a stub from a
missing group-header cell (where a top-level group doesn't span a
later column) or an always-populated data column.
A leading row is header-compatible iff every empty cell in it lies in
a row-stub column. This subsumes three previously disjoint cases:
- Dense first-row headers (receipts, simple relational): row 0 is
a clean data row with no empties, so the stub-col condition is
trivially satisfied. Old Fix 4's "row 0 all non-empty + body has
empties" special-case collapses into the same rule. - Multi-row headers with colspan group labels (FinTabNet
hierarchical, benefits-style matrix): header rows carry colspan > 1
cells and are not themselves clean data rows, so the detection
naturally extends through them to the first plain body row. - Financial 10-K tables with empty row-stub labels (FinTabNet
single-row headers): col 0 is structurally identified as a stub
column; the empty col-0 cell in row 0 is now permitted as the
row-stub-column signature rather than disqualifying.
Body cells in stub columns are promoted to <th scope="row">
directly, generalizing Fix 8's rowspan-based dimensional-column
promotion to also cover single-row-header tables.
Additional invariants
-
Row-group divider propagation (new). A body row whose single
non-empty logical cell sits in a stub column is structurally a
row-group header for the rows that follow. Fix 4 now promotes that
cell to<th scope="rowgroup">, andmaze_pathfinderwalks up the
stub column to include it inrow_path— but only within its
extent. Explicitrowspan > 1uses the rowspan range; divider
rows withrowspan == 1run until the next rowgroup origin in the
same column. This captures the FinTabNet year-label pattern
("2014" row between Q1–Q4 blocks) without mis-nesting explicit
rowgroup peers like "North" / "South" withrowspan="2"each. -
Stub-column rule uses strict-majority count (refined). The
earlier rule required a stub column to be non-empty in every
non-divider body row. That rejected tables with an unlabeled
trailing summary row (a FinTabNet pattern where the totals line
leaves the row-label cell blank:— | $1,573,043 | ...). Fix 4
now accepts a column as a stub when it is non-empty in strictly
more body rows than it is empty in — still a deterministic integer
comparison, still refuses to promote sparsely-filled data columns. -
first_data_idxanchor relaxed (refined). The earlier anchor
required the first data row to have every cell non-empty. That
rejected tables whose body rows are inherently gappy (a column
that only fills on certain events — e.g., an aggregate fair-value
column that has a value only when shares vest). The anchor now
requires col 0 non-empty, ≥ 2 non-empty logical cells, and no span
structure — the remaining conditions still separate body rows from
header rows, since header rows carry either an empty col 0 or a
span marker. -
Column-header detection ignores
<th scope="row">.grid_parser
Step 1 previously accepted any<th>-or-empty row as the primary
column-header row, including rows where Fix 5 had promoted a "Total"
label to<th scope="row">. That misidentification caused
mid-table summary rows to be treated as headers, producing
fabricated column paths. Now Step 1 requires<th>cells without
scope="row". -
Fix 8's
active_rowspancounter tracks pre-promoted<th>cells.
When Fix 4 has already promoted the first-column cell via the
stub-column path, Fix 8 still needs to track its rowspan so
subsequent rows aren't mistaken for first-column content. The
counter was previously only updated inside Fix 8's promotion branch,
leaving it stuck at 0 when the cell was already<th>— which in
turn over-promoted data cells at logical col > 0 to row-stub status.
Corpus
-
FinTabNet oracle corpus (200 fixtures): pulled from
apoidea/fintabnet-htmlvia
scripts/build_fintabnet_fixtures.py.
With the five structural invariants in place, 199/200 fixtures
produce rules-format output that oracle-matches. The single remaining
flat-fallback case uses footnote-marker separator columns — an
unusual layout pattern where narrow unlabeled columns sit between
dollar-amount columns to hold footnote references. -
Regression golds updated: two matrix fixtures
(conflicting-spans-scope,scope-rowgroup-colgroup) had captured
the pre-existing Fix 8 bug. Updated golds reflect the corrected
output where a numeric data cell at logical col > 0 in a row whose
col 0 is covered by an above rowspan is no longer misclassified as
a row-subheader.
Quality gate
- New gate reason
partial_column_coverage. Detects when column
headers cover some rules in a table but not others — the silent
label-shift failure mode where a multi-level header omits a slot
for the row-label column, shifting every column label one position
right and leaving the rightmost data column unlabeled. Tables that
hit this case now fall back to flat rendering with the reason
surfaced inTableReport, instead of emitting wrong facts at full
confidence. Contributed by @BeastxD7 in
#2.
Public API additions
TableReport.text— the rendered output for this table only,
the same lines that contribute to the concatenated string returned
alongside the report. Lets callers passing whole-document HTML keep
per-table provenance without having to split the flat blob
themselves.TableReport.caption— text of the table's<caption>element
when present, otherwiseNone. Only direct<caption>children
are read; surrounding headings andidattributes are intentionally
ignored —table_indexremains the only stable positional
identifier.