Releases · PebbleRoad/table2rules

14 Jun 14:14

maish

v0.6.3

6521b09

v0.6.3 Latest

Latest

Fixed

Label-only row-group headers now thread through real docling schedule
shapes. Three additional shapes from real TableItem.export_to_html output
were dropping the line-item title from grouped values' row_headers:
- Narrow title → full-width description band → values. The title's row-group
  extent now extends through an immediately-following full-width description
  band (a nested member of the header block, not a boundary), so the title
  threads as the outer ancestor instead of being dropped:
  9. Trip Cancellation > If your trip is cancelled… > 1. Adult insured person | ….
- Multi-cell title rows (a leading item number/key plus a textual title, e.g.
  10 | Travel delay) are now recognized as group headers. A row is a title
  when at most one of its label cells is numeric-only — this admits the
  number+title shape while still rejecting a data row whose value columns merely
  happen to be empty (Average: | 80.2 | 10.7 | 3.3, ≥2 numeric). A repeating
  key column is excluded from the promoted title so it is not duplicated.
- Two-column Label | Value schedules. The left column is now promoted to the
  row-label/stub even under a single-row thead header (Benefit | Maximum limit)
  — Signal D, scoped to exactly two columns so multi-column property tables are
  untouched. This also produces proper one-record-per-line output for plain
  two-column relational tables (North | Sales: 100) instead of splitting each
  row into two disconnected Header: value lines.
New fixtures matrix/label-only-title-then-description-band and
matrix/label-only-title-number-key-matrix.

Known limitation: a sub-grouped header of the form <n>. | Group: | (empty)
with a colspan title over a promoted descriptor column still falls back to
flat (the spanned cell trips the gate's "rules originate from <td>"
invariant). This is a pre-existing, separate gate interaction — not the
label-only-threading path — and is tracked for a follow-up.

Assets 2

14 Jun 12:32

maish

v0.6.2

4fc6dd9

v0.6.2

Fixed

Label-only row-group headers are now threaded across columns (multi-column
matrix variant). A label-only band groups a row range, but 0.6.1 only reached
it from the value cell's own column and the row-label columns. When a group
header sits in a leading stub / line-number column while the sub-rows leave
that column empty and carry their identity in a different column (a numbered
schedule with plan × cover value columns), the band was unreachable — so the
header was dropped entirely rather than threaded. The maze now scans all
columns for label-only bands (full-width and source scope="rowgroup" bands
keep the column-restricted scan, so unrelated stub dividers don't cross-attach),
and each group's extent still closes at the next group:
```
10. Travel Delay > Adult under 70 | PLANS > BASIC: 100
```
This also recovers data in three real-world corpus tables (pubtabnet-180372,
-357665, -374857) whose year / cohort group headers — in a stub column the
data rows leave empty — were silently dropped in 0.6.1. New fixture
matrix/label-only-rowgroup-stub-column.

Assets 2

14 Jun 11:40

maish

v0.6.1

d1af76e

v0.6.1

Fixed

Label-only rows are now threaded as row-group headers. 0.6.0's row-group
threading only handled the full-width band form (a cell spanning the value
region / scope="rowgroup"). The label-only-row form — a body row whose
value columns are empty while a single leading label cell carries text
(9. Trip Cancellation | (empty) | (empty)), pervasive in Label | Value
financial/insurance schedules — was missed: it was emitted as an orphaned
is_label rule and the values beneath it lost their group identity. Such a
row is now promoted to a row-group ancestor and threaded into every value line
under it, until the next label-only row at the same level:
```
10. Travel Delay > If the departure is delayed > 1. Adult insured person | Maximum limit (S$) > Value Plan: 100
```
Detection stays geometric and additive to the existing band path:
- A row with no value-bearing <td> but exactly one non-empty label
  source cell is a group header. Consecutive label-only rows nest in order
  (a title followed by a description), and the two forms compose — a
  full-width section band and a label-only group nest consistently.
- The single-label-cell requirement separates a group header from a data
  row whose designated value columns merely happen to be empty: a summary row
  spreading several values across its label cells (Total | n=4, or a row
  under a header that over-promoted numeric columns to row labels) keeps the
  is_label preservation path, so its cells are never invented into a
  breadcrumb and misattributed onto the rows below.
- A trailing label-only row with no value rows beneath it stays a label
  (parity with the full-width-note guard) — no empty group is created.

Assets 2

11 Jun 09:06

maish

v0.6.0

6315893

v0.6.0

Added

Hierarchical row-groups are threaded into every value line, symmetric with
the multi-level column header path. Previously rules-mode fully qualified
the column side of each value (… | A > B > C: value) but dropped the row
hierarchy — section bands, group headers, and the data-row label were emitted
as separate, indistinguishable lines, and value lines carried none of that
row context. A consumer feeding the text to an LLM had to reconstruct "which
group / which sub-class does this value belong to" by stateful inference
across lines, which is non-deterministic. Now each value line carries its full
row path, so one line maps to exactly one record:
```
1. PERSONAL ACCIDENT > Accidental death and permanent disability > 1 > Each adult insured person under 70 | MAXIMUM LIMIT OF BENEFIT (S$) > VALUE PLAN > INDIVIDUAL COVER: 150,000
```
Detection is geometric and honors explicit markup:
- A value-region-wide body cell (is_full_width_note geometry — reaches the
  last column, spans a majority) is a row-group band: a full-width section
  band, or a group header / description spanning the value region. It is
  promoted to scope="rowgroup" and threaded as a bounded ancestor of the
  rows it groups. Nested bands are bounded by colspan — a band's extent
  ends at the next band of equal-or-wider span — so an inner group header does
  not close an outer section band. Source cells already marked
  scope="rowgroup" are honored as-is.
- The maze walks the data cell's own column (which a value-region-wide
  band spans) in addition to the row-label columns, so a band is found even
  when the row's label cell is empty — fixing the case where a group's
  unlabeled continuation row previously lost its band entirely.
- Columns under the leftmost top-level header group (the row-label / stub
  dimension, e.g. a SECTION header spanning the rownum and person-class
  columns beside a separate value-header group) are threaded as row labels
  even though they carry header text, so the row identity — not just the
  leading number — appears on each value line.
Guards keep it faithful: a band is promoted only when its extent contains a
real data row (so a standalone trailing note is left as a note, never
stranded), and a full-width body cell that merely repeats a column header (a
units caption like "(In thousands…)" reprinted between sections) is not
treated as a band. Existing scope="rowgroup" tables (e.g. FinTabNet year
bands) are unchanged. New fixtures matrix/rowgroup-nested-bands,
matrix/full-width-note-colspan, matrix/section-divider-td-colspan.

Assets 2

08 Jun 16:29

maish

v0.5.2

97a4be0

v0.5.2

Fixed

Full-width descriptive <td colspan> cells no longer replicate across every
value column. In a plan×cover benefit matrix, a row shaped
<td>marker</td><td colspan="7">description…</td> is a full-width note, not a
per-column value. The per-position expansion in _build_rules was emitting it
once per spanned column, each stamped with that column's header — so a single
benefit description ("If the departure of your public transport is delayed…")
became the value of six unrelated plan×cover cells. A data cell that reaches
the last column and spans a majority of the grid's columns
(spans.is_full_width_note) is now attributed to its origin column's header at
every spanned position, and the exporter's origin-aware dedup collapses it to a
single line. The cell is still emitted at every position, so an overlapping-span
corruption (a rowspan intruding into the note's row) is still detected by the
gate and fails open to flat. Legitimate narrow spans (a right-edge colspan=2
amount covering two sub-columns of one group) keep their per-column fan-out.
New fixture matrix/full-width-note-colspan.
Body section dividers no longer bleed onto every row as fabricated column
headers. A single full-width <td colspan="N"> divider (e.g. a benefits
schedule 1. PERSONAL ACCIDENT row) expands to N non-empty logical positions,
so the colspan-expanded cell count read it as a full header row —
detect_header_block swept the first divider and the body rows between it and
the first clean data row into the <thead>, and they then bled onto every line
as column headers (observed on a 100-row matrix: a benefit name attached to 474
of 478 output lines, including unrelated sections). The header now ends at the
first of a series (≥ 2) of full-width single-cell dividers — distinguishing
body section dividers from a one-off header subtitle like "(Dollars in
thousands)". Such dividers stay in the body and render once as full-width notes;
Fix 8 no longer promotes a single full-width cell to <th scope="row"> (which
stranded it, since it has no value column to anchor a rule). New fixture
matrix/section-divider-td-colspan.
All mypy errors in src/table2rules resolved (find_all("table") result
narrowed to Tag before use).

Assets 2

06 Jun 13:04

maish

v0.5.1

c62f7d2

v0.5.1

Fixed

De-spanned section headers are preserved instead of dropped. A body row
whose row-label is present but which carries no independent value — the shape
a full-width section title takes after an OCR/HTML pipeline drops its
colspan — used to vanish entirely, silently losing the label and leaving
repeated sub-item rows (Each child insured person) context-less.
_build_rules now emits a label-only rule for such rows, preserving the text
verbatim. The label is kept as-is rather than reconstructed into a section
breadcrumb, because such a row is structurally indistinguishable from a leaf
row with a genuinely missing value. Two encodings are covered:
- Empty value column — e.g. a benefits-schedule 1. Accidental death …
  row, or a balance-sheet Current assets: / FinTabNet Segments: label.
- Value echoes the column header — e.g. a 24. COVID-19 Coverage Extension | Sum Insured row. The echoed Sum Insured value is a
  self-echo that clean_rules strips; previously that took the section
  label down with it. The row is now recognised as label-only first, so the
  label survives while the redundant echo is still dropped.
A LogicRule.is_label flag marks these rows; the confidence gate treats them
as pass-through (excluded from header-attachment and coverage scoring) so a
table does not degrade to flat output merely for carrying them. New fixture
relational/despanned-section-headers exercises both encodings. Recovers
previously-dropped labels across the real-world corpus.

Changed

The real-world corpus is now byte-checked in CI. test_regression_golds
previously skipped tests/realworld/, leaving 401 benchmark golds with
nothing asserting them — two had silently drifted out of sync with the code.
All real-world golds are now frozen and byte-compared alongside the
hand-authored fixtures. The correctness/robustness oracles still assert the
output is right; the golds now assert it does not change unless a human
regenerates them, which is what makes a silent-drop regression visible.
scripts/benchmark.py no longer emits stray golds for top-level *.md
(README, scratch files), keeping --update-gold in lockstep with the test.

Assets 2

06 May 03:51

maish

v0.5.0

8b5d526

v0.5.0

Added

Two universal structural witnesses extend simple_repair.detect_header_block
so headless tables whose row 0 was previously misclassified as data now
promote correctly to a header block:
- Fuller-than-body. Row 0 is disqualified as a clean data row when its
  non-empty cell count is strictly greater than the minimum non-empty
  count of the non-divider rows below it AND every column row 0 fills is
  covered by at least one body row. Catches headerless tables with
  implicit-rowspan group-label columns (a leading column repeated at
  group starts and left empty in continuation rows), multi-stub
  indentation pyramids (one of several leading columns filled per row by
  hierarchy level), and alternating coefficient/std-error layouts. The
  body-coverage clause excludes receipts whose row 0 is a wider line
  item over narrower totals — there the body never uses row 0's extra
  columns.
- Cell-type inversion. Row 0 is disqualified when it contains at
  least one corner-stub <td> (where the body majority is <th>) and a
  strict majority of compared columns have row 0's cell tag inverted
  versus the body majority. Catches header rows authored as
  [td, th, th] above body rows [th, td, td]. The corner-stub
  requirement is load-bearing: an all-<th> row above an all-<td>
  body would also "invert" every column, but Fix 7 already wraps that
  case in <thead> via the contiguous-<th> chain.
Both witnesses apply only at row 0 — beyond the first row, "fuller than
the body below" and "cell tags differ from body majority" describe normal
body geometry, not a header.
Fixtures
tests/fixtures/relational/headerless-{corner-stub-inversion,implicit-rowspan-group,stub-pyramid}.md
exercise the three new patterns.

Assets 2

30 Apr 08:02

maish

v0.4.1

8e6aad1

v0.4.1

Fixed

extract_cell_text now skips text nodes whose direct parent is a <style>
or <script> tag. Wikipedia multi-column list templates inject inline
<style> blocks directly inside <td> cells; previously the raw CSS leaked
into emitted rule values while the quality gate still reported
mode=rules, score=1.00 — a silent failure invisible to callers.

Assets 2

30 Apr 05:41

maish

v0.4.0

a59146c

v0.4.0

Header detection reframed as a set of universal structural rules, and
the Layer-2 / Layer-3 real-world corpus doubled to add IBM 10-K
financial tables (FinTabNet). Combined, the rules unlock rules-format
output for 199 of 200 FinTabNet fixtures (up from 0 with the
previous "all row-0 cells non-empty" heuristic) while leaving
PubTabNet's 170/200 pass count unchanged.

Structural rule — header-block detection (replaces Fix 4's

row-0-only heuristic)

A new deterministic rule in simple_repair.detect_header_block picks
the header block via two geometric definitions:

Clean data row. A row where every logical position is an origin
cell at this row (no rowspan copy from above) with
rowspan == colspan == 1 and non-empty text. The first such row
(with col 0 non-empty) marks the header/body boundary.
Row-stub column. A column empty in every non-divider header row
AND non-empty in every non-divider body row. The conjunction is
load-bearing: neither condition alone distinguishes a stub from a
missing group-header cell (where a top-level group doesn't span a
later column) or an always-populated data column.

A leading row is header-compatible iff every empty cell in it lies in
a row-stub column. This subsumes three previously disjoint cases:

Dense first-row headers (receipts, simple relational): row 0 is
a clean data row with no empties, so the stub-col condition is
trivially satisfied. Old Fix 4's "row 0 all non-empty + body has
empties" special-case collapses into the same rule.
Multi-row headers with colspan group labels (FinTabNet
hierarchical, benefits-style matrix): header rows carry colspan > 1
cells and are not themselves clean data rows, so the detection
naturally extends through them to the first plain body row.
Financial 10-K tables with empty row-stub labels (FinTabNet
single-row headers): col 0 is structurally identified as a stub
column; the empty col-0 cell in row 0 is now permitted as the
row-stub-column signature rather than disqualifying.

Body cells in stub columns are promoted to <th scope="row">
directly, generalizing Fix 8's rowspan-based dimensional-column
promotion to also cover single-row-header tables.

Additional invariants

Row-group divider propagation (new). A body row whose single
non-empty logical cell sits in a stub column is structurally a
row-group header for the rows that follow. Fix 4 now promotes that
cell to <th scope="rowgroup">, and maze_pathfinder walks up the
stub column to include it in row_path — but only within its
extent. Explicit rowspan > 1 uses the rowspan range; divider
rows with rowspan == 1 run until the next rowgroup origin in the
same column. This captures the FinTabNet year-label pattern
("2014" row between Q1–Q4 blocks) without mis-nesting explicit
rowgroup peers like "North" / "South" with rowspan="2" each.
Stub-column rule uses strict-majority count (refined). The
earlier rule required a stub column to be non-empty in every
non-divider body row. That rejected tables with an unlabeled
trailing summary row (a FinTabNet pattern where the totals line
leaves the row-label cell blank: — | $1,573,043 | ...). Fix 4
now accepts a column as a stub when it is non-empty in strictly
more body rows than it is empty in — still a deterministic integer
comparison, still refuses to promote sparsely-filled data columns.
first_data_idx anchor relaxed (refined). The earlier anchor
required the first data row to have every cell non-empty. That
rejected tables whose body rows are inherently gappy (a column
that only fills on certain events — e.g., an aggregate fair-value
column that has a value only when shares vest). The anchor now
requires col 0 non-empty, ≥ 2 non-empty logical cells, and no span
structure — the remaining conditions still separate body rows from
header rows, since header rows carry either an empty col 0 or a
span marker.
Column-header detection ignores <th scope="row">. grid_parser
Step 1 previously accepted any <th>-or-empty row as the primary
column-header row, including rows where Fix 5 had promoted a "Total"
label to <th scope="row">. That misidentification caused
mid-table summary rows to be treated as headers, producing
fabricated column paths. Now Step 1 requires <th> cells without
scope="row".
Fix 8's active_rowspan counter tracks pre-promoted <th> cells.
When Fix 4 has already promoted the first-column cell via the
stub-column path, Fix 8 still needs to track its rowspan so
subsequent rows aren't mistaken for first-column content. The
counter was previously only updated inside Fix 8's promotion branch,
leaving it stuck at 0 when the cell was already <th> — which in
turn over-promoted data cells at logical col > 0 to row-stub status.

Corpus

FinTabNet oracle corpus (200 fixtures): pulled from
apoidea/fintabnet-html via
scripts/build_fintabnet_fixtures.py.
With the five structural invariants in place, 199/200 fixtures
produce rules-format output that oracle-matches. The single remaining
flat-fallback case uses footnote-marker separator columns — an
unusual layout pattern where narrow unlabeled columns sit between
dollar-amount columns to hold footnote references.
Regression golds updated: two matrix fixtures
(conflicting-spans-scope, scope-rowgroup-colgroup) had captured
the pre-existing Fix 8 bug. Updated golds reflect the corrected
output where a numeric data cell at logical col > 0 in a row whose
col 0 is covered by an above rowspan is no longer misclassified as
a row-subheader.

Quality gate

New gate reason partial_column_coverage. Detects when column
headers cover some rules in a table but not others — the silent
label-shift failure mode where a multi-level header omits a slot
for the row-label column, shifting every column label one position
right and leaving the rightmost data column unlabeled. Tables that
hit this case now fall back to flat rendering with the reason
surfaced in TableReport, instead of emitting wrong facts at full
confidence. Contributed by @BeastxD7 in
#2.

Public API additions

TableReport.text — the rendered output for this table only,
the same lines that contribute to the concatenated string returned
alongside the report. Lets callers passing whole-document HTML keep
per-table provenance without having to split the flat blob
themselves.
TableReport.caption — text of the table's <caption> element
when present, otherwise None. Only direct <caption> children
are read; surrounding headings and id attributes are intentionally
ignored — table_index remains the only stable positional
identifier.

Contributors

BeastxD7

Assets 2

Releases: PebbleRoad/table2rules

v0.6.3

Fixed

Uh oh!

v0.6.2

Fixed

Uh oh!

v0.6.1

Fixed

Uh oh!

v0.6.0

Added

Uh oh!

v0.5.2

Fixed

Uh oh!

v0.5.1

Fixed

Changed

Uh oh!

v0.5.0

Added

Uh oh!

v0.4.1

Fixed

Uh oh!

v0.4.0

Structural rule — header-block detection (replaces Fix 4's

Additional invariants

Corpus

Quality gate

Public API additions

Contributors

Uh oh!