Skip to content

feat: don't merge tables separated by a section heading#10

Merged
phyohan18 merged 3 commits into
mainfrom
feat/intervening-content-guard
May 29, 2026
Merged

feat: don't merge tables separated by a section heading#10
phyohan18 merged 3 commits into
mainfrom
feat/intervening-content-guard

Conversation

@phyohan18

Copy link
Copy Markdown
Contributor

Problem

Tables that share a column schema but belong to different sections were being stitched into one logical table, because header similarity alone drove the merge. Real-world trigger: an insurance policy's eight per-benefit plan grids — each a Prestige | Elite | Classic table under its own 38a … 38h heading — collapsed into 2 merged blobs, attached to the wrong headings.

Insight

A genuine page-split continuation has nothing but page furniture (running headers/footers) between its fragments. So a section heading sitting between two tables in reading order is a reliable "these are separate tables" signal.

Change

  • New MultiPageConfig.block_on_intervening_content (default True) and TableMeta.content_before.
  • The docling adapter computes content_before per table in body reading order.
  • Both merge paths consult it (symmetric guard): _classify_sequential_pair (pass 1) and should_force_orphan_merge (pass 2).

Avoiding regressions on legitimate continuations

  • Only section_header/title block. Paragraphs, list items, captions, footnotes and figures are ignored — real PDFs scatter those between fragments of a single continued table (interleaved reading order, cell text extracted as body nodes).
  • Running headers are detected by near-identical recurrence (Jaccard ≥ 0.8) across pages, so a repeated banner — or a journal name docling labels page_header on one page and section_header on the next — does not block a continuation.

Tests

  • 5 merger-level guard tests (test_merger.py).
  • 3 adapter-level tests building real DoclingDocuments (test_intervening_content_guard.py).
  • Full suite: 158 passed, 2 skipped. The existing continuation fixtures (repeated-header, headerless-continuation, orphan-pair, inconsistent-header-detection) are unaffected.

Known limitations

  • Two genuinely distinct same-schema tables separated only by a paragraph with no heading still merge. The guard only ever reduces merges, so it cannot introduce new wrong merges — but it doesn't catch that case.
  • A real heading that legitimately repeats near-identically on ≥2 pages (e.g. two chapters both titled "Notes") is treated as furniture. Rare; block_on_intervening_content=False is the escape hatch.

🤖 Generated with Claude Code

phyohan18 and others added 3 commits May 29, 2026 17:04
Two tables that share a column schema but belong to different sections were
being stitched into one because header similarity alone drove the merge. A
genuine page-split continuation has nothing but page furniture between its
fragments, so a section heading between two tables is a reliable "separate
tables" signal.

The docling adapter now computes a per-table TableMeta.content_before in
reading order, and both merge paths consult it: _classify_sequential_pair
(pass 1) and should_force_orphan_merge (pass 2). Gated by the new
MultiPageConfig.block_on_intervening_content (default True).

Furniture handling so legitimate continuations still merge:
- Only section_header/title nodes block. Paragraphs, list items, captions,
  footnotes and figures are ignored — real PDFs scatter those between
  fragments of a single continued table.
- A heading that recurs near-identically (Jaccard >= 0.8) on another page is
  treated as a running header (a repeated banner, or a journal name docling
  labels page_header on one page and section_header on the next), not a
  boundary.

Fixes over-eager merging of same-schema per-section tables, e.g. an insurance
policy's eight Prestige|Elite|Classic benefit grids collapsing into one.

Tests: 5 merger-level guard tests + 3 adapter-level tests building real
DoclingDocuments. Full suite 158 passed; the existing continuation fixtures
(repeated-header, headerless-continuation, orphan-pair,
inconsistent-header-detection) are unaffected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@phyohan18 phyohan18 merged commit c8a5d82 into main May 29, 2026
7 checks passed
@phyohan18 phyohan18 deleted the feat/intervening-content-guard branch May 29, 2026 10:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant