feat: don't merge tables separated by a section heading#10
Merged
Conversation
Two tables that share a column schema but belong to different sections were being stitched into one because header similarity alone drove the merge. A genuine page-split continuation has nothing but page furniture between its fragments, so a section heading between two tables is a reliable "separate tables" signal. The docling adapter now computes a per-table TableMeta.content_before in reading order, and both merge paths consult it: _classify_sequential_pair (pass 1) and should_force_orphan_merge (pass 2). Gated by the new MultiPageConfig.block_on_intervening_content (default True). Furniture handling so legitimate continuations still merge: - Only section_header/title nodes block. Paragraphs, list items, captions, footnotes and figures are ignored — real PDFs scatter those between fragments of a single continued table. - A heading that recurs near-identically (Jaccard >= 0.8) on another page is treated as a running header (a repeated banner, or a journal name docling labels page_header on one page and section_header on the next), not a boundary. Fixes over-eager merging of same-schema per-section tables, e.g. an insurance policy's eight Prestige|Elite|Classic benefit grids collapsing into one. Tests: 5 merger-level guard tests + 3 adapter-level tests building real DoclingDocuments. Full suite 158 passed; the existing continuation fixtures (repeated-header, headerless-continuation, orphan-pair, inconsistent-header-detection) are unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Tables that share a column schema but belong to different sections were being stitched into one logical table, because header similarity alone drove the merge. Real-world trigger: an insurance policy's eight per-benefit plan grids — each a
Prestige | Elite | Classictable under its own38a … 38hheading — collapsed into 2 merged blobs, attached to the wrong headings.Insight
A genuine page-split continuation has nothing but page furniture (running headers/footers) between its fragments. So a section heading sitting between two tables in reading order is a reliable "these are separate tables" signal.
Change
MultiPageConfig.block_on_intervening_content(defaultTrue) andTableMeta.content_before.content_beforeper table in body reading order._classify_sequential_pair(pass 1) andshould_force_orphan_merge(pass 2).Avoiding regressions on legitimate continuations
section_header/titleblock. Paragraphs, list items, captions, footnotes and figures are ignored — real PDFs scatter those between fragments of a single continued table (interleaved reading order, cell text extracted as body nodes).page_headeron one page andsection_headeron the next — does not block a continuation.Tests
test_merger.py).DoclingDocuments (test_intervening_content_guard.py).repeated-header,headerless-continuation,orphan-pair,inconsistent-header-detection) are unaffected.Known limitations
block_on_intervening_content=Falseis the escape hatch.🤖 Generated with Claude Code