Skip to content

add xlsx_analyze and pptx_analyze#4725

Merged
myhloli merged 11 commits intoopendatalab:add_pptx_xlsxfrom
Sidney233:dev
Apr 14, 2026
Merged

add xlsx_analyze and pptx_analyze#4725
myhloli merged 11 commits intoopendatalab:add_pptx_xlsxfrom
Sidney233:dev

Conversation

@Sidney233
Copy link
Copy Markdown
Contributor

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

Please describe the motivation of this PR and the goal you want to achieve through this PR.

Modification

Please briefly describe what modification is made in this PR.

BC-breaking (Optional)

Does the modification introduce changes that break the backward compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here and update the documentation.

Checklist

Before PR:

  • Pre-commit or other linting tools are used to fix the potential lint issues.
  • Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests.
  • The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
  • The documentation has been modified accordingly, like docstring or example tutorials.

After PR:

  • If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects.
  • CLA has been signed and all committers have signed the CLA in this PR.

Copilot AI review requested due to automatic review settings April 2, 2026 09:39
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Apr 2, 2026
@dosubot dosubot bot added the enhancement New feature or request label Apr 2, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces initial XLSX parsing support (tables, images, and drawing-based math extraction) and adds office_*_analyze entrypoints for XLSX and PPTX to produce the project’s “middle JSON” output.

Changes:

  • Added XlsxConverter with table detection, HTML table rendering, image extraction, and OMML→LaTeX math extraction.
  • Added office_xlsx_analyze / office_pptx_analyze wrappers mirroring the existing DOCX analyze flow.
  • Added a large set of XLSX fixtures (both full .xlsx/.xlsm files and an exploded OOXML sample) for manual/testing use.

Reviewed changes

Copilot reviewed 37 out of 49 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
mineru/model/xlsx/xlsx_converter.py New XLSX converter implementation (tables/images/math → MinerU blocks).
mineru/model/xlsx/main.py XLSX converter entrypoint helpers (convert_path / convert_binary) and demo runner.
mineru/model/xlsx/init.py XLSX package marker.
mineru/model/pptx/pptx_converter.py Adds equation_bookends field (currently unused).
mineru/backend/office/xlsx_analyze.py New XLSX analyze wrapper producing middle JSON + raw results.
mineru/backend/office/pptx_analyze.py New PPTX analyze wrapper producing middle JSON + raw results.
mineru/model/xlsx/test_xlsx/xlsx_01.xlsx XLSX fixture for converter validation.
mineru/model/xlsx/test_xlsx/xlsx_02_sample_sales_data.xlsm XLSM fixture for converter validation.
mineru/model/xlsx/test_xlsx/xlsx_03_chartsheet.xlsx XLSX fixture containing charts/drawings.
mineru/model/xlsx/test_xlsx/xlsx_04_inflated.xlsx XLSX fixture for stress/edge testing.
mineru/model/xlsx/test_xlsx/xlsx_05_table_with_title.xlsx XLSX fixture for table-with-title scenarios.
mineru/model/xlsx/test_xlsx/xlsx_06_edge_cases_.xlsx XLSX fixture for edge cases.
mineru/model/xlsx/test_xlsx/xlsx_07_gap_tolerance_.xlsx XLSX fixture to exercise gap_tolerance table grouping.
mineru/model/xlsx/test_xlsx/xlsx_08_one_cell_anchor.xlsx XLSX fixture to exercise one-cell anchors.
mineru/model/xlsx/test_xlsx/xlsx_01/[Content_Types].xml Exploded OOXML sample (content types).
mineru/model/xlsx/test_xlsx/xlsx_01/_rels/.rels Exploded OOXML sample (package rels).
mineru/model/xlsx/test_xlsx/xlsx_01/docProps/app.xml Exploded OOXML sample (app props).
mineru/model/xlsx/test_xlsx/xlsx_01/docProps/core.xml Exploded OOXML sample (core props).
mineru/model/xlsx/test_xlsx/xlsx_01/docProps/custom.xml Exploded OOXML sample (custom props).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/workbook.xml Exploded OOXML sample (workbook).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/_rels/workbook.xml.rels Exploded OOXML sample (workbook rels).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/_rels/cellimages.xml.rels Exploded OOXML sample (cell image rels).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/cellimages.xml Exploded OOXML sample (WPS cell images mapping).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/sharedStrings.xml Exploded OOXML sample (shared strings).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/styles.xml Exploded OOXML sample (styles).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/theme/theme1.xml Exploded OOXML sample (theme).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/worksheets/sheet1.xml Exploded OOXML sample (sheet with DISPIMG + formula drawing rel).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/worksheets/sheet2.xml Exploded OOXML sample (worksheet).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/worksheets/sheet3.xml Exploded OOXML sample (worksheet).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/worksheets/sheet4.xml Exploded OOXML sample (hidden worksheet).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/worksheets/_rels/sheet1.xml.rels Exploded OOXML sample (sheet drawing rel).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/worksheets/_rels/sheet2.xml.rels Exploded OOXML sample (sheet drawing rel).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/worksheets/_rels/sheet3.xml.rels Exploded OOXML sample (sheet drawing rel).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/drawings/drawing1.xml Exploded OOXML sample (one-cell anchor w/ OMML).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/drawings/drawing2.xml Exploded OOXML sample (chart frames).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/drawings/drawing3.xml Exploded OOXML sample (embedded image).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/drawings/_rels/drawing2.xml.rels Exploded OOXML sample (drawing→chart rels).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/drawings/_rels/drawing3.xml.rels Exploded OOXML sample (drawing→image rel).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/charts/chart1.xml Exploded OOXML sample (chart xml).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/charts/chart2.xml Exploded OOXML sample (chart xml).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/charts/_rels/chart1.xml.rels Exploded OOXML sample (chart style rels).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/charts/_rels/chart2.xml.rels Exploded OOXML sample (chart style rels).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/charts/colors1.xml Exploded OOXML sample (chart colors).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/charts/colors2.xml Exploded OOXML sample (chart colors).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/charts/style1.xml Exploded OOXML sample (chart style).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/charts/style2.xml Exploded OOXML sample (chart style).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/media/image1.png Exploded OOXML sample media.
mineru/model/xlsx/test_xlsx/xlsx_01/xl/media/image2.png Exploded OOXML sample media.
mineru/model/xlsx/test_xlsx/xlsx_01/xl/media/image3.jpeg Exploded OOXML sample media.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +96 to +99
self.pages = []
self.cur_page = []
self.pages.append(self.cur_page)
self.image_map = {}
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.pages is initialized with an empty cur_page and then a new empty cur_page is appended after each sheet conversion. This leaves an extra trailing empty page in pages (and shifts indexing) for any workbook with >=1 sheet. Consider creating/appending the per-sheet page at the start of each iteration (or only appending after conversion when content exists) and avoid appending a final empty page.

Copilot uses AI. Check for mistakes.
Comment thread mineru/model/xlsx/xlsx_converter.py Outdated
Comment on lines +326 to +330
if (r, c) in self.math_map:
for formula in self.math_map[(r, c)]:
text_content += self.equation_bookends.format(EQ=formula)

lines.append(f" <{tag}{attr_str}><p>{text_content}</p></{tag}>")
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Math formula lookup appears inconsistent with table coordinates: self.math_map keys are sheet-level (row,col) from drawing XML, but here you check (r, c) where r/c are table-local indices. This will fail for tables not anchored at (0,0). Also, formula is injected into HTML unescaped, which can break the table HTML or allow tag injection if the source file is malicious. Use the table anchor offset when indexing math_map and HTML-escape the formula content before embedding inside <eq>...</eq>.

Copilot uses AI. Check for mistakes.
Comment on lines +541 to +549
# 1. 检查单元格直接值
cell = sheet.cell(row=r + 1, column=c + 1)
if cell.value is not None:
return True

# 2. 检查是否属于某个合并单元格区域
for mr in sheet.merged_cells.ranges:
if cell.coordinate in mr:
return True
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has_content() calls sheet.cell(...), which creates cells for empty coordinates in openpyxl. Because this is invoked heavily by BFS (and later when iterating the bbox), converting large/mostly-empty sheets can dramatically inflate sheet._cells and memory usage, and mutates the workbook. Prefer non-creating access (e.g., sheet._cells.get((row, col)) / Worksheet._get_cell if available) and precompute merged-cell membership to avoid per-call scans of sheet.merged_cells.ranges.

Copilot uses AI. Check for mistakes.
Comment on lines +116 to +120
if hasattr(file_stream, "seek"):
file_stream.seek(0)

self.workbook = load_workbook(filename=file_stream, data_only=True)
if self.workbook is not None:
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load_workbook(...) can raise (corrupt file, unsupported content, password protection, etc.) but is not wrapped in error handling. Currently, convert() will crash even though you already handle ZipFile failures. Consider catching exceptions around load_workbook, logging a clear message, and ensuring the zipfile handle is closed in a finally block.

Copilot uses AI. Check for mistakes.
Comment on lines +235 to +238
used_cells = set()
if self.workbook is not None:
content_layer = self._get_sheet_content_layer(sheet) # 检测工作表的可见性
tables = self._find_data_tables(sheet) # 检测工作表中的所有数据表格
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

content_layer is computed but never used, and _get_sheet_content_layer() returns a raw string while its docstring references ContentLayer.INVISIBLE. This looks like leftover/unfinished visibility-layer handling and adds confusion for future maintainers. Either remove the content-layer plumbing or implement it consistently (including importing/using the proper enum/type).

Copilot uses AI. Check for mistakes.
Comment on lines +39 to +41
script_root = Path(__file__).resolve().parent.parent.parent.parent
default_xlsx = script_root / "demo" / "xlsx" / "xlsx_01.xlsx"

Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The demo runner defaults to demo/xlsx/xlsx_01.xlsx, but there is no demo/xlsx/ directory in the repository. Running this module as __main__ will fail by default. Either add the referenced demo file/directory or point the default to an existing fixture path.

Copilot uses AI. Check for mistakes.
Comment on lines +45 to +50
parser.add_argument(
"xlsx",
nargs="?",
default=str(default_xlsx),
help="path to docx file (defaults to demo/docx/demo1.docx relative to project root)",
)
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CLI help text still refers to a DOCX path (demo/docx/demo1.docx) even though this script analyzes XLSX. Update the help string (and the preceding comment at the top of the __main__ block) to avoid misleading users.

Copilot uses AI. Check for mistakes.
Comment on lines +42 to +50
parser = argparse.ArgumentParser(
description="Quick demo runner for office_pptx_analyze"
)
parser.add_argument(
"pptx",
nargs="?",
default=str(default_pptx),
help="path to docx file (defaults to demo/docx/demo1.docx relative to project root)",
)
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CLI help text still refers to a DOCX path (demo/docx/demo1.docx) even though this script analyzes PPTX. Update the help string (and the preceding comment at the top of the __main__ block) to avoid misleading users.

Copilot uses AI. Check for mistakes.
Comment thread mineru/model/xlsx/main.py
Comment on lines +15 to +16
if __name__ == "__main__":
print(convert_path("test_xlsx/xlsx_01.xlsx")) No newline at end of file
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if __name__ == "__main__" uses test_xlsx/xlsx_01.xlsx, but the repo contains test_xlsx/xlsx_01/ (unzipped XML) and other .xlsx fixtures, not xlsx_01.xlsx. Consider switching to an existing file or adopting the more robust CLI pattern used in mineru/model/docx/main.py to avoid a broken example entrypoint.

Copilot uses AI. Check for mistakes.
self.pages = []
self.cur_page = []
self.list_block_stack: list = [] # 列表块堆栈
self.equation_bookends: str = "<eq>{EQ}</eq>" # 公式标记格式
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

equation_bookends is introduced but not referenced anywhere in the PPTX converter. If inline-equation tagging is intended, wire it into the text extraction path; otherwise remove it to avoid dead state and confusion.

Suggested change
self.equation_bookends: str = "<eq>{EQ}</eq>" # 公式标记格式

Copilot uses AI. Check for mistakes.
@myhloli myhloli changed the base branch from dev to add_pptx_xlsx April 2, 2026 14:08
@myhloli
Copy link
Copy Markdown
Collaborator

myhloli commented Apr 2, 2026

@codex

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 719d1e00b5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

sheet = self.workbook[name]
self._convert_sheet(sheet)
self.cur_page = []
self.pages.append(self.cur_page)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Remove trailing empty page after worksheet iteration

convert() appends a new self.cur_page after processing every sheet but never trims the final empty page, so workbooks with N sheets return N+1 pages and result_to_middle_json() emits an extra blank pdf_info entry. This creates incorrect page counts/index alignment for downstream consumers and is reproducible with the new test_xlsx files.

Useful? React with 👍 / 👎.

rel_id = rel.attrib.get("Id")
target = rel.attrib.get("Target")
if rel_id and target:
self.cell_image_map[cell_image_embed_to_name[rel_id]] = target
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Ignore unmapped relationship IDs in cell image rels

The mapping builder assumes every relationship ID in xl/_rels/cellimages.xml.rels exists in cell_image_embed_to_name, but self.cell_image_map[cell_image_embed_to_name[rel_id]] = target raises KeyError for any extra/unreferenced relationship (for example, non-image rels). Because this is inside a broad except, one such entry causes the function to return {} and all cell-image extraction for that workbook is lost.

Useful? React with 👍 / 👎.

@kevintyj
Copy link
Copy Markdown

kevintyj commented Apr 3, 2026

Thanks for adding the functions! One issue I noticed: PptxConverter._handle_text_elements() stores list items under the key list_items, but parse_list_block() in office_magic_model.py reads from the key content (line 355). This means all PPTX lists will silently be dropped (parse_list_block returns None when content is empty).

The fix should be a one-line change in pptx_converter.py — rename list_items to content in the list block dict (around line 230):

new_list_block = {
    "type": BlockType.LIST,
    "attribute": list_attribute,
    "content": [],  # was "list_items" — needs to match what parse_list_block() expects
}

The item format {"type": BlockType.TEXT, "content": "..."} is already correct since BlockType.TEXT = "text" matches what parse_list_block checks for.

The same issue exists in the current codebase (pre-PR) — this PR is a good opportunity to fix it.

@kevintyj
Copy link
Copy Markdown

kevintyj commented Apr 3, 2026

Also noticed that there is no rich text markup support (bold, italics, etc.) and heading is all level 1. Was this an intentional design decision?

@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Apr 8, 2026
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Apr 9, 2026
@myhloli myhloli changed the base branch from add_pptx_xlsx to dev April 9, 2026 08:57
@myhloli myhloli changed the base branch from dev to add_pptx_xlsx April 9, 2026 08:59
@myhloli
Copy link
Copy Markdown
Collaborator

myhloli commented Apr 9, 2026

@codex

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8cd60fc5cd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread mineru/model/xlsx/xlsx_converter.py Outdated
Comment on lines +328 to +329
if (r, c) in self.math_map:
for formula in self.math_map[(r, c)]:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use sheet coordinates when injecting math formulas

_map_math_formulas_to_cells() stores formula anchors in worksheet coordinates, but excel_table_to_html() looks them up with the table-local loop indices (r, c). For any table whose anchor is not (0, 0), formulas are missed (or can be attached to the wrong cell if another table shares the same local coordinates), which corrupts extracted table content for common layouts where data starts below/after headers.

Useful? React with 👍 / 👎.

Comment on lines +133 to +134
self.cur_page = []
self.pages.append(self.cur_page)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid appending a trailing empty page per workbook

After processing each sheet, the converter always creates and appends a new cur_page, including after the final sheet. This makes every conversion return an extra blank page (e.g., 4 sheets become 5 pages), shifting page indices and producing empty pdf_info entries downstream.

Useful? React with 👍 / 👎.

@myhloli myhloli merged commit 0725a2b into opendatalab:add_pptx_xlsx Apr 14, 2026
1 check passed
@github-actions github-actions bot locked and limited conversation to collaborators Apr 14, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

enhancement New feature or request size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants