add xlsx_analyze and pptx_analyze by Sidney233 · Pull Request #4725 · opendatalab/MinerU

Sidney233 · 2026-04-02T09:39:31Z

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

Please describe the motivation of this PR and the goal you want to achieve through this PR.

Modification

Please briefly describe what modification is made in this PR.

BC-breaking (Optional)

Does the modification introduce changes that break the backward compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here and update the documentation.

Checklist

Before PR:

Pre-commit or other linting tools are used to fix the potential lint issues.
Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests.
The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
The documentation has been modified accordingly, like docstring or example tutorials.

After PR:

If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects.
CLA has been signed and all committers have signed the CLA in this PR.

Copilot

Pull request overview

This PR introduces initial XLSX parsing support (tables, images, and drawing-based math extraction) and adds office_*_analyze entrypoints for XLSX and PPTX to produce the project’s “middle JSON” output.

Changes:

Added XlsxConverter with table detection, HTML table rendering, image extraction, and OMML→LaTeX math extraction.
Added office_xlsx_analyze / office_pptx_analyze wrappers mirroring the existing DOCX analyze flow.
Added a large set of XLSX fixtures (both full .xlsx/.xlsm files and an exploded OOXML sample) for manual/testing use.

Reviewed changes

Copilot reviewed 37 out of 49 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
mineru/model/xlsx/xlsx_converter.py	New XLSX converter implementation (tables/images/math → MinerU blocks).
mineru/model/xlsx/main.py	XLSX converter entrypoint helpers (`convert_path` / `convert_binary`) and demo runner.
mineru/model/xlsx/init.py	XLSX package marker.
mineru/model/pptx/pptx_converter.py	Adds `equation_bookends` field (currently unused).
mineru/backend/office/xlsx_analyze.py	New XLSX analyze wrapper producing middle JSON + raw results.
mineru/backend/office/pptx_analyze.py	New PPTX analyze wrapper producing middle JSON + raw results.
mineru/model/xlsx/test_xlsx/xlsx_01.xlsx	XLSX fixture for converter validation.
mineru/model/xlsx/test_xlsx/xlsx_02_sample_sales_data.xlsm	XLSM fixture for converter validation.
mineru/model/xlsx/test_xlsx/xlsx_03_chartsheet.xlsx	XLSX fixture containing charts/drawings.
mineru/model/xlsx/test_xlsx/xlsx_04_inflated.xlsx	XLSX fixture for stress/edge testing.
mineru/model/xlsx/test_xlsx/xlsx_05_table_with_title.xlsx	XLSX fixture for table-with-title scenarios.
mineru/model/xlsx/test_xlsx/xlsx_06_edge_cases_.xlsx	XLSX fixture for edge cases.
mineru/model/xlsx/test_xlsx/xlsx_07_gap_tolerance_.xlsx	XLSX fixture to exercise `gap_tolerance` table grouping.
mineru/model/xlsx/test_xlsx/xlsx_08_one_cell_anchor.xlsx	XLSX fixture to exercise one-cell anchors.
mineru/model/xlsx/test_xlsx/xlsx_01/[Content_Types].xml	Exploded OOXML sample (content types).
mineru/model/xlsx/test_xlsx/xlsx_01/_rels/.rels	Exploded OOXML sample (package rels).
mineru/model/xlsx/test_xlsx/xlsx_01/docProps/app.xml	Exploded OOXML sample (app props).
mineru/model/xlsx/test_xlsx/xlsx_01/docProps/core.xml	Exploded OOXML sample (core props).
mineru/model/xlsx/test_xlsx/xlsx_01/docProps/custom.xml	Exploded OOXML sample (custom props).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/workbook.xml	Exploded OOXML sample (workbook).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/_rels/workbook.xml.rels	Exploded OOXML sample (workbook rels).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/_rels/cellimages.xml.rels	Exploded OOXML sample (cell image rels).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/cellimages.xml	Exploded OOXML sample (WPS cell images mapping).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/sharedStrings.xml	Exploded OOXML sample (shared strings).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/styles.xml	Exploded OOXML sample (styles).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/theme/theme1.xml	Exploded OOXML sample (theme).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/worksheets/sheet1.xml	Exploded OOXML sample (sheet with DISPIMG + formula drawing rel).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/worksheets/sheet2.xml	Exploded OOXML sample (worksheet).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/worksheets/sheet3.xml	Exploded OOXML sample (worksheet).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/worksheets/sheet4.xml	Exploded OOXML sample (hidden worksheet).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/worksheets/_rels/sheet1.xml.rels	Exploded OOXML sample (sheet drawing rel).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/worksheets/_rels/sheet2.xml.rels	Exploded OOXML sample (sheet drawing rel).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/worksheets/_rels/sheet3.xml.rels	Exploded OOXML sample (sheet drawing rel).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/drawings/drawing1.xml	Exploded OOXML sample (one-cell anchor w/ OMML).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/drawings/drawing2.xml	Exploded OOXML sample (chart frames).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/drawings/drawing3.xml	Exploded OOXML sample (embedded image).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/drawings/_rels/drawing2.xml.rels	Exploded OOXML sample (drawing→chart rels).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/drawings/_rels/drawing3.xml.rels	Exploded OOXML sample (drawing→image rel).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/charts/chart1.xml	Exploded OOXML sample (chart xml).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/charts/chart2.xml	Exploded OOXML sample (chart xml).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/charts/_rels/chart1.xml.rels	Exploded OOXML sample (chart style rels).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/charts/_rels/chart2.xml.rels	Exploded OOXML sample (chart style rels).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/charts/colors1.xml	Exploded OOXML sample (chart colors).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/charts/colors2.xml	Exploded OOXML sample (chart colors).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/charts/style1.xml	Exploded OOXML sample (chart style).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/charts/style2.xml	Exploded OOXML sample (chart style).
mineru/model/xlsx/test_xlsx/xlsx_01/xl/media/image1.png	Exploded OOXML sample media.
mineru/model/xlsx/test_xlsx/xlsx_01/xl/media/image2.png	Exploded OOXML sample media.
mineru/model/xlsx/test_xlsx/xlsx_01/xl/media/image3.jpeg	Exploded OOXML sample media.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-02T09:48:17Z

+        self.pages = []
+        self.cur_page = []
+        self.pages.append(self.cur_page)
+        self.image_map = {}


self.pages is initialized with an empty cur_page and then a new empty cur_page is appended after each sheet conversion. This leaves an extra trailing empty page in pages (and shifts indexing) for any workbook with >=1 sheet. Consider creating/appending the per-sheet page at the start of each iteration (or only appending after conversion when content exists) and avoid appending a final empty page.

Copilot · 2026-04-02T09:48:17Z

+                    if (r, c) in self.math_map:
+                        for formula in self.math_map[(r, c)]:
+                            text_content += self.equation_bookends.format(EQ=formula)
+
+                    lines.append(f"    <{tag}{attr_str}><p>{text_content}</p></{tag}>")


Math formula lookup appears inconsistent with table coordinates: self.math_map keys are sheet-level (row,col) from drawing XML, but here you check (r, c) where r/c are table-local indices. This will fail for tables not anchored at (0,0). Also, formula is injected into HTML unescaped, which can break the table HTML or allow tag injection if the source file is malicious. Use the table anchor offset when indexing math_map and HTML-escape the formula content before embedding inside <eq>...</eq>.

Copilot · 2026-04-02T09:48:17Z

+            # 1. 检查单元格直接值
+            cell = sheet.cell(row=r + 1, column=c + 1)
+            if cell.value is not None:
+                return True
+
+            # 2. 检查是否属于某个合并单元格区域
+            for mr in sheet.merged_cells.ranges:
+                if cell.coordinate in mr:
+                    return True


has_content() calls sheet.cell(...), which creates cells for empty coordinates in openpyxl. Because this is invoked heavily by BFS (and later when iterating the bbox), converting large/mostly-empty sheets can dramatically inflate sheet._cells and memory usage, and mutates the workbook. Prefer non-creating access (e.g., sheet._cells.get((row, col)) / Worksheet._get_cell if available) and precompute merged-cell membership to avoid per-call scans of sheet.merged_cells.ranges.

Copilot · 2026-04-02T09:48:18Z

+        if hasattr(file_stream, "seek"):
+            file_stream.seek(0)
+
+        self.workbook = load_workbook(filename=file_stream, data_only=True)
+        if self.workbook is not None:


load_workbook(...) can raise (corrupt file, unsupported content, password protection, etc.) but is not wrapped in error handling. Currently, convert() will crash even though you already handle ZipFile failures. Consider catching exceptions around load_workbook, logging a clear message, and ensuring the zipfile handle is closed in a finally block.

Copilot · 2026-04-02T09:48:18Z

+        used_cells = set()
+        if self.workbook is not None:
+            content_layer = self._get_sheet_content_layer(sheet)  # 检测工作表的可见性
+            tables = self._find_data_tables(sheet)  # 检测工作表中的所有数据表格


content_layer is computed but never used, and _get_sheet_content_layer() returns a raw string while its docstring references ContentLayer.INVISIBLE. This looks like leftover/unfinished visibility-layer handling and adds confusion for future maintainers. Either remove the content-layer plumbing or implement it consistently (including importing/using the proper enum/type).

Copilot · 2026-04-02T09:48:19Z

+    script_root = Path(__file__).resolve().parent.parent.parent.parent
+    default_xlsx = script_root / "demo" / "xlsx" / "xlsx_01.xlsx"
+


The demo runner defaults to demo/xlsx/xlsx_01.xlsx, but there is no demo/xlsx/ directory in the repository. Running this module as __main__ will fail by default. Either add the referenced demo file/directory or point the default to an existing fixture path.

Copilot · 2026-04-02T09:48:19Z

+    parser.add_argument(
+        "xlsx",
+        nargs="?",
+        default=str(default_xlsx),
+        help="path to docx file (defaults to demo/docx/demo1.docx relative to project root)",
+    )


The CLI help text still refers to a DOCX path (demo/docx/demo1.docx) even though this script analyzes XLSX. Update the help string (and the preceding comment at the top of the __main__ block) to avoid misleading users.

Copilot · 2026-04-02T09:48:19Z

+    parser = argparse.ArgumentParser(
+        description="Quick demo runner for office_pptx_analyze"
+    )
+    parser.add_argument(
+        "pptx",
+        nargs="?",
+        default=str(default_pptx),
+        help="path to docx file (defaults to demo/docx/demo1.docx relative to project root)",
+    )


The CLI help text still refers to a DOCX path (demo/docx/demo1.docx) even though this script analyzes PPTX. Update the help string (and the preceding comment at the top of the __main__ block) to avoid misleading users.

Copilot · 2026-04-02T09:48:19Z

+if __name__ == "__main__":
+    print(convert_path("test_xlsx/xlsx_01.xlsx"))


if __name__ == "__main__" uses test_xlsx/xlsx_01.xlsx, but the repo contains test_xlsx/xlsx_01/ (unzipped XML) and other .xlsx fixtures, not xlsx_01.xlsx. Consider switching to an existing file or adopting the more robust CLI pattern used in mineru/model/docx/main.py to avoid a broken example entrypoint.

Copilot · 2026-04-02T09:48:20Z

        self.pages = []
        self.cur_page = []
        self.list_block_stack: list = []  # 列表块堆栈
+        self.equation_bookends: str = "<eq>{EQ}</eq>"  # 公式标记格式


equation_bookends is introduced but not referenced anywhere in the PPTX converter. If inline-equation tagging is intended, wire it into the text extraction path; otherwise remove it to avoid dead state and confusion.

Suggested change

self.equation_bookends: str = "<eq>{EQ}</eq>" # 公式标记格式

myhloli · 2026-04-02T14:08:42Z

@codex

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 719d1e00b5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-02T14:12:08Z

+                sheet = self.workbook[name]
+                self._convert_sheet(sheet)
+                self.cur_page = []
+                self.pages.append(self.cur_page)


Remove trailing empty page after worksheet iteration

convert() appends a new self.cur_page after processing every sheet but never trims the final empty page, so workbooks with N sheets return N+1 pages and result_to_middle_json() emits an extra blank pdf_info entry. This creates incorrect page counts/index alignment for downstream consumers and is reproducible with the new test_xlsx files.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-02T14:12:08Z

+                rel_id = rel.attrib.get("Id")
+                target = rel.attrib.get("Target")
+                if rel_id and target:
+                    self.cell_image_map[cell_image_embed_to_name[rel_id]] = target


Ignore unmapped relationship IDs in cell image rels

The mapping builder assumes every relationship ID in xl/_rels/cellimages.xml.rels exists in cell_image_embed_to_name, but self.cell_image_map[cell_image_embed_to_name[rel_id]] = target raises KeyError for any extra/unreferenced relationship (for example, non-image rels). Because this is inside a broad except, one such entry causes the function to return {} and all cell-image extraction for that workbook is lost.

Useful? React with 👍 / 👎.

kevintyj · 2026-04-03T09:38:33Z

Thanks for adding the functions! One issue I noticed: PptxConverter._handle_text_elements() stores list items under the key list_items, but parse_list_block() in office_magic_model.py reads from the key content (line 355). This means all PPTX lists will silently be dropped (parse_list_block returns None when content is empty).

The fix should be a one-line change in pptx_converter.py — rename list_items to content in the list block dict (around line 230):

new_list_block = {
    "type": BlockType.LIST,
    "attribute": list_attribute,
    "content": [],  # was "list_items" — needs to match what parse_list_block() expects
}

The item format {"type": BlockType.TEXT, "content": "..."} is already correct since BlockType.TEXT = "text" matches what parse_list_block checks for.

The same issue exists in the current codebase (pre-PR) — this PR is a good opportunity to fix it.

kevintyj · 2026-04-03T09:55:57Z

Also noticed that there is no rich text markup support (bold, italics, etc.) and heading is all level 1. Was this an intentional design decision?

myhloli · 2026-04-09T09:02:47Z

@codex

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8cd60fc5cd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-09T09:07:53Z

+                    if (r, c) in self.math_map:
+                        for formula in self.math_map[(r, c)]:


Use sheet coordinates when injecting math formulas

_map_math_formulas_to_cells() stores formula anchors in worksheet coordinates, but excel_table_to_html() looks them up with the table-local loop indices (r, c). For any table whose anchor is not (0, 0), formulas are missed (or can be attached to the wrong cell if another table shares the same local coordinates), which corrupts extracted table content for common layouts where data starts below/after headers.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-09T09:07:53Z

+                self.cur_page = []
+                self.pages.append(self.cur_page)


Avoid appending a trailing empty page per workbook

After processing each sheet, the converter always creates and appends a new cur_page, including after the final sheet. This makes every conversion return an extra blank page (e.g., 4 sheets become 5 pages), shifting page indices and producing empty pdf_info entries downstream.

Useful? React with 👍 / 👎.

Sidney233 and others added 5 commits March 17, 2026 20:47

feat: 添加解析xlsx

f1e848d

feat: 添加解析xlsx

40c5f10

feat: 添加解析xlsx

f24b3bf

feat: 添加xlsx_analyze和pptx_analyze

194ff84

Merge branch 'opendatalab:dev' into dev

719d1e0

Copilot AI review requested due to automatic review settings April 2, 2026 09:39

dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Apr 2, 2026

Copilot started reviewing on behalf of Sidney233 April 2, 2026 09:40 View session

dosubot bot added the enhancement New feature or request label Apr 2, 2026

Copilot AI reviewed Apr 2, 2026

View reviewed changes

myhloli changed the base branch from dev to add_pptx_xlsx April 2, 2026 14:08

chatgpt-codex-connector bot reviewed Apr 2, 2026

View reviewed changes

Merge branch 'opendatalab:dev' into dev

cdb1485

dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Apr 8, 2026

Sidney233 added 4 commits April 9, 2026 16:29

feat: 添加富文本解析，pptx添加多级列表

20e798e

Merge remote-tracking branch 'origin/dev' into dev

21f7700

feat: 添加富文本解析，pptx添加多级列表

59da364

feat: 添加富文本解析，pptx添加多级列表

8cd60fc

dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Apr 9, 2026

myhloli changed the base branch from add_pptx_xlsx to dev April 9, 2026 08:57

myhloli changed the base branch from dev to add_pptx_xlsx April 9, 2026 08:59

chatgpt-codex-connector bot reviewed Apr 9, 2026

View reviewed changes

fix: 修复pptx普通文本框缺失富文本的问题，xlsx公式锚点问题

710bdb3

myhloli merged commit 0725a2b into opendatalab:add_pptx_xlsx Apr 14, 2026
1 check passed

github-actions bot locked and limited conversation to collaborators Apr 14, 2026

		script_root = Path(__file__).resolve().parent.parent.parent.parent
		default_xlsx = script_root / "demo" / "xlsx" / "xlsx_01.xlsx"

		if __name__ == "__main__":
		print(convert_path("test_xlsx/xlsx_01.xlsx")) No newline at end of file

		if (r, c) in self.math_map:
		for formula in self.math_map[(r, c)]:

Conversation

Sidney233 commented Apr 2, 2026

Motivation

Modification

BC-breaking (Optional)

Use cases (Optional)

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

myhloli commented Apr 2, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

kevintyj commented Apr 3, 2026

Uh oh!

kevintyj commented Apr 3, 2026

Uh oh!

myhloli commented Apr 9, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants