add xlsx_analyze and pptx_analyze#4725
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces initial XLSX parsing support (tables, images, and drawing-based math extraction) and adds office_*_analyze entrypoints for XLSX and PPTX to produce the project’s “middle JSON” output.
Changes:
- Added
XlsxConverterwith table detection, HTML table rendering, image extraction, and OMML→LaTeX math extraction. - Added
office_xlsx_analyze/office_pptx_analyzewrappers mirroring the existing DOCX analyze flow. - Added a large set of XLSX fixtures (both full
.xlsx/.xlsmfiles and an exploded OOXML sample) for manual/testing use.
Reviewed changes
Copilot reviewed 37 out of 49 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| mineru/model/xlsx/xlsx_converter.py | New XLSX converter implementation (tables/images/math → MinerU blocks). |
| mineru/model/xlsx/main.py | XLSX converter entrypoint helpers (convert_path / convert_binary) and demo runner. |
| mineru/model/xlsx/init.py | XLSX package marker. |
| mineru/model/pptx/pptx_converter.py | Adds equation_bookends field (currently unused). |
| mineru/backend/office/xlsx_analyze.py | New XLSX analyze wrapper producing middle JSON + raw results. |
| mineru/backend/office/pptx_analyze.py | New PPTX analyze wrapper producing middle JSON + raw results. |
| mineru/model/xlsx/test_xlsx/xlsx_01.xlsx | XLSX fixture for converter validation. |
| mineru/model/xlsx/test_xlsx/xlsx_02_sample_sales_data.xlsm | XLSM fixture for converter validation. |
| mineru/model/xlsx/test_xlsx/xlsx_03_chartsheet.xlsx | XLSX fixture containing charts/drawings. |
| mineru/model/xlsx/test_xlsx/xlsx_04_inflated.xlsx | XLSX fixture for stress/edge testing. |
| mineru/model/xlsx/test_xlsx/xlsx_05_table_with_title.xlsx | XLSX fixture for table-with-title scenarios. |
| mineru/model/xlsx/test_xlsx/xlsx_06_edge_cases_.xlsx | XLSX fixture for edge cases. |
| mineru/model/xlsx/test_xlsx/xlsx_07_gap_tolerance_.xlsx | XLSX fixture to exercise gap_tolerance table grouping. |
| mineru/model/xlsx/test_xlsx/xlsx_08_one_cell_anchor.xlsx | XLSX fixture to exercise one-cell anchors. |
| mineru/model/xlsx/test_xlsx/xlsx_01/[Content_Types].xml | Exploded OOXML sample (content types). |
| mineru/model/xlsx/test_xlsx/xlsx_01/_rels/.rels | Exploded OOXML sample (package rels). |
| mineru/model/xlsx/test_xlsx/xlsx_01/docProps/app.xml | Exploded OOXML sample (app props). |
| mineru/model/xlsx/test_xlsx/xlsx_01/docProps/core.xml | Exploded OOXML sample (core props). |
| mineru/model/xlsx/test_xlsx/xlsx_01/docProps/custom.xml | Exploded OOXML sample (custom props). |
| mineru/model/xlsx/test_xlsx/xlsx_01/xl/workbook.xml | Exploded OOXML sample (workbook). |
| mineru/model/xlsx/test_xlsx/xlsx_01/xl/_rels/workbook.xml.rels | Exploded OOXML sample (workbook rels). |
| mineru/model/xlsx/test_xlsx/xlsx_01/xl/_rels/cellimages.xml.rels | Exploded OOXML sample (cell image rels). |
| mineru/model/xlsx/test_xlsx/xlsx_01/xl/cellimages.xml | Exploded OOXML sample (WPS cell images mapping). |
| mineru/model/xlsx/test_xlsx/xlsx_01/xl/sharedStrings.xml | Exploded OOXML sample (shared strings). |
| mineru/model/xlsx/test_xlsx/xlsx_01/xl/styles.xml | Exploded OOXML sample (styles). |
| mineru/model/xlsx/test_xlsx/xlsx_01/xl/theme/theme1.xml | Exploded OOXML sample (theme). |
| mineru/model/xlsx/test_xlsx/xlsx_01/xl/worksheets/sheet1.xml | Exploded OOXML sample (sheet with DISPIMG + formula drawing rel). |
| mineru/model/xlsx/test_xlsx/xlsx_01/xl/worksheets/sheet2.xml | Exploded OOXML sample (worksheet). |
| mineru/model/xlsx/test_xlsx/xlsx_01/xl/worksheets/sheet3.xml | Exploded OOXML sample (worksheet). |
| mineru/model/xlsx/test_xlsx/xlsx_01/xl/worksheets/sheet4.xml | Exploded OOXML sample (hidden worksheet). |
| mineru/model/xlsx/test_xlsx/xlsx_01/xl/worksheets/_rels/sheet1.xml.rels | Exploded OOXML sample (sheet drawing rel). |
| mineru/model/xlsx/test_xlsx/xlsx_01/xl/worksheets/_rels/sheet2.xml.rels | Exploded OOXML sample (sheet drawing rel). |
| mineru/model/xlsx/test_xlsx/xlsx_01/xl/worksheets/_rels/sheet3.xml.rels | Exploded OOXML sample (sheet drawing rel). |
| mineru/model/xlsx/test_xlsx/xlsx_01/xl/drawings/drawing1.xml | Exploded OOXML sample (one-cell anchor w/ OMML). |
| mineru/model/xlsx/test_xlsx/xlsx_01/xl/drawings/drawing2.xml | Exploded OOXML sample (chart frames). |
| mineru/model/xlsx/test_xlsx/xlsx_01/xl/drawings/drawing3.xml | Exploded OOXML sample (embedded image). |
| mineru/model/xlsx/test_xlsx/xlsx_01/xl/drawings/_rels/drawing2.xml.rels | Exploded OOXML sample (drawing→chart rels). |
| mineru/model/xlsx/test_xlsx/xlsx_01/xl/drawings/_rels/drawing3.xml.rels | Exploded OOXML sample (drawing→image rel). |
| mineru/model/xlsx/test_xlsx/xlsx_01/xl/charts/chart1.xml | Exploded OOXML sample (chart xml). |
| mineru/model/xlsx/test_xlsx/xlsx_01/xl/charts/chart2.xml | Exploded OOXML sample (chart xml). |
| mineru/model/xlsx/test_xlsx/xlsx_01/xl/charts/_rels/chart1.xml.rels | Exploded OOXML sample (chart style rels). |
| mineru/model/xlsx/test_xlsx/xlsx_01/xl/charts/_rels/chart2.xml.rels | Exploded OOXML sample (chart style rels). |
| mineru/model/xlsx/test_xlsx/xlsx_01/xl/charts/colors1.xml | Exploded OOXML sample (chart colors). |
| mineru/model/xlsx/test_xlsx/xlsx_01/xl/charts/colors2.xml | Exploded OOXML sample (chart colors). |
| mineru/model/xlsx/test_xlsx/xlsx_01/xl/charts/style1.xml | Exploded OOXML sample (chart style). |
| mineru/model/xlsx/test_xlsx/xlsx_01/xl/charts/style2.xml | Exploded OOXML sample (chart style). |
| mineru/model/xlsx/test_xlsx/xlsx_01/xl/media/image1.png | Exploded OOXML sample media. |
| mineru/model/xlsx/test_xlsx/xlsx_01/xl/media/image2.png | Exploded OOXML sample media. |
| mineru/model/xlsx/test_xlsx/xlsx_01/xl/media/image3.jpeg | Exploded OOXML sample media. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| self.pages = [] | ||
| self.cur_page = [] | ||
| self.pages.append(self.cur_page) | ||
| self.image_map = {} |
There was a problem hiding this comment.
self.pages is initialized with an empty cur_page and then a new empty cur_page is appended after each sheet conversion. This leaves an extra trailing empty page in pages (and shifts indexing) for any workbook with >=1 sheet. Consider creating/appending the per-sheet page at the start of each iteration (or only appending after conversion when content exists) and avoid appending a final empty page.
| if (r, c) in self.math_map: | ||
| for formula in self.math_map[(r, c)]: | ||
| text_content += self.equation_bookends.format(EQ=formula) | ||
|
|
||
| lines.append(f" <{tag}{attr_str}><p>{text_content}</p></{tag}>") |
There was a problem hiding this comment.
Math formula lookup appears inconsistent with table coordinates: self.math_map keys are sheet-level (row,col) from drawing XML, but here you check (r, c) where r/c are table-local indices. This will fail for tables not anchored at (0,0). Also, formula is injected into HTML unescaped, which can break the table HTML or allow tag injection if the source file is malicious. Use the table anchor offset when indexing math_map and HTML-escape the formula content before embedding inside <eq>...</eq>.
| # 1. 检查单元格直接值 | ||
| cell = sheet.cell(row=r + 1, column=c + 1) | ||
| if cell.value is not None: | ||
| return True | ||
|
|
||
| # 2. 检查是否属于某个合并单元格区域 | ||
| for mr in sheet.merged_cells.ranges: | ||
| if cell.coordinate in mr: | ||
| return True |
There was a problem hiding this comment.
has_content() calls sheet.cell(...), which creates cells for empty coordinates in openpyxl. Because this is invoked heavily by BFS (and later when iterating the bbox), converting large/mostly-empty sheets can dramatically inflate sheet._cells and memory usage, and mutates the workbook. Prefer non-creating access (e.g., sheet._cells.get((row, col)) / Worksheet._get_cell if available) and precompute merged-cell membership to avoid per-call scans of sheet.merged_cells.ranges.
| if hasattr(file_stream, "seek"): | ||
| file_stream.seek(0) | ||
|
|
||
| self.workbook = load_workbook(filename=file_stream, data_only=True) | ||
| if self.workbook is not None: |
There was a problem hiding this comment.
load_workbook(...) can raise (corrupt file, unsupported content, password protection, etc.) but is not wrapped in error handling. Currently, convert() will crash even though you already handle ZipFile failures. Consider catching exceptions around load_workbook, logging a clear message, and ensuring the zipfile handle is closed in a finally block.
| used_cells = set() | ||
| if self.workbook is not None: | ||
| content_layer = self._get_sheet_content_layer(sheet) # 检测工作表的可见性 | ||
| tables = self._find_data_tables(sheet) # 检测工作表中的所有数据表格 |
There was a problem hiding this comment.
content_layer is computed but never used, and _get_sheet_content_layer() returns a raw string while its docstring references ContentLayer.INVISIBLE. This looks like leftover/unfinished visibility-layer handling and adds confusion for future maintainers. Either remove the content-layer plumbing or implement it consistently (including importing/using the proper enum/type).
| script_root = Path(__file__).resolve().parent.parent.parent.parent | ||
| default_xlsx = script_root / "demo" / "xlsx" / "xlsx_01.xlsx" | ||
|
|
There was a problem hiding this comment.
The demo runner defaults to demo/xlsx/xlsx_01.xlsx, but there is no demo/xlsx/ directory in the repository. Running this module as __main__ will fail by default. Either add the referenced demo file/directory or point the default to an existing fixture path.
| parser.add_argument( | ||
| "xlsx", | ||
| nargs="?", | ||
| default=str(default_xlsx), | ||
| help="path to docx file (defaults to demo/docx/demo1.docx relative to project root)", | ||
| ) |
There was a problem hiding this comment.
The CLI help text still refers to a DOCX path (demo/docx/demo1.docx) even though this script analyzes XLSX. Update the help string (and the preceding comment at the top of the __main__ block) to avoid misleading users.
| parser = argparse.ArgumentParser( | ||
| description="Quick demo runner for office_pptx_analyze" | ||
| ) | ||
| parser.add_argument( | ||
| "pptx", | ||
| nargs="?", | ||
| default=str(default_pptx), | ||
| help="path to docx file (defaults to demo/docx/demo1.docx relative to project root)", | ||
| ) |
There was a problem hiding this comment.
The CLI help text still refers to a DOCX path (demo/docx/demo1.docx) even though this script analyzes PPTX. Update the help string (and the preceding comment at the top of the __main__ block) to avoid misleading users.
| if __name__ == "__main__": | ||
| print(convert_path("test_xlsx/xlsx_01.xlsx")) No newline at end of file |
There was a problem hiding this comment.
if __name__ == "__main__" uses test_xlsx/xlsx_01.xlsx, but the repo contains test_xlsx/xlsx_01/ (unzipped XML) and other .xlsx fixtures, not xlsx_01.xlsx. Consider switching to an existing file or adopting the more robust CLI pattern used in mineru/model/docx/main.py to avoid a broken example entrypoint.
| self.pages = [] | ||
| self.cur_page = [] | ||
| self.list_block_stack: list = [] # 列表块堆栈 | ||
| self.equation_bookends: str = "<eq>{EQ}</eq>" # 公式标记格式 |
There was a problem hiding this comment.
equation_bookends is introduced but not referenced anywhere in the PPTX converter. If inline-equation tagging is intended, wire it into the text extraction path; otherwise remove it to avoid dead state and confusion.
| self.equation_bookends: str = "<eq>{EQ}</eq>" # 公式标记格式 |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 719d1e00b5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| sheet = self.workbook[name] | ||
| self._convert_sheet(sheet) | ||
| self.cur_page = [] | ||
| self.pages.append(self.cur_page) |
There was a problem hiding this comment.
Remove trailing empty page after worksheet iteration
convert() appends a new self.cur_page after processing every sheet but never trims the final empty page, so workbooks with N sheets return N+1 pages and result_to_middle_json() emits an extra blank pdf_info entry. This creates incorrect page counts/index alignment for downstream consumers and is reproducible with the new test_xlsx files.
Useful? React with 👍 / 👎.
| rel_id = rel.attrib.get("Id") | ||
| target = rel.attrib.get("Target") | ||
| if rel_id and target: | ||
| self.cell_image_map[cell_image_embed_to_name[rel_id]] = target |
There was a problem hiding this comment.
Ignore unmapped relationship IDs in cell image rels
The mapping builder assumes every relationship ID in xl/_rels/cellimages.xml.rels exists in cell_image_embed_to_name, but self.cell_image_map[cell_image_embed_to_name[rel_id]] = target raises KeyError for any extra/unreferenced relationship (for example, non-image rels). Because this is inside a broad except, one such entry causes the function to return {} and all cell-image extraction for that workbook is lost.
Useful? React with 👍 / 👎.
|
Thanks for adding the functions! One issue I noticed: The fix should be a one-line change in new_list_block = {
"type": BlockType.LIST,
"attribute": list_attribute,
"content": [], # was "list_items" — needs to match what parse_list_block() expects
}The item format The same issue exists in the current codebase (pre-PR) — this PR is a good opportunity to fix it. |
|
Also noticed that there is no rich text markup support (bold, italics, etc.) and heading is all level 1. Was this an intentional design decision? |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8cd60fc5cd
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (r, c) in self.math_map: | ||
| for formula in self.math_map[(r, c)]: |
There was a problem hiding this comment.
Use sheet coordinates when injecting math formulas
_map_math_formulas_to_cells() stores formula anchors in worksheet coordinates, but excel_table_to_html() looks them up with the table-local loop indices (r, c). For any table whose anchor is not (0, 0), formulas are missed (or can be attached to the wrong cell if another table shares the same local coordinates), which corrupts extracted table content for common layouts where data starts below/after headers.
Useful? React with 👍 / 👎.
| self.cur_page = [] | ||
| self.pages.append(self.cur_page) |
There was a problem hiding this comment.
Avoid appending a trailing empty page per workbook
After processing each sheet, the converter always creates and appends a new cur_page, including after the final sheet. This makes every conversion return an extra blank page (e.g., 4 sheets become 5 pages), shifting page indices and producing empty pdf_info entries downstream.
Useful? React with 👍 / 👎.
Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
Motivation
Please describe the motivation of this PR and the goal you want to achieve through this PR.
Modification
Please briefly describe what modification is made in this PR.
BC-breaking (Optional)
Does the modification introduce changes that break the backward compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.
Use cases (Optional)
If this PR introduces a new feature, it is better to list some use cases here and update the documentation.
Checklist
Before PR:
After PR: