Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 40 additions & 3 deletions README.ja.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,14 +16,29 @@

# ExStruct — Excel 構造化抽出エンジン

ExStruct は Excel ワークブックを読み取り、構造化データ(セル・テーブル候補・図形・チャート・SmartArt・印刷範囲ビュー)をデフォルトで JSON に出力します。CLI と Python API と MCPサーバー を提供し、LLM/RAG 向けの前処理やドキュメント理解に最適化された抽出オプションを備えています。
ExStruct は Excel ワークブックを構造化データへ抽出し、shared core を通じて
patch-based な編集フローも扱えます。抽出 API、JSON-first editing CLI、
host-managed integration 向けの MCP サーバーを提供し、LLM/RAG 向け前処理、
レビューしやすい編集フロー、ローカル自動化に使える設計になっています。

- COM/Excel 環境 (Windows) ではリッチ抽出
- 非 COM 環境 (Linux/macOS) では
- LibreOffice runtime があればセル・テーブル候補・図形・グラフ(best-effort)
- それ以外の環境ではセル+テーブル候補+印刷範囲へのフォールバックで安全に動作します。

LLM/RAG 向けに検出ヒューリスティックや出力モードを調整可能です。
LLM/RAG 向けに検出ヒューリスティックや出力モードを調整でき、編集ワーク
フローも同じ責務分離で扱えます。

## インターフェースの選び方

| 用途 | 推奨インターフェース | 理由 |
| --- | --- | --- |
| Python で直接 Excel 編集コードを書く | `openpyxl` / `xlwings` | imperative な Python 編集にはこちらの方が普通は適しています。`exstruct.edit` は ExStruct の patch contract を Python から再利用したい場合だけ使います。 |
| ローカル運用や AI エージェントの編集フローを回す | `exstruct patch` / `make` / `ops` / `validate` | canonical operational interface。JSON-first で `dry_run` に向きます。 |
| sandboxed / host-managed integration を動かす | `exstruct-mcp` / MCP tools | `PathPolicy`、transport、artifact behavior を持つ integration / compatibility layer です。 |

抽出については、従来どおり top-level Python API(`extract`,
`process_excel`, `ExStructEngine`)と `exstruct INPUT.xlsx ...` CLI を使います。

## 主な特徴

Expand All @@ -32,6 +47,7 @@ LLM/RAG 向けに検出ヒューリスティックや出力モードを調整可
- **数式取得**: `formulas_map`(数式文字列 → セル座標)を openpyxl/COM で取得。`verbose` 既定、`include_formulas_map` で制御。
- **フォーマット**: JSON(デフォルトはコンパクト、`--pretty` で整形)、YAML、TOON(任意依存)。
- **backend metadata は opt-in**: shape/chart の `provenance` / `approximation_level` / `confidence` は、トークン節約のため既定では直列化出力に含めません。必要な場合だけ `--include-backend-metadata` または `include_backend_metadata=True` を使います。
- **ワークブック編集インターフェース**: ExStruct の主な編集導線は editing CLI、host 側制御が必要な場合は MCP tools、Python から `exstruct.edit` を使うのは同じ patch contract を再利用したい場合に限ります。
- **テーブル検出のチューニング**: API でヒューリスティックを動的に変更可能。
- **ハイパーリンク抽出**: `verbose` モード(または `include_cell_links=True` 指定)でセルのリンクを `links` に出力。
- **CLI レンダリング**(Excel COM 必須): `standard` / `verbose` では PDF とシート画像を生成可能。
Expand Down Expand Up @@ -92,12 +108,33 @@ exstruct validate --input book.xlsx --pretty
```

- `patch` / `make` は JSON の `PatchResult` を標準出力に出します。
- workbook editing の canonical operational / agent interface はこの editing CLI です。
- `ops list` / `ops describe` で public patch-op schema を確認できます。
- `validate` はワークブックの読取可否(`is_readable`, `warnings`, `errors`)を返します。
- Phase 2 では既存の抽出 CLI はそのまま維持し、`exstruct extract` や対話的な safety flag はまだ追加しません。

推奨フロー:

1. patch ops を組み立てる。
2. `exstruct patch --dry-run` を実行し、`PatchResult`・warnings・diff を確認する。
3. dry run と実適用で同じ engine を使いたい場合は `--backend openpyxl` を固定する。
4. `--backend auto` を使う場合は `PatchResult.engine` を確認する。Windows/Excel 環境では実適用時に COM へ切り替わることがある。
5. 問題がなければ `--dry-run` なしで再実行する。

## MCPサーバー (標準入出力)

MCP は同じ editing core を包む integration / compatibility layer です。
path restriction、transport mapping、artifact mirroring、approval-aware な
agent 実行が必要なときに使ってください。通常の Python workbook editing には
`openpyxl` / `xlwings` の方が合っています。ローカル shell / agent workflow
では editing CLI を優先します。

もし editing が MCP-first だった時期の名残で `exstruct_patch` /
`exstruct_make` を直接使っているだけなら、MCP host control が必要な場合を
除いて、新規のローカル workflow は `exstruct patch` / `exstruct make`
へ寄せてください。Python から同じ patch contract を使いたい場合だけ
`exstruct.edit` を検討してください。

### uvx を使ったクイックスタート(推奨)

インストール不要で直接実行できます:
Expand Down Expand Up @@ -168,7 +205,7 @@ exstruct-mcp --root C:\data --log-file C:\logs\exstruct-mcp.log --on-conflict re

[MCPサーバー](https://harumiweb.github.io/exstruct/mcp/)

## クイックスタート Python
## クイックスタート Python Extraction

```python
from pathlib import Path
Expand Down
43 changes: 40 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,14 +26,30 @@

# ExStruct — Excel Structured Extraction Engine

ExStruct reads Excel workbooks and outputs structured data such as cells, table candidates, shapes, charts, SmartArt, and print-area views as JSON by default. It provides a CLI, a Python API, and an MCP server, with extraction options tuned for LLM/RAG preprocessing and document understanding.
ExStruct reads Excel workbooks into structured data and applies patch-based
editing workflows through a shared core. It provides extraction APIs, a
JSON-first editing CLI, and an MCP server for host-managed integrations, with
options tuned for LLM/RAG preprocessing, reviewable edit flows, and local
automation.

- In COM/Excel environments (Windows), it performs rich extraction.
- In non-COM environments (Linux/macOS):
- if the LibreOffice runtime is available, it performs best-effort extraction for cells, table candidates, shapes, connectors, and charts
- otherwise, it safely falls back to cells + table candidates + print areas

Detection heuristics and output modes are adjustable for LLM/RAG pipelines.
Detection heuristics, editing workflows, and output modes are adjustable for
LLM/RAG pipelines and local automation.

## Choose an Interface

| Use case | Recommended interface | Why |
| --- | --- | --- |
| Write direct Python Excel-editing code | `openpyxl` / `xlwings` | Usually the better fit for imperative Python editing. Reach for `exstruct.edit` only when you specifically want ExStruct's patch contract in Python. |
| Run local operator or AI-agent edit workflows | `exstruct patch`, `make`, `ops`, `validate` | Canonical operational interface; JSON-first and dry-run friendly. |
| Run sandboxed or host-managed integrations | `exstruct-mcp` / MCP tools | Integration / compatibility layer that owns `PathPolicy`, transport, and artifact behavior. |

Extraction keeps the existing top-level Python API (`extract`, `process_excel`,
`ExStructEngine`) and the legacy `exstruct INPUT.xlsx ...` CLI entrypoint.

## Main Features

Expand All @@ -42,6 +58,7 @@ Detection heuristics and output modes are adjustable for LLM/RAG pipelines.
- **Formula extraction**: emits `formulas_map` (formula string -> cell coordinates) via openpyxl/COM. It is enabled by default in `verbose` and can be controlled with `include_formulas_map`.
- **Formats**: JSON (compact by default, `--pretty` for formatting), YAML, and TOON (optional dependencies).
- **Backend metadata is opt-in**: shape/chart `provenance`, `approximation_level`, and `confidence` are omitted from serialized output by default. Enable them with `--include-backend-metadata` or `include_backend_metadata=True`.
- **Workbook editing interfaces**: use the editing CLI for primary ExStruct edit flows, keep MCP for host-owned safety controls, and use `exstruct.edit` only when you need the same patch contract from Python.
- **Table detection tuning**: heuristics can be adjusted dynamically through the API.
- **Hyperlink extraction**: in `verbose` mode, or with `include_cell_links=True`, cell links are emitted in `links`.
- **CLI rendering**: in `standard` / `verbose`, PDF and sheet images can be generated when Excel COM is available.
Expand Down Expand Up @@ -102,13 +119,33 @@ exstruct validate --input book.xlsx --pretty
```

- `patch` and `make` print JSON `PatchResult` to stdout.
- This is the canonical operational / agent interface for workbook editing.
- `ops list` / `ops describe` expose the public patch-op schema.
- `validate` reports workbook readability (`is_readable`, `warnings`, `errors`).
- Phase 2 keeps the legacy extraction CLI unchanged; it does not add
`exstruct extract` or interactive safety flags yet.

Recommended edit flow:

1. Build patch ops.
2. Run `exstruct patch --dry-run` and inspect `PatchResult`, warnings, and diff.
3. Pin `--backend openpyxl` when you want the dry run and the real apply to use the same engine.
4. If you keep `--backend auto`, inspect `PatchResult.engine`; on Windows/Excel hosts the real apply may switch to COM.
5. Re-run without `--dry-run` only after the result is acceptable.

## MCP Server (stdio)

MCP is the integration / compatibility layer around the same editing core. Use
it when you need host-managed path restrictions, transport mapping, artifact
mirroring, or approval-aware agent execution. For ordinary Python workbook
editing, `openpyxl` / `xlwings` are usually a better fit. For local shell or
agent workflows, prefer the editing CLI.

If you previously used `exstruct_patch` / `exstruct_make` only because editing
was MCP-first, migrate new local workflows to `exstruct patch` or
`exstruct make` unless you specifically need MCP host controls or the shared
patch contract inside Python.

### Quick Start with `uvx` (recommended)

You can run it directly without installation:
Expand Down Expand Up @@ -179,7 +216,7 @@ MCP setup guide for each AI agent:

[MCP Server](https://harumiweb.github.io/exstruct/mcp/)

## Quick Start Python
## Quick Start Python Extraction

```python
from pathlib import Path
Expand Down
10 changes: 10 additions & 0 deletions dev-docs/specs/editing-api.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,16 @@ documented separately in `dev-docs/specs/editing-cli.md`.
- `exstruct.mcp.patch.specs`
- `exstruct.mcp.op_schema`

## Canonical usage documentation obligations

- Public docs may describe `exstruct.edit`, but should frame it as an advanced
or shared-contract surface rather than the default recommendation for Python
workbook editing.
- Public docs should state that ordinary imperative Python editing is usually
better served by `openpyxl` / `xlwings`.
- Public docs should direct shell / agent operational flows to the editing CLI
and reserve MCP for host-owned policy concerns.

## Host-only responsibilities

The following behaviors are not part of the Python editing API contract and
Expand Down
13 changes: 13 additions & 0 deletions dev-docs/specs/editing-cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,19 @@ This document defines the Phase 2 public editing CLI contract.
- The legacy extraction entrypoint `exstruct INPUT.xlsx ...` remains valid and
is not rewritten to `exstruct extract` in Phase 2.

## Canonical usage documentation obligations

- Public docs must describe the editing CLI as the canonical operational /
agent interface for workbook editing.
- Public docs should recommend the `dry_run -> inspect PatchResult -> apply`
workflow for edit operations, but must qualify that `backend="auto"` can
use openpyxl for the dry run and COM for the real apply on COM-capable
hosts; when same-engine comparison matters, docs should tell users to pin
`backend="openpyxl"`.
- Public docs must distinguish the local CLI from:
- `exstruct.edit` for embedded Python usage
- MCP for host-owned path policy, transport, and artifact behavior

## Dispatch and compatibility rules

- `exstruct.cli.main` dispatches to the editing parser only when the first
Expand Down
41 changes: 36 additions & 5 deletions docs/api.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,11 @@
# API Reference

This page shows the primary APIs, minimal runnable examples, expected outputs, and the dependencies required for optional features. Hyperlinks are included when `include_cell_links=True` (or when using `mode="verbose"`). Auto page-break areas are COM-only and appear when auto page-break extraction/output is enabled (CLI exposes the option only when COM is available).
This page shows the primary APIs, minimal runnable examples, expected outputs,
and the dependencies required for optional features. Hyperlinks are included
when `include_cell_links=True` (or when using `mode="verbose"`). Auto
page-break areas are COM-only and appear when auto page-break
extraction/output is enabled (CLI exposes the option only when COM is
available).

## TOC

Expand All @@ -16,7 +21,7 @@ This page shows the primary APIs, minimal runnable examples, expected outputs, a
- [Editing functions](#editing-functions)
- [Engine and options](#engine-and-options)
- [Models](#models)
- [Model helpers SheetData / WorkbookData](#model-helpers-sheetdata--workbookdata)
- [Model helpers for SheetData and WorkbookData](#model-helpers-for-sheetdata-and-workbookdata)
- [Error Handling](#error-handling)
- [Tuning Examples](#tuning-examples)

Expand Down Expand Up @@ -78,8 +83,11 @@ process_excel(

## Editing API

Phase 1 exposes workbook editing as a first-class Python package under
`exstruct.edit`.
ExStruct also exposes workbook editing under `exstruct.edit`, but this is a
secondary surface. If you are writing Python code to edit Excel directly,
`openpyxl` / `xlwings` are usually simpler choices. Reach for `exstruct.edit`
when you specifically want the same patch contract used by ExStruct's CLI and
MCP integration layer.

```python
from pathlib import Path
Expand Down Expand Up @@ -108,6 +116,29 @@ Key points:
- The matching operational CLI is `exstruct patch`, `exstruct make`,
`exstruct ops`, and `exstruct validate`.

Backend capability guide:

| Backend | Use it for | Notes |
| --- | --- | --- |
| `openpyxl` | Basic cell/style/layout edits, plus `dry_run`, `return_inverse_ops`, and `preflight_formula_check` flows | Pure Python path. Not valid for `.xls`, and not for COM-only ops such as `create_chart`. |
| `com` | Highest-fidelity workbook editing, `.xls`, and COM-only ops such as `create_chart` | Requires Excel COM. Rejects `dry_run`, `return_inverse_ops`, and `preflight_formula_check`. |
| `auto` | Default mixed workflow | Resolves to the best supported backend for the request. `dry_run`, `return_inverse_ops`, and `preflight_formula_check` force the openpyxl path even on COM-capable hosts, so inspect `PatchResult.engine` before assuming the same engine will run the real apply. |

Known editing limits:

- `create_chart` requires the COM-backed path.
- `.xls` editing requires COM.
- `exstruct.edit` does not own `PathPolicy`, artifact mirroring, or host
approval flows.
- Existing MCP compatibility imports remain valid.

For local shell or AI-agent edit workflows, prefer the CLI so you can do
`dry_run -> inspect PatchResult -> apply` with an explicit backend. Use
`backend="openpyxl"` when you want the dry run and the real apply to exercise
the same engine. With `backend="auto"`, dry runs resolve to openpyxl while the
real apply may switch to COM on Windows/Excel hosts. For restricted hosts, use
the MCP server, which wraps the same core and adds host policy.

## Dependencies

- Core extraction: pandas, openpyxl (installed with the package).
Expand Down Expand Up @@ -240,7 +271,7 @@ Python APIの最新情報は以下の自動生成セクションを参照して

See generated/models.md for the detailed model fields (run `python scripts/gen_model_docs.py` to refresh).

### Model helpers (SheetData / WorkbookData)
### Model helpers for SheetData and WorkbookData

- `to_json(pretty=False, indent=None, include_backend_metadata=False)` → JSON string (pretty when requested)
- `to_yaml(include_backend_metadata=False)` → YAML string (requires `pyyaml`)
Expand Down
Loading
Loading