refactor(ir): extract OptimizeOrchTensors pass and simplify ConvertTensorToTileOps#969
Conversation
…OrchTensors - Extract optimize_orch_tensors_pass.cpp as standalone pass with 3 pattern classes: IterArgReuseOptimizer, AssembleParentStridesOptimizer, AssembleLoopRewriter (previously interleaved free functions) - Remove dead Pattern 3 (loop hoisting) code from OptimizeOrchTensors - Restore MatmulSlice handling in ConvertTensorToTileOps (was accidentally removed — produces tile.load(Mat, transpose=...) for slice→matmul patterns) - Remove alias analysis (~390 lines) from ConvertTensorToTileOps as it violates single-responsibility; direction inference handled by OptimizeOrchTensors Pattern 1 - Replace local VarUseVisitor with common var_collectors::VarDefUseCollector - Add documentation for both passes (en + zh-cn) and renumber pass docs to match pipeline execution order
- Add IncoreTileOps to required/produced in kOptimizeOrchTensorsProperties to enforce correct pass ordering (must run after ConvertTensorToTileOps) - Renumber Pattern 4 → Pattern 3 in docs (old Pattern 3 was removed) - Fix type stub docstring to mention both orchestration and InCore functions
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR extracts orchestration-level buffer/layout optimizations from ConvertTensorToTileOps into a new OptimizeOrchTensors pass, adds its implementation, docs, Python bindings, tests, and updates pass ordering and registration; ConvertTensorToTileOps is simplified accordingly. Changes
Sequence Diagram(s)sequenceDiagram
participant PM as Program
participant PassMgr as PassManager
participant OptPass as OptimizeOrchTensors
participant ConvPass as ConvertTensorToTileOps
participant IC as InCoreFunction
participant Orch as Orchestration
rect rgba(200,230,255,0.5)
PM->>PassMgr: request optimization (Default)
PassMgr->>ConvPass: run ConvertTensorToTileOps
ConvPass->>IC: convert tensor.* → tile.* (matmul-slice prescan)
end
rect rgba(200,255,200,0.5)
PassMgr->>OptPass: run OptimizeOrchTensors
OptPass->>Orch: analyze tensor.create / assemble patterns in orchestration
OptPass->>IC: rewrite InCore signatures (merge Out→InOut, update TensorView strides)
OptPass->>Orch: update callsites (remove tensor.create, adjust calls)
OptPass->>IC: rewrite assemble-loops → tile.store
end
PM->>PM: resulting optimized IR
Estimated code review effort🎯 4 (Complex) | ⏱️ ~85 minutes Possibly related issues
Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces the OptimizeOrchTensors pass, which optimizes tensor buffer usage between orchestration and InCore functions through three patterns: iter-arg reuse, assemble parent stride propagation, and assemble-loop rewriting. It also refactors ConvertTensorToTileOps by moving cross-function analyses to this new pass. The feedback focuses on improving the robustness of IR traversals in the new pass, specifically recommending the use of shared utilities like VarDefUseCollector to ensure nested control flow structures are correctly handled during variable use analysis and loop optimization.
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (1)
tests/ut/ir/transforms/test_optimize_orch_tensors.py (1)
21-236: Add aWhileStmtregression for Pattern 1.The new pass docs and analyzer cover both
ForStmtandWhileStmt, but this suite only exercisespl.rangeloops. A smallwhilefixture would guard theVisitStmt_(const WhileStmtPtr&)path from drifting unnoticed.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/ut/ir/transforms/test_optimize_orch_tensors.py` around lines 21 - 236, Add a new unit test method in the TestIterArgReuse class (e.g., test_while_iter_arg_reuse) that mirrors one of the existing ForStmt cases but uses a WhileStmt-style loop in the main function to exercise the VisitStmt_(const WhileStmtPtr&) path; keep the same main_incore_0 signature and expected transformation (Out param merged into InOut), run After = passes.optimize_orch_tensors()(Before) and call ir.assert_structural_equal(After, Expected) so the optimize_orch_tensors pass and the main/main_incore_0 pair are validated for a while-loop iter-arg scenario.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/en/dev/passes/10-optimize_orch_tensors.md`:
- Around line 125-131: Update the pass contract table to include IncoreTileOps:
add IncoreTileOps to the Required column alongside SplitIncoreOrch and to the
Produced (or indicate as preserved) column so the doc reflects that the pass
requires and preserves IncoreTileOps to enforce ordering after
ConvertTensorToTileOps; reference the IncoreTileOps symbol and
ConvertTensorToTileOps in the brief note so readers understand why the contract
changed.
In `@docs/zh-cn/dev/passes/09-convert_tensor_to_tile_ops.md`:
- Around line 48-49: Update the docs to remove the claim that the assemble-loop
rewrite is performed in ConvertTensorToTileOps and instead point readers to the
later pass that performs it: change the sentence referencing "如果返回值来自
`tile.assemble` 循环,则将循环重写为直接使用 `tile.store`(assemble-loop 重写)" so it no longer
attributes the rewrite to `ConvertTensorToTileOps`, and add a brief note
directing readers to the `OptimizeOrchTensors` pass (and the term "assemble-loop
重写") as the component that actually performs this loop rewrite.
In `@docs/zh-cn/dev/passes/10-optimize_orch_tensors.md`:
- Around line 125-131: The Pass 属性表 is missing IncoreTileOps: update the table
so both the Required and Produced entries list the full set {SplitIncoreOrch,
IncoreTileOps} (i.e., replace the single "SplitIncoreOrch" value with the pair
that matches the implementation), ensuring the Pass attributes for the pass
named/represented by SplitIncoreOrch and IncoreTileOps in this document reflect
the actual implementation.
In `@src/ir/transforms/optimize_orch_tensors_pass.cpp`:
- Around line 1076-1124: The loop rewrite assumes op->iter_args_[0]->initValue_
is a disposable tensor created by tensor.create but doesn't verify it; update
VisitStmt_(const ForStmtPtr& op) to check that op->iter_args_[0]->initValue_ is
an Assign/Var whose defining statement is a tensor.create before inserting into
dead_create_vars_. Specifically, locate the use of iter_args_[0]->initValue_
(and the code that sets init_var and dead_create_vars_) and: 1) resolve
initValue_ to its defining AssignStmt/Call and confirm the call->op_->name_ ==
"tensor.create"; 2) optionally ensure the created Var has no other uses (reuse
StmtUsesVar or similar) before marking it dead; only then insert init_var.get()
into dead_create_vars_, otherwise skip marking it.
- Around line 326-329: The current code stores reuse_result into results_ keyed
only by fname causing the first matching caller to overwrite the callee
signature for all call sites; change this by either (A) verifying that every
call site of the callee yields the identical reuse_result.merges set before
writing to results_ (use the module's call graph / iterate all callers and
compare merge sets) and only then store results_[fname] = reuse_result and call
RewriteIncore()/CallSiteRewriter(), or (B) avoid global mutation and keep the
rewrite local: do not insert into results_ for fname, instead record
per-callsite rewrites (keyed by the caller or CallSite id) and only modify that
caller/callsite; update the logic around results_, reuse_result,
RewriteIncore(), and CallSiteRewriter() accordingly (same change also for the
analogous block at lines ~460-479).
---
Nitpick comments:
In `@tests/ut/ir/transforms/test_optimize_orch_tensors.py`:
- Around line 21-236: Add a new unit test method in the TestIterArgReuse class
(e.g., test_while_iter_arg_reuse) that mirrors one of the existing ForStmt cases
but uses a WhileStmt-style loop in the main function to exercise the
VisitStmt_(const WhileStmtPtr&) path; keep the same main_incore_0 signature and
expected transformation (Out param merged into InOut), run After =
passes.optimize_orch_tensors()(Before) and call
ir.assert_structural_equal(After, Expected) so the optimize_orch_tensors pass
and the main/main_incore_0 pair are validated for a while-loop iter-arg
scenario.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 0a159f65-9ebf-4c15-b0a6-b8fce2ed2b04
📒 Files selected for processing (30)
.claude/rules/pass-doc-ordering.mdCMakeLists.txtdocs/en/dev/passes/09-convert_tensor_to_tile_ops.mddocs/en/dev/passes/10-optimize_orch_tensors.mddocs/en/dev/passes/11-flatten_tile_nd_to_2d.mddocs/en/dev/passes/14-expand_mixed_kernel.mddocs/en/dev/passes/15-init_memref.mddocs/en/dev/passes/16-memory_reuse.mddocs/en/dev/passes/17-allocate_memory_addr.mddocs/en/dev/passes/90-insert_sync.mddocs/en/dev/passes/91-utility_passes.mddocs/zh-cn/dev/passes/09-convert_tensor_to_tile_ops.mddocs/zh-cn/dev/passes/10-optimize_orch_tensors.mddocs/zh-cn/dev/passes/11-flatten_tile_nd_to_2d.mddocs/zh-cn/dev/passes/14-expand_mixed_kernel.mddocs/zh-cn/dev/passes/15-init_memref.mddocs/zh-cn/dev/passes/16-memory_reuse.mddocs/zh-cn/dev/passes/17-allocate_memory_addr.mddocs/zh-cn/dev/passes/90-insert_sync.mddocs/zh-cn/dev/passes/91-utility_passes.mdinclude/pypto/ir/transforms/pass_properties.hinclude/pypto/ir/transforms/passes.hpython/bindings/modules/passes.cpppython/pypto/ir/pass_manager.pypython/pypto/pypto_core/passes.pyisrc/ir/transforms/convert_tensor_to_tile_ops_pass.cppsrc/ir/transforms/optimize_orch_tensors_pass.cpptests/ut/ir/transforms/test_convert_tensor_to_tile_ops.pytests/ut/ir/transforms/test_optimize_orch_tensors.pytests/ut/ir/transforms/test_pass_manager.py
There was a problem hiding this comment.
Pull request overview
This PR refactors the IR transform pipeline by extracting cross-function buffer/tensor optimizations out of ConvertTensorToTileOps into a new standalone OptimizeOrchTensors pass, leaving ConvertTensorToTileOps as a largely mechanical tensor→tile lowering and call-site update pass. It updates C++/Python pass registration, pass-manager ordering, adds focused unit tests for the new pass, and renumbers/extends pass documentation to match the updated pipeline.
Changes:
- Introduce
OptimizeOrchTensors(new pass) implementing iter-arg reuse, assemble parent-stride propagation, and assemble-loop rewrite. - Simplify
ConvertTensorToTileOpsby removing iter-arg/assemble analyses and parameter-direction inference, while keeping conversion-time MatmulSlice handling. - Update bindings, pass manager, tests, and English/Chinese pass docs (including doc ordering renumbering).
Reviewed changes
Copilot reviewed 16 out of 30 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
src/ir/transforms/convert_tensor_to_tile_ops_pass.cpp |
Removes bundled buffer optimizations/analyses; keeps core tensor→tile conversion and assemble-loop conversion behavior. |
src/ir/transforms/optimize_orch_tensors_pass.cpp |
New pass implementing the extracted orchestration/InCore tensor buffer optimizations (Patterns 1–3). |
include/pypto/ir/transforms/passes.h |
Declares the new OptimizeOrchTensors pass and documents intended behavior/order. |
include/pypto/ir/transforms/pass_properties.h |
Adds pass properties for OptimizeOrchTensors to enforce correct pipeline ordering. |
python/bindings/modules/passes.cpp |
Exposes optimize_orch_tensors in Python bindings with a pass description. |
python/pypto/ir/pass_manager.py |
Registers OptimizeOrchTensors in the default pipeline ordering after ConvertTensorToTileOps. |
python/pypto/pypto_core/passes.pyi |
Adds stub for optimize_orch_tensors and exports it via __all__. |
CMakeLists.txt |
Adds the new C++ pass source to the build. |
tests/ut/ir/transforms/test_pass_manager.py |
Updates expected pass order to include OptimizeOrchTensors. |
tests/ut/ir/transforms/test_optimize_orch_tensors.py |
New unit tests covering the three OptimizeOrchTensors patterns and edge cases. |
tests/ut/ir/transforms/test_convert_tensor_to_tile_ops.py |
Updates/realigns tests to validate “naive” ConvertTensorToTileOps behavior post-extraction. |
docs/en/dev/passes/09-convert_tensor_to_tile_ops.md |
New/updated documentation describing ConvertTensorToTileOps’ simplified responsibilities. |
docs/en/dev/passes/10-optimize_orch_tensors.md |
New documentation for OptimizeOrchTensors patterns and intended placement. |
docs/en/dev/passes/11-flatten_tile_nd_to_2d.md |
Renumbered/added doc to match pipeline ordering. |
docs/en/dev/passes/14-expand_mixed_kernel.md |
Renumbered doc to match pipeline ordering. |
docs/en/dev/passes/15-init_memref.md |
Renumbered doc to match pipeline ordering. |
docs/en/dev/passes/16-memory_reuse.md |
Renumbered doc to match pipeline ordering. |
docs/en/dev/passes/17-allocate_memory_addr.md |
Renumbered doc to match pipeline ordering. |
docs/en/dev/passes/90-insert_sync.md |
Added/renumbered documentation for InsertSync. |
docs/en/dev/passes/91-utility_passes.md |
Added/renumbered documentation for utility passes. |
docs/zh-cn/dev/passes/09-convert_tensor_to_tile_ops.md |
Chinese documentation for ConvertTensorToTileOps aligned with the refactor. |
docs/zh-cn/dev/passes/10-optimize_orch_tensors.md |
Chinese documentation for OptimizeOrchTensors aligned with the refactor. |
docs/zh-cn/dev/passes/11-flatten_tile_nd_to_2d.md |
Chinese doc added/renumbered to match pipeline ordering. |
docs/zh-cn/dev/passes/14-expand_mixed_kernel.md |
Chinese doc added/renumbered to match pipeline ordering. |
docs/zh-cn/dev/passes/15-init_memref.md |
Chinese doc added/renumbered to match pipeline ordering. |
docs/zh-cn/dev/passes/16-memory_reuse.md |
Chinese doc added/renumbered to match pipeline ordering. |
docs/zh-cn/dev/passes/17-allocate_memory_addr.md |
Chinese doc added/renumbered to match pipeline ordering. |
docs/zh-cn/dev/passes/90-insert_sync.md |
Chinese documentation for InsertSync added. |
docs/zh-cn/dev/passes/91-utility_passes.md |
Chinese documentation for utility passes added. |
.claude/rules/pass-doc-ordering.md |
Updates the doc numbering map to reflect the new pipeline order. |
…ents Restore UpgradeWrittenTensorParamDirections in ConvertTensorToTileOps that was accidentally removed during the OptimizeOrchTensors extraction. This fixes CI failures where InCore params written via tile.store were not upgraded from In to Out/InOut, causing codegen to emit add_input instead of add_output/add_inout. Also addresses PR review feedback: - Replace manual StmtUsesVar with VarDefUseCollector (handles all stmts) - Fix pass properties tables in docs (add IncoreTileOps) - Clarify assemble-loop rewrite attribution in docs - Remove "future" qualifier from Pattern 2 test docstring - Update test expectation for tensor.write param direction
There was a problem hiding this comment.
♻️ Duplicate comments (2)
src/ir/transforms/optimize_orch_tensors_pass.cpp (2)
327-330:⚠️ Potential issue | 🔴 CriticalDon't globalize a reuse decision from the first matching caller.
results_is keyed only by callee name, so the first loop that matches rewrites thatInCoresignature for every call site. Any later caller that does not satisfy the samemergesset will still lose thoseOutargs and start reusing/mutating its input buffer in place. Please either prove that all callers agree on the exact merge set before storing here, or keep this optimization scoped to the matching call sites only.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ir/transforms/optimize_orch_tensors_pass.cpp` around lines 327 - 330, The current code stores reuse_result into results_ keyed only by fname when reuse_result.merges is non-empty, which globalizes a caller-specific merge decision; change this so we either (A) validate that every call site for fname has an identical reuse_result.merges before writing into results_ (iterate call sites and compare merges sets) or (B) avoid modifying results_ and instead apply reuse_result only to the specific matching call site(s) (i.e., keep the optimization scoped locally rather than storing results_[fname] = std::move(reuse_result)); use the existing symbols results_, reuse_result, fname and the merges set to implement the chosen fix.
1077-1080:⚠️ Potential issue | 🔴 CriticalOnly rewrite/delete a loop seed when it is a dead
tensor.create.This matcher currently switches the iter-arg init to the
Outparam for any var-like seed and immediately records that seed indead_create_vars_. If the seed comes from a real tensor value, or is referenced later, the rewrite changes semantics andLoopRewriteMutatorwill also erase a non-tensor.createassignment. Please require atensor.createdefinition and prove the seed has no remaining uses before inserting it into the dead set.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ir/transforms/optimize_orch_tensors_pass.cpp` around lines 1077 - 1080, The code currently flips the loop iter-arg init to the Out param and unconditionally inserts the seed Var (init_var from As<Var>(op->iter_args_[0]->initValue_)) into dead_create_vars_, which can rewrite non-tensor.create seeds; change this to only perform the rewrite and insert into dead_create_vars_ when the seed is proven to be a tensor.create with no remaining uses. Concretely: after obtaining init_var, resolve its defining expression (the Var's definition) and verify it's a tensor.create call (e.g., the defining node is a Call/Invoke of "tensor.create" or equivalent), and then prove the Var has zero remaining uses before mutating iter_args_[0]->initValue_ and inserting init_var.get() into dead_create_vars_; otherwise leave the iter-arg and do not record the var so LoopRewriteMutator won't erase non-tensor.create assignments.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/ir/transforms/optimize_orch_tensors_pass.cpp`:
- Around line 327-330: The current code stores reuse_result into results_ keyed
only by fname when reuse_result.merges is non-empty, which globalizes a
caller-specific merge decision; change this so we either (A) validate that every
call site for fname has an identical reuse_result.merges before writing into
results_ (iterate call sites and compare merges sets) or (B) avoid modifying
results_ and instead apply reuse_result only to the specific matching call
site(s) (i.e., keep the optimization scoped locally rather than storing
results_[fname] = std::move(reuse_result)); use the existing symbols results_,
reuse_result, fname and the merges set to implement the chosen fix.
- Around line 1077-1080: The code currently flips the loop iter-arg init to the
Out param and unconditionally inserts the seed Var (init_var from
As<Var>(op->iter_args_[0]->initValue_)) into dead_create_vars_, which can
rewrite non-tensor.create seeds; change this to only perform the rewrite and
insert into dead_create_vars_ when the seed is proven to be a tensor.create with
no remaining uses. Concretely: after obtaining init_var, resolve its defining
expression (the Var's definition) and verify it's a tensor.create call (e.g.,
the defining node is a Call/Invoke of "tensor.create" or equivalent), and then
prove the Var has zero remaining uses before mutating iter_args_[0]->initValue_
and inserting init_var.get() into dead_create_vars_; otherwise leave the
iter-arg and do not record the var so LoopRewriteMutator won't erase
non-tensor.create assignments.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: f2e76331-633b-4523-afc4-a41199c0512b
📒 Files selected for processing (7)
docs/en/dev/passes/09-convert_tensor_to_tile_ops.mddocs/en/dev/passes/10-optimize_orch_tensors.mddocs/zh-cn/dev/passes/09-convert_tensor_to_tile_ops.mddocs/zh-cn/dev/passes/10-optimize_orch_tensors.mdsrc/ir/transforms/convert_tensor_to_tile_ops_pass.cppsrc/ir/transforms/optimize_orch_tensors_pass.cpptests/ut/ir/transforms/test_convert_tensor_to_tile_ops.py
✅ Files skipped from review due to trivial changes (4)
- docs/en/dev/passes/10-optimize_orch_tensors.md
- docs/en/dev/passes/09-convert_tensor_to_tile_ops.md
- docs/zh-cn/dev/passes/10-optimize_orch_tensors.md
- docs/zh-cn/dev/passes/09-convert_tensor_to_tile_ops.md
Summary
Addresses #962 (Approach A — post-pass) by extracting buffer optimization patterns from
ConvertTensorToTileOpsinto a standaloneOptimizeOrchTensorspass.OptimizeOrchTensorsas a new pass with 3 pattern classes:IterArgReuseOptimizer(Pattern 1): merges Out→InOut for loop-carried buffersAssembleParentStridesOptimizer(Pattern 2): attaches parent-tensor strides via TensorViewAssembleLoopRewriter(Pattern 3): rewrites tile.assemble loops to tile.store loopsConvertTensorToTileOps: remove alias analysis (~390 lines), remove iter-arg mapping, remove IfStmt store sinking — now purely mechanical tensor→tile conversiontile.load(Mat, transpose=...)for slice→matmul patternsVarUseVisitorwith sharedvar_collectors::VarDefUseCollectorIncoreTileOpstoOptimizeOrchTensorspass properties (required + produced) to enforce correct orderingCross-layer changes
convert_tensor_to_tile_ops_pass.cpp,optimize_orch_tensors_pass.cpp(new)passes.h,pass_properties.hCMakeLists.txtpasses.cpppasses.pyipass_manager.pytest_convert_tensor_to_tile_ops.py,test_optimize_orch_tensors.py(new),test_pass_manager.py09-convert_tensor_to_tile_ops.md(new),10-optimize_orch_tensors.md(new), doc renumberingTesting
Related Issues
Addresses #962