feat(ir): add guarded chunk policy for dynamic-bound pl.parallel/pl.range#978
Conversation
Fixes hw-native-sys#930 Introduces ChunkPolicy::Guarded as the new default for pl.parallel / pl.range / pl.unroll with chunk=C. Instead of splitting into a main kernel (N//C full chunks) plus a remainder kernel (N%C tail), Guarded emits a single outer loop over ceil(N/C) chunks with an internal `if (idx < stop)` guard, mirroring the manual workaround that users previously wrote by hand. This unblocks two problems with the split approach: 1. Cross-iteration inout accumulation (hw-native-sys#928): a single kernel keeps loop-carried state on one continuous chain. 2. Kernel count inflation in dynamic-bound parallel loops: stages no longer double into main+remainder, shortening orchestration dependency chains. Users opt back into the old split via `chunk_policy="leading_full"`. SplitChunkedLoops now dispatches on policy: - LeadingFull -> SplitLeadingFull / SplitLeadingFullWithIterArgs (renamed from SplitSimple / SplitWithIterArgs, logic unchanged). - Guarded -> SplitGuarded / SplitGuardedWithIterArgs. The iter_args variant threads loop-carried state through an IfStmt phi with an else branch that yields the inner iter_args unchanged, so guarded iterations stay on the SSA chain. Existing tests that verify LeadingFull split shape are pinned with an explicit chunk_policy="leading_full"; new tests cover Guarded across static/dynamic bounds with and without iter_args.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR adds a new ChunkPolicy::Guarded mode (single loop with per-iteration if-guards), switches the default chunk policy from LeadingFull to Guarded across C++/Python APIs, implements guarded chunked-loop splitting and printing, updates serialization/deserialization, and adjusts many tests to opt into the old LeadingFull behavior. Changes
Sequence Diagram(s)sequenceDiagram
participant DSL as DSL / User code
participant Builder as IRBuilder
participant For as ForStmt{chunk_config=Guarded}
participant Splitter as ChunkedLoopSplitter
participant IR as LoweredIR
participant Printer as IRPythonPrinter
DSL->>Builder: begin_for_loop(chunk_size, policy="guarded")
Builder->>For: construct ForStmt (chunk_config policy=Guarded)
For->>Splitter: VisitStmt_(ForStmt)
Splitter->>Splitter: compute n_total = ceil(trip_count / C)
Splitter->>IR: emit outer_for(sb0 += C)
IR->>IR: emit inner_for(si in [0..C))
IR->>IR: emit IfStmt( idx < stop ) around body
Splitter->>IR: thread iter_args via IfStmt returns (guarded)
IR->>Printer: IRPythonPrinter::VisitStmt_(ForStmt)
Printer->>Printer: omit chunk_policy (default Guarded) or print "leading_full" if explicit
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces the Guarded chunk policy as the new default for loop chunking, which enables a single-kernel outline by using an inner loop with an if guard. The changes span the C++ IR definitions, Python bindings, the DSL API, and the SplitChunkedLoops transformation pass. The implementation correctly handles both simple loops and those with iteration arguments (using SSA-style phi nodes via IfStmt). Review feedback highlights the need to support negative loop steps in the guard condition and to consistently pass source location information (spans) when constructing IR expression nodes.
There was a problem hiding this comment.
Pull request overview
Adds a new chunk-splitting strategy (ChunkPolicy::Guarded) and makes it the default for chunked pl.range/pl.parallel, emitting a single outlined kernel with an internal bounds guard rather than a main+remainder split. This targets correctness for cross-iteration state (iter_args / inout accumulation) and reduces kernel count under pl.auto_incore() for dynamic bounds.
Changes:
- Introduces
ChunkPolicy::Guarded(default) across C++ IR, Python bindings/stubs, DSL/parser defaults, and serialization. - Updates
SplitChunkedLoopsto dispatch on chunk policy and implements guarded splitting (with/without iter_args via IfStmt phi threading). - Retrofitts existing tests to pin
chunk_policy="leading_full"where they assert the legacy split shape, and adds new unit tests covering guarded behavior and printer defaults.
Reviewed changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/ut/language/parser/test_error_cases.py | Pins legacy chunk policy in parser error-case fixture to keep expected behavior stable. |
| tests/ut/ir/transforms/test_split_chunked_loops.py | Adds guarded-policy unit tests; updates existing cases to explicitly request leading_full where needed. |
| tests/ut/ir/transforms/test_outline_incore_scopes.py | Makes chunk policy explicit to preserve prior outline behavior in tests. |
| tests/ut/ir/transforms/test_outline_incore_interleaved_ops.py | Makes chunk policy explicit in interleaving/interchange test programs. |
| tests/ut/ir/transforms/test_interchange_chunk_loops.py | Makes chunk policy explicit to preserve interchange expectations under new default. |
| tests/ut/ir/parser/test_parse_pl_at.py | Updates parser tests to pass explicit leading_full where the test expects legacy behavior. |
| tests/ut/codegen/test_orchestration_codegen.py | Pins leading_full in orchestration codegen test input to avoid default-policy drift. |
| tests/st/runtime/test_qwen3_decode_scope3_mixed.py | Pins leading_full in runtime test programs. |
| tests/st/runtime/test_cross_core.py | Pins leading_full in runtime test programs. |
| src/ir/transforms/split_chunked_loops_pass.cpp | Implements guarded splitting and policy dispatch; renames legacy split helpers for parity. |
| src/ir/transforms/python_printer.cpp | Omits printing chunk_policy="guarded" (new default) and prints leading_full explicitly when requested. |
| src/ir/serialization/type_deserializers.cpp | Updates chunk policy default during deserialization to Guarded. |
| python/pypto/pypto_core/ir.pyi | Adds Guarded enum value and flips default chunk policy in stubs. |
| python/pypto/language/parser/ast_parser.py | Adds guarded to accepted policies and flips parser default to guarded. |
| python/pypto/language/dsl_api.py | Flips DSL default chunk policy string to guarded and updates docstrings/overloads. |
| python/pypto/ir/builder.py | Accepts "guarded" in IRBuilder loop construction and flips default accordingly. |
| python/bindings/modules/ir.cpp | Exposes ChunkPolicy.Guarded in bindings and flips default args to Guarded. |
| python/bindings/modules/ir_builder.cpp | Flips default chunk_policy argument to Guarded in builder bindings/docs. |
| include/pypto/ir/transforms/utils/auto_name_utils.h | Adds a new auto-name qualifier for guarded/phi vars (cg). |
| include/pypto/ir/stmt.h | Adds ChunkPolicy::Guarded, updates string conversions, and flips ChunkConfig default policy. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/ir/transforms/split_chunked_loops_pass.cpp`:
- Around line 678-681: The guard currently always builds MakeLt(idx_expr,
stop_expr, sp) and constructs the IfStmt (visited_body) which only works for
positive step values; update split_chunked_loops_pass so the predicate is chosen
from the compile-time step sign: when step is positive use MakeLt(idx_expr,
stop_expr, sp), when step is negative use MakeGt(idx_expr, stop_expr, sp) (and
use the resulting predicate when constructing the IfStmt), and ensure the same
change is applied at the other occurrence around lines 764–766; add a regression
test that constructs a descending chunked range (e.g., range(10, 0, -1,
chunk=...)) to verify the negative-step guarded path executes iterations
correctly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 5ec5ec35-5cdc-439e-b9e9-01ae73b98b83
📒 Files selected for processing (20)
include/pypto/ir/stmt.hinclude/pypto/ir/transforms/utils/auto_name_utils.hpython/bindings/modules/ir.cpppython/bindings/modules/ir_builder.cpppython/pypto/ir/builder.pypython/pypto/language/dsl_api.pypython/pypto/language/parser/ast_parser.pypython/pypto/pypto_core/ir.pyisrc/ir/serialization/type_deserializers.cppsrc/ir/transforms/python_printer.cppsrc/ir/transforms/split_chunked_loops_pass.cpptests/st/runtime/test_cross_core.pytests/st/runtime/test_qwen3_decode_scope3_mixed.pytests/ut/codegen/test_orchestration_codegen.pytests/ut/ir/parser/test_parse_pl_at.pytests/ut/ir/transforms/test_interchange_chunk_loops.pytests/ut/ir/transforms/test_outline_incore_interleaved_ops.pytests/ut/ir/transforms/test_outline_incore_scopes.pytests/ut/ir/transforms/test_split_chunked_loops.pytests/ut/language/parser/test_error_cases.py
Resolves PR review feedback for hw-native-sys#978: - SplitGuarded / SplitGuardedWithIterArgs: select guard predicate from step sign. Positive step uses `idx < stop`; negative step uses `idx > stop`. Without this, descending chunked loops (e.g. `pl.range(10, 0, -1, chunk=4)`) would have every iteration become a no-op since `idx < stop` is always false. - Pass span `sp` to all MakeAdd/MakeMul calls that construct `idx_expr` so the generated IR carries source-location info, matching the rest of the pass. - Add two regression tests: `test_guarded_negative_step` (with iter_args) and `test_guarded_negative_step_no_iter_args`.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/ir/transforms/split_chunked_loops_pass.cpp (1)
300-358: Extract the shared trip-count setup.Both policy branches recompute the same static/dynamic trip-count logic before deriving either
(n_full, n_rem)orn_total. Pulling that into a small helper would reduce drift in a correctness-sensitive code path.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ir/transforms/split_chunked_loops_pass.cpp` around lines 300 - 358, Extract the shared trip-count logic into a helper (e.g. BuildOrGetTripCount) that encapsulates the TryGetConstInt(start_expr)/TryGetConstInt(stop_expr) static path using ComputeStaticTripCount and MakeConstIndex and the dynamic path using BuildTripCountExpr (keeping span sp); have it return either an optional<int64_t> static_trip_count and/or an ExprPtr trip_count_expr so callers can derive their values. Replace the duplicated blocks that compute tc/trip_count in both the LeadingFull and Guarded branches with calls to this helper, then compute (n_full, n_rem) using MakeFloorDiv/MakeFloorMod when dynamic or using static_trip_count/ chunk_size, and compute n_total using the same helper followed by the ceil formula (MakeAdd + MakeFloorDiv) or static computation; ensure you preserve use of chunk_expr/chunk_size and the original spans and variable names (n_full, n_rem, n_total, start_c/stop_c patterns) so behavior is unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/ir/transforms/split_chunked_loops_pass.cpp`:
- Around line 300-358: Extract the shared trip-count logic into a helper (e.g.
BuildOrGetTripCount) that encapsulates the
TryGetConstInt(start_expr)/TryGetConstInt(stop_expr) static path using
ComputeStaticTripCount and MakeConstIndex and the dynamic path using
BuildTripCountExpr (keeping span sp); have it return either an optional<int64_t>
static_trip_count and/or an ExprPtr trip_count_expr so callers can derive their
values. Replace the duplicated blocks that compute tc/trip_count in both the
LeadingFull and Guarded branches with calls to this helper, then compute
(n_full, n_rem) using MakeFloorDiv/MakeFloorMod when dynamic or using
static_trip_count/ chunk_size, and compute n_total using the same helper
followed by the ceil formula (MakeAdd + MakeFloorDiv) or static computation;
ensure you preserve use of chunk_expr/chunk_size and the original spans and
variable names (n_full, n_rem, n_total, start_c/stop_c patterns) so behavior is
unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: d59a3e4a-819b-42bd-9f50-6b6b040bb5d5
📒 Files selected for processing (2)
src/ir/transforms/split_chunked_loops_pass.cpptests/ut/ir/transforms/test_split_chunked_loops.py
Refresh docs/en and docs/zh-cn for SplitChunkedLoops to reflect: - New `chunk_policy` parameter; `guarded` is now the default. - Both policies documented side-by-side with algorithm descriptions, before/after IR examples, and a policy-choice table (dynamic bound, kernel count, hot-loop masking). - Step-sign-aware guard: `idx < stop` for positive step, `idx > stop` for negative step. - New `cg` (chunk_guard) auto-name qualifier for IfStmt phi vars. - Removed obsolete "chunk + init_values forbidden" claim — both policies now thread iter_args through the generated loops.
Summary
Adds
ChunkPolicy::Guardedas the new default forpl.range/pl.parallelwithchunk=C. Instead of splitting a dynamic-bound chunked loop into a main + remainder kernel (leading_full), guarded mode emits a single outer loop overceil(T/C)chunks with an innerif (idx < stop)guard on the body. With iter_args, loop-carried state threads through anIfStmtphi whose else branch yields the iter_args unchanged.Fixes #930. Partially addresses #928 (cross-iteration inout accumulation under
pl.auto_incore()).Why
leading_fullunder dynamic bounds breaks cross-iteration inout accumulation ([Bug] pl.parallel remainder kernel breaks inout accumulation across chunk boundary #928): the remainder kernel receives loop-carried state as input-only copies and writes to freshly allocated output tensors.pl.auto_incore()(e.g. Qwen3-32B decode scope2 went from 9 → 13 kernels after migrating four sequential stages topl.parallel).Guarded mode keeps a single outlined kernel.
Key changes
ChunkPolicy::Guardedenum value; flippedChunkConfig::policydefault fromLeadingFulltoGuardedacross C++ / bindings / stubs / DSL.SplitChunkedLoopspass dispatches on policy. Renamed the existing code pathsSplitSimple→SplitLeadingFull,SplitWithIterArgs→SplitLeadingFullWithIterArgsfor parity with the newSplitGuarded/SplitGuardedWithIterArgsimplementations.n_total = ceil(T/C)computation; guardstart + (i_out * C + i_in) * step < stop.thenyields updated values,elseyields inner iter_args unchanged) — SSA-correct.chunk_policy="leading_full"when explicit;guardedis omitted as the default.Test plan
tests/ut/ir/transforms/test_split_chunked_loops.pypass, including 13 newTestGuardedPolicytests using Before/Expected +ir.assert_structural_equal:pl.parallelkindloop_originattrs (noChunkRemainderemitted)chunk_policy="leading_full"where they verify split shape.test_interchange_chunk_loops.pypasses (no regression from policy change).Out of scope / follow-ups
add_inout()across chunk boundary (closes [Bug] pl.parallel remainder kernel breaks inout accumulation across chunk boundary #928).LeadingFullwhen the compiler can proveT % C == 0(avoids dead guarded iterations).Fixes #930