feat(ir): add guarded chunk policy for dynamic-bound pl.parallel/pl.range by Hzfengsy · Pull Request #978 · hw-native-sys/pypto

Hzfengsy · 2026-04-12T07:03:23Z

Summary

Adds ChunkPolicy::Guarded as the new default for pl.range/pl.parallel with chunk=C. Instead of splitting a dynamic-bound chunked loop into a main + remainder kernel (leading_full), guarded mode emits a single outer loop over ceil(T/C) chunks with an inner if (idx < stop) guard on the body. With iter_args, loop-carried state threads through an IfStmt phi whose else branch yields the iter_args unchanged.

Fixes #930. Partially addresses #928 (cross-iteration inout accumulation under pl.auto_incore()).

Why

leading_full under dynamic bounds breaks cross-iteration inout accumulation ([Bug] pl.parallel remainder kernel breaks inout accumulation across chunk boundary #928): the remainder kernel receives loop-carried state as input-only copies and writes to freshly allocated output tensors.
It also doubles kernel count under pl.auto_incore() (e.g. Qwen3-32B decode scope2 went from 9 → 13 kernels after migrating four sequential stages to pl.parallel).

Guarded mode keeps a single outlined kernel.

Key changes

New ChunkPolicy::Guarded enum value; flipped ChunkConfig::policy default from LeadingFull to Guarded across C++ / bindings / stubs / DSL.
SplitChunkedLoops pass dispatches on policy. Renamed the existing code paths SplitSimple → SplitLeadingFull, SplitWithIterArgs → SplitLeadingFullWithIterArgs for parity with the new SplitGuarded / SplitGuardedWithIterArgs implementations.
Static and dynamic n_total = ceil(T/C) computation; guard start + (i_out * C + i_in) * step < stop.
IfStmt phi threads iter_args through both branches (then yields updated values, else yields inner iter_args unchanged) — SSA-correct.
Printer emits chunk_policy="leading_full" when explicit; guarded is omitted as the default.

Test plan

36 unit tests in tests/ut/ir/transforms/test_split_chunked_loops.py pass, including 13 new TestGuardedPolicy tests using Before/Expected + ir.assert_structural_equal:
- default policy selection, static divisible/non-divisible/trip<chunk with iter_args
- no iter_args (static + dynamic)
- non-unit step, pl.parallel kind
- dynamic stop, dynamic start+stop
- nested guarded loops (inner guard inside outer's then-branch)
- loop_origin attrs (no ChunkRemainder emitted)
- printer default omission
Existing tests retrofitted with explicit chunk_policy="leading_full" where they verify split shape.
test_interchange_chunk_loops.py passes (no regression from policy change).
Pre-commit hooks: clang-format, cpplint, ruff, pyright, headers, english-only all pass.
System tests skipped per user direction.

Out of scope / follow-ups

Orchestration-level confirmation that guarded output preserves add_inout() across chunk boundary (closes [Bug] pl.parallel remainder kernel breaks inout accumulation across chunk boundary #928).
Pass-level analysis to auto-select LeadingFull when the compiler can prove T % C == 0 (avoids dead guarded iterations).

Fixes #930

Fixes hw-native-sys#930 Introduces ChunkPolicy::Guarded as the new default for pl.parallel / pl.range / pl.unroll with chunk=C. Instead of splitting into a main kernel (N//C full chunks) plus a remainder kernel (N%C tail), Guarded emits a single outer loop over ceil(N/C) chunks with an internal `if (idx < stop)` guard, mirroring the manual workaround that users previously wrote by hand. This unblocks two problems with the split approach: 1. Cross-iteration inout accumulation (hw-native-sys#928): a single kernel keeps loop-carried state on one continuous chain. 2. Kernel count inflation in dynamic-bound parallel loops: stages no longer double into main+remainder, shortening orchestration dependency chains. Users opt back into the old split via `chunk_policy="leading_full"`. SplitChunkedLoops now dispatches on policy: - LeadingFull -> SplitLeadingFull / SplitLeadingFullWithIterArgs (renamed from SplitSimple / SplitWithIterArgs, logic unchanged). - Guarded -> SplitGuarded / SplitGuardedWithIterArgs. The iter_args variant threads loop-carried state through an IfStmt phi with an else branch that yields the inner iter_args unchanged, so guarded iterations stay on the SSA chain. Existing tests that verify LeadingFull split shape are pinned with an explicit chunk_policy="leading_full"; new tests cover Guarded across static/dynamic bounds with and without iter_args.

coderabbitai · 2026-04-12T07:03:41Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR adds a new ChunkPolicy::Guarded mode (single loop with per-iteration if-guards), switches the default chunk policy from LeadingFull to Guarded across C++/Python APIs, implements guarded chunked-loop splitting and printing, updates serialization/deserialization, and adjusts many tests to opt into the old LeadingFull behavior.

Changes

Cohort / File(s)	Summary
Core IR Definitions `include/pypto/ir/stmt.h`, `include/pypto/ir/transforms/utils/auto_name_utils.h`	Added `ChunkPolicy::Guarded`; updated (de)serialization parsing/strings; added `ChunkGuardQualifier()`; changed `ChunkConfig` default policy to `Guarded`.
C++ Python Bindings `python/bindings/modules/ir.cpp`, `python/bindings/modules/ir_builder.cpp`	Exposed `Guarded` enum to Python; changed default `ChunkConfig`/`ForStmt`/`IRBuilder.begin_for_loop` chunk_policy defaults to `Guarded`; adjusted docstrings and property default behavior.
Python Builder & DSL `python/pypto/ir/builder.py`, `python/pypto/language/dsl_api.py`, `python/pypto/language/parser/ast_parser.py`	Default chunk_policy strings changed from `"leading_full"` → `"guarded"`; builder/DSL accept and map `"guarded"` and preserve `"leading_full"` support; parser validation updated to accept both.
Type Stubs `python/pypto/pypto_core/ir.pyi`	Added `ChunkPolicy.Guarded`; updated default parameter annotations/docstrings to `Guarded` for `ChunkConfig`, `ForStmt`, and `IRBuilder.begin_for_loop`.
Serialization & Printing `src/ir/serialization/type_deserializers.cpp`, `src/ir/transforms/python_printer.cpp`	Deserialization default policy when missing → `Guarded`; printer now omits `chunk_policy` when value is `Guarded` and emits DSL-style lowercase strings (`"leading_full"`, `"guarded"`).
Chunked Loop Splitter `src/ir/transforms/split_chunked_loops_pass.cpp`	Added policy dispatch: `LeadingFull` preserves main+remainder split (renamed helpers to `SplitLeadingFull`); `Guarded` implements single-loop guarded emission with `SplitGuarded`, handles iter_args via IfStmt phi-like returns, preserves zero-trip forwarding; refactored helpers and control flow.
Tests — explicit LeadingFull opt-ins `tests/st/runtime/...`, `tests/ut/.../test_parse_pl_at.py`, `tests/ut/codegen/test_orchestration_codegen.py`, `tests/ut/ir/transforms/...`	Numerous tests updated to add `chunk_policy="leading_full"` where prior default behavior was required; many split-loop tests updated/extended; new guarded-policy test coverage added.

Sequence Diagram(s)

sequenceDiagram
  participant DSL as DSL / User code
  participant Builder as IRBuilder
  participant For as ForStmt{chunk_config=Guarded}
  participant Splitter as ChunkedLoopSplitter
  participant IR as LoweredIR
  participant Printer as IRPythonPrinter

  DSL->>Builder: begin_for_loop(chunk_size, policy="guarded")
  Builder->>For: construct ForStmt (chunk_config policy=Guarded)
  For->>Splitter: VisitStmt_(ForStmt)
  Splitter->>Splitter: compute n_total = ceil(trip_count / C)
  Splitter->>IR: emit outer_for(sb0 += C)
  IR->>IR: emit inner_for(si in [0..C))
  IR->>IR: emit IfStmt( idx < stop ) around body
  Splitter->>IR: thread iter_args via IfStmt returns (guarded)
  IR->>Printer: IRPythonPrinter::VisitStmt_(ForStmt)
  Printer->>Printer: omit chunk_policy (default Guarded) or print "leading_full" if explicit

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

feat(pass): support dynamic loop bounds in SplitChunkedLoops #906: Overlaps refactors in split_chunked_loops_pass.cpp and helper renames/behavior.
fix(transforms): Apply VisitExpr to iter_arg initValue in SplitChunkedLoops #378: Touches split_chunked_loops_pass.cpp iter_arg init/value handling that may conflict or interact.
feat(ir/passes): Add auto_incore scope to gate SplitChunkedLoops and InterchangeChunkLoops #361: Related changes to chunking/ForStmt handling and printing that intersect with this PR.

Suggested reviewers

lyfne123

Poem

"🐰 A guarded hop, one loop, one cheer,
No split to break the chain so dear.
Ifs keep order, iter_args hold tight,
Defaults now guarded, gentle and light,
A rabbit's nibble of tidy code delight!"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 44.81% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: adding a guarded chunk policy feature for dynamic-bound parallel/range loops.
Linked Issues check	✅ Passed	The PR fully addresses the core objectives from `#930` (if-guard codegen mode for pl.parallel with dynamic bounds) and `#928` (inout accumulation preservation). ChunkPolicy::Guarded implementation, SplitChunkedLoops dispatch on policy, static/dynamic ceil(T/C) computation, IfStmt phi threading for iter_args, and printer default omission all align with stated requirements.
Out of Scope Changes check	✅ Passed	All code changes are directly related to implementing ChunkPolicy::Guarded, supporting it across C++/bindings/DSL/tests, and updating existing tests to explicitly specify leading_full where needed for backward compatibility.
Description check	✅ Passed	The pull request description clearly explains the new ChunkPolicy::Guarded feature, its rationale, key changes, and test coverage. It is directly related to the changeset.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces the Guarded chunk policy as the new default for loop chunking, which enables a single-kernel outline by using an inner loop with an if guard. The changes span the C++ IR definitions, Python bindings, the DSL API, and the SplitChunkedLoops transformation pass. The implementation correctly handles both simple loops and those with iteration arguments (using SSA-style phi nodes via IfStmt). Review feedback highlights the need to support negative loop steps in the guard condition and to consistently pass source location information (spans) when constructing IR expression nodes.

src/ir/transforms/split_chunked_loops_pass.cpp

Copilot

Pull request overview

Adds a new chunk-splitting strategy (ChunkPolicy::Guarded) and makes it the default for chunked pl.range/pl.parallel, emitting a single outlined kernel with an internal bounds guard rather than a main+remainder split. This targets correctness for cross-iteration state (iter_args / inout accumulation) and reduces kernel count under pl.auto_incore() for dynamic bounds.

Changes:

Introduces ChunkPolicy::Guarded (default) across C++ IR, Python bindings/stubs, DSL/parser defaults, and serialization.
Updates SplitChunkedLoops to dispatch on chunk policy and implements guarded splitting (with/without iter_args via IfStmt phi threading).
Retrofitts existing tests to pin chunk_policy="leading_full" where they assert the legacy split shape, and adds new unit tests covering guarded behavior and printer defaults.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
tests/ut/language/parser/test_error_cases.py	Pins legacy chunk policy in parser error-case fixture to keep expected behavior stable.
tests/ut/ir/transforms/test_split_chunked_loops.py	Adds guarded-policy unit tests; updates existing cases to explicitly request `leading_full` where needed.
tests/ut/ir/transforms/test_outline_incore_scopes.py	Makes chunk policy explicit to preserve prior outline behavior in tests.
tests/ut/ir/transforms/test_outline_incore_interleaved_ops.py	Makes chunk policy explicit in interleaving/interchange test programs.
tests/ut/ir/transforms/test_interchange_chunk_loops.py	Makes chunk policy explicit to preserve interchange expectations under new default.
tests/ut/ir/parser/test_parse_pl_at.py	Updates parser tests to pass explicit `leading_full` where the test expects legacy behavior.
tests/ut/codegen/test_orchestration_codegen.py	Pins `leading_full` in orchestration codegen test input to avoid default-policy drift.
tests/st/runtime/test_qwen3_decode_scope3_mixed.py	Pins `leading_full` in runtime test programs.
tests/st/runtime/test_cross_core.py	Pins `leading_full` in runtime test programs.
src/ir/transforms/split_chunked_loops_pass.cpp	Implements guarded splitting and policy dispatch; renames legacy split helpers for parity.
src/ir/transforms/python_printer.cpp	Omits printing `chunk_policy="guarded"` (new default) and prints `leading_full` explicitly when requested.
src/ir/serialization/type_deserializers.cpp	Updates chunk policy default during deserialization to `Guarded`.
python/pypto/pypto_core/ir.pyi	Adds `Guarded` enum value and flips default chunk policy in stubs.
python/pypto/language/parser/ast_parser.py	Adds `guarded` to accepted policies and flips parser default to `guarded`.
python/pypto/language/dsl_api.py	Flips DSL default chunk policy string to `guarded` and updates docstrings/overloads.
python/pypto/ir/builder.py	Accepts `"guarded"` in IRBuilder loop construction and flips default accordingly.
python/bindings/modules/ir.cpp	Exposes `ChunkPolicy.Guarded` in bindings and flips default args to `Guarded`.
python/bindings/modules/ir_builder.cpp	Flips default `chunk_policy` argument to `Guarded` in builder bindings/docs.
include/pypto/ir/transforms/utils/auto_name_utils.h	Adds a new auto-name qualifier for guarded/phi vars (`cg`).
include/pypto/ir/stmt.h	Adds `ChunkPolicy::Guarded`, updates string conversions, and flips `ChunkConfig` default policy.

src/ir/transforms/split_chunked_loops_pass.cpp

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/ir/transforms/split_chunked_loops_pass.cpp`:
- Around line 678-681: The guard currently always builds MakeLt(idx_expr,
stop_expr, sp) and constructs the IfStmt (visited_body) which only works for
positive step values; update split_chunked_loops_pass so the predicate is chosen
from the compile-time step sign: when step is positive use MakeLt(idx_expr,
stop_expr, sp), when step is negative use MakeGt(idx_expr, stop_expr, sp) (and
use the resulting predicate when constructing the IfStmt), and ensure the same
change is applied at the other occurrence around lines 764–766; add a regression
test that constructs a descending chunked range (e.g., range(10, 0, -1,
chunk=...)) to verify the negative-step guarded path executes iterations
correctly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 5ec5ec35-5cdc-439e-b9e9-01ae73b98b83

📥 Commits

Reviewing files that changed from the base of the PR and between 53d9f93 and ad892eb.

📒 Files selected for processing (20)

include/pypto/ir/stmt.h
include/pypto/ir/transforms/utils/auto_name_utils.h
python/bindings/modules/ir.cpp
python/bindings/modules/ir_builder.cpp
python/pypto/ir/builder.py
python/pypto/language/dsl_api.py
python/pypto/language/parser/ast_parser.py
python/pypto/pypto_core/ir.pyi
src/ir/serialization/type_deserializers.cpp
src/ir/transforms/python_printer.cpp
src/ir/transforms/split_chunked_loops_pass.cpp
tests/st/runtime/test_cross_core.py
tests/st/runtime/test_qwen3_decode_scope3_mixed.py
tests/ut/codegen/test_orchestration_codegen.py
tests/ut/ir/parser/test_parse_pl_at.py
tests/ut/ir/transforms/test_interchange_chunk_loops.py
tests/ut/ir/transforms/test_outline_incore_interleaved_ops.py
tests/ut/ir/transforms/test_outline_incore_scopes.py
tests/ut/ir/transforms/test_split_chunked_loops.py
tests/ut/language/parser/test_error_cases.py

src/ir/transforms/split_chunked_loops_pass.cpp

Resolves PR review feedback for hw-native-sys#978: - SplitGuarded / SplitGuardedWithIterArgs: select guard predicate from step sign. Positive step uses `idx < stop`; negative step uses `idx > stop`. Without this, descending chunked loops (e.g. `pl.range(10, 0, -1, chunk=4)`) would have every iteration become a no-op since `idx < stop` is always false. - Pass span `sp` to all MakeAdd/MakeMul calls that construct `idx_expr` so the generated IR carries source-location info, matching the rest of the pass. - Add two regression tests: `test_guarded_negative_step` (with iter_args) and `test_guarded_negative_step_no_iter_args`.

coderabbitai

🧹 Nitpick comments (1)

src/ir/transforms/split_chunked_loops_pass.cpp (1)

300-358: Extract the shared trip-count setup.

Both policy branches recompute the same static/dynamic trip-count logic before deriving either (n_full, n_rem) or n_total. Pulling that into a small helper would reduce drift in a correctness-sensitive code path.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/ir/transforms/split_chunked_loops_pass.cpp` around lines 300 - 358,
Extract the shared trip-count logic into a helper (e.g. BuildOrGetTripCount)
that encapsulates the TryGetConstInt(start_expr)/TryGetConstInt(stop_expr)
static path using ComputeStaticTripCount and MakeConstIndex and the dynamic path
using BuildTripCountExpr (keeping span sp); have it return either an
optional<int64_t> static_trip_count and/or an ExprPtr trip_count_expr so callers
can derive their values. Replace the duplicated blocks that compute
tc/trip_count in both the LeadingFull and Guarded branches with calls to this
helper, then compute (n_full, n_rem) using MakeFloorDiv/MakeFloorMod when
dynamic or using static_trip_count/ chunk_size, and compute n_total using the
same helper followed by the ceil formula (MakeAdd + MakeFloorDiv) or static
computation; ensure you preserve use of chunk_expr/chunk_size and the original
spans and variable names (n_full, n_rem, n_total, start_c/stop_c patterns) so
behavior is unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/ir/transforms/split_chunked_loops_pass.cpp`:
- Around line 300-358: Extract the shared trip-count logic into a helper (e.g.
BuildOrGetTripCount) that encapsulates the
TryGetConstInt(start_expr)/TryGetConstInt(stop_expr) static path using
ComputeStaticTripCount and MakeConstIndex and the dynamic path using
BuildTripCountExpr (keeping span sp); have it return either an optional<int64_t>
static_trip_count and/or an ExprPtr trip_count_expr so callers can derive their
values. Replace the duplicated blocks that compute tc/trip_count in both the
LeadingFull and Guarded branches with calls to this helper, then compute
(n_full, n_rem) using MakeFloorDiv/MakeFloorMod when dynamic or using
static_trip_count/ chunk_size, and compute n_total using the same helper
followed by the ceil formula (MakeAdd + MakeFloorDiv) or static computation;
ensure you preserve use of chunk_expr/chunk_size and the original spans and
variable names (n_full, n_rem, n_total, start_c/stop_c patterns) so behavior is
unchanged.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: d59a3e4a-819b-42bd-9f50-6b6b040bb5d5

📥 Commits

Reviewing files that changed from the base of the PR and between ad892eb and ab7f75c.

📒 Files selected for processing (2)

src/ir/transforms/split_chunked_loops_pass.cpp
tests/ut/ir/transforms/test_split_chunked_loops.py

Refresh docs/en and docs/zh-cn for SplitChunkedLoops to reflect: - New `chunk_policy` parameter; `guarded` is now the default. - Both policies documented side-by-side with algorithm descriptions, before/after IR examples, and a policy-choice table (dynamic bound, kernel count, hot-loop masking). - Step-sign-aware guard: `idx < stop` for positive step, `idx > stop` for negative step. - New `cg` (chunk_guard) auto-name qualifier for IfStmt phi vars. - Removed obsolete "chunk + init_values forbidden" claim — both policies now thread iter_args through the generated loops.

Hzfengsy added 2 commits April 12, 2026 14:30

test(ir): expand guarded chunk policy tests with structural equality

ad892eb

Copilot AI review requested due to automatic review settings April 12, 2026 07:03

github-project-automation bot added this to pto project Apr 12, 2026

Copilot started reviewing on behalf of Hzfengsy April 12, 2026 07:04 View session

gemini-code-assist bot reviewed Apr 12, 2026

View reviewed changes

Copilot AI reviewed Apr 12, 2026

View reviewed changes

src/ir/transforms/split_chunked_loops_pass.cpp Outdated Show resolved Hide resolved

src/ir/transforms/split_chunked_loops_pass.cpp Outdated Show resolved Hide resolved

coderabbitai bot reviewed Apr 12, 2026

View reviewed changes

src/ir/transforms/split_chunked_loops_pass.cpp Outdated Show resolved Hide resolved

coderabbitai bot reviewed Apr 12, 2026

View reviewed changes

Hzfengsy requested a review from zhangqi-chen April 12, 2026 10:14

zhangqi-chen approved these changes Apr 13, 2026

View reviewed changes

Hzfengsy merged commit 31a1e09 into hw-native-sys:main Apr 13, 2026
8 checks passed

Hzfengsy deleted the issue-930-guarded-chunk-policy branch April 13, 2026 01:56

Conversation

Hzfengsy commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Key changes

Test plan

Out of scope / follow-ups

Uh oh!

coderabbitai bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Hzfengsy commented Apr 12, 2026 •

edited

Loading

coderabbitai bot commented Apr 12, 2026 •

edited

Loading