Skip to content

feat(ir): add guarded chunk policy for dynamic-bound pl.parallel/pl.range#978

Merged
Hzfengsy merged 4 commits intohw-native-sys:mainfrom
Hzfengsy:issue-930-guarded-chunk-policy
Apr 13, 2026
Merged

feat(ir): add guarded chunk policy for dynamic-bound pl.parallel/pl.range#978
Hzfengsy merged 4 commits intohw-native-sys:mainfrom
Hzfengsy:issue-930-guarded-chunk-policy

Conversation

@Hzfengsy
Copy link
Copy Markdown
Member

@Hzfengsy Hzfengsy commented Apr 12, 2026

Summary

Adds ChunkPolicy::Guarded as the new default for pl.range/pl.parallel with chunk=C. Instead of splitting a dynamic-bound chunked loop into a main + remainder kernel (leading_full), guarded mode emits a single outer loop over ceil(T/C) chunks with an inner if (idx < stop) guard on the body. With iter_args, loop-carried state threads through an IfStmt phi whose else branch yields the iter_args unchanged.

Fixes #930. Partially addresses #928 (cross-iteration inout accumulation under pl.auto_incore()).

Why

  • leading_full under dynamic bounds breaks cross-iteration inout accumulation ([Bug] pl.parallel remainder kernel breaks inout accumulation across chunk boundary #928): the remainder kernel receives loop-carried state as input-only copies and writes to freshly allocated output tensors.
  • It also doubles kernel count under pl.auto_incore() (e.g. Qwen3-32B decode scope2 went from 9 → 13 kernels after migrating four sequential stages to pl.parallel).

Guarded mode keeps a single outlined kernel.

Key changes

  • New ChunkPolicy::Guarded enum value; flipped ChunkConfig::policy default from LeadingFull to Guarded across C++ / bindings / stubs / DSL.
  • SplitChunkedLoops pass dispatches on policy. Renamed the existing code paths SplitSimpleSplitLeadingFull, SplitWithIterArgsSplitLeadingFullWithIterArgs for parity with the new SplitGuarded / SplitGuardedWithIterArgs implementations.
  • Static and dynamic n_total = ceil(T/C) computation; guard start + (i_out * C + i_in) * step < stop.
  • IfStmt phi threads iter_args through both branches (then yields updated values, else yields inner iter_args unchanged) — SSA-correct.
  • Printer emits chunk_policy="leading_full" when explicit; guarded is omitted as the default.

Test plan

  • 36 unit tests in tests/ut/ir/transforms/test_split_chunked_loops.py pass, including 13 new TestGuardedPolicy tests using Before/Expected + ir.assert_structural_equal:
    • default policy selection, static divisible/non-divisible/trip<chunk with iter_args
    • no iter_args (static + dynamic)
    • non-unit step, pl.parallel kind
    • dynamic stop, dynamic start+stop
    • nested guarded loops (inner guard inside outer's then-branch)
    • loop_origin attrs (no ChunkRemainder emitted)
    • printer default omission
  • Existing tests retrofitted with explicit chunk_policy="leading_full" where they verify split shape.
  • test_interchange_chunk_loops.py passes (no regression from policy change).
  • Pre-commit hooks: clang-format, cpplint, ruff, pyright, headers, english-only all pass.
  • System tests skipped per user direction.

Out of scope / follow-ups

Fixes #930

Fixes hw-native-sys#930

Introduces ChunkPolicy::Guarded as the new default for pl.parallel /
pl.range / pl.unroll with chunk=C. Instead of splitting into a main
kernel (N//C full chunks) plus a remainder kernel (N%C tail), Guarded
emits a single outer loop over ceil(N/C) chunks with an internal
`if (idx < stop)` guard, mirroring the manual workaround that users
previously wrote by hand.

This unblocks two problems with the split approach:
1. Cross-iteration inout accumulation (hw-native-sys#928): a single kernel keeps
   loop-carried state on one continuous chain.
2. Kernel count inflation in dynamic-bound parallel loops: stages no
   longer double into main+remainder, shortening orchestration
   dependency chains.

Users opt back into the old split via `chunk_policy="leading_full"`.

SplitChunkedLoops now dispatches on policy:
- LeadingFull -> SplitLeadingFull / SplitLeadingFullWithIterArgs
  (renamed from SplitSimple / SplitWithIterArgs, logic unchanged).
- Guarded -> SplitGuarded / SplitGuardedWithIterArgs. The iter_args
  variant threads loop-carried state through an IfStmt phi with an
  else branch that yields the inner iter_args unchanged, so guarded
  iterations stay on the SSA chain.

Existing tests that verify LeadingFull split shape are pinned with an
explicit chunk_policy="leading_full"; new tests cover Guarded across
static/dynamic bounds with and without iter_args.
Copilot AI review requested due to automatic review settings April 12, 2026 07:03
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 12, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds a new ChunkPolicy::Guarded mode (single loop with per-iteration if-guards), switches the default chunk policy from LeadingFull to Guarded across C++/Python APIs, implements guarded chunked-loop splitting and printing, updates serialization/deserialization, and adjusts many tests to opt into the old LeadingFull behavior.

Changes

Cohort / File(s) Summary
Core IR Definitions
include/pypto/ir/stmt.h, include/pypto/ir/transforms/utils/auto_name_utils.h
Added ChunkPolicy::Guarded; updated (de)serialization parsing/strings; added ChunkGuardQualifier(); changed ChunkConfig default policy to Guarded.
C++ Python Bindings
python/bindings/modules/ir.cpp, python/bindings/modules/ir_builder.cpp
Exposed Guarded enum to Python; changed default ChunkConfig/ForStmt/IRBuilder.begin_for_loop chunk_policy defaults to Guarded; adjusted docstrings and property default behavior.
Python Builder & DSL
python/pypto/ir/builder.py, python/pypto/language/dsl_api.py, python/pypto/language/parser/ast_parser.py
Default chunk_policy strings changed from "leading_full""guarded"; builder/DSL accept and map "guarded" and preserve "leading_full" support; parser validation updated to accept both.
Type Stubs
python/pypto/pypto_core/ir.pyi
Added ChunkPolicy.Guarded; updated default parameter annotations/docstrings to Guarded for ChunkConfig, ForStmt, and IRBuilder.begin_for_loop.
Serialization & Printing
src/ir/serialization/type_deserializers.cpp, src/ir/transforms/python_printer.cpp
Deserialization default policy when missing → Guarded; printer now omits chunk_policy when value is Guarded and emits DSL-style lowercase strings ("leading_full", "guarded").
Chunked Loop Splitter
src/ir/transforms/split_chunked_loops_pass.cpp
Added policy dispatch: LeadingFull preserves main+remainder split (renamed helpers to SplitLeadingFull*); Guarded implements single-loop guarded emission with SplitGuarded*, handles iter_args via IfStmt phi-like returns, preserves zero-trip forwarding; refactored helpers and control flow.
Tests — explicit LeadingFull opt-ins
tests/st/runtime/..., tests/ut/.../test_parse_pl_at.py, tests/ut/codegen/test_orchestration_codegen.py, tests/ut/ir/transforms/...
Numerous tests updated to add chunk_policy="leading_full" where prior default behavior was required; many split-loop tests updated/extended; new guarded-policy test coverage added.

Sequence Diagram(s)

sequenceDiagram
  participant DSL as DSL / User code
  participant Builder as IRBuilder
  participant For as ForStmt{chunk_config=Guarded}
  participant Splitter as ChunkedLoopSplitter
  participant IR as LoweredIR
  participant Printer as IRPythonPrinter

  DSL->>Builder: begin_for_loop(chunk_size, policy="guarded")
  Builder->>For: construct ForStmt (chunk_config policy=Guarded)
  For->>Splitter: VisitStmt_(ForStmt)
  Splitter->>Splitter: compute n_total = ceil(trip_count / C)
  Splitter->>IR: emit outer_for(sb0 += C)
  IR->>IR: emit inner_for(si in [0..C))
  IR->>IR: emit IfStmt( idx < stop ) around body
  Splitter->>IR: thread iter_args via IfStmt returns (guarded)
  IR->>Printer: IRPythonPrinter::VisitStmt_(ForStmt)
  Printer->>Printer: omit chunk_policy (default Guarded) or print "leading_full" if explicit
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • lyfne123

Poem

"🐰 A guarded hop, one loop, one cheer,
No split to break the chain so dear.
Ifs keep order, iter_args hold tight,
Defaults now guarded, gentle and light,
A rabbit's nibble of tidy code delight!"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 44.81% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding a guarded chunk policy feature for dynamic-bound parallel/range loops.
Linked Issues check ✅ Passed The PR fully addresses the core objectives from #930 (if-guard codegen mode for pl.parallel with dynamic bounds) and #928 (inout accumulation preservation). ChunkPolicy::Guarded implementation, SplitChunkedLoops dispatch on policy, static/dynamic ceil(T/C) computation, IfStmt phi threading for iter_args, and printer default omission all align with stated requirements.
Out of Scope Changes check ✅ Passed All code changes are directly related to implementing ChunkPolicy::Guarded, supporting it across C++/bindings/DSL/tests, and updating existing tests to explicitly specify leading_full where needed for backward compatibility.
Description check ✅ Passed The pull request description clearly explains the new ChunkPolicy::Guarded feature, its rationale, key changes, and test coverage. It is directly related to the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the Guarded chunk policy as the new default for loop chunking, which enables a single-kernel outline by using an inner loop with an if guard. The changes span the C++ IR definitions, Python bindings, the DSL API, and the SplitChunkedLoops transformation pass. The implementation correctly handles both simple loops and those with iteration arguments (using SSA-style phi nodes via IfStmt). Review feedback highlights the need to support negative loop steps in the guard condition and to consistently pass source location information (spans) when constructing IR expression nodes.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new chunk-splitting strategy (ChunkPolicy::Guarded) and makes it the default for chunked pl.range/pl.parallel, emitting a single outlined kernel with an internal bounds guard rather than a main+remainder split. This targets correctness for cross-iteration state (iter_args / inout accumulation) and reduces kernel count under pl.auto_incore() for dynamic bounds.

Changes:

  • Introduces ChunkPolicy::Guarded (default) across C++ IR, Python bindings/stubs, DSL/parser defaults, and serialization.
  • Updates SplitChunkedLoops to dispatch on chunk policy and implements guarded splitting (with/without iter_args via IfStmt phi threading).
  • Retrofitts existing tests to pin chunk_policy="leading_full" where they assert the legacy split shape, and adds new unit tests covering guarded behavior and printer defaults.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/ut/language/parser/test_error_cases.py Pins legacy chunk policy in parser error-case fixture to keep expected behavior stable.
tests/ut/ir/transforms/test_split_chunked_loops.py Adds guarded-policy unit tests; updates existing cases to explicitly request leading_full where needed.
tests/ut/ir/transforms/test_outline_incore_scopes.py Makes chunk policy explicit to preserve prior outline behavior in tests.
tests/ut/ir/transforms/test_outline_incore_interleaved_ops.py Makes chunk policy explicit in interleaving/interchange test programs.
tests/ut/ir/transforms/test_interchange_chunk_loops.py Makes chunk policy explicit to preserve interchange expectations under new default.
tests/ut/ir/parser/test_parse_pl_at.py Updates parser tests to pass explicit leading_full where the test expects legacy behavior.
tests/ut/codegen/test_orchestration_codegen.py Pins leading_full in orchestration codegen test input to avoid default-policy drift.
tests/st/runtime/test_qwen3_decode_scope3_mixed.py Pins leading_full in runtime test programs.
tests/st/runtime/test_cross_core.py Pins leading_full in runtime test programs.
src/ir/transforms/split_chunked_loops_pass.cpp Implements guarded splitting and policy dispatch; renames legacy split helpers for parity.
src/ir/transforms/python_printer.cpp Omits printing chunk_policy="guarded" (new default) and prints leading_full explicitly when requested.
src/ir/serialization/type_deserializers.cpp Updates chunk policy default during deserialization to Guarded.
python/pypto/pypto_core/ir.pyi Adds Guarded enum value and flips default chunk policy in stubs.
python/pypto/language/parser/ast_parser.py Adds guarded to accepted policies and flips parser default to guarded.
python/pypto/language/dsl_api.py Flips DSL default chunk policy string to guarded and updates docstrings/overloads.
python/pypto/ir/builder.py Accepts "guarded" in IRBuilder loop construction and flips default accordingly.
python/bindings/modules/ir.cpp Exposes ChunkPolicy.Guarded in bindings and flips default args to Guarded.
python/bindings/modules/ir_builder.cpp Flips default chunk_policy argument to Guarded in builder bindings/docs.
include/pypto/ir/transforms/utils/auto_name_utils.h Adds a new auto-name qualifier for guarded/phi vars (cg).
include/pypto/ir/stmt.h Adds ChunkPolicy::Guarded, updates string conversions, and flips ChunkConfig default policy.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/ir/transforms/split_chunked_loops_pass.cpp`:
- Around line 678-681: The guard currently always builds MakeLt(idx_expr,
stop_expr, sp) and constructs the IfStmt (visited_body) which only works for
positive step values; update split_chunked_loops_pass so the predicate is chosen
from the compile-time step sign: when step is positive use MakeLt(idx_expr,
stop_expr, sp), when step is negative use MakeGt(idx_expr, stop_expr, sp) (and
use the resulting predicate when constructing the IfStmt), and ensure the same
change is applied at the other occurrence around lines 764–766; add a regression
test that constructs a descending chunked range (e.g., range(10, 0, -1,
chunk=...)) to verify the negative-step guarded path executes iterations
correctly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 5ec5ec35-5cdc-439e-b9e9-01ae73b98b83

📥 Commits

Reviewing files that changed from the base of the PR and between 53d9f93 and ad892eb.

📒 Files selected for processing (20)
  • include/pypto/ir/stmt.h
  • include/pypto/ir/transforms/utils/auto_name_utils.h
  • python/bindings/modules/ir.cpp
  • python/bindings/modules/ir_builder.cpp
  • python/pypto/ir/builder.py
  • python/pypto/language/dsl_api.py
  • python/pypto/language/parser/ast_parser.py
  • python/pypto/pypto_core/ir.pyi
  • src/ir/serialization/type_deserializers.cpp
  • src/ir/transforms/python_printer.cpp
  • src/ir/transforms/split_chunked_loops_pass.cpp
  • tests/st/runtime/test_cross_core.py
  • tests/st/runtime/test_qwen3_decode_scope3_mixed.py
  • tests/ut/codegen/test_orchestration_codegen.py
  • tests/ut/ir/parser/test_parse_pl_at.py
  • tests/ut/ir/transforms/test_interchange_chunk_loops.py
  • tests/ut/ir/transforms/test_outline_incore_interleaved_ops.py
  • tests/ut/ir/transforms/test_outline_incore_scopes.py
  • tests/ut/ir/transforms/test_split_chunked_loops.py
  • tests/ut/language/parser/test_error_cases.py

Resolves PR review feedback for hw-native-sys#978:

- SplitGuarded / SplitGuardedWithIterArgs: select guard predicate from
  step sign. Positive step uses `idx < stop`; negative step uses
  `idx > stop`. Without this, descending chunked loops
  (e.g. `pl.range(10, 0, -1, chunk=4)`) would have every iteration
  become a no-op since `idx < stop` is always false.
- Pass span `sp` to all MakeAdd/MakeMul calls that construct `idx_expr`
  so the generated IR carries source-location info, matching the rest
  of the pass.
- Add two regression tests: `test_guarded_negative_step` (with
  iter_args) and `test_guarded_negative_step_no_iter_args`.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/ir/transforms/split_chunked_loops_pass.cpp (1)

300-358: Extract the shared trip-count setup.

Both policy branches recompute the same static/dynamic trip-count logic before deriving either (n_full, n_rem) or n_total. Pulling that into a small helper would reduce drift in a correctness-sensitive code path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ir/transforms/split_chunked_loops_pass.cpp` around lines 300 - 358,
Extract the shared trip-count logic into a helper (e.g. BuildOrGetTripCount)
that encapsulates the TryGetConstInt(start_expr)/TryGetConstInt(stop_expr)
static path using ComputeStaticTripCount and MakeConstIndex and the dynamic path
using BuildTripCountExpr (keeping span sp); have it return either an
optional<int64_t> static_trip_count and/or an ExprPtr trip_count_expr so callers
can derive their values. Replace the duplicated blocks that compute
tc/trip_count in both the LeadingFull and Guarded branches with calls to this
helper, then compute (n_full, n_rem) using MakeFloorDiv/MakeFloorMod when
dynamic or using static_trip_count/ chunk_size, and compute n_total using the
same helper followed by the ceil formula (MakeAdd + MakeFloorDiv) or static
computation; ensure you preserve use of chunk_expr/chunk_size and the original
spans and variable names (n_full, n_rem, n_total, start_c/stop_c patterns) so
behavior is unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/ir/transforms/split_chunked_loops_pass.cpp`:
- Around line 300-358: Extract the shared trip-count logic into a helper (e.g.
BuildOrGetTripCount) that encapsulates the
TryGetConstInt(start_expr)/TryGetConstInt(stop_expr) static path using
ComputeStaticTripCount and MakeConstIndex and the dynamic path using
BuildTripCountExpr (keeping span sp); have it return either an optional<int64_t>
static_trip_count and/or an ExprPtr trip_count_expr so callers can derive their
values. Replace the duplicated blocks that compute tc/trip_count in both the
LeadingFull and Guarded branches with calls to this helper, then compute
(n_full, n_rem) using MakeFloorDiv/MakeFloorMod when dynamic or using
static_trip_count/ chunk_size, and compute n_total using the same helper
followed by the ceil formula (MakeAdd + MakeFloorDiv) or static computation;
ensure you preserve use of chunk_expr/chunk_size and the original spans and
variable names (n_full, n_rem, n_total, start_c/stop_c patterns) so behavior is
unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: d59a3e4a-819b-42bd-9f50-6b6b040bb5d5

📥 Commits

Reviewing files that changed from the base of the PR and between ad892eb and ab7f75c.

📒 Files selected for processing (2)
  • src/ir/transforms/split_chunked_loops_pass.cpp
  • tests/ut/ir/transforms/test_split_chunked_loops.py

Refresh docs/en and docs/zh-cn for SplitChunkedLoops to reflect:

- New `chunk_policy` parameter; `guarded` is now the default.
- Both policies documented side-by-side with algorithm descriptions,
  before/after IR examples, and a policy-choice table (dynamic bound,
  kernel count, hot-loop masking).
- Step-sign-aware guard: `idx < stop` for positive step, `idx > stop`
  for negative step.
- New `cg` (chunk_guard) auto-name qualifier for IfStmt phi vars.
- Removed obsolete "chunk + init_values forbidden" claim — both
  policies now thread iter_args through the generated loops.
@Hzfengsy Hzfengsy requested a review from zhangqi-chen April 12, 2026 10:14
@Hzfengsy Hzfengsy merged commit 31a1e09 into hw-native-sys:main Apr 13, 2026
8 checks passed
@Hzfengsy Hzfengsy deleted the issue-930-guarded-chunk-policy branch April 13, 2026 01:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

3 participants