Update: fuse Stage 0-1 and Stage 6-7 with chunked_loop_optimizer in Scope 3#105
Update: fuse Stage 0-1 and Stage 6-7 with chunked_loop_optimizer in Scope 3#105bumble0918 wants to merge 1 commit intohw-native-sys:mainfrom
Conversation
bumble0918
commented
Apr 14, 2026
- Add cross_core.py example for Stage 0&1 fusion debugging
- Fuse output projection + residual add in Qwen3 decode (Stage 0&1)
- Fuse down projection + final residual writeback (Stage 6&7)
- Increase MLP_OUT_CHUNK from 64 to 256 for better tiling
- Use pl.parallel with chunk=4 for cross-core task distribution
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a new cross-core example demonstrating fused vs. unfused matmul-plus-residual variants and updates Qwen3 scope3 scheduling to use chunked loop optimizations that fuse output projection and residual writeback into chunked parallel core-group regions. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
dd9b022 to
1129e77
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces a new example script, cross_core.py, demonstrating the fusion of output projection and residual addition using the chunked_loop_optimizer. It also optimizes the qwen3_32b_decode_scope3.py model by fusing stages 0 and 1, and stages 6 and 7, while increasing the MLP_OUT_CHUNK size. Feedback suggests replacing hardcoded chunk sizes in parallel loops with named constants for better maintainability and correcting the slicing logic in the cross_core example to handle batch sizes larger than the tile size.
| with pl.at(level=pl.Level.CORE_GROUP, optimization=pl.chunked_loop_optimizer(split=pl.SplitMode.UP_DOWN)): | ||
| for ob in pl.parallel(0, Q_OUT_BLOCKS, chunk=4): |
There was a problem hiding this comment.
The fusion of Stage 0 and Stage 1 using chunked_loop_optimizer with pl.parallel is a significant optimization for cross-core execution. However, the chunk=4 parameter is hardcoded. While this might be tuned for the current configuration, it could be beneficial to define this as a constant or make it configurable to allow for easier tuning across different hardware platforms or hidden sizes.
| with pl.at(level=pl.Level.CORE_GROUP, optimization=pl.chunked_loop_optimizer(split=pl.SplitMode.UP_DOWN)): | ||
| for dob in pl.parallel(0, HIDDEN_BLOCKS, chunk=4): |
| ): | ||
| for ob in pl.parallel(0, q_out_blocks, chunk=chunk): | ||
| o0 = ob * q_out_chunk | ||
| a_chunk_0 = pl.slice(attn_out, [batch_tile, k_chunk], [0, 0]) |
There was a problem hiding this comment.
In the fused program, a_chunk_0 is sliced with a fixed row offset of 0. This assumes that the batch size is equal to batch_tile. If batch > batch_tile, this code will only process the first tile of the batch, which might lead to incorrect results or incomplete output in resid. Since this is a debugging script, it's safer to ensure the slicing logic accounts for the batch dimension if it's intended to be generic.
References
- Ensure code functionality handles edge cases and aligns with intent, especially regarding tensor slicing and batch processing.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
examples/models/qwen3/qwen3_32b_decode_scope3.py (1)
32-47:⚠️ Potential issue | 🟡 MinorFail fast on non-256-aligned
intermediate_size.This change makes the tiling contract stricter, but
MLP_OUT_BLOCKS = INTER_CFG // MLP_OUT_CHUNKstill floors silently. Any caller that passes a non-multiple of 256 now gets a partial MLP path with no signal.Suggested guard
def build_qwen3_scope3_program( batch: int = BATCH, hidden_size: int = HIDDEN, intermediate_size: int = INTERMEDIATE, ): BATCH_CFG = batch HIDDEN_CFG = hidden_size INTER_CFG = intermediate_size + + if INTER_CFG % MLP_OUT_CHUNK != 0: + raise ValueError("intermediate_size must be divisible by MLP_OUT_CHUNK") HIDDEN_BLOCKS = HIDDEN_CFG // K_CHUNK Q_OUT_BLOCKS = HIDDEN_CFG // Q_OUT_CHUNK MLP_OUT_BLOCKS = INTER_CFG // MLP_OUT_CHUNK🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/models/qwen3/qwen3_32b_decode_scope3.py` around lines 32 - 47, The code computes MLP_OUT_BLOCKS = INTER_CFG // MLP_OUT_CHUNK but does not verify INTER_CFG is a multiple of MLP_OUT_CHUNK, causing silent truncation; update build_qwen3_scope3_program to validate INTER_CFG % MLP_OUT_CHUNK == 0 (use an assert or raise ValueError) before computing MLP_OUT_BLOCKS and include a clear message mentioning INTER_CFG and MLP_OUT_CHUNK so callers fail fast when intermediate_size is not 256-aligned.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/intermediate/cross_core.py`:
- Around line 37-47: Before building programs, validate the single-tile shape
contract: in build_cross_core_fusion_program (and the other builder functions
that hard-code row offset 0 and slice [batch_tile, ...]) check that batch ==
batch_tile and that hidden is divisible by k_chunk and q_out_chunk (i.e., hidden
% k_chunk == 0 and hidden % q_out_chunk == 0); if any check fails, raise a clear
ValueError with a message explaining the mismatch so the caller cannot silently
drop rows or tail blocks. Ensure these validations run before computing
hidden_blocks/q_out_blocks or proceeding with program construction.
---
Outside diff comments:
In `@examples/models/qwen3/qwen3_32b_decode_scope3.py`:
- Around line 32-47: The code computes MLP_OUT_BLOCKS = INTER_CFG //
MLP_OUT_CHUNK but does not verify INTER_CFG is a multiple of MLP_OUT_CHUNK,
causing silent truncation; update build_qwen3_scope3_program to validate
INTER_CFG % MLP_OUT_CHUNK == 0 (use an assert or raise ValueError) before
computing MLP_OUT_BLOCKS and include a clear message mentioning INTER_CFG and
MLP_OUT_CHUNK so callers fail fast when intermediate_size is not 256-aligned.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: b3540889-3903-4a4d-8024-82bffac49230
📒 Files selected for processing (2)
examples/intermediate/cross_core.pyexamples/models/qwen3/qwen3_32b_decode_scope3.py
| def build_cross_core_fusion_program( | ||
| batch: int = BATCH, | ||
| hidden: int = HIDDEN, | ||
| k_chunk: int = K_CHUNK, | ||
| q_out_chunk: int = Q_OUT_CHUNK, | ||
| batch_tile: int = BATCH_TILE, | ||
| chunk: int = 4, | ||
| ): | ||
| """Build fused Stage 0 & 1 program with chunked_loop_optimizer.""" | ||
| hidden_blocks = hidden // k_chunk | ||
| q_out_blocks = hidden // q_out_chunk |
There was a problem hiding this comment.
Validate the single-tile shape contract before building either program.
Both builders hard-code row offset 0 and slice [batch_tile, ...], while the block counts use floor division. batch != batch_tile leaves rows uncomputed, and non-divisible hidden values silently drop the tail.
Suggested guard
def build_cross_core_fusion_program(
batch: int = BATCH,
hidden: int = HIDDEN,
k_chunk: int = K_CHUNK,
q_out_chunk: int = Q_OUT_CHUNK,
batch_tile: int = BATCH_TILE,
chunk: int = 4,
):
"""Build fused Stage 0 & 1 program with chunked_loop_optimizer."""
+ if batch != batch_tile:
+ raise ValueError("This example currently requires batch == batch_tile")
+ if hidden % k_chunk != 0 or hidden % q_out_chunk != 0:
+ raise ValueError("hidden must be divisible by k_chunk and q_out_chunk")
hidden_blocks = hidden // k_chunk
q_out_blocks = hidden // q_out_chunk
...
def build_cross_core_split_program(
batch: int = BATCH,
hidden: int = HIDDEN,
k_chunk: int = K_CHUNK,
q_out_chunk: int = Q_OUT_CHUNK,
batch_tile: int = BATCH_TILE,
):
"""Build unfused Stage 0 & 1 program with separate pl.at blocks."""
+ if batch != batch_tile:
+ raise ValueError("This example currently requires batch == batch_tile")
+ if hidden % k_chunk != 0 or hidden % q_out_chunk != 0:
+ raise ValueError("hidden must be divisible by k_chunk and q_out_chunk")
hidden_blocks = hidden // k_chunk
q_out_blocks = hidden // q_out_chunkAlso applies to: 54-79, 86-95, 102-126
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/intermediate/cross_core.py` around lines 37 - 47, Before building
programs, validate the single-tile shape contract: in
build_cross_core_fusion_program (and the other builder functions that hard-code
row offset 0 and slice [batch_tile, ...]) check that batch == batch_tile and
that hidden is divisible by k_chunk and q_out_chunk (i.e., hidden % k_chunk == 0
and hidden % q_out_chunk == 0); if any check fails, raise a clear ValueError
with a message explaining the mismatch so the caller cannot silently drop rows
or tail blocks. Ensure these validations run before computing
hidden_blocks/q_out_blocks or proceeding with program construction.
- Add cross_core.py example for Stage 0&1 fusion debugging - Fuse output projection + residual add in Qwen3 decode (Stage 0&1) - Fuse down projection + final residual writeback (Stage 6&7) - Increase MLP_OUT_CHUNK from 64 to 256 for better tiling - Use pl.parallel with chunk=4 for cross-core task distribution
1129e77 to
6eafa55
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
examples/intermediate/cross_core.py (1)
37-47:⚠️ Potential issue | 🟠 MajorValidate the single-tile shape contract in both builders.
Both builders still hard-code row offset
0and slice[batch_tile, ...], whilehidden_blocksandq_out_blocksare floor-divided.batch != batch_tileleaves rows uncovered, and non-divisiblehiddenvalues silently drop the tail.Suggested fix
+def _validate_single_tile_config( + batch: int, + hidden: int, + k_chunk: int, + q_out_chunk: int, + batch_tile: int, +) -> None: + if batch != batch_tile: + raise ValueError("This example currently requires batch == batch_tile") + if hidden % k_chunk != 0 or hidden % q_out_chunk != 0: + raise ValueError("hidden must be divisible by k_chunk and q_out_chunk") + + def build_cross_core_fusion_program( batch: int = BATCH, hidden: int = HIDDEN, k_chunk: int = K_CHUNK, q_out_chunk: int = Q_OUT_CHUNK, batch_tile: int = BATCH_TILE, chunk: int = 4, ): """Build fused Stage 0 & 1 program with chunked_loop_optimizer.""" + _validate_single_tile_config(batch, hidden, k_chunk, q_out_chunk, batch_tile) hidden_blocks = hidden // k_chunk q_out_blocks = hidden // q_out_chunk ... def build_cross_core_split_program( batch: int = BATCH, hidden: int = HIDDEN, k_chunk: int = K_CHUNK, q_out_chunk: int = Q_OUT_CHUNK, batch_tile: int = BATCH_TILE, ): """Build unfused Stage 0 & 1 program with separate pl.at blocks.""" + _validate_single_tile_config(batch, hidden, k_chunk, q_out_chunk, batch_tile) hidden_blocks = hidden // k_chunk q_out_blocks = hidden // q_out_chunkAlso applies to: 86-95
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/intermediate/cross_core.py` around lines 37 - 47, In build_cross_core_fusion_program the two builders assume a single-tile shape by hard-coding row offset 0 and slice [batch_tile, ...], which leaves rows uncovered when batch != batch_tile and drops tail elements when hidden or q_out are not divisible by k_chunk/q_out_chunk; update both builders (the ones constructing the Stage 0 & 1 fused program) to validate the single-tile contract: check that batch == batch_tile and that hidden % k_chunk == 0 and hidden % q_out_chunk == 0 (or explicitly handle the remainder via ceil/block-padding), and if the checks fail either adjust the slice calculations to cover the tail rows/columns or raise a clear error; reference the builders inside build_cross_core_fusion_program and ensure row offsets and slice ranges are computed from batch and hidden (not hard-coded 0 and batch_tile) so all rows and hidden blocks are covered.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/intermediate/cross_core.py`:
- Line 237: Remove the unused f-string prefixes on the print calls that don't
interpolate values: replace the f-prefixed print with a normal string for the
message "Cross Core — Stage 0 & 1 Fusion Test" and do the same for the other
print with identical issue (the one around the second occurrence of that
message). Update the print statements (the print calls that currently use
f"...") to use plain string literals so Ruff F541 is resolved.
- Around line 1-4: The license header at the top of the file has inverted
wording on line 4: change "You may use this file except in compliance with the
License." to "You may not use this file except in compliance with the License."
— update the file header comment block in examples/intermediate/cross_core.py so
the standard phrase matches the other files and preserves the correct meaning.
In `@examples/models/qwen3/qwen3_32b_decode_scope3.py`:
- Line 32: Add explicit tiling-contract checks before program construction to
prevent silent truncation: validate that batch % BATCH_TILE == 0, hidden_size %
<hidden_chunk> == 0, intermediate_size % MLP_OUT_CHUNK == 0 (and any other chunk
constants used at lines 66-68 and 136-138) and raise a clear error (e.g.,
ValueError) if they fail. Locate the constants MLP_OUT_CHUNK and BATCH_TILE and
the places where fused loops/tiles are computed (the blocks referenced at lines
66-68 and 136-138) and insert assertions/guard code that includes the offending
values in the message so callers can correct batch/hidden/intermediate
dimensions before building the program.
---
Duplicate comments:
In `@examples/intermediate/cross_core.py`:
- Around line 37-47: In build_cross_core_fusion_program the two builders assume
a single-tile shape by hard-coding row offset 0 and slice [batch_tile, ...],
which leaves rows uncovered when batch != batch_tile and drops tail elements
when hidden or q_out are not divisible by k_chunk/q_out_chunk; update both
builders (the ones constructing the Stage 0 & 1 fused program) to validate the
single-tile contract: check that batch == batch_tile and that hidden % k_chunk
== 0 and hidden % q_out_chunk == 0 (or explicitly handle the remainder via
ceil/block-padding), and if the checks fail either adjust the slice calculations
to cover the tail rows/columns or raise a clear error; reference the builders
inside build_cross_core_fusion_program and ensure row offsets and slice ranges
are computed from batch and hidden (not hard-coded 0 and batch_tile) so all rows
and hidden blocks are covered.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: b6667168-094d-44ac-97b8-434bddb74083
📒 Files selected for processing (2)
examples/intermediate/cross_core.pyexamples/models/qwen3/qwen3_32b_decode_scope3.py
| # Copyright (c) PyPTO Contributors. | ||
| # This program is free software, you can redistribute it and/or modify it under the terms and conditions of | ||
| # CANN Open Software License Agreement Version 2.0 (the "License"). | ||
| # Please refer to the License for details. You may use this file except in compliance with the License. |
There was a problem hiding this comment.
Fix the inverted license notice.
Line 4 says You may use this file except in compliance with the License., which flips the meaning of the standard header. This should be may not use, matching the other files.
Suggested fix
-# Please refer to the License for details. You may use this file except in compliance with the License.
+# Please refer to the License for details. You may not use this file except in compliance with the License.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Copyright (c) PyPTO Contributors. | |
| # This program is free software, you can redistribute it and/or modify it under the terms and conditions of | |
| # CANN Open Software License Agreement Version 2.0 (the "License"). | |
| # Please refer to the License for details. You may use this file except in compliance with the License. | |
| # Copyright (c) PyPTO Contributors. | |
| # This program is free software, you can redistribute it and/or modify it under the terms and conditions of | |
| # CANN Open Software License Agreement Version 2.0 (the "License"). | |
| # Please refer to the License for details. You may not use this file except in compliance with the License. |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/intermediate/cross_core.py` around lines 1 - 4, The license header
at the top of the file has inverted wording on line 4: change "You may use this
file except in compliance with the License." to "You may not use this file
except in compliance with the License." — update the file header comment block
in examples/intermediate/cross_core.py so the standard phrase matches the other
files and preserves the correct meaning.
| args = parser.parse_args() | ||
|
|
||
| print(f"\n{'='*60}") | ||
| print(f"Cross Core — Stage 0 & 1 Fusion Test") |
There was a problem hiding this comment.
Remove the unused f prefixes.
Ruff F541 flags both of these strings because they do not interpolate anything.
Suggested fix
- print(f"Cross Core — Stage 0 & 1 Fusion Test")
+ print("Cross Core — Stage 0 & 1 Fusion Test")
...
- print(f"PASSED")
+ print("PASSED")Also applies to: 267-267
🧰 Tools
🪛 Ruff (0.15.9)
[error] 237-237: f-string without any placeholders
Remove extraneous f prefix
(F541)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/intermediate/cross_core.py` at line 237, Remove the unused f-string
prefixes on the print calls that don't interpolate values: replace the
f-prefixed print with a normal string for the message "Cross Core — Stage 0 & 1
Fusion Test" and do the same for the other print with identical issue (the one
around the second occurrence of that message). Update the print statements (the
print calls that currently use f"...") to use plain string literals so Ruff F541
is resolved.
| K_CHUNK = 128 | ||
| Q_OUT_CHUNK = 64 | ||
| MLP_OUT_CHUNK = 64 | ||
| MLP_OUT_CHUNK = 256 |
There was a problem hiding this comment.
Validate the tiling contract before building this program.
Line 32 makes MLP_OUT_CHUNK another hard divisibility requirement, but these fused loops still execute only full tiles while the block counts are floor-divided. A batch that's not a multiple of BATCH_TILE, or a hidden_size / intermediate_size that's not divisible by the corresponding chunk size, will silently truncate work or hit invalid partial slices.
Suggested guard
def build_qwen3_scope3_program(
batch: int = BATCH,
hidden_size: int = HIDDEN,
intermediate_size: int = INTERMEDIATE,
):
+ if batch % BATCH_TILE != 0:
+ raise ValueError("batch must be divisible by BATCH_TILE")
+ if hidden_size % K_CHUNK != 0 or hidden_size % Q_OUT_CHUNK != 0:
+ raise ValueError("hidden_size must be divisible by K_CHUNK and Q_OUT_CHUNK")
+ if intermediate_size % MLP_OUT_CHUNK != 0:
+ raise ValueError("intermediate_size must be divisible by MLP_OUT_CHUNK")
+
BATCH_CFG = batch
HIDDEN_CFG = hidden_size
INTER_CFG = intermediate_sizeAlso applies to: 66-68, 136-138
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/models/qwen3/qwen3_32b_decode_scope3.py` at line 32, Add explicit
tiling-contract checks before program construction to prevent silent truncation:
validate that batch % BATCH_TILE == 0, hidden_size % <hidden_chunk> == 0,
intermediate_size % MLP_OUT_CHUNK == 0 (and any other chunk constants used at
lines 66-68 and 136-138) and raise a clear error (e.g., ValueError) if they
fail. Locate the constants MLP_OUT_CHUNK and BATCH_TILE and the places where
fused loops/tiles are computed (the blocks referenced at lines 66-68 and
136-138) and insert assertions/guard code that includes the offending values in
the message so callers can correct batch/hidden/intermediate dimensions before
building the program.