Skip to content

Update: fuse Stage 0-1 and Stage 6-7 with chunked_loop_optimizer in Scope 3#105

Open
bumble0918 wants to merge 1 commit intohw-native-sys:mainfrom
bumble0918:feature/2026-04-13
Open

Update: fuse Stage 0-1 and Stage 6-7 with chunked_loop_optimizer in Scope 3#105
bumble0918 wants to merge 1 commit intohw-native-sys:mainfrom
bumble0918:feature/2026-04-13

Conversation

@bumble0918
Copy link
Copy Markdown
Contributor

  • Add cross_core.py example for Stage 0&1 fusion debugging
  • Fuse output projection + residual add in Qwen3 decode (Stage 0&1)
  • Fuse down projection + final residual writeback (Stage 6&7)
  • Increase MLP_OUT_CHUNK from 64 to 256 for better tiling
  • Use pl.parallel with chunk=4 for cross-core task distribution

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 14, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a new cross-core example demonstrating fused vs. unfused matmul-plus-residual variants and updates Qwen3 scope3 scheduling to use chunked loop optimizations that fuse output projection and residual writeback into chunked parallel core-group regions.

Changes

Cohort / File(s) Summary
New Cross‑Core Example
examples/intermediate/cross_core.py
Added an executable example with two IR builders (fused and unfused) for resid = matmul(attn_out, wo) + hidden_states (BF16 inputs, FP32 accumulation/output), tensor spec generation, a FP32 Torch golden reference, compile_and_run() with platform/device/ fusion CLI and runtime options.
Qwen3 Model Update
examples/models/qwen3/qwen3_32b_decode_scope3.py
Increased MLP_OUT_CHUNK 64→256 and refactored scope3 stages to merge output-projection and residual add into pl.at(..., optimization=pl.chunked_loop_optimizer(split=pl.SplitMode.UP_DOWN)) regions using pl.parallel(..., chunk=4); applied similar chunked rework to later stages and removed a trailing comma in RunConfig.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐰
I hop through loops both fused and split,
Chunked paths align where matmuls sit,
BF16 dreams and FP32 light,
Cores join hands and run the night,
A rabbit cheers—optimization's delight!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 15.38% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately reflects the main changes: fusing Stage 0-1 and Stage 6-7 with chunked_loop_optimizer in the Qwen3 decode scope.
Description check ✅ Passed The description directly relates to the changeset, detailing all major modifications including the new example, fusion changes, MLP_OUT_CHUNK increase, and pl.parallel usage.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new example script, cross_core.py, demonstrating the fusion of output projection and residual addition using the chunked_loop_optimizer. It also optimizes the qwen3_32b_decode_scope3.py model by fusing stages 0 and 1, and stages 6 and 7, while increasing the MLP_OUT_CHUNK size. Feedback suggests replacing hardcoded chunk sizes in parallel loops with named constants for better maintainability and correcting the slicing logic in the cross_core example to handle batch sizes larger than the tile size.

Comment on lines +67 to +68
with pl.at(level=pl.Level.CORE_GROUP, optimization=pl.chunked_loop_optimizer(split=pl.SplitMode.UP_DOWN)):
for ob in pl.parallel(0, Q_OUT_BLOCKS, chunk=4):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The fusion of Stage 0 and Stage 1 using chunked_loop_optimizer with pl.parallel is a significant optimization for cross-core execution. However, the chunk=4 parameter is hardcoded. While this might be tuned for the current configuration, it could be beneficial to define this as a constant or make it configurable to allow for easier tuning across different hardware platforms or hidden sizes.

Comment on lines +136 to +137
with pl.at(level=pl.Level.CORE_GROUP, optimization=pl.chunked_loop_optimizer(split=pl.SplitMode.UP_DOWN)):
for dob in pl.parallel(0, HIDDEN_BLOCKS, chunk=4):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the Stage 0-1 fusion, the chunk=4 parameter here is hardcoded. For consistency and maintainability, consider using a named constant for the parallel chunk size.

):
for ob in pl.parallel(0, q_out_blocks, chunk=chunk):
o0 = ob * q_out_chunk
a_chunk_0 = pl.slice(attn_out, [batch_tile, k_chunk], [0, 0])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In the fused program, a_chunk_0 is sliced with a fixed row offset of 0. This assumes that the batch size is equal to batch_tile. If batch > batch_tile, this code will only process the first tile of the batch, which might lead to incorrect results or incomplete output in resid. Since this is a debugging script, it's safer to ensure the slicing logic accounts for the batch dimension if it's intended to be generic.

References
  1. Ensure code functionality handles edge cases and aligns with intent, especially regarding tensor slicing and batch processing.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
examples/models/qwen3/qwen3_32b_decode_scope3.py (1)

32-47: ⚠️ Potential issue | 🟡 Minor

Fail fast on non-256-aligned intermediate_size.

This change makes the tiling contract stricter, but MLP_OUT_BLOCKS = INTER_CFG // MLP_OUT_CHUNK still floors silently. Any caller that passes a non-multiple of 256 now gets a partial MLP path with no signal.

Suggested guard
 def build_qwen3_scope3_program(
     batch: int = BATCH,
     hidden_size: int = HIDDEN,
     intermediate_size: int = INTERMEDIATE,
 ):
     BATCH_CFG = batch
     HIDDEN_CFG = hidden_size
     INTER_CFG = intermediate_size
+
+    if INTER_CFG % MLP_OUT_CHUNK != 0:
+        raise ValueError("intermediate_size must be divisible by MLP_OUT_CHUNK")
 
     HIDDEN_BLOCKS = HIDDEN_CFG // K_CHUNK
     Q_OUT_BLOCKS = HIDDEN_CFG // Q_OUT_CHUNK
     MLP_OUT_BLOCKS = INTER_CFG // MLP_OUT_CHUNK
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/models/qwen3/qwen3_32b_decode_scope3.py` around lines 32 - 47, The
code computes MLP_OUT_BLOCKS = INTER_CFG // MLP_OUT_CHUNK but does not verify
INTER_CFG is a multiple of MLP_OUT_CHUNK, causing silent truncation; update
build_qwen3_scope3_program to validate INTER_CFG % MLP_OUT_CHUNK == 0 (use an
assert or raise ValueError) before computing MLP_OUT_BLOCKS and include a clear
message mentioning INTER_CFG and MLP_OUT_CHUNK so callers fail fast when
intermediate_size is not 256-aligned.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/intermediate/cross_core.py`:
- Around line 37-47: Before building programs, validate the single-tile shape
contract: in build_cross_core_fusion_program (and the other builder functions
that hard-code row offset 0 and slice [batch_tile, ...]) check that batch ==
batch_tile and that hidden is divisible by k_chunk and q_out_chunk (i.e., hidden
% k_chunk == 0 and hidden % q_out_chunk == 0); if any check fails, raise a clear
ValueError with a message explaining the mismatch so the caller cannot silently
drop rows or tail blocks. Ensure these validations run before computing
hidden_blocks/q_out_blocks or proceeding with program construction.

---

Outside diff comments:
In `@examples/models/qwen3/qwen3_32b_decode_scope3.py`:
- Around line 32-47: The code computes MLP_OUT_BLOCKS = INTER_CFG //
MLP_OUT_CHUNK but does not verify INTER_CFG is a multiple of MLP_OUT_CHUNK,
causing silent truncation; update build_qwen3_scope3_program to validate
INTER_CFG % MLP_OUT_CHUNK == 0 (use an assert or raise ValueError) before
computing MLP_OUT_BLOCKS and include a clear message mentioning INTER_CFG and
MLP_OUT_CHUNK so callers fail fast when intermediate_size is not 256-aligned.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b3540889-3903-4a4d-8024-82bffac49230

📥 Commits

Reviewing files that changed from the base of the PR and between efce0d4 and dd9b022.

📒 Files selected for processing (2)
  • examples/intermediate/cross_core.py
  • examples/models/qwen3/qwen3_32b_decode_scope3.py

Comment on lines +37 to +47
def build_cross_core_fusion_program(
batch: int = BATCH,
hidden: int = HIDDEN,
k_chunk: int = K_CHUNK,
q_out_chunk: int = Q_OUT_CHUNK,
batch_tile: int = BATCH_TILE,
chunk: int = 4,
):
"""Build fused Stage 0 & 1 program with chunked_loop_optimizer."""
hidden_blocks = hidden // k_chunk
q_out_blocks = hidden // q_out_chunk
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate the single-tile shape contract before building either program.

Both builders hard-code row offset 0 and slice [batch_tile, ...], while the block counts use floor division. batch != batch_tile leaves rows uncomputed, and non-divisible hidden values silently drop the tail.

Suggested guard
 def build_cross_core_fusion_program(
     batch: int = BATCH,
     hidden: int = HIDDEN,
     k_chunk: int = K_CHUNK,
     q_out_chunk: int = Q_OUT_CHUNK,
     batch_tile: int = BATCH_TILE,
     chunk: int = 4,
 ):
     """Build fused Stage 0 & 1 program with chunked_loop_optimizer."""
+    if batch != batch_tile:
+        raise ValueError("This example currently requires batch == batch_tile")
+    if hidden % k_chunk != 0 or hidden % q_out_chunk != 0:
+        raise ValueError("hidden must be divisible by k_chunk and q_out_chunk")
     hidden_blocks = hidden // k_chunk
     q_out_blocks = hidden // q_out_chunk
 ...
 def build_cross_core_split_program(
     batch: int = BATCH,
     hidden: int = HIDDEN,
     k_chunk: int = K_CHUNK,
     q_out_chunk: int = Q_OUT_CHUNK,
     batch_tile: int = BATCH_TILE,
 ):
     """Build unfused Stage 0 & 1 program with separate pl.at blocks."""
+    if batch != batch_tile:
+        raise ValueError("This example currently requires batch == batch_tile")
+    if hidden % k_chunk != 0 or hidden % q_out_chunk != 0:
+        raise ValueError("hidden must be divisible by k_chunk and q_out_chunk")
     hidden_blocks = hidden // k_chunk
     q_out_blocks = hidden // q_out_chunk

Also applies to: 54-79, 86-95, 102-126

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/intermediate/cross_core.py` around lines 37 - 47, Before building
programs, validate the single-tile shape contract: in
build_cross_core_fusion_program (and the other builder functions that hard-code
row offset 0 and slice [batch_tile, ...]) check that batch == batch_tile and
that hidden is divisible by k_chunk and q_out_chunk (i.e., hidden % k_chunk == 0
and hidden % q_out_chunk == 0); if any check fails, raise a clear ValueError
with a message explaining the mismatch so the caller cannot silently drop rows
or tail blocks. Ensure these validations run before computing
hidden_blocks/q_out_blocks or proceeding with program construction.

@bumble0918 bumble0918 changed the title Update: fuse Stage 0-1 and Stage 6-7 with chunked_loop_optimizer Update: fuse Stage 0-1 and Stage 6-7 with chunked_loop_optimizer in Scope 3 Apr 14, 2026
- Add cross_core.py example for Stage 0&1 fusion debugging
- Fuse output projection + residual add in Qwen3 decode (Stage 0&1)
- Fuse down projection + final residual writeback (Stage 6&7)
- Increase MLP_OUT_CHUNK from 64 to 256 for better tiling
- Use pl.parallel with chunk=4 for cross-core task distribution
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
examples/intermediate/cross_core.py (1)

37-47: ⚠️ Potential issue | 🟠 Major

Validate the single-tile shape contract in both builders.

Both builders still hard-code row offset 0 and slice [batch_tile, ...], while hidden_blocks and q_out_blocks are floor-divided. batch != batch_tile leaves rows uncovered, and non-divisible hidden values silently drop the tail.

Suggested fix
+def _validate_single_tile_config(
+    batch: int,
+    hidden: int,
+    k_chunk: int,
+    q_out_chunk: int,
+    batch_tile: int,
+) -> None:
+    if batch != batch_tile:
+        raise ValueError("This example currently requires batch == batch_tile")
+    if hidden % k_chunk != 0 or hidden % q_out_chunk != 0:
+        raise ValueError("hidden must be divisible by k_chunk and q_out_chunk")
+
+
 def build_cross_core_fusion_program(
     batch: int = BATCH,
     hidden: int = HIDDEN,
     k_chunk: int = K_CHUNK,
     q_out_chunk: int = Q_OUT_CHUNK,
     batch_tile: int = BATCH_TILE,
     chunk: int = 4,
 ):
     """Build fused Stage 0 & 1 program with chunked_loop_optimizer."""
+    _validate_single_tile_config(batch, hidden, k_chunk, q_out_chunk, batch_tile)
     hidden_blocks = hidden // k_chunk
     q_out_blocks = hidden // q_out_chunk
 ...
 def build_cross_core_split_program(
     batch: int = BATCH,
     hidden: int = HIDDEN,
     k_chunk: int = K_CHUNK,
     q_out_chunk: int = Q_OUT_CHUNK,
     batch_tile: int = BATCH_TILE,
 ):
     """Build unfused Stage 0 & 1 program with separate pl.at blocks."""
+    _validate_single_tile_config(batch, hidden, k_chunk, q_out_chunk, batch_tile)
     hidden_blocks = hidden // k_chunk
     q_out_blocks = hidden // q_out_chunk

Also applies to: 86-95

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/intermediate/cross_core.py` around lines 37 - 47, In
build_cross_core_fusion_program the two builders assume a single-tile shape by
hard-coding row offset 0 and slice [batch_tile, ...], which leaves rows
uncovered when batch != batch_tile and drops tail elements when hidden or q_out
are not divisible by k_chunk/q_out_chunk; update both builders (the ones
constructing the Stage 0 & 1 fused program) to validate the single-tile
contract: check that batch == batch_tile and that hidden % k_chunk == 0 and
hidden % q_out_chunk == 0 (or explicitly handle the remainder via
ceil/block-padding), and if the checks fail either adjust the slice calculations
to cover the tail rows/columns or raise a clear error; reference the builders
inside build_cross_core_fusion_program and ensure row offsets and slice ranges
are computed from batch and hidden (not hard-coded 0 and batch_tile) so all rows
and hidden blocks are covered.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/intermediate/cross_core.py`:
- Line 237: Remove the unused f-string prefixes on the print calls that don't
interpolate values: replace the f-prefixed print with a normal string for the
message "Cross Core — Stage 0 & 1 Fusion Test" and do the same for the other
print with identical issue (the one around the second occurrence of that
message). Update the print statements (the print calls that currently use
f"...") to use plain string literals so Ruff F541 is resolved.
- Around line 1-4: The license header at the top of the file has inverted
wording on line 4: change "You may use this file except in compliance with the
License." to "You may not use this file except in compliance with the License."
— update the file header comment block in examples/intermediate/cross_core.py so
the standard phrase matches the other files and preserves the correct meaning.

In `@examples/models/qwen3/qwen3_32b_decode_scope3.py`:
- Line 32: Add explicit tiling-contract checks before program construction to
prevent silent truncation: validate that batch % BATCH_TILE == 0, hidden_size %
<hidden_chunk> == 0, intermediate_size % MLP_OUT_CHUNK == 0 (and any other chunk
constants used at lines 66-68 and 136-138) and raise a clear error (e.g.,
ValueError) if they fail. Locate the constants MLP_OUT_CHUNK and BATCH_TILE and
the places where fused loops/tiles are computed (the blocks referenced at lines
66-68 and 136-138) and insert assertions/guard code that includes the offending
values in the message so callers can correct batch/hidden/intermediate
dimensions before building the program.

---

Duplicate comments:
In `@examples/intermediate/cross_core.py`:
- Around line 37-47: In build_cross_core_fusion_program the two builders assume
a single-tile shape by hard-coding row offset 0 and slice [batch_tile, ...],
which leaves rows uncovered when batch != batch_tile and drops tail elements
when hidden or q_out are not divisible by k_chunk/q_out_chunk; update both
builders (the ones constructing the Stage 0 & 1 fused program) to validate the
single-tile contract: check that batch == batch_tile and that hidden % k_chunk
== 0 and hidden % q_out_chunk == 0 (or explicitly handle the remainder via
ceil/block-padding), and if the checks fail either adjust the slice calculations
to cover the tail rows/columns or raise a clear error; reference the builders
inside build_cross_core_fusion_program and ensure row offsets and slice ranges
are computed from batch and hidden (not hard-coded 0 and batch_tile) so all rows
and hidden blocks are covered.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b6667168-094d-44ac-97b8-434bddb74083

📥 Commits

Reviewing files that changed from the base of the PR and between dd9b022 and 6eafa55.

📒 Files selected for processing (2)
  • examples/intermediate/cross_core.py
  • examples/models/qwen3/qwen3_32b_decode_scope3.py

Comment on lines +1 to +4
# Copyright (c) PyPTO Contributors.
# This program is free software, you can redistribute it and/or modify it under the terms and conditions of
# CANN Open Software License Agreement Version 2.0 (the "License").
# Please refer to the License for details. You may use this file except in compliance with the License.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix the inverted license notice.

Line 4 says You may use this file except in compliance with the License., which flips the meaning of the standard header. This should be may not use, matching the other files.

Suggested fix
-# Please refer to the License for details. You may use this file except in compliance with the License.
+# Please refer to the License for details. You may not use this file except in compliance with the License.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Copyright (c) PyPTO Contributors.
# This program is free software, you can redistribute it and/or modify it under the terms and conditions of
# CANN Open Software License Agreement Version 2.0 (the "License").
# Please refer to the License for details. You may use this file except in compliance with the License.
# Copyright (c) PyPTO Contributors.
# This program is free software, you can redistribute it and/or modify it under the terms and conditions of
# CANN Open Software License Agreement Version 2.0 (the "License").
# Please refer to the License for details. You may not use this file except in compliance with the License.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/intermediate/cross_core.py` around lines 1 - 4, The license header
at the top of the file has inverted wording on line 4: change "You may use this
file except in compliance with the License." to "You may not use this file
except in compliance with the License." — update the file header comment block
in examples/intermediate/cross_core.py so the standard phrase matches the other
files and preserves the correct meaning.

args = parser.parse_args()

print(f"\n{'='*60}")
print(f"Cross Core — Stage 0 & 1 Fusion Test")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove the unused f prefixes.

Ruff F541 flags both of these strings because they do not interpolate anything.

Suggested fix
-    print(f"Cross Core — Stage 0 & 1 Fusion Test")
+    print("Cross Core — Stage 0 & 1 Fusion Test")
...
-    print(f"PASSED")
+    print("PASSED")

Also applies to: 267-267

🧰 Tools
🪛 Ruff (0.15.9)

[error] 237-237: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/intermediate/cross_core.py` at line 237, Remove the unused f-string
prefixes on the print calls that don't interpolate values: replace the
f-prefixed print with a normal string for the message "Cross Core — Stage 0 & 1
Fusion Test" and do the same for the other print with identical issue (the one
around the second occurrence of that message). Update the print statements (the
print calls that currently use f"...") to use plain string literals so Ruff F541
is resolved.

K_CHUNK = 128
Q_OUT_CHUNK = 64
MLP_OUT_CHUNK = 64
MLP_OUT_CHUNK = 256
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate the tiling contract before building this program.

Line 32 makes MLP_OUT_CHUNK another hard divisibility requirement, but these fused loops still execute only full tiles while the block counts are floor-divided. A batch that's not a multiple of BATCH_TILE, or a hidden_size / intermediate_size that's not divisible by the corresponding chunk size, will silently truncate work or hit invalid partial slices.

Suggested guard
 def build_qwen3_scope3_program(
     batch: int = BATCH,
     hidden_size: int = HIDDEN,
     intermediate_size: int = INTERMEDIATE,
 ):
+    if batch % BATCH_TILE != 0:
+        raise ValueError("batch must be divisible by BATCH_TILE")
+    if hidden_size % K_CHUNK != 0 or hidden_size % Q_OUT_CHUNK != 0:
+        raise ValueError("hidden_size must be divisible by K_CHUNK and Q_OUT_CHUNK")
+    if intermediate_size % MLP_OUT_CHUNK != 0:
+        raise ValueError("intermediate_size must be divisible by MLP_OUT_CHUNK")
+
     BATCH_CFG = batch
     HIDDEN_CFG = hidden_size
     INTER_CFG = intermediate_size

Also applies to: 66-68, 136-138

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/models/qwen3/qwen3_32b_decode_scope3.py` at line 32, Add explicit
tiling-contract checks before program construction to prevent silent truncation:
validate that batch % BATCH_TILE == 0, hidden_size % <hidden_chunk> == 0,
intermediate_size % MLP_OUT_CHUNK == 0 (and any other chunk constants used at
lines 66-68 and 136-138) and raise a clear error (e.g., ValueError) if they
fail. Locate the constants MLP_OUT_CHUNK and BATCH_TILE and the places where
fused loops/tiles are computed (the blocks referenced at lines 66-68 and
136-138) and insert assertions/guard code that includes the offending values in
the message so callers can correct batch/hidden/intermediate dimensions before
building the program.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant