Update: fuse Stage 0-1 and Stage 6-7 with chunked_loop_optimizer in Scope 3 by bumble0918 · Pull Request #105 · hw-native-sys/pypto-lib

bumble0918 · 2026-04-14T02:45:30Z

Add cross_core.py example for Stage 0&1 fusion debugging
Fuse output projection + residual add in Qwen3 decode (Stage 0&1)
Fuse down projection + final residual writeback (Stage 6&7)
Increase MLP_OUT_CHUNK from 64 to 256 for better tiling
Use pl.parallel with chunk=4 for cross-core task distribution

coderabbitai · 2026-04-14T02:45:45Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a new cross-core example demonstrating fused vs. unfused matmul-plus-residual variants and updates Qwen3 scope3 scheduling to use chunked loop optimizations that fuse output projection and residual writeback into chunked parallel core-group regions.

Changes

Cohort / File(s)	Summary
New Cross‑Core Example `examples/intermediate/cross_core.py`	Added an executable example with two IR builders (fused and unfused) for `resid = matmul(attn_out, wo) + hidden_states` (BF16 inputs, FP32 accumulation/output), tensor spec generation, a FP32 Torch golden reference, `compile_and_run()` with platform/device/ fusion CLI and runtime options.
Qwen3 Model Update `examples/models/qwen3/qwen3_32b_decode_scope3.py`	Increased `MLP_OUT_CHUNK` 64→256 and refactored scope3 stages to merge output-projection and residual add into `pl.at(..., optimization=pl.chunked_loop_optimizer(split=pl.SplitMode.UP_DOWN))` regions using `pl.parallel(..., chunk=4)`; applied similar chunked rework to later stages and removed a trailing comma in `RunConfig`.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Add: Qwen3 scope3 standalone example and update tilelet decode #65: Makes the same scope3 change—merging output-projection and residual into a chunked pl.at region with pl.chunked_loop_optimizer and pl.parallel(..., chunk=4).
Refactor: replace deprecated pl.incore/pl.auto_incore with pl.at syntax #101: Converts in-core/auto_incore regions to explicit pl.at(level=pl.Level.CORE_GROUP, optimization=pl.chunked_loop_optimizer) with matching scope3 refactorings.
Owen Scope3: Initial Implementation and Numerical Correctness on CV-seperated mode #79: Overlaps on scope3 scheduling and residual/writeback region rework using chunked parallel regions.

Poem

🐰
I hop through loops both fused and split,
Chunked paths align where matmuls sit,
BF16 dreams and FP32 light,
Cores join hands and run the night,
A rabbit cheers—optimization's delight!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 15.38% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately reflects the main changes: fusing Stage 0-1 and Stage 6-7 with chunked_loop_optimizer in the Qwen3 decode scope.
Description check	✅ Passed	The description directly relates to the changeset, detailing all major modifications including the new example, fusion changes, MLP_OUT_CHUNK increase, and pl.parallel usage.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a new example script, cross_core.py, demonstrating the fusion of output projection and residual addition using the chunked_loop_optimizer. It also optimizes the qwen3_32b_decode_scope3.py model by fusing stages 0 and 1, and stages 6 and 7, while increasing the MLP_OUT_CHUNK size. Feedback suggests replacing hardcoded chunk sizes in parallel loops with named constants for better maintainability and correcting the slicing logic in the cross_core example to handle batch sizes larger than the tile size.

gemini-code-assist · 2026-04-14T02:50:55Z

examples/models/qwen3/qwen3_32b_decode_scope3.py

+                with pl.at(level=pl.Level.CORE_GROUP, optimization=pl.chunked_loop_optimizer(split=pl.SplitMode.UP_DOWN)):
+                    for ob in pl.parallel(0, Q_OUT_BLOCKS, chunk=4):


The fusion of Stage 0 and Stage 1 using chunked_loop_optimizer with pl.parallel is a significant optimization for cross-core execution. However, the chunk=4 parameter is hardcoded. While this might be tuned for the current configuration, it could be beneficial to define this as a constant or make it configurable to allow for easier tuning across different hardware platforms or hidden sizes.

gemini-code-assist · 2026-04-14T02:50:55Z

examples/models/qwen3/qwen3_32b_decode_scope3.py

+                with pl.at(level=pl.Level.CORE_GROUP, optimization=pl.chunked_loop_optimizer(split=pl.SplitMode.UP_DOWN)):
+                    for dob in pl.parallel(0, HIDDEN_BLOCKS, chunk=4):


Similar to the Stage 0-1 fusion, the chunk=4 parameter here is hardcoded. For consistency and maintainability, consider using a named constant for the parallel chunk size.

gemini-code-assist · 2026-04-14T02:50:55Z

examples/intermediate/cross_core.py

+            ):
+                for ob in pl.parallel(0, q_out_blocks, chunk=chunk):
+                    o0 = ob * q_out_chunk
+                    a_chunk_0 = pl.slice(attn_out, [batch_tile, k_chunk], [0, 0])


In the fused program, a_chunk_0 is sliced with a fixed row offset of 0. This assumes that the batch size is equal to batch_tile. If batch > batch_tile, this code will only process the first tile of the batch, which might lead to incorrect results or incomplete output in resid. Since this is a debugging script, it's safer to ensure the slicing logic accounts for the batch dimension if it's intended to be generic.

References

Ensure code functionality handles edge cases and aligns with intent, especially regarding tensor slicing and batch processing.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

examples/models/qwen3/qwen3_32b_decode_scope3.py (1)

32-47: ⚠️ Potential issue | 🟡 Minor

Fail fast on non-256-aligned intermediate_size.

This change makes the tiling contract stricter, but MLP_OUT_BLOCKS = INTER_CFG // MLP_OUT_CHUNK still floors silently. Any caller that passes a non-multiple of 256 now gets a partial MLP path with no signal.

Suggested guard

 def build_qwen3_scope3_program(
     batch: int = BATCH,
     hidden_size: int = HIDDEN,
     intermediate_size: int = INTERMEDIATE,
 ):
     BATCH_CFG = batch
     HIDDEN_CFG = hidden_size
     INTER_CFG = intermediate_size
+
+    if INTER_CFG % MLP_OUT_CHUNK != 0:
+        raise ValueError("intermediate_size must be divisible by MLP_OUT_CHUNK")
 
     HIDDEN_BLOCKS = HIDDEN_CFG // K_CHUNK
     Q_OUT_BLOCKS = HIDDEN_CFG // Q_OUT_CHUNK
     MLP_OUT_BLOCKS = INTER_CFG // MLP_OUT_CHUNK

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/models/qwen3/qwen3_32b_decode_scope3.py` around lines 32 - 47, The
code computes MLP_OUT_BLOCKS = INTER_CFG // MLP_OUT_CHUNK but does not verify
INTER_CFG is a multiple of MLP_OUT_CHUNK, causing silent truncation; update
build_qwen3_scope3_program to validate INTER_CFG % MLP_OUT_CHUNK == 0 (use an
assert or raise ValueError) before computing MLP_OUT_BLOCKS and include a clear
message mentioning INTER_CFG and MLP_OUT_CHUNK so callers fail fast when
intermediate_size is not 256-aligned.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/intermediate/cross_core.py`:
- Around line 37-47: Before building programs, validate the single-tile shape
contract: in build_cross_core_fusion_program (and the other builder functions
that hard-code row offset 0 and slice [batch_tile, ...]) check that batch ==
batch_tile and that hidden is divisible by k_chunk and q_out_chunk (i.e., hidden
% k_chunk == 0 and hidden % q_out_chunk == 0); if any check fails, raise a clear
ValueError with a message explaining the mismatch so the caller cannot silently
drop rows or tail blocks. Ensure these validations run before computing
hidden_blocks/q_out_blocks or proceeding with program construction.

---

Outside diff comments:
In `@examples/models/qwen3/qwen3_32b_decode_scope3.py`:
- Around line 32-47: The code computes MLP_OUT_BLOCKS = INTER_CFG //
MLP_OUT_CHUNK but does not verify INTER_CFG is a multiple of MLP_OUT_CHUNK,
causing silent truncation; update build_qwen3_scope3_program to validate
INTER_CFG % MLP_OUT_CHUNK == 0 (use an assert or raise ValueError) before
computing MLP_OUT_BLOCKS and include a clear message mentioning INTER_CFG and
MLP_OUT_CHUNK so callers fail fast when intermediate_size is not 256-aligned.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b3540889-3903-4a4d-8024-82bffac49230

📥 Commits

Reviewing files that changed from the base of the PR and between efce0d4 and dd9b022.

📒 Files selected for processing (2)

examples/intermediate/cross_core.py
examples/models/qwen3/qwen3_32b_decode_scope3.py

coderabbitai · 2026-04-14T02:53:15Z

examples/intermediate/cross_core.py

+def build_cross_core_fusion_program(
+    batch: int = BATCH,
+    hidden: int = HIDDEN,
+    k_chunk: int = K_CHUNK,
+    q_out_chunk: int = Q_OUT_CHUNK,
+    batch_tile: int = BATCH_TILE,
+    chunk: int = 4,
+):
+    """Build fused Stage 0 & 1 program with chunked_loop_optimizer."""
+    hidden_blocks = hidden // k_chunk
+    q_out_blocks = hidden // q_out_chunk


⚠️ Potential issue | 🟠 Major

Validate the single-tile shape contract before building either program.

Both builders hard-code row offset 0 and slice [batch_tile, ...], while the block counts use floor division. batch != batch_tile leaves rows uncomputed, and non-divisible hidden values silently drop the tail.

Suggested guard

def build_cross_core_fusion_program( batch: int = BATCH, hidden: int = HIDDEN, k_chunk: int = K_CHUNK, q_out_chunk: int = Q_OUT_CHUNK, batch_tile: int = BATCH_TILE, chunk: int = 4, ): """Build fused Stage 0 & 1 program with chunked_loop_optimizer.""" + if batch != batch_tile: + raise ValueError("This example currently requires batch == batch_tile") + if hidden % k_chunk != 0 or hidden % q_out_chunk != 0: + raise ValueError("hidden must be divisible by k_chunk and q_out_chunk") hidden_blocks = hidden // k_chunk q_out_blocks = hidden // q_out_chunk ... def build_cross_core_split_program( batch: int = BATCH, hidden: int = HIDDEN, k_chunk: int = K_CHUNK, q_out_chunk: int = Q_OUT_CHUNK, batch_tile: int = BATCH_TILE, ): """Build unfused Stage 0 & 1 program with separate pl.at blocks.""" + if batch != batch_tile: + raise ValueError("This example currently requires batch == batch_tile") + if hidden % k_chunk != 0 or hidden % q_out_chunk != 0: + raise ValueError("hidden must be divisible by k_chunk and q_out_chunk") hidden_blocks = hidden // k_chunk q_out_blocks = hidden // q_out_chunk

Also applies to: 54-79, 86-95, 102-126

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/intermediate/cross_core.py` around lines 37 - 47, Before building programs, validate the single-tile shape contract: in build_cross_core_fusion_program (and the other builder functions that hard-code row offset 0 and slice [batch_tile, ...]) check that batch == batch_tile and that hidden is divisible by k_chunk and q_out_chunk (i.e., hidden % k_chunk == 0 and hidden % q_out_chunk == 0); if any check fails, raise a clear ValueError with a message explaining the mismatch so the caller cannot silently drop rows or tail blocks. Ensure these validations run before computing hidden_blocks/q_out_blocks or proceeding with program construction.

- Add cross_core.py example for Stage 0&1 fusion debugging - Fuse output projection + residual add in Qwen3 decode (Stage 0&1) - Fuse down projection + final residual writeback (Stage 6&7) - Increase MLP_OUT_CHUNK from 64 to 256 for better tiling - Use pl.parallel with chunk=4 for cross-core task distribution

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (1)

examples/intermediate/cross_core.py (1)

37-47: ⚠️ Potential issue | 🟠 Major

Validate the single-tile shape contract in both builders.

Both builders still hard-code row offset 0 and slice [batch_tile, ...], while hidden_blocks and q_out_blocks are floor-divided. batch != batch_tile leaves rows uncovered, and non-divisible hidden values silently drop the tail.

Suggested fix

+def _validate_single_tile_config(
+    batch: int,
+    hidden: int,
+    k_chunk: int,
+    q_out_chunk: int,
+    batch_tile: int,
+) -> None:
+    if batch != batch_tile:
+        raise ValueError("This example currently requires batch == batch_tile")
+    if hidden % k_chunk != 0 or hidden % q_out_chunk != 0:
+        raise ValueError("hidden must be divisible by k_chunk and q_out_chunk")
+
+
 def build_cross_core_fusion_program(
     batch: int = BATCH,
     hidden: int = HIDDEN,
     k_chunk: int = K_CHUNK,
     q_out_chunk: int = Q_OUT_CHUNK,
     batch_tile: int = BATCH_TILE,
     chunk: int = 4,
 ):
     """Build fused Stage 0 & 1 program with chunked_loop_optimizer."""
+    _validate_single_tile_config(batch, hidden, k_chunk, q_out_chunk, batch_tile)
     hidden_blocks = hidden // k_chunk
     q_out_blocks = hidden // q_out_chunk
 ...
 def build_cross_core_split_program(
     batch: int = BATCH,
     hidden: int = HIDDEN,
     k_chunk: int = K_CHUNK,
     q_out_chunk: int = Q_OUT_CHUNK,
     batch_tile: int = BATCH_TILE,
 ):
     """Build unfused Stage 0 & 1 program with separate pl.at blocks."""
+    _validate_single_tile_config(batch, hidden, k_chunk, q_out_chunk, batch_tile)
     hidden_blocks = hidden // k_chunk
     q_out_blocks = hidden // q_out_chunk

Also applies to: 86-95

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/intermediate/cross_core.py` around lines 37 - 47, In
build_cross_core_fusion_program the two builders assume a single-tile shape by
hard-coding row offset 0 and slice [batch_tile, ...], which leaves rows
uncovered when batch != batch_tile and drops tail elements when hidden or q_out
are not divisible by k_chunk/q_out_chunk; update both builders (the ones
constructing the Stage 0 & 1 fused program) to validate the single-tile
contract: check that batch == batch_tile and that hidden % k_chunk == 0 and
hidden % q_out_chunk == 0 (or explicitly handle the remainder via
ceil/block-padding), and if the checks fail either adjust the slice calculations
to cover the tail rows/columns or raise a clear error; reference the builders
inside build_cross_core_fusion_program and ensure row offsets and slice ranges
are computed from batch and hidden (not hard-coded 0 and batch_tile) so all rows
and hidden blocks are covered.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/intermediate/cross_core.py`:
- Line 237: Remove the unused f-string prefixes on the print calls that don't
interpolate values: replace the f-prefixed print with a normal string for the
message "Cross Core — Stage 0 & 1 Fusion Test" and do the same for the other
print with identical issue (the one around the second occurrence of that
message). Update the print statements (the print calls that currently use
f"...") to use plain string literals so Ruff F541 is resolved.
- Around line 1-4: The license header at the top of the file has inverted
wording on line 4: change "You may use this file except in compliance with the
License." to "You may not use this file except in compliance with the License."
— update the file header comment block in examples/intermediate/cross_core.py so
the standard phrase matches the other files and preserves the correct meaning.

In `@examples/models/qwen3/qwen3_32b_decode_scope3.py`:
- Line 32: Add explicit tiling-contract checks before program construction to
prevent silent truncation: validate that batch % BATCH_TILE == 0, hidden_size %
<hidden_chunk> == 0, intermediate_size % MLP_OUT_CHUNK == 0 (and any other chunk
constants used at lines 66-68 and 136-138) and raise a clear error (e.g.,
ValueError) if they fail. Locate the constants MLP_OUT_CHUNK and BATCH_TILE and
the places where fused loops/tiles are computed (the blocks referenced at lines
66-68 and 136-138) and insert assertions/guard code that includes the offending
values in the message so callers can correct batch/hidden/intermediate
dimensions before building the program.

---

Duplicate comments:
In `@examples/intermediate/cross_core.py`:
- Around line 37-47: In build_cross_core_fusion_program the two builders assume
a single-tile shape by hard-coding row offset 0 and slice [batch_tile, ...],
which leaves rows uncovered when batch != batch_tile and drops tail elements
when hidden or q_out are not divisible by k_chunk/q_out_chunk; update both
builders (the ones constructing the Stage 0 & 1 fused program) to validate the
single-tile contract: check that batch == batch_tile and that hidden % k_chunk
== 0 and hidden % q_out_chunk == 0 (or explicitly handle the remainder via
ceil/block-padding), and if the checks fail either adjust the slice calculations
to cover the tail rows/columns or raise a clear error; reference the builders
inside build_cross_core_fusion_program and ensure row offsets and slice ranges
are computed from batch and hidden (not hard-coded 0 and batch_tile) so all rows
and hidden blocks are covered.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b6667168-094d-44ac-97b8-434bddb74083

📥 Commits

Reviewing files that changed from the base of the PR and between dd9b022 and 6eafa55.

📒 Files selected for processing (2)

examples/intermediate/cross_core.py
examples/models/qwen3/qwen3_32b_decode_scope3.py

coderabbitai · 2026-04-14T06:43:22Z

examples/intermediate/cross_core.py

+# Copyright (c) PyPTO Contributors.
+# This program is free software, you can redistribute it and/or modify it under the terms and conditions of
+# CANN Open Software License Agreement Version 2.0 (the "License").
+# Please refer to the License for details. You may use this file except in compliance with the License.


⚠️ Potential issue | 🟡 Minor

Fix the inverted license notice.

Line 4 says You may use this file except in compliance with the License., which flips the meaning of the standard header. This should be may not use, matching the other files.

Suggested fix

-# Please refer to the License for details. You may use this file except in compliance with the License. +# Please refer to the License for details. You may not use this file except in compliance with the License.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Copyright (c) PyPTO Contributors.

# This program is free software, you can redistribute it and/or modify it under the terms and conditions of

# CANN Open Software License Agreement Version 2.0 (the "License").

# Please refer to the License for details. You may use this file except in compliance with the License.

# Copyright (c) PyPTO Contributors.

# This program is free software, you can redistribute it and/or modify it under the terms and conditions of

# CANN Open Software License Agreement Version 2.0 (the "License").

# Please refer to the License for details. You may not use this file except in compliance with the License.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/intermediate/cross_core.py` around lines 1 - 4, The license header at the top of the file has inverted wording on line 4: change "You may use this file except in compliance with the License." to "You may not use this file except in compliance with the License." — update the file header comment block in examples/intermediate/cross_core.py so the standard phrase matches the other files and preserves the correct meaning.

coderabbitai · 2026-04-14T06:43:22Z

examples/intermediate/cross_core.py

+    args = parser.parse_args()
+
+    print(f"\n{'='*60}")
+    print(f"Cross Core — Stage 0 & 1 Fusion Test")


⚠️ Potential issue | 🟡 Minor

Remove the unused f prefixes.

Ruff F541 flags both of these strings because they do not interpolate anything.

Suggested fix

- print(f"Cross Core — Stage 0 & 1 Fusion Test") + print("Cross Core — Stage 0 & 1 Fusion Test") ... - print(f"PASSED") + print("PASSED")

Also applies to: 267-267

🧰 Tools

🪛 Ruff (0.15.9)

[error] 237-237: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/intermediate/cross_core.py` at line 237, Remove the unused f-string prefixes on the print calls that don't interpolate values: replace the f-prefixed print with a normal string for the message "Cross Core — Stage 0 & 1 Fusion Test" and do the same for the other print with identical issue (the one around the second occurrence of that message). Update the print statements (the print calls that currently use f"...") to use plain string literals so Ruff F541 is resolved.

coderabbitai · 2026-04-14T06:43:22Z

examples/models/qwen3/qwen3_32b_decode_scope3.py

 K_CHUNK = 128
 Q_OUT_CHUNK = 64
-MLP_OUT_CHUNK = 64
+MLP_OUT_CHUNK = 256


⚠️ Potential issue | 🟠 Major

Validate the tiling contract before building this program.

Line 32 makes MLP_OUT_CHUNK another hard divisibility requirement, but these fused loops still execute only full tiles while the block counts are floor-divided. A batch that's not a multiple of BATCH_TILE, or a hidden_size / intermediate_size that's not divisible by the corresponding chunk size, will silently truncate work or hit invalid partial slices.

Suggested guard

def build_qwen3_scope3_program( batch: int = BATCH, hidden_size: int = HIDDEN, intermediate_size: int = INTERMEDIATE, ): + if batch % BATCH_TILE != 0: + raise ValueError("batch must be divisible by BATCH_TILE") + if hidden_size % K_CHUNK != 0 or hidden_size % Q_OUT_CHUNK != 0: + raise ValueError("hidden_size must be divisible by K_CHUNK and Q_OUT_CHUNK") + if intermediate_size % MLP_OUT_CHUNK != 0: + raise ValueError("intermediate_size must be divisible by MLP_OUT_CHUNK") + BATCH_CFG = batch HIDDEN_CFG = hidden_size INTER_CFG = intermediate_size

Also applies to: 66-68, 136-138

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/models/qwen3/qwen3_32b_decode_scope3.py` at line 32, Add explicit tiling-contract checks before program construction to prevent silent truncation: validate that batch % BATCH_TILE == 0, hidden_size % <hidden_chunk> == 0, intermediate_size % MLP_OUT_CHUNK == 0 (and any other chunk constants used at lines 66-68 and 136-138) and raise a clear error (e.g., ValueError) if they fail. Locate the constants MLP_OUT_CHUNK and BATCH_TILE and the places where fused loops/tiles are computed (the blocks referenced at lines 66-68 and 136-138) and insert assertions/guard code that includes the offending values in the message so callers can correct batch/hidden/intermediate dimensions before building the program.

bumble0918 force-pushed the feature/2026-04-13 branch from dd9b022 to 1129e77 Compare April 14, 2026 02:46

gemini-code-assist bot reviewed Apr 14, 2026

View reviewed changes

coderabbitai bot reviewed Apr 14, 2026

View reviewed changes

bumble0918 changed the title ~~Update: fuse Stage 0-1 and Stage 6-7 with chunked_loop_optimizer~~ Update: fuse Stage 0-1 and Stage 6-7 with chunked_loop_optimizer in Scope 3 Apr 14, 2026

bumble0918 force-pushed the feature/2026-04-13 branch from 1129e77 to 6eafa55 Compare April 14, 2026 06:35

coderabbitai bot reviewed Apr 14, 2026

View reviewed changes

		with pl.at(level=pl.Level.CORE_GROUP, optimization=pl.chunked_loop_optimizer(split=pl.SplitMode.UP_DOWN)):
		for ob in pl.parallel(0, Q_OUT_BLOCKS, chunk=4):

		with pl.at(level=pl.Level.CORE_GROUP, optimization=pl.chunked_loop_optimizer(split=pl.SplitMode.UP_DOWN)):
		for dob in pl.parallel(0, HIDDEN_BLOCKS, chunk=4):

Conversation

bumble0918 commented Apr 14, 2026

Uh oh!

coderabbitai bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai bot commented Apr 14, 2026 •

edited

Loading