Skip to content

pto codegen modified to support qwen3 program#752

Open
mknkfan wants to merge 1 commit intohw-native-sys:mainfrom
mknkfan:pypto_changed
Open

pto codegen modified to support qwen3 program#752
mknkfan wants to merge 1 commit intohw-native-sys:mainfrom
mknkfan:pypto_changed

Conversation

@mknkfan
Copy link
Copy Markdown

@mknkfan mknkfan commented Mar 27, 2026

Modify pypto ir framework to compile qwen3_32b_prefill, please check the changes if there are any issues. The summary of the changes is in CHANGES_SUMMARY.txt where the modifications are made with cursor (opus 4.6).

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 27, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR updates IR repair, tile-layout resolution, and PTO codegen: fixes dangling scf.if yields via SSA-suffix stripping and iter-arg base-name mapping; forces row-major layouts for several tile ops and broadcast row-expands; adjusts IfStmt phi-buffer allocation and branch move emission; generalizes backend layout repair; and adds many example programs and a compile-only runtime path.

Changes

Cohort / File(s) Summary
Loop / Yield Repair
src/ir/transforms/loop_state_repair.cpp
Replace positional iter_arg mapping with base-name matching for IfStmt return vars; introduce pluggable YieldFixFn; include iter_args/loop_var in defined set during dangling-yield fixups.
Backend Layout Repair Pass
src/ir/transforms/resolve_backend_op_layouts_pass.cpp
Generalize repair to fix all non-row-major tiles (not just column vectors); reshape logic now uses column-vector→row-vector shaping only when appropriate; adjust post-repair target reshape conditions.
PTO Codegen / IfStmt Phi Handling
src/codegen/pto/pto_control_flow_codegen.cpp, src/codegen/pto/pto_codegen.cpp
Emit layout_DN for 2D [N,1] (N>1) views; force phi buffers with loc=acc to use vec/row-major TileView and skip emitting pto.tmov when both src/dst are acc.
Tile Ops & Broadcast
src/ir/op/tile_ops/broadcast.cpp, src/backend/common/pto_ops_common.cpp, src/ir/op/tile_ops/*
Force row-major for row-expand outputs; add tile.extract to row-major requirements; introduce kRowExpandOps that constrain only arg0 to row-major; register tile.slice with explicit row-major input/output.
Cross-core / Verifier
src/ir/transforms/utils/cross_core_pipe.cpp, src/ir/verifier/mixed_kernel_expanded_verifier.cpp
Aggregate observed slot sizes across directions and return max (or nullopt when missing); downgrade inconsistent slot-size diagnostics from error to warning and report chosen max.
Python API / IR Small Fixes
python/pypto/runtime/runner.py, python/pypto/ir/op/tensor_ops.py, python/pypto/ir/op/tile_ops.py
Make golden optional (`Callable
Examples Added
examples-lib/... (many files: hello_world.py, matmul.py, deepseek_v3_2/*.py, qwen3/*.py, ...)
Add ~10 example programs for HelloWorld, MatMul, DeepSeek V3.2 (front/prefill/decode variants), and multiple Qwen3 single-layer kernels (decode/prefill/training), each with builders, tensor-spec helpers, and compile/run harnesses.
Misc Internal Utilities
src/ir/transforms/utils/..., src/ir/verifier/..., src/.../mixed_kernel_expanded_verifier.cpp
Various internal behavior adjustments (slot-size aggregation, warning severity, small verifier messaging changes).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Possibly related PRs

Suggested reviewers

  • Hzfengsy
  • lyfne123

🐰 Through layouts and yields, this rabbit did hop,
I stripped SSA tails and closed each dangling stop,
Row-major rows I gently coaxed to stay,
Phi-buffers now migrate the correct way,
Qwen and DeepSeek prance on—hooray! 🥕✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 7.04% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ⚠️ Warning The title references 'qwen3 program' but the changeset encompasses far broader modifications across multiple compilation, layout-resolution, and codegen paths beyond just qwen3 support—including loop state repair, broadcast layout handling, control flow codegen, and new example programs across multiple models. Consider a more precise title reflecting the main technical changes, such as 'Fix tile layout handling and control flow codegen for prefill compilation' or specify the scope as 'Support qwen3 prefill: fix layouts, control flow, and yield handling across multiple backend passes'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description check ✅ Passed The description explains the intent (modify pypto IR to support qwen3_32b_prefill compilation) and references the detailed change summary, relating directly to the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive set of compilation fixes and feature enhancements for the PyPTO compiler, primarily targeting Qwen3 and DeepSeek V3.2 models. Key improvements include resolving yield type mismatches in control flow, enforcing hardware-required row-major layouts for broadcast operations, and generalizing layout repair logic for non-row-major tiles. Additionally, the PR adds several new library examples for DeepSeek and Qwen3 models and updates the runtime to support compile-only runs. I have no feedback to provide.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/codegen/pto/pto_codegen.cpp (1)

470-489: ⚠️ Potential issue | 🟠 Major

Honor explicit TensorLayout::ND before applying the [N,1] DN heuristic.

Once layout_str is initialized from layout_DN, the ND branch leaves it untouched. That means a tensor view explicitly marked ND on a column-vector shape is still emitted as #pto.layout<dn>, which silently changes its declared layout.

Suggested fix
       if (tensor_type->tensor_view_.has_value()) {
         switch (tensor_type->tensor_view_.value().layout) {
           case ir::TensorLayout::DN:
             layout_str = "dn";
             break;
           case ir::TensorLayout::NZ:
             layout_str = "nz";
             break;
           case ir::TensorLayout::ND:
-            break;
+            layout_str = "nd";
+            break;
         }
       }

Also applies to: 552-563

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/codegen/pto/pto_codegen.cpp` around lines 470 - 489, The DN-heuristic
overrides an explicit TensorLayout::ND because the heuristic runs regardless of
a tensor_view_ explicitly set to ND; update the logic around
tensor_type->tensor_view_ and layout_DN so that if
tensor_type->tensor_view_.has_value() and
tensor_type->tensor_view_.value().layout == ir::TensorLayout::ND you skip the
[N,1] DN heuristic (i.e., do not set layout_DN from the shape check). Change the
guard before the shape-based DN assignment to only run when there is no explicit
tensor_view_ or when the explicit layout is not ND, keeping references to
layout_DN, tensor_type->tensor_view_, ir::TensorLayout::ND, GetConstIntValue and
As<ir::Var> to locate and modify the code.
🧹 Nitpick comments (2)
python/pypto/ir/op/tile_ops.py (1)

1727-1728: Keep transpose axes normalized to INDEX.

Like the tensor wrapper, this now trusts any incoming ConstInt dtype. Rebuilding from the current value keeps tile.transpose axis operands consistently typed as INDEX.

💡 Suggested fix
-    axis1_expr = axis1 if isinstance(axis1, ConstInt) else ConstInt(axis1, DataType.INDEX, actual_span)
-    axis2_expr = axis2 if isinstance(axis2, ConstInt) else ConstInt(axis2, DataType.INDEX, actual_span)
+    axis1_expr = ConstInt(axis1.value if isinstance(axis1, ConstInt) else axis1, DataType.INDEX, actual_span)
+    axis2_expr = ConstInt(axis2.value if isinstance(axis2, ConstInt) else axis2, DataType.INDEX, actual_span)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/pypto/ir/op/tile_ops.py` around lines 1727 - 1728, axis1_expr and
axis2_expr currently reuse incoming ConstInt instances which may have non-INDEX
dtypes; always rebuild them as new ConstInt with DataType.INDEX so
tile.transpose axis operands are consistently typed. Replace the conditional
reuse with construction using ConstInt(axis1.value or axis1, DataType.INDEX,
actual_span) (and same for axis2) so that even when axis1/axis2 is already a
ConstInt you recreate it normalized to INDEX; refer to axis1_expr, axis2_expr,
ConstInt and DataType.INDEX to locate and modify the code handling
tile.transpose axes.
python/pypto/ir/op/tensor_ops.py (1)

846-847: Keep transpose axes normalized to INDEX.

Forwarding an existing ConstInt unchanged makes the IR wrapper depend on whatever dtype the caller used. Rebuilding from the current value keeps tensor.transpose axis operands consistently typed as INDEX.

💡 Suggested fix
-    axis1_expr = axis1 if isinstance(axis1, ConstInt) else ConstInt(axis1, DataType.INDEX, actual_span)
-    axis2_expr = axis2 if isinstance(axis2, ConstInt) else ConstInt(axis2, DataType.INDEX, actual_span)
+    axis1_expr = ConstInt(axis1.value if isinstance(axis1, ConstInt) else axis1, DataType.INDEX, actual_span)
+    axis2_expr = ConstInt(axis2.value if isinstance(axis2, ConstInt) else axis2, DataType.INDEX, actual_span)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/pypto/ir/op/tensor_ops.py` around lines 846 - 847, The current code
leaves existing ConstInt operands unchanged, which can preserve a non-INDEX
dtype; always normalize transpose axes to DataType.INDEX by rebuilding
axis1_expr and axis2_expr from the underlying integer value rather than
forwarding the ConstInt instance. Concretely, when creating axis1_expr and
axis2_expr (used by tensor.transpose), extract the integer value from
axis1/axis2 if they are ConstInt (e.g., axis1.value/axis2.value) and pass that
into a new ConstInt(..., DataType.INDEX, actual_span) so both axis operands are
freshly constructed with DataType.INDEX and the correct span.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples-lib/deepseek_v3_2/deepseek_v3_2_decode_back.py`:
- Around line 223-236: The RunConfig instantiation is invalid: replace the
unsupported keyword work_dir with save_kernels=True and
save_kernels_dir=work_dir, and change the invalid enum BackendType.CCE to
BackendType.Ascend910B_CCE; update the call that constructs RunConfig (the
arguments passed to run(program=program, tensor_specs=tensor_specs, ...)) so it
uses save_kernels=True, save_kernels_dir=work_dir and
BackendType.Ascend910B_CCE.

In `@examples-lib/deepseek_v3_2/deepseek_v3_2_decode_front.py`:
- Around line 390-452: The attn_front write inside the pl.parallel head loop
causes concurrent writes of the same [b,0] row; move the pl.assemble(attn_front,
attn_row, [b, 0]) out of the parallel loop so attn_row is fully populated across
all h before a single write. Keep all per-head work (building q_nope/q_rot,
oi/li/mi accumulation, ctx_latent, ctx_v, and pl.assemble(attn_row, ctx_v, [0,
v_col])) inside the pl.parallel over h, but perform the final
pl.assemble(attn_front, attn_row, [b, 0]) only once after the pl.parallel on
NUM_HEADS_CFG completes.

In `@examples-lib/deepseek_v3_2/deepseek_v3_2_prefill_back.py`:
- Around line 228-241: The RunConfig being constructed incorrectly includes a
nonexistent parameter work_dir; remove work_dir from the RunConfig(...) call and
instead enable kernel artifact saving by adding save_kernels=True and
save_kernels_dir=work_dir to the RunConfig passed to run(), ensuring run(...)
will save generated artefacts to the intended work_dir; update the RunConfig
construction referenced in the run(...) invocation accordingly.

In `@examples-lib/deepseek_v3_2/deepseek_v3_2_prefill_front.py`:
- Around line 548-561: The RunConfig call currently passes an unsupported field
work_dir which causes an immediate failure; replace that parameter by setting
save_kernels=True and save_kernels_dir=work_dir within the RunConfig constructor
(keeping platform, device_id, rtol, atol, strategy, dump_passes, backend_type
unchanged) so the RunConfig creation succeeds; update the RunConfig invocation
where RunConfig(...) is used to include save_kernels and save_kernels_dir
instead of work_dir.

In `@examples-lib/qwen3/qwen3_32b_decode.py`:
- Around line 411-454: The RunConfig defaults are inconsistent: the function
parameter platform (default "a2a3") does not match the hard-coded
BackendType.Ascend950 used in RunConfig, causing compilation for a different
target; update the call that constructs RunConfig inside this function so
backend_type is chosen to match platform (e.g., select
BackendType.Ascend910B_PTO when platform == "a2a3", or derive backend_type from
the platform variable), or change the platform default to match
BackendType.Ascend950 — adjust the code around the RunConfig(...) call and
ensure BackendType.Ascend950 and BackendType.Ascend910B_PTO are the referenced
symbols used for the mapping.

In `@examples-lib/qwen3/qwen3_32b_prefill.py`:
- Around line 259-269: The slice of scores should use a statically-shaped extent
and an explicit valid_shape to keep compiler lowering consistent: replace the
dynamic extent pl.slice(scores, [1, valid_len], [0, 0]) with a static extent [1,
SEQ_TILE] and pass valid_shape=[1, valid_len] (i.e., call pl.slice(scores, [1,
SEQ_TILE], [0, 0], valid_shape=[1, valid_len])) so scores_valid remains [1,
SEQ_TILE] while retaining the actual valid length in valid_shape; update the
pl.slice invocation that creates scores_valid accordingly (refer to
scores_valid, pl.slice, SEQ_TILE, valid_len).

In `@examples-lib/qwen3/qwen3_32b_training_forward_and_backward_src.py`:
- Around line 964-973: The RunConfig call is passing a non-existent argument
work_dir which raises a TypeError; update the RunConfig instantiation in this
file (the RunConfig(...) block) to remove work_dir or replace it with the
runner-supported parameter save_kernels_dir (e.g., save_kernels_dir=work_dir) so
the output location is controlled correctly; ensure only valid RunConfig
keywords remain (rtol, atol, strategy, dump_passes, backend_type, device_id,
platform, etc.) and run a quick compile to verify no TypeError occurs.
- Around line 59-72: In build_qwen3_32b_training_forward_backward_program
validate or handle non-multiple dimensions: check that hidden_size and
intermediate_size are multiples of K_CHUNK and MLP_OUT_CHUNK (and that
hidden_size aligns with Q_OUT_CHUNK where relevant) or compute and handle tail
sizes instead of using ceil-div-only block counts (HIDDEN_BLOCKS, Q_OUT_BLOCKS,
MLP_OUT_BLOCKS); also compute tok_blocks = (MAX_SEQ_CFG + TOK_TILE - 1) //
TOK_TILE or explicitly check MAX_SEQ_CFG % TOK_TILE == 0 and reject/raise if not
supported. Update kernel invocation/slicing logic to guard last-block ranges
using computed remainders or bail early with a clear error referencing
HIDDEN_CFG, INTER_CFG, K_CHUNK, Q_OUT_CHUNK, MLP_OUT_CHUNK, and TOK_TILE.

In `@examples-lib/qwen3/qwen3_32b_training_forward_and_backward.py`:
- Around line 129-132: The loop computing tok_blocks uses floor division
(tok_blocks = MAX_SEQ_CFG // TOK_TILE) which drops tail tokens when MAX_SEQ_CFG
% TOK_TILE != 0; fix by computing tok_blocks = (MAX_SEQ_CFG + TOK_TILE - 1) //
TOK_TILE (round up) and modify the nested loop inside pl.parallel (the for
p0_idx in pl.range(tok_blocks) body) to handle the final partial tile using the
existing valid_shape/path (i.e., compute p0 = p0_idx * TOK_TILE and for the last
block clamp the tile length with min(TOK_TILE, MAX_SEQ_CFG - p0) and pass that
as valid_shape to the downstream logic), or alternatively add an upfront
validation to reject non-divisible MAX_SEQ_CFG by raising an error if
MAX_SEQ_CFG % TOK_TILE != 0.
- Around line 1002-1023: The computed work_dir variable is never passed into the
runner, so update the RunConfig passed to run(...) to set
save_kernels_dir=work_dir (or the equivalent RunConfig field used to control
kernel output). Locate the RunConfig construction in the call to
run(program=program, ... ) and add save_kernels_dir=work_dir to that RunConfig
so generated kernels are written to the printed work_dir instead of a
timestamped temp directory.

In `@examples-lib/qwen3/qwen3-32b.py`:
- Around line 241-247: The pl.slice call makes the tile width dynamic; instead,
keep the physical tile size [1, SEQ_TILE] and pass valid_len via valid_shape:
replace the pl.slice(scores, [1, valid_len], ...) that produces scores_valid
with a slice that always uses [1, SEQ_TILE] (so scores_valid is physically
shaped [1, SEQ_TILE]) and set its valid_shape to [1, valid_len] (mirroring
examples-lib/qwen3/qwen3_32b_decode.py); ensure downstream uses (cur_mi,
exp_scores, cur_li, exp_pad, etc.) operate on that fixed tile and rely on
valid_shape for correctness.

In `@python/pypto/runtime/runner.py`:
- Around line 233-235: The guard currently only checks `golden is not None`, so
`codegen_only` in `RunConfig` is ignored; modify the conditional before calling
`_execute_on_device` to also require `not config.codegen_only`. In other words,
locate the block where `_execute_on_device(work_dir, golden_path,
config.platform, config.device_id)` is invoked and change the condition to
ensure both `golden is not None` and `config.codegen_only` is false (e.g., `if
golden is not None and not config.codegen_only:`) so callers with
`codegen_only=True` will skip the CodeRunner execution.

In `@src/ir/transforms/resolve_backend_op_layouts_pass.cpp`:
- Around line 94-96: NeedsLayoutRepair currently rewrites every non-row-major
input (it returns true when GetTileLayout(tile_type) != TileLayout::row_major)
but the assign/rebind path only restores layout for the special [N,1] col-major
case, causing ordinary 2D col_major results to be materialized as row-major and
rebound to op->var_ without restoring their declared layout; update the
assign/restore logic that follows NeedsLayoutRepair so that after generalized
layout repair you detect the original tile layout (via GetTileLayout on the
operand/TileType) and, when it was col_major (or any non-row-major), reshape or
rematerialize the result back to that declared layout before assigning to
op->var_ (i.e., generalize the existing [N,1] reshape code to handle 2D
col_major and other non-row-major layouts and ensure any path that sets op->var_
preserves the original TileLayout).

In `@src/ir/transforms/utils/cross_core_pipe.cpp`:
- Around line 130-143: The code returns a max slot size even when a mix of
static and dynamic tile sizes exists because observed_slot_sizes.empty() only
detects when all ops are dynamic; to fix this add an explicit boolean flag (e.g.
unknown_slot_size_seen) to PipeDirectionMetadata, set that flag in
RecordTileSlotSize() whenever a dynamic/unknown tile size is observed, and
update this function's loop over metadata.c2v / metadata.v2c to bail out and
return std::nullopt if any direction has unknown_slot_size_seen (in addition to
the existing empty check), otherwise compute max_slot_size as before.

In `@src/ir/transforms/utils/loop_state_repair.cpp`:
- Around line 114-117: The current ExtractBaseName(const std::string& name_hint)
uses string-prefix matching and can map distinct loop-carried values
incorrectly; replace the name-hint based collapse with an identity-based mapping
derived from the loop node itself (use the loop's return_vars_/iter_args_
correspondence or the loop AST/IR node that pairs yields to iter args) so keys
are based on the actual iter_arg identity rather than the stem of name_hint;
update any code that calls ExtractBaseName (including the other occurrences
around the 151-164 region) to look up the correct iter_arg via the loop's
return_vars_/iter_args_ mapping and only fall back to name_hint as a last
resort.

---

Outside diff comments:
In `@src/codegen/pto/pto_codegen.cpp`:
- Around line 470-489: The DN-heuristic overrides an explicit TensorLayout::ND
because the heuristic runs regardless of a tensor_view_ explicitly set to ND;
update the logic around tensor_type->tensor_view_ and layout_DN so that if
tensor_type->tensor_view_.has_value() and
tensor_type->tensor_view_.value().layout == ir::TensorLayout::ND you skip the
[N,1] DN heuristic (i.e., do not set layout_DN from the shape check). Change the
guard before the shape-based DN assignment to only run when there is no explicit
tensor_view_ or when the explicit layout is not ND, keeping references to
layout_DN, tensor_type->tensor_view_, ir::TensorLayout::ND, GetConstIntValue and
As<ir::Var> to locate and modify the code.

---

Nitpick comments:
In `@python/pypto/ir/op/tensor_ops.py`:
- Around line 846-847: The current code leaves existing ConstInt operands
unchanged, which can preserve a non-INDEX dtype; always normalize transpose axes
to DataType.INDEX by rebuilding axis1_expr and axis2_expr from the underlying
integer value rather than forwarding the ConstInt instance. Concretely, when
creating axis1_expr and axis2_expr (used by tensor.transpose), extract the
integer value from axis1/axis2 if they are ConstInt (e.g.,
axis1.value/axis2.value) and pass that into a new ConstInt(..., DataType.INDEX,
actual_span) so both axis operands are freshly constructed with DataType.INDEX
and the correct span.

In `@python/pypto/ir/op/tile_ops.py`:
- Around line 1727-1728: axis1_expr and axis2_expr currently reuse incoming
ConstInt instances which may have non-INDEX dtypes; always rebuild them as new
ConstInt with DataType.INDEX so tile.transpose axis operands are consistently
typed. Replace the conditional reuse with construction using
ConstInt(axis1.value or axis1, DataType.INDEX, actual_span) (and same for axis2)
so that even when axis1/axis2 is already a ConstInt you recreate it normalized
to INDEX; refer to axis1_expr, axis2_expr, ConstInt and DataType.INDEX to locate
and modify the code handling tile.transpose axes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 96070b8a-541a-4bcb-b0bc-ab923c8066f9

📥 Commits

Reviewing files that changed from the base of the PR and between d19e7f4 and b7b2cc1.

📒 Files selected for processing (24)
  • CHANGES_SUMMARY.txt
  • examples-lib/deepseek_v3_2/deepseek_v3_2_decode_back.py
  • examples-lib/deepseek_v3_2/deepseek_v3_2_decode_front.py
  • examples-lib/deepseek_v3_2/deepseek_v3_2_prefill_back.py
  • examples-lib/deepseek_v3_2/deepseek_v3_2_prefill_front.py
  • examples-lib/hello_world.py
  • examples-lib/matmul.py
  • examples-lib/qwen3/qwen3-32b.py
  • examples-lib/qwen3/qwen3_32b_decode.py
  • examples-lib/qwen3/qwen3_32b_prefill.py
  • examples-lib/qwen3/qwen3_32b_prefill_src.py
  • examples-lib/qwen3/qwen3_32b_training_forward_and_backward.py
  • examples-lib/qwen3/qwen3_32b_training_forward_and_backward_src.py
  • python/pypto/ir/op/tensor_ops.py
  • python/pypto/ir/op/tile_ops.py
  • python/pypto/runtime/runner.py
  • src/backend/common/pto_ops_common.cpp
  • src/codegen/pto/pto_codegen.cpp
  • src/codegen/pto/pto_control_flow_codegen.cpp
  • src/ir/op/tile_ops/broadcast.cpp
  • src/ir/transforms/resolve_backend_op_layouts_pass.cpp
  • src/ir/transforms/utils/cross_core_pipe.cpp
  • src/ir/transforms/utils/loop_state_repair.cpp
  • src/ir/verifier/mixed_kernel_expanded_verifier.cpp

Comment on lines +223 to +236
result = run(
program=program,
tensor_specs=tensor_specs,
golden=None,
config=RunConfig(
platform=platform,
device_id=device_id,
rtol=2e-2,
atol=2e-2,
strategy=OptimizationStrategy.Default,
dump_passes=dump_passes,
backend_type=BackendType.CCE,
work_dir=work_dir,
),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "*.py" | xargs grep -l "class RunConfig" | head -5

Repository: hw-native-sys/pypto

Length of output: 95


🏁 Script executed:

cat -n ./python/pypto/runtime/runner.py | head -100

Repository: hw-native-sys/pypto

Length of output: 5014


🏁 Script executed:

cat -n examples-lib/deepseek_v3_2/deepseek_v3_2_decode_back.py | sed -n '223,236p'

Repository: hw-native-sys/pypto

Length of output: 567


🏁 Script executed:

cat -n ./python/pypto/runtime/runner.py | sed -n '56,90p'

Repository: hw-native-sys/pypto

Length of output: 1994


🏁 Script executed:

# Verify BackendType.CCE exists (from the code snippet)
grep -r "BackendType\|class BackendType" --include="*.py" | grep -E "(CCE|BackendType)" | head -10

Repository: hw-native-sys/pypto

Length of output: 1016


🏁 Script executed:

find . -type f -name "*.py" | xargs grep -l "class BackendType\|enum BackendType" | head -3

Repository: hw-native-sys/pypto

Length of output: 45


🏁 Script executed:

grep -r "^class BackendType\|^BackendType.*=" --include="*.py" | head -10

Repository: hw-native-sys/pypto

Length of output: 45


🏁 Script executed:

find . -type f -name "*.py" -path "*/backend*" | head -20

Repository: hw-native-sys/pypto

Length of output: 280


🏁 Script executed:

cat -n ./python/pypto/backend/__init__.py

Repository: hw-native-sys/pypto

Length of output: 1825


🏁 Script executed:

grep -r "BackendType\|CCE" ./python/pypto/pypto_core/backend.py | head -20

Repository: hw-native-sys/pypto

Length of output: 132


🏁 Script executed:

find . -type f -name "backend.py" | head -10

Repository: hw-native-sys/pypto

Length of output: 45


🏁 Script executed:

find . -type f -path "*pypto_core*" -name "*.py" | head -20

Repository: hw-native-sys/pypto

Length of output: 45


🏁 Script executed:

grep -r "class BackendType" . --include="*.py" --include="*.pyi" 2>/dev/null | head -10

Repository: hw-native-sys/pypto

Length of output: 119


🏁 Script executed:

cat -n ./python/pypto/pypto_core/backend.pyi | grep -A 20 "class BackendType"

Repository: hw-native-sys/pypto

Length of output: 714


🏁 Script executed:

grep -r "BackendType\.CCE" . --include="*.py" | head -10

Repository: hw-native-sys/pypto

Length of output: 475


🏁 Script executed:

# Check if BackendType.CCE is actually valid or if it's a typo
grep -B 5 -A 5 "BackendType\.CCE\|BackendType\.Ascend910B_CCE" ./examples-lib/deepseek_v3_2/deepseek_v3_2_decode_back.py | head -20

Repository: hw-native-sys/pypto

Length of output: 478


RunConfig doesn't accept work_dir, and BackendType.CCE is not a valid enum value.

This config construction will raise at instantiation. Replace work_dir=work_dir with save_kernels=True and save_kernels_dir=work_dir. Additionally, BackendType.CCE is not a valid backend type; use BackendType.Ascend910B_CCE instead.

Suggested fix
     result = run(
         program=program,
         tensor_specs=tensor_specs,
         golden=None,
         config=RunConfig(
             platform=platform,
             device_id=device_id,
             rtol=2e-2,
             atol=2e-2,
             strategy=OptimizationStrategy.Default,
             dump_passes=dump_passes,
-            backend_type=BackendType.CCE,
-            work_dir=work_dir,
+            backend_type=BackendType.Ascend910B_CCE,
+            save_kernels=True,
+            save_kernels_dir=work_dir,
         ),
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples-lib/deepseek_v3_2/deepseek_v3_2_decode_back.py` around lines 223 -
236, The RunConfig instantiation is invalid: replace the unsupported keyword
work_dir with save_kernels=True and save_kernels_dir=work_dir, and change the
invalid enum BackendType.CCE to BackendType.Ascend910B_CCE; update the call that
constructs RunConfig (the arguments passed to run(program=program,
tensor_specs=tensor_specs, ...)) so it uses save_kernels=True,
save_kernels_dir=work_dir and BackendType.Ascend910B_CCE.

Comment on lines +390 to +452
for h in pl.parallel(0, NUM_HEADS_CFG, 1, chunk=8):
q_col = h * QK_HEAD_DIM_CFG
q_nope = pl.cast(
pl.slice(q_proj, [1, QK_NOPE_HEAD_DIM_CFG], [b, q_col]),
target_type=pl.FP32,
)
q_pe = pl.cast(
pl.slice(q_proj, [1, QK_ROPE_HEAD_DIM_CFG], [b, q_col + QK_NOPE_HEAD_DIM_CFG]),
target_type=pl.FP32,
)
q_lo = pl.slice(q_pe, [1, QK_ROPE_HEAD_DIM_CFG // 2], [0, 0])
q_hi = pl.slice(q_pe, [1, QK_ROPE_HEAD_DIM_CFG // 2], [0, QK_ROPE_HEAD_DIM_CFG // 2])
q_rot = pl.create_tensor([1, QK_ROPE_HEAD_DIM_CFG], dtype=pl.FP32)
q_rot = pl.assemble(q_rot, pl.sub(pl.col_expand_mul(q_lo, cos_lo), pl.col_expand_mul(q_hi, sin_lo)), [0, 0])
q_rot = pl.assemble(q_rot, pl.add(pl.col_expand_mul(q_hi, cos_hi), pl.col_expand_mul(q_lo, sin_hi)), [0, QK_ROPE_HEAD_DIM_CFG // 2])
q_nope_latent = pl.matmul(
pl.cast(q_nope, target_type=pl.BF16),
pl.reshape(pl.slice(w_q_nope_to_latent, [1, QK_NOPE_HEAD_DIM_CFG, KV_LORA_RANK_CFG], [h, 0, 0]), [QK_NOPE_HEAD_DIM_CFG, KV_LORA_RANK_CFG]),
)

oi = pl.create_tensor([1, KV_LORA_RANK_CFG], dtype=pl.FP32)
li = pl.create_tensor([1, 1], dtype=pl.FP32)
mi = pl.create_tensor([1, 1], dtype=pl.FP32)
oi = pl.mul(oi, 0.0)
li = pl.mul(li, 0.0)
mi = pl.mul(mi, 0.0)

sparse_k = pl.min(INDEX_TOPK_CFG, ctx_len)
for kk in pl.range(sparse_k):
s = pl.tensor.read(topk_idx, [0, kk])
if s >= 0:
cache_s = b * MAX_SEQ_CFG + s
kv_s = pl.cast(pl.slice(kv_cache, [1, KV_LORA_RANK_CFG], [cache_s, 0]), target_type=pl.FP32)
pe_s = pl.cast(pl.slice(pe_cache, [1, QK_ROPE_HEAD_DIM_CFG], [cache_s, 0]), target_type=pl.FP32)
score_nope = pl.row_sum(pl.mul(q_nope_latent, kv_s))
score_pe = pl.row_sum(pl.mul(q_rot, pe_s))
score = pl.mul(pl.add(score_nope, score_pe), ATTN_SCALE)
cur_mi = score
cur_li = pl.exp(pl.sub(score, cur_mi))
oi_tmp = pl.row_expand_mul(kv_s, cur_li)
if kk == 0:
oi = oi_tmp
li = cur_li
mi = cur_mi
else:
mi_new = pl.maximum(mi, cur_mi)
alpha = pl.exp(pl.sub(mi, mi_new))
beta = pl.exp(pl.sub(cur_mi, mi_new))
li = pl.add(pl.mul(alpha, li), pl.mul(beta, cur_li))
oi = pl.add(pl.row_expand_mul(oi, alpha), pl.row_expand_mul(oi_tmp, beta))
mi = mi_new
ctx_latent = pl.row_expand_div(oi, li)
v_col = h * V_HEAD_DIM_CFG
ctx_v = pl.create_tensor([1, V_HEAD_DIM_CFG], dtype=pl.FP32)
ctx_v = pl.mul(ctx_v, 0.0)
for vb in pl.range(V_OUT_BLOCKS):
v0 = vb * V_OUT_CHUNK
wv_tile = pl.slice(w_latent_to_v, [1, KV_LORA_RANK_CFG, V_OUT_CHUNK], [h, 0, v0])
wv_tile = pl.reshape(wv_tile, [KV_LORA_RANK_CFG, V_OUT_CHUNK])
v_part = pl.matmul(pl.cast(ctx_latent, target_type=pl.BF16), wv_tile, out_dtype=pl.FP32)
ctx_v = pl.assemble(ctx_v, v_part, [0, v0])
attn_row = pl.assemble(attn_row, ctx_v, [0, v_col])
attn_front = pl.assemble(attn_front, attn_row, [b, 0])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Move the attn_front write out of the head-parallel loop.

Every h iteration writes the same [b, 0] row while pl.parallel is active. That makes attn_front a shared write target fed by partially populated attn_row; the full row should only be assembled once the head loop is complete.

Suggested change
                     for h in pl.parallel(0, NUM_HEADS_CFG, 1, chunk=8):
                         q_col = h * QK_HEAD_DIM_CFG
                         q_nope = pl.cast(
                             pl.slice(q_proj, [1, QK_NOPE_HEAD_DIM_CFG], [b, q_col]),
                             target_type=pl.FP32,
@@
                             v_part = pl.matmul(pl.cast(ctx_latent, target_type=pl.BF16), wv_tile, out_dtype=pl.FP32)
                             ctx_v = pl.assemble(ctx_v, v_part, [0, v0])
                         attn_row = pl.assemble(attn_row, ctx_v, [0, v_col])
-                        attn_front = pl.assemble(attn_front, attn_row, [b, 0])
+                    attn_front = pl.assemble(attn_front, attn_row, [b, 0])
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples-lib/deepseek_v3_2/deepseek_v3_2_decode_front.py` around lines 390 -
452, The attn_front write inside the pl.parallel head loop causes concurrent
writes of the same [b,0] row; move the pl.assemble(attn_front, attn_row, [b, 0])
out of the parallel loop so attn_row is fully populated across all h before a
single write. Keep all per-head work (building q_nope/q_rot, oi/li/mi
accumulation, ctx_latent, ctx_v, and pl.assemble(attn_row, ctx_v, [0, v_col]))
inside the pl.parallel over h, but perform the final pl.assemble(attn_front,
attn_row, [b, 0]) only once after the pl.parallel on NUM_HEADS_CFG completes.

Comment on lines +228 to +241
result = run(
program=program,
tensor_specs=tensor_specs,
golden=None,
config=RunConfig(
platform=platform,
device_id=device_id,
rtol=2e-2,
atol=2e-2,
strategy=OptimizationStrategy.Default,
dump_passes=dump_passes,
backend_type=BackendType.CCE,
work_dir=work_dir,
),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "*.py" | xargs grep -l "class RunConfig" | head -5

Repository: hw-native-sys/pypto

Length of output: 95


🏁 Script executed:

find . -type f -path "*/runner.py" | head -10

Repository: hw-native-sys/pypto

Length of output: 95


🏁 Script executed:

find . -type f -name "*runner*" -type f | grep -E "\.(py|pyx)$" | head -10

Repository: hw-native-sys/pypto

Length of output: 134


🏁 Script executed:

cat -n ./python/pypto/runtime/runner.py | head -200

Repository: hw-native-sys/pypto

Length of output: 9397


🏁 Script executed:

cat -n ./examples-lib/deepseek_v3_2/deepseek_v3_2_prefill_back.py | sed -n '220,250p'

Repository: hw-native-sys/pypto

Length of output: 1354


🏁 Script executed:

grep -n "work_dir" ./python/pypto/runtime/runner.py | head -20

Repository: hw-native-sys/pypto

Length of output: 1076


🏁 Script executed:

cat -n ./python/pypto/runtime/runner.py | sed -n '200,230p'

Repository: hw-native-sys/pypto

Length of output: 1361


work_dir is not a valid RunConfig parameter.

RunConfig does not expose work_dir. Forward the path through save_kernels=True and save_kernels_dir=work_dir instead, which the run() function uses internally to determine where to save generated artefacts.

Suggested fix
     result = run(
         program=program,
         tensor_specs=tensor_specs,
         golden=None,
         config=RunConfig(
             platform=platform,
             device_id=device_id,
             rtol=2e-2,
             atol=2e-2,
             strategy=OptimizationStrategy.Default,
             dump_passes=dump_passes,
             backend_type=BackendType.CCE,
-            work_dir=work_dir,
+            save_kernels=True,
+            save_kernels_dir=work_dir,
         ),
     )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
result = run(
program=program,
tensor_specs=tensor_specs,
golden=None,
config=RunConfig(
platform=platform,
device_id=device_id,
rtol=2e-2,
atol=2e-2,
strategy=OptimizationStrategy.Default,
dump_passes=dump_passes,
backend_type=BackendType.CCE,
work_dir=work_dir,
),
result = run(
program=program,
tensor_specs=tensor_specs,
golden=None,
config=RunConfig(
platform=platform,
device_id=device_id,
rtol=2e-2,
atol=2e-2,
strategy=OptimizationStrategy.Default,
dump_passes=dump_passes,
backend_type=BackendType.CCE,
save_kernels=True,
save_kernels_dir=work_dir,
),
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples-lib/deepseek_v3_2/deepseek_v3_2_prefill_back.py` around lines 228 -
241, The RunConfig being constructed incorrectly includes a nonexistent
parameter work_dir; remove work_dir from the RunConfig(...) call and instead
enable kernel artifact saving by adding save_kernels=True and
save_kernels_dir=work_dir to the RunConfig passed to run(), ensuring run(...)
will save generated artefacts to the intended work_dir; update the RunConfig
construction referenced in the run(...) invocation accordingly.

Comment on lines +548 to +561
result = run(
program=program,
tensor_specs=tensor_specs,
golden=None,
config=RunConfig(
platform=platform,
device_id=device_id,
rtol=2e-2,
atol=2e-2,
strategy=OptimizationStrategy.Default,
dump_passes=dump_passes,
backend_type=BackendType.CCE,
work_dir=work_dir,
),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

find . -name "runner.py" -type f | head -20

Repository: hw-native-sys/pypto

Length of output: 95


🏁 Script executed:

rg "class RunConfig" -A 50

Repository: hw-native-sys/pypto

Length of output: 4062


This RunConfig(...) call will fail immediately.

work_dir is not a supported RunConfig field. Use save_kernels=True plus save_kernels_dir=work_dir instead.

Suggested fix
     result = run(
         program=program,
         tensor_specs=tensor_specs,
         golden=None,
         config=RunConfig(
             platform=platform,
             device_id=device_id,
             rtol=2e-2,
             atol=2e-2,
             strategy=OptimizationStrategy.Default,
             dump_passes=dump_passes,
             backend_type=BackendType.CCE,
-            work_dir=work_dir,
+            save_kernels=True,
+            save_kernels_dir=work_dir,
         ),
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples-lib/deepseek_v3_2/deepseek_v3_2_prefill_front.py` around lines 548 -
561, The RunConfig call currently passes an unsupported field work_dir which
causes an immediate failure; replace that parameter by setting save_kernels=True
and save_kernels_dir=work_dir within the RunConfig constructor (keeping
platform, device_id, rtol, atol, strategy, dump_passes, backend_type unchanged)
so the RunConfig creation succeeds; update the RunConfig invocation where
RunConfig(...) is used to include save_kernels and save_kernels_dir instead of
work_dir.

Comment on lines +411 to +454
platform: str = "a2a3",
device_id: int = 11,
work_dir: str | None = None,
dump_passes: bool = True,
):
from pypto.backend import BackendType
from pypto.ir.pass_manager import OptimizationStrategy
from pypto.runtime import RunConfig, run

program = build_qwen3_single_layer_decode_program(
batch=batch,
max_seq_len=max_seq_len,
hidden_size=hidden_size,
num_heads=num_heads,
num_kv_heads=num_kv_heads,
head_dim=head_dim,
intermediate_size=intermediate_size,
)

print(program)

tensor_specs = build_tensor_specs(
batch=batch,
max_seq_len=max_seq_len,
hidden_size=hidden_size,
num_heads=num_heads,
num_kv_heads=num_kv_heads,
head_dim=head_dim,
intermediate_size=intermediate_size,
)

result = run(
program=program,
tensor_specs=tensor_specs,
golden=None,
config=RunConfig(
platform=platform,
device_id=device_id,
rtol=2e-2,
atol=2e-2,
strategy=OptimizationStrategy.Default,
dump_passes=dump_passes,
backend_type=BackendType.Ascend950,
),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "qwen3_32b_decode.py compile_and_run defaults:"
sed -n '403,454p' examples-lib/qwen3/qwen3_32b_decode.py

echo
echo "PTO backend -> target_arch mapping:"
sed -n '165,177p' src/codegen/pto/pto_codegen.cpp

Repository: hw-native-sys/pypto

Length of output: 2072


The default RunConfig targets two different architectures.

This helper defaults platform="a2a3" but hard-codes BackendType.Ascend950. Since src/codegen/pto/pto_codegen.cpp maps Ascend950 to target_arch = "a5" and Ascend910B_PTO to "a2a3", the program is compiled for a different target than what the runner will attempt to execute on.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples-lib/qwen3/qwen3_32b_decode.py` around lines 411 - 454, The RunConfig
defaults are inconsistent: the function parameter platform (default "a2a3") does
not match the hard-coded BackendType.Ascend950 used in RunConfig, causing
compilation for a different target; update the call that constructs RunConfig
inside this function so backend_type is chosen to match platform (e.g., select
BackendType.Ascend910B_PTO when platform == "a2a3", or derive backend_type from
the platform variable), or change the platform default to match
BackendType.Ascend950 — adjust the code around the RunConfig(...) call and
ensure BackendType.Ascend950 and BackendType.Ascend910B_PTO are the referenced
symbols used for the mapping.

Comment on lines +241 to +247
scores_valid = pl.slice(scores, [1, valid_len], [0, 0])
cur_mi = pl.cast(pl.row_max(scores_valid), target_type=pl.FP32)
exp_scores = pl.exp(pl.row_expand_sub(scores_valid, cur_mi))
cur_li = pl.cast(pl.row_sum(exp_scores), target_type=pl.FP32)
exp_pad = pl.create_tensor([1, SEQ_TILE], dtype=pl.FP32)
exp_pad = pl.mul(exp_pad, 0.0)
exp_pad = pl.assemble(exp_pad, exp_scores, [0, 0])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Keep the score tail slice physically [1, SEQ_TILE].

pl.slice(scores, [1, valid_len], ...) makes the tile width itself dynamic. PTO lowering has separate valid_shape plumbing for tail tiles; it is much safer to keep the physical score tile [1, SEQ_TILE] and carry valid_len via valid_shape, as examples-lib/qwen3/qwen3_32b_decode.py already does.

Suggested fix
-                            scores_valid = pl.slice(scores, [1, valid_len], [0, 0])
-                            cur_mi = pl.cast(pl.row_max(scores_valid), target_type=pl.FP32)
-                            exp_scores = pl.exp(pl.row_expand_sub(scores_valid, cur_mi))
+                            scores_valid = pl.slice(
+                                scores,
+                                [1, SEQ_TILE],
+                                [0, 0],
+                                valid_shape=[1, valid_len],
+                            )
+                            scores_padded = pl.fillpad(scores_valid, pad_value=pl.PadValue.min)
+                            cur_mi = pl.cast(pl.row_max(scores_padded), target_type=pl.FP32)
+                            exp_scores = pl.exp(pl.row_expand_sub(scores_padded, cur_mi))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples-lib/qwen3/qwen3-32b.py` around lines 241 - 247, The pl.slice call
makes the tile width dynamic; instead, keep the physical tile size [1, SEQ_TILE]
and pass valid_len via valid_shape: replace the pl.slice(scores, [1, valid_len],
...) that produces scores_valid with a slice that always uses [1, SEQ_TILE] (so
scores_valid is physically shaped [1, SEQ_TILE]) and set its valid_shape to [1,
valid_len] (mirroring examples-lib/qwen3/qwen3_32b_decode.py); ensure downstream
uses (cur_mi, exp_scores, cur_li, exp_pad, etc.) operate on that fixed tile and
rely on valid_shape for correctness.

Comment on lines +233 to +235
# 4. Execute via Simpler's CodeRunner
_execute_on_device(work_dir, golden_path, config.platform, config.device_id)
if golden is not None:
_execute_on_device(work_dir, golden_path, config.platform, config.device_id)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

codegen_only still doesn't stop execution.

RunConfig documents codegen_only=True as compile-only, but this guard only checks golden is not None. A caller that supplies a golden and sets codegen_only=True will still invoke CodeRunner.

💡 Suggested fix
-        if golden is not None:
+        if golden is not None and not config.codegen_only:
             _execute_on_device(work_dir, golden_path, config.platform, config.device_id)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# 4. Execute via Simpler's CodeRunner
_execute_on_device(work_dir, golden_path, config.platform, config.device_id)
if golden is not None:
_execute_on_device(work_dir, golden_path, config.platform, config.device_id)
# 4. Execute via Simpler's CodeRunner
if golden is not None and not config.codegen_only:
_execute_on_device(work_dir, golden_path, config.platform, config.device_id)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/pypto/runtime/runner.py` around lines 233 - 235, The guard currently
only checks `golden is not None`, so `codegen_only` in `RunConfig` is ignored;
modify the conditional before calling `_execute_on_device` to also require `not
config.codegen_only`. In other words, locate the block where
`_execute_on_device(work_dir, golden_path, config.platform, config.device_id)`
is invoked and change the condition to ensure both `golden is not None` and
`config.codegen_only` is false (e.g., `if golden is not None and not
config.codegen_only:`) so callers with `codegen_only=True` will skip the
CodeRunner execution.

Comment on lines +94 to +96
bool NeedsLayoutRepair(const TileTypePtr& tile_type) {
return tile_type && GetTileLayout(tile_type) != TileLayout::row_major;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Restore non-row-major outputs after generalized layout repair.

NeedsLayoutRepair() now rewrites every non-row-major input, but the assign path only reshapes back when the result is [N,1] col-major. Ordinary 2D col_major results will now be materialized as row-major and rebound to op->var_ without restoring their declared layout.

Suggested fix
-    if (IsColumnVectorColMajor(result_tile_type)) {
+    if (NeedsLayoutRepair(result_tile_type)) {
       auto row_major_var = std::make_shared<Var>(NextTempName(op->var_->name_hint_, {"row_major"}),
                                                  repaired_call->GetType(), call->span_);
       rewritten.push_back(std::make_shared<AssignStmt>(row_major_var, repaired_call, op->span_));

       auto reshape_back = CreateReshapeCall(row_major_var, result_tile_type->shape_, call->span_);

Also applies to: 176-186

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ir/transforms/resolve_backend_op_layouts_pass.cpp` around lines 94 - 96,
NeedsLayoutRepair currently rewrites every non-row-major input (it returns true
when GetTileLayout(tile_type) != TileLayout::row_major) but the assign/rebind
path only restores layout for the special [N,1] col-major case, causing ordinary
2D col_major results to be materialized as row-major and rebound to op->var_
without restoring their declared layout; update the assign/restore logic that
follows NeedsLayoutRepair so that after generalized layout repair you detect the
original tile layout (via GetTileLayout on the operand/TileType) and, when it
was col_major (or any non-row-major), reshape or rematerialize the result back
to that declared layout before assigning to op->var_ (i.e., generalize the
existing [N,1] reshape code to handle 2D col_major and other non-row-major
layouts and ensure any path that sets op->var_ preserves the original
TileLayout).

Comment on lines +130 to +143
int64_t max_slot_size = 0;
bool found_any = false;
for (const auto* direction : {&metadata.c2v, &metadata.v2c}) {
if (!direction->has_ops) continue;
if (direction->has_inconsistent_slot_size || !direction->slot_size_bytes.has_value()) {
// When any observed size is non-static, we cannot infer a slot size.
if (direction->observed_slot_sizes.empty()) {
return std::nullopt;
}
if (!common_slot_size.has_value()) {
common_slot_size = direction->slot_size_bytes;
continue;
}
if (common_slot_size.value() != direction->slot_size_bytes.value()) {
return std::nullopt;
for (int64_t sz : direction->observed_slot_sizes) {
max_slot_size = std::max(max_slot_size, sz);
found_any = true;
}
}
return common_slot_size;
return found_any ? std::optional<int64_t>(max_slot_size) : std::nullopt;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Mixed static/dynamic pipe ops can still produce a bogus slot size.

observed_slot_sizes.empty() only catches the case where all ops in a direction are non-static. If one tpush/tpop has a dynamic tile size and another has a static size, RecordTileSlotSize() marks the direction as having ops but only records the static size, so this function now returns a max slot size instead of nullopt. That can drive automatic pipe setup with an invalid slot size. Please track an explicit “unknown slot size seen” flag in PipeDirectionMetadata and bail out when any contributing direction has it.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ir/transforms/utils/cross_core_pipe.cpp` around lines 130 - 143, The code
returns a max slot size even when a mix of static and dynamic tile sizes exists
because observed_slot_sizes.empty() only detects when all ops are dynamic; to
fix this add an explicit boolean flag (e.g. unknown_slot_size_seen) to
PipeDirectionMetadata, set that flag in RecordTileSlotSize() whenever a
dynamic/unknown tile size is observed, and update this function's loop over
metadata.c2v / metadata.v2c to bail out and return std::nullopt if any direction
has unknown_slot_size_seen (in addition to the existing empty check), otherwise
compute max_slot_size as before.

Comment on lines +114 to +117
std::string ExtractBaseName(const std::string& name_hint) {
auto pos = name_hint.find("__");
return pos != std::string::npos ? name_hint.substr(0, pos) : name_hint;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Base-name matching can repair the wrong loop-carried value.

name_hint_ is only a cosmetic label, and ExtractBaseName() collapses every foo__* value to the same key. If a loop carries two distinct vars with the same stem—or a user-defined name already contains __—the first match wins and the dangling yield gets rewired to the wrong iter arg. Please derive this mapping from the enclosing loop’s return_vars_/iter_args_ correspondence or another identity-based relation instead of string matching.

Also applies to: 151-164

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ir/transforms/utils/loop_state_repair.cpp` around lines 114 - 117, The
current ExtractBaseName(const std::string& name_hint) uses string-prefix
matching and can map distinct loop-carried values incorrectly; replace the
name-hint based collapse with an identity-based mapping derived from the loop
node itself (use the loop's return_vars_/iter_args_ correspondence or the loop
AST/IR node that pairs yields to iter args) so keys are based on the actual
iter_arg identity rather than the stem of name_hint; update any code that calls
ExtractBaseName (including the other occurrences around the 151-164 region) to
look up the correct iter_arg via the loop's return_vars_/iter_args_ mapping and
only fall back to name_hint as a last resort.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
examples-lib/qwen3/qwen3_32b_training_forward_and_backward_src.py (2)

618-620: grad_sink is computed but never used.

The variable grad_sink is assigned the result of pl.mul(bwd_energy, 0.0) but is never referenced afterward. If this is intended as a dependency anchor to prevent bwd_energy from being optimized away, consider adding a brief comment explaining its purpose. Otherwise, this is dead code that can be removed.

💡 Suggestion: Add explanatory comment or remove

If intentional:

                         bwd_energy = pl.add(bwd_energy, pl.row_sum(attn_w))
+                        # Anchor to ensure bwd_energy computation is retained
                         grad_sink = pl.mul(bwd_energy, 0.0)

If unintentional dead code:

                         bwd_energy = pl.add(bwd_energy, pl.row_sum(attn_w))
-                        grad_sink = pl.mul(bwd_energy, 0.0)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples-lib/qwen3/qwen3_32b_training_forward_and_backward_src.py` around
lines 618 - 620, The variable grad_sink is created as pl.mul(bwd_energy, 0.0)
but never used; either remove this dead code or make its intent explicit as a
dependency anchor. Locate the block where bwd_energy is accumulated (references:
bwd_energy, contrib, attn_w, pl.row_sum) and either delete the grad_sink
assignment or replace it with a clear comment stating it exists to create a
side-effect/dependency to prevent optimization (e.g., “// dependency anchor to
keep bwd_energy live”), ensuring no other code expects grad_sink to be consumed.

1261-1264: Fragile error detection heuristic.

Checking "code_runner" in result.error to detect missing CodeRunner is brittle—error message wording may change. Consider using a more robust detection mechanism, such as checking for a specific exception type or a dedicated flag in RunResult.

💡 Suggestion: Use a more specific check

If a specific exception type is available:

-    if not result.passed and result.error and "code_runner" in result.error:
+    if not result.passed and result.error and "CodeRunner" in result.error:
         print("Result: COMPILE OK — device run skipped (code_runner not found).")

Or better, if the runner can be updated to set a flag:

if not result.passed and getattr(result, 'compile_only', False):
    print("Result: COMPILE OK — device run skipped.")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples-lib/qwen3/qwen3_32b_training_forward_and_backward_src.py` around
lines 1261 - 1264, The current check uses a brittle substring match on
result.error; change it to a robust signal on the RunResult object instead: have
the runner populate a boolean flag (e.g., RunResult.compile_only) or set a
specific exception type on result.error, then update the condition here to use
getattr(result, 'compile_only', False) (or isinstance(result.error,
SpecificCompileOnlyException)) and print the "COMPILE OK — device run skipped"
message based on that flag/type; update any code that constructs RunResult in
the runner to set compile_only when compilation succeeds but device run is
intentionally skipped.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@examples-lib/qwen3/qwen3_32b_training_forward_and_backward_src.py`:
- Around line 618-620: The variable grad_sink is created as pl.mul(bwd_energy,
0.0) but never used; either remove this dead code or make its intent explicit as
a dependency anchor. Locate the block where bwd_energy is accumulated
(references: bwd_energy, contrib, attn_w, pl.row_sum) and either delete the
grad_sink assignment or replace it with a clear comment stating it exists to
create a side-effect/dependency to prevent optimization (e.g., “// dependency
anchor to keep bwd_energy live”), ensuring no other code expects grad_sink to be
consumed.
- Around line 1261-1264: The current check uses a brittle substring match on
result.error; change it to a robust signal on the RunResult object instead: have
the runner populate a boolean flag (e.g., RunResult.compile_only) or set a
specific exception type on result.error, then update the condition here to use
getattr(result, 'compile_only', False) (or isinstance(result.error,
SpecificCompileOnlyException)) and print the "COMPILE OK — device run skipped"
message based on that flag/type; update any code that constructs RunResult in
the runner to set compile_only when compilation succeeds but device run is
intentionally skipped.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 039bcb0d-d6c9-4d68-9d3d-dc2c42cbd301

📥 Commits

Reviewing files that changed from the base of the PR and between b7b2cc1 and 3973ebf.

📒 Files selected for processing (1)
  • examples-lib/qwen3/qwen3_32b_training_forward_and_backward_src.py

@mknkfan mknkfan changed the title pypto ir src modified to support qwen3 prefill compile pypto ir modified to support qwen3 codes Apr 2, 2026
@mknkfan mknkfan changed the title pypto ir modified to support qwen3 codes pypto ir modified to support qwen3 code Apr 2, 2026
@mknkfan
Copy link
Copy Markdown
Author

mknkfan commented Apr 2, 2026

Added more qwen3 models from hw-native-sys/pypto-lib/examples/models/qwen3, that are modified to support the pypto compile. The modified files are suffixed with *_new.py

@mknkfan mknkfan force-pushed the pypto_changed branch 2 times, most recently from 4aba28e to 3ae9bf7 Compare April 2, 2026 09:02
@mknkfan mknkfan changed the title pypto ir modified to support qwen3 code pto codegen modified to support qwen3 program Apr 7, 2026
@mknkfan
Copy link
Copy Markdown
Author

mknkfan commented Apr 7, 2026

Branch rebased to origin, conflicts fixed.

pto_codegen.cpp modification: Branch on param direction and function name suffix for out column vectors
pto_control_flow_codegen.cpp modification: Yield tensor view SSA for tensor vars in YieldStmt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Ready

Development

Successfully merging this pull request may close these issues.

2 participants