pto codegen modified to support qwen3 program#752
pto codegen modified to support qwen3 program#752mknkfan wants to merge 1 commit intohw-native-sys:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR updates IR repair, tile-layout resolution, and PTO codegen: fixes dangling scf.if yields via SSA-suffix stripping and iter-arg base-name mapping; forces row-major layouts for several tile ops and broadcast row-expands; adjusts IfStmt phi-buffer allocation and branch move emission; generalizes backend layout repair; and adds many example programs and a compile-only runtime path. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive set of compilation fixes and feature enhancements for the PyPTO compiler, primarily targeting Qwen3 and DeepSeek V3.2 models. Key improvements include resolving yield type mismatches in control flow, enforcing hardware-required row-major layouts for broadcast operations, and generalizing layout repair logic for non-row-major tiles. Additionally, the PR adds several new library examples for DeepSeek and Qwen3 models and updates the runtime to support compile-only runs. I have no feedback to provide.
There was a problem hiding this comment.
Actionable comments posted: 15
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/codegen/pto/pto_codegen.cpp (1)
470-489:⚠️ Potential issue | 🟠 MajorHonor explicit
TensorLayout::NDbefore applying the[N,1]DN heuristic.Once
layout_stris initialized fromlayout_DN, theNDbranch leaves it untouched. That means a tensor view explicitly markedNDon a column-vector shape is still emitted as#pto.layout<dn>, which silently changes its declared layout.Suggested fix
if (tensor_type->tensor_view_.has_value()) { switch (tensor_type->tensor_view_.value().layout) { case ir::TensorLayout::DN: layout_str = "dn"; break; case ir::TensorLayout::NZ: layout_str = "nz"; break; case ir::TensorLayout::ND: - break; + layout_str = "nd"; + break; } }Also applies to: 552-563
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/codegen/pto/pto_codegen.cpp` around lines 470 - 489, The DN-heuristic overrides an explicit TensorLayout::ND because the heuristic runs regardless of a tensor_view_ explicitly set to ND; update the logic around tensor_type->tensor_view_ and layout_DN so that if tensor_type->tensor_view_.has_value() and tensor_type->tensor_view_.value().layout == ir::TensorLayout::ND you skip the [N,1] DN heuristic (i.e., do not set layout_DN from the shape check). Change the guard before the shape-based DN assignment to only run when there is no explicit tensor_view_ or when the explicit layout is not ND, keeping references to layout_DN, tensor_type->tensor_view_, ir::TensorLayout::ND, GetConstIntValue and As<ir::Var> to locate and modify the code.
🧹 Nitpick comments (2)
python/pypto/ir/op/tile_ops.py (1)
1727-1728: Keep transpose axes normalized toINDEX.Like the tensor wrapper, this now trusts any incoming
ConstIntdtype. Rebuilding from the current value keepstile.transposeaxis operands consistently typed asINDEX.💡 Suggested fix
- axis1_expr = axis1 if isinstance(axis1, ConstInt) else ConstInt(axis1, DataType.INDEX, actual_span) - axis2_expr = axis2 if isinstance(axis2, ConstInt) else ConstInt(axis2, DataType.INDEX, actual_span) + axis1_expr = ConstInt(axis1.value if isinstance(axis1, ConstInt) else axis1, DataType.INDEX, actual_span) + axis2_expr = ConstInt(axis2.value if isinstance(axis2, ConstInt) else axis2, DataType.INDEX, actual_span)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/pypto/ir/op/tile_ops.py` around lines 1727 - 1728, axis1_expr and axis2_expr currently reuse incoming ConstInt instances which may have non-INDEX dtypes; always rebuild them as new ConstInt with DataType.INDEX so tile.transpose axis operands are consistently typed. Replace the conditional reuse with construction using ConstInt(axis1.value or axis1, DataType.INDEX, actual_span) (and same for axis2) so that even when axis1/axis2 is already a ConstInt you recreate it normalized to INDEX; refer to axis1_expr, axis2_expr, ConstInt and DataType.INDEX to locate and modify the code handling tile.transpose axes.python/pypto/ir/op/tensor_ops.py (1)
846-847: Keep transpose axes normalized toINDEX.Forwarding an existing
ConstIntunchanged makes the IR wrapper depend on whatever dtype the caller used. Rebuilding from the current value keepstensor.transposeaxis operands consistently typed asINDEX.💡 Suggested fix
- axis1_expr = axis1 if isinstance(axis1, ConstInt) else ConstInt(axis1, DataType.INDEX, actual_span) - axis2_expr = axis2 if isinstance(axis2, ConstInt) else ConstInt(axis2, DataType.INDEX, actual_span) + axis1_expr = ConstInt(axis1.value if isinstance(axis1, ConstInt) else axis1, DataType.INDEX, actual_span) + axis2_expr = ConstInt(axis2.value if isinstance(axis2, ConstInt) else axis2, DataType.INDEX, actual_span)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/pypto/ir/op/tensor_ops.py` around lines 846 - 847, The current code leaves existing ConstInt operands unchanged, which can preserve a non-INDEX dtype; always normalize transpose axes to DataType.INDEX by rebuilding axis1_expr and axis2_expr from the underlying integer value rather than forwarding the ConstInt instance. Concretely, when creating axis1_expr and axis2_expr (used by tensor.transpose), extract the integer value from axis1/axis2 if they are ConstInt (e.g., axis1.value/axis2.value) and pass that into a new ConstInt(..., DataType.INDEX, actual_span) so both axis operands are freshly constructed with DataType.INDEX and the correct span.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples-lib/deepseek_v3_2/deepseek_v3_2_decode_back.py`:
- Around line 223-236: The RunConfig instantiation is invalid: replace the
unsupported keyword work_dir with save_kernels=True and
save_kernels_dir=work_dir, and change the invalid enum BackendType.CCE to
BackendType.Ascend910B_CCE; update the call that constructs RunConfig (the
arguments passed to run(program=program, tensor_specs=tensor_specs, ...)) so it
uses save_kernels=True, save_kernels_dir=work_dir and
BackendType.Ascend910B_CCE.
In `@examples-lib/deepseek_v3_2/deepseek_v3_2_decode_front.py`:
- Around line 390-452: The attn_front write inside the pl.parallel head loop
causes concurrent writes of the same [b,0] row; move the pl.assemble(attn_front,
attn_row, [b, 0]) out of the parallel loop so attn_row is fully populated across
all h before a single write. Keep all per-head work (building q_nope/q_rot,
oi/li/mi accumulation, ctx_latent, ctx_v, and pl.assemble(attn_row, ctx_v, [0,
v_col])) inside the pl.parallel over h, but perform the final
pl.assemble(attn_front, attn_row, [b, 0]) only once after the pl.parallel on
NUM_HEADS_CFG completes.
In `@examples-lib/deepseek_v3_2/deepseek_v3_2_prefill_back.py`:
- Around line 228-241: The RunConfig being constructed incorrectly includes a
nonexistent parameter work_dir; remove work_dir from the RunConfig(...) call and
instead enable kernel artifact saving by adding save_kernels=True and
save_kernels_dir=work_dir to the RunConfig passed to run(), ensuring run(...)
will save generated artefacts to the intended work_dir; update the RunConfig
construction referenced in the run(...) invocation accordingly.
In `@examples-lib/deepseek_v3_2/deepseek_v3_2_prefill_front.py`:
- Around line 548-561: The RunConfig call currently passes an unsupported field
work_dir which causes an immediate failure; replace that parameter by setting
save_kernels=True and save_kernels_dir=work_dir within the RunConfig constructor
(keeping platform, device_id, rtol, atol, strategy, dump_passes, backend_type
unchanged) so the RunConfig creation succeeds; update the RunConfig invocation
where RunConfig(...) is used to include save_kernels and save_kernels_dir
instead of work_dir.
In `@examples-lib/qwen3/qwen3_32b_decode.py`:
- Around line 411-454: The RunConfig defaults are inconsistent: the function
parameter platform (default "a2a3") does not match the hard-coded
BackendType.Ascend950 used in RunConfig, causing compilation for a different
target; update the call that constructs RunConfig inside this function so
backend_type is chosen to match platform (e.g., select
BackendType.Ascend910B_PTO when platform == "a2a3", or derive backend_type from
the platform variable), or change the platform default to match
BackendType.Ascend950 — adjust the code around the RunConfig(...) call and
ensure BackendType.Ascend950 and BackendType.Ascend910B_PTO are the referenced
symbols used for the mapping.
In `@examples-lib/qwen3/qwen3_32b_prefill.py`:
- Around line 259-269: The slice of scores should use a statically-shaped extent
and an explicit valid_shape to keep compiler lowering consistent: replace the
dynamic extent pl.slice(scores, [1, valid_len], [0, 0]) with a static extent [1,
SEQ_TILE] and pass valid_shape=[1, valid_len] (i.e., call pl.slice(scores, [1,
SEQ_TILE], [0, 0], valid_shape=[1, valid_len])) so scores_valid remains [1,
SEQ_TILE] while retaining the actual valid length in valid_shape; update the
pl.slice invocation that creates scores_valid accordingly (refer to
scores_valid, pl.slice, SEQ_TILE, valid_len).
In `@examples-lib/qwen3/qwen3_32b_training_forward_and_backward_src.py`:
- Around line 964-973: The RunConfig call is passing a non-existent argument
work_dir which raises a TypeError; update the RunConfig instantiation in this
file (the RunConfig(...) block) to remove work_dir or replace it with the
runner-supported parameter save_kernels_dir (e.g., save_kernels_dir=work_dir) so
the output location is controlled correctly; ensure only valid RunConfig
keywords remain (rtol, atol, strategy, dump_passes, backend_type, device_id,
platform, etc.) and run a quick compile to verify no TypeError occurs.
- Around line 59-72: In build_qwen3_32b_training_forward_backward_program
validate or handle non-multiple dimensions: check that hidden_size and
intermediate_size are multiples of K_CHUNK and MLP_OUT_CHUNK (and that
hidden_size aligns with Q_OUT_CHUNK where relevant) or compute and handle tail
sizes instead of using ceil-div-only block counts (HIDDEN_BLOCKS, Q_OUT_BLOCKS,
MLP_OUT_BLOCKS); also compute tok_blocks = (MAX_SEQ_CFG + TOK_TILE - 1) //
TOK_TILE or explicitly check MAX_SEQ_CFG % TOK_TILE == 0 and reject/raise if not
supported. Update kernel invocation/slicing logic to guard last-block ranges
using computed remainders or bail early with a clear error referencing
HIDDEN_CFG, INTER_CFG, K_CHUNK, Q_OUT_CHUNK, MLP_OUT_CHUNK, and TOK_TILE.
In `@examples-lib/qwen3/qwen3_32b_training_forward_and_backward.py`:
- Around line 129-132: The loop computing tok_blocks uses floor division
(tok_blocks = MAX_SEQ_CFG // TOK_TILE) which drops tail tokens when MAX_SEQ_CFG
% TOK_TILE != 0; fix by computing tok_blocks = (MAX_SEQ_CFG + TOK_TILE - 1) //
TOK_TILE (round up) and modify the nested loop inside pl.parallel (the for
p0_idx in pl.range(tok_blocks) body) to handle the final partial tile using the
existing valid_shape/path (i.e., compute p0 = p0_idx * TOK_TILE and for the last
block clamp the tile length with min(TOK_TILE, MAX_SEQ_CFG - p0) and pass that
as valid_shape to the downstream logic), or alternatively add an upfront
validation to reject non-divisible MAX_SEQ_CFG by raising an error if
MAX_SEQ_CFG % TOK_TILE != 0.
- Around line 1002-1023: The computed work_dir variable is never passed into the
runner, so update the RunConfig passed to run(...) to set
save_kernels_dir=work_dir (or the equivalent RunConfig field used to control
kernel output). Locate the RunConfig construction in the call to
run(program=program, ... ) and add save_kernels_dir=work_dir to that RunConfig
so generated kernels are written to the printed work_dir instead of a
timestamped temp directory.
In `@examples-lib/qwen3/qwen3-32b.py`:
- Around line 241-247: The pl.slice call makes the tile width dynamic; instead,
keep the physical tile size [1, SEQ_TILE] and pass valid_len via valid_shape:
replace the pl.slice(scores, [1, valid_len], ...) that produces scores_valid
with a slice that always uses [1, SEQ_TILE] (so scores_valid is physically
shaped [1, SEQ_TILE]) and set its valid_shape to [1, valid_len] (mirroring
examples-lib/qwen3/qwen3_32b_decode.py); ensure downstream uses (cur_mi,
exp_scores, cur_li, exp_pad, etc.) operate on that fixed tile and rely on
valid_shape for correctness.
In `@python/pypto/runtime/runner.py`:
- Around line 233-235: The guard currently only checks `golden is not None`, so
`codegen_only` in `RunConfig` is ignored; modify the conditional before calling
`_execute_on_device` to also require `not config.codegen_only`. In other words,
locate the block where `_execute_on_device(work_dir, golden_path,
config.platform, config.device_id)` is invoked and change the condition to
ensure both `golden is not None` and `config.codegen_only` is false (e.g., `if
golden is not None and not config.codegen_only:`) so callers with
`codegen_only=True` will skip the CodeRunner execution.
In `@src/ir/transforms/resolve_backend_op_layouts_pass.cpp`:
- Around line 94-96: NeedsLayoutRepair currently rewrites every non-row-major
input (it returns true when GetTileLayout(tile_type) != TileLayout::row_major)
but the assign/rebind path only restores layout for the special [N,1] col-major
case, causing ordinary 2D col_major results to be materialized as row-major and
rebound to op->var_ without restoring their declared layout; update the
assign/restore logic that follows NeedsLayoutRepair so that after generalized
layout repair you detect the original tile layout (via GetTileLayout on the
operand/TileType) and, when it was col_major (or any non-row-major), reshape or
rematerialize the result back to that declared layout before assigning to
op->var_ (i.e., generalize the existing [N,1] reshape code to handle 2D
col_major and other non-row-major layouts and ensure any path that sets op->var_
preserves the original TileLayout).
In `@src/ir/transforms/utils/cross_core_pipe.cpp`:
- Around line 130-143: The code returns a max slot size even when a mix of
static and dynamic tile sizes exists because observed_slot_sizes.empty() only
detects when all ops are dynamic; to fix this add an explicit boolean flag (e.g.
unknown_slot_size_seen) to PipeDirectionMetadata, set that flag in
RecordTileSlotSize() whenever a dynamic/unknown tile size is observed, and
update this function's loop over metadata.c2v / metadata.v2c to bail out and
return std::nullopt if any direction has unknown_slot_size_seen (in addition to
the existing empty check), otherwise compute max_slot_size as before.
In `@src/ir/transforms/utils/loop_state_repair.cpp`:
- Around line 114-117: The current ExtractBaseName(const std::string& name_hint)
uses string-prefix matching and can map distinct loop-carried values
incorrectly; replace the name-hint based collapse with an identity-based mapping
derived from the loop node itself (use the loop's return_vars_/iter_args_
correspondence or the loop AST/IR node that pairs yields to iter args) so keys
are based on the actual iter_arg identity rather than the stem of name_hint;
update any code that calls ExtractBaseName (including the other occurrences
around the 151-164 region) to look up the correct iter_arg via the loop's
return_vars_/iter_args_ mapping and only fall back to name_hint as a last
resort.
---
Outside diff comments:
In `@src/codegen/pto/pto_codegen.cpp`:
- Around line 470-489: The DN-heuristic overrides an explicit TensorLayout::ND
because the heuristic runs regardless of a tensor_view_ explicitly set to ND;
update the logic around tensor_type->tensor_view_ and layout_DN so that if
tensor_type->tensor_view_.has_value() and
tensor_type->tensor_view_.value().layout == ir::TensorLayout::ND you skip the
[N,1] DN heuristic (i.e., do not set layout_DN from the shape check). Change the
guard before the shape-based DN assignment to only run when there is no explicit
tensor_view_ or when the explicit layout is not ND, keeping references to
layout_DN, tensor_type->tensor_view_, ir::TensorLayout::ND, GetConstIntValue and
As<ir::Var> to locate and modify the code.
---
Nitpick comments:
In `@python/pypto/ir/op/tensor_ops.py`:
- Around line 846-847: The current code leaves existing ConstInt operands
unchanged, which can preserve a non-INDEX dtype; always normalize transpose axes
to DataType.INDEX by rebuilding axis1_expr and axis2_expr from the underlying
integer value rather than forwarding the ConstInt instance. Concretely, when
creating axis1_expr and axis2_expr (used by tensor.transpose), extract the
integer value from axis1/axis2 if they are ConstInt (e.g.,
axis1.value/axis2.value) and pass that into a new ConstInt(..., DataType.INDEX,
actual_span) so both axis operands are freshly constructed with DataType.INDEX
and the correct span.
In `@python/pypto/ir/op/tile_ops.py`:
- Around line 1727-1728: axis1_expr and axis2_expr currently reuse incoming
ConstInt instances which may have non-INDEX dtypes; always rebuild them as new
ConstInt with DataType.INDEX so tile.transpose axis operands are consistently
typed. Replace the conditional reuse with construction using
ConstInt(axis1.value or axis1, DataType.INDEX, actual_span) (and same for axis2)
so that even when axis1/axis2 is already a ConstInt you recreate it normalized
to INDEX; refer to axis1_expr, axis2_expr, ConstInt and DataType.INDEX to locate
and modify the code handling tile.transpose axes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 96070b8a-541a-4bcb-b0bc-ab923c8066f9
📒 Files selected for processing (24)
CHANGES_SUMMARY.txtexamples-lib/deepseek_v3_2/deepseek_v3_2_decode_back.pyexamples-lib/deepseek_v3_2/deepseek_v3_2_decode_front.pyexamples-lib/deepseek_v3_2/deepseek_v3_2_prefill_back.pyexamples-lib/deepseek_v3_2/deepseek_v3_2_prefill_front.pyexamples-lib/hello_world.pyexamples-lib/matmul.pyexamples-lib/qwen3/qwen3-32b.pyexamples-lib/qwen3/qwen3_32b_decode.pyexamples-lib/qwen3/qwen3_32b_prefill.pyexamples-lib/qwen3/qwen3_32b_prefill_src.pyexamples-lib/qwen3/qwen3_32b_training_forward_and_backward.pyexamples-lib/qwen3/qwen3_32b_training_forward_and_backward_src.pypython/pypto/ir/op/tensor_ops.pypython/pypto/ir/op/tile_ops.pypython/pypto/runtime/runner.pysrc/backend/common/pto_ops_common.cppsrc/codegen/pto/pto_codegen.cppsrc/codegen/pto/pto_control_flow_codegen.cppsrc/ir/op/tile_ops/broadcast.cppsrc/ir/transforms/resolve_backend_op_layouts_pass.cppsrc/ir/transforms/utils/cross_core_pipe.cppsrc/ir/transforms/utils/loop_state_repair.cppsrc/ir/verifier/mixed_kernel_expanded_verifier.cpp
| result = run( | ||
| program=program, | ||
| tensor_specs=tensor_specs, | ||
| golden=None, | ||
| config=RunConfig( | ||
| platform=platform, | ||
| device_id=device_id, | ||
| rtol=2e-2, | ||
| atol=2e-2, | ||
| strategy=OptimizationStrategy.Default, | ||
| dump_passes=dump_passes, | ||
| backend_type=BackendType.CCE, | ||
| work_dir=work_dir, | ||
| ), |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -type f -name "*.py" | xargs grep -l "class RunConfig" | head -5Repository: hw-native-sys/pypto
Length of output: 95
🏁 Script executed:
cat -n ./python/pypto/runtime/runner.py | head -100Repository: hw-native-sys/pypto
Length of output: 5014
🏁 Script executed:
cat -n examples-lib/deepseek_v3_2/deepseek_v3_2_decode_back.py | sed -n '223,236p'Repository: hw-native-sys/pypto
Length of output: 567
🏁 Script executed:
cat -n ./python/pypto/runtime/runner.py | sed -n '56,90p'Repository: hw-native-sys/pypto
Length of output: 1994
🏁 Script executed:
# Verify BackendType.CCE exists (from the code snippet)
grep -r "BackendType\|class BackendType" --include="*.py" | grep -E "(CCE|BackendType)" | head -10Repository: hw-native-sys/pypto
Length of output: 1016
🏁 Script executed:
find . -type f -name "*.py" | xargs grep -l "class BackendType\|enum BackendType" | head -3Repository: hw-native-sys/pypto
Length of output: 45
🏁 Script executed:
grep -r "^class BackendType\|^BackendType.*=" --include="*.py" | head -10Repository: hw-native-sys/pypto
Length of output: 45
🏁 Script executed:
find . -type f -name "*.py" -path "*/backend*" | head -20Repository: hw-native-sys/pypto
Length of output: 280
🏁 Script executed:
cat -n ./python/pypto/backend/__init__.pyRepository: hw-native-sys/pypto
Length of output: 1825
🏁 Script executed:
grep -r "BackendType\|CCE" ./python/pypto/pypto_core/backend.py | head -20Repository: hw-native-sys/pypto
Length of output: 132
🏁 Script executed:
find . -type f -name "backend.py" | head -10Repository: hw-native-sys/pypto
Length of output: 45
🏁 Script executed:
find . -type f -path "*pypto_core*" -name "*.py" | head -20Repository: hw-native-sys/pypto
Length of output: 45
🏁 Script executed:
grep -r "class BackendType" . --include="*.py" --include="*.pyi" 2>/dev/null | head -10Repository: hw-native-sys/pypto
Length of output: 119
🏁 Script executed:
cat -n ./python/pypto/pypto_core/backend.pyi | grep -A 20 "class BackendType"Repository: hw-native-sys/pypto
Length of output: 714
🏁 Script executed:
grep -r "BackendType\.CCE" . --include="*.py" | head -10Repository: hw-native-sys/pypto
Length of output: 475
🏁 Script executed:
# Check if BackendType.CCE is actually valid or if it's a typo
grep -B 5 -A 5 "BackendType\.CCE\|BackendType\.Ascend910B_CCE" ./examples-lib/deepseek_v3_2/deepseek_v3_2_decode_back.py | head -20Repository: hw-native-sys/pypto
Length of output: 478
RunConfig doesn't accept work_dir, and BackendType.CCE is not a valid enum value.
This config construction will raise at instantiation. Replace work_dir=work_dir with save_kernels=True and save_kernels_dir=work_dir. Additionally, BackendType.CCE is not a valid backend type; use BackendType.Ascend910B_CCE instead.
Suggested fix
result = run(
program=program,
tensor_specs=tensor_specs,
golden=None,
config=RunConfig(
platform=platform,
device_id=device_id,
rtol=2e-2,
atol=2e-2,
strategy=OptimizationStrategy.Default,
dump_passes=dump_passes,
- backend_type=BackendType.CCE,
- work_dir=work_dir,
+ backend_type=BackendType.Ascend910B_CCE,
+ save_kernels=True,
+ save_kernels_dir=work_dir,
),
)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples-lib/deepseek_v3_2/deepseek_v3_2_decode_back.py` around lines 223 -
236, The RunConfig instantiation is invalid: replace the unsupported keyword
work_dir with save_kernels=True and save_kernels_dir=work_dir, and change the
invalid enum BackendType.CCE to BackendType.Ascend910B_CCE; update the call that
constructs RunConfig (the arguments passed to run(program=program,
tensor_specs=tensor_specs, ...)) so it uses save_kernels=True,
save_kernels_dir=work_dir and BackendType.Ascend910B_CCE.
| for h in pl.parallel(0, NUM_HEADS_CFG, 1, chunk=8): | ||
| q_col = h * QK_HEAD_DIM_CFG | ||
| q_nope = pl.cast( | ||
| pl.slice(q_proj, [1, QK_NOPE_HEAD_DIM_CFG], [b, q_col]), | ||
| target_type=pl.FP32, | ||
| ) | ||
| q_pe = pl.cast( | ||
| pl.slice(q_proj, [1, QK_ROPE_HEAD_DIM_CFG], [b, q_col + QK_NOPE_HEAD_DIM_CFG]), | ||
| target_type=pl.FP32, | ||
| ) | ||
| q_lo = pl.slice(q_pe, [1, QK_ROPE_HEAD_DIM_CFG // 2], [0, 0]) | ||
| q_hi = pl.slice(q_pe, [1, QK_ROPE_HEAD_DIM_CFG // 2], [0, QK_ROPE_HEAD_DIM_CFG // 2]) | ||
| q_rot = pl.create_tensor([1, QK_ROPE_HEAD_DIM_CFG], dtype=pl.FP32) | ||
| q_rot = pl.assemble(q_rot, pl.sub(pl.col_expand_mul(q_lo, cos_lo), pl.col_expand_mul(q_hi, sin_lo)), [0, 0]) | ||
| q_rot = pl.assemble(q_rot, pl.add(pl.col_expand_mul(q_hi, cos_hi), pl.col_expand_mul(q_lo, sin_hi)), [0, QK_ROPE_HEAD_DIM_CFG // 2]) | ||
| q_nope_latent = pl.matmul( | ||
| pl.cast(q_nope, target_type=pl.BF16), | ||
| pl.reshape(pl.slice(w_q_nope_to_latent, [1, QK_NOPE_HEAD_DIM_CFG, KV_LORA_RANK_CFG], [h, 0, 0]), [QK_NOPE_HEAD_DIM_CFG, KV_LORA_RANK_CFG]), | ||
| ) | ||
|
|
||
| oi = pl.create_tensor([1, KV_LORA_RANK_CFG], dtype=pl.FP32) | ||
| li = pl.create_tensor([1, 1], dtype=pl.FP32) | ||
| mi = pl.create_tensor([1, 1], dtype=pl.FP32) | ||
| oi = pl.mul(oi, 0.0) | ||
| li = pl.mul(li, 0.0) | ||
| mi = pl.mul(mi, 0.0) | ||
|
|
||
| sparse_k = pl.min(INDEX_TOPK_CFG, ctx_len) | ||
| for kk in pl.range(sparse_k): | ||
| s = pl.tensor.read(topk_idx, [0, kk]) | ||
| if s >= 0: | ||
| cache_s = b * MAX_SEQ_CFG + s | ||
| kv_s = pl.cast(pl.slice(kv_cache, [1, KV_LORA_RANK_CFG], [cache_s, 0]), target_type=pl.FP32) | ||
| pe_s = pl.cast(pl.slice(pe_cache, [1, QK_ROPE_HEAD_DIM_CFG], [cache_s, 0]), target_type=pl.FP32) | ||
| score_nope = pl.row_sum(pl.mul(q_nope_latent, kv_s)) | ||
| score_pe = pl.row_sum(pl.mul(q_rot, pe_s)) | ||
| score = pl.mul(pl.add(score_nope, score_pe), ATTN_SCALE) | ||
| cur_mi = score | ||
| cur_li = pl.exp(pl.sub(score, cur_mi)) | ||
| oi_tmp = pl.row_expand_mul(kv_s, cur_li) | ||
| if kk == 0: | ||
| oi = oi_tmp | ||
| li = cur_li | ||
| mi = cur_mi | ||
| else: | ||
| mi_new = pl.maximum(mi, cur_mi) | ||
| alpha = pl.exp(pl.sub(mi, mi_new)) | ||
| beta = pl.exp(pl.sub(cur_mi, mi_new)) | ||
| li = pl.add(pl.mul(alpha, li), pl.mul(beta, cur_li)) | ||
| oi = pl.add(pl.row_expand_mul(oi, alpha), pl.row_expand_mul(oi_tmp, beta)) | ||
| mi = mi_new | ||
| ctx_latent = pl.row_expand_div(oi, li) | ||
| v_col = h * V_HEAD_DIM_CFG | ||
| ctx_v = pl.create_tensor([1, V_HEAD_DIM_CFG], dtype=pl.FP32) | ||
| ctx_v = pl.mul(ctx_v, 0.0) | ||
| for vb in pl.range(V_OUT_BLOCKS): | ||
| v0 = vb * V_OUT_CHUNK | ||
| wv_tile = pl.slice(w_latent_to_v, [1, KV_LORA_RANK_CFG, V_OUT_CHUNK], [h, 0, v0]) | ||
| wv_tile = pl.reshape(wv_tile, [KV_LORA_RANK_CFG, V_OUT_CHUNK]) | ||
| v_part = pl.matmul(pl.cast(ctx_latent, target_type=pl.BF16), wv_tile, out_dtype=pl.FP32) | ||
| ctx_v = pl.assemble(ctx_v, v_part, [0, v0]) | ||
| attn_row = pl.assemble(attn_row, ctx_v, [0, v_col]) | ||
| attn_front = pl.assemble(attn_front, attn_row, [b, 0]) |
There was a problem hiding this comment.
Move the attn_front write out of the head-parallel loop.
Every h iteration writes the same [b, 0] row while pl.parallel is active. That makes attn_front a shared write target fed by partially populated attn_row; the full row should only be assembled once the head loop is complete.
Suggested change
for h in pl.parallel(0, NUM_HEADS_CFG, 1, chunk=8):
q_col = h * QK_HEAD_DIM_CFG
q_nope = pl.cast(
pl.slice(q_proj, [1, QK_NOPE_HEAD_DIM_CFG], [b, q_col]),
target_type=pl.FP32,
@@
v_part = pl.matmul(pl.cast(ctx_latent, target_type=pl.BF16), wv_tile, out_dtype=pl.FP32)
ctx_v = pl.assemble(ctx_v, v_part, [0, v0])
attn_row = pl.assemble(attn_row, ctx_v, [0, v_col])
- attn_front = pl.assemble(attn_front, attn_row, [b, 0])
+ attn_front = pl.assemble(attn_front, attn_row, [b, 0])🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples-lib/deepseek_v3_2/deepseek_v3_2_decode_front.py` around lines 390 -
452, The attn_front write inside the pl.parallel head loop causes concurrent
writes of the same [b,0] row; move the pl.assemble(attn_front, attn_row, [b, 0])
out of the parallel loop so attn_row is fully populated across all h before a
single write. Keep all per-head work (building q_nope/q_rot, oi/li/mi
accumulation, ctx_latent, ctx_v, and pl.assemble(attn_row, ctx_v, [0, v_col]))
inside the pl.parallel over h, but perform the final pl.assemble(attn_front,
attn_row, [b, 0]) only once after the pl.parallel on NUM_HEADS_CFG completes.
| result = run( | ||
| program=program, | ||
| tensor_specs=tensor_specs, | ||
| golden=None, | ||
| config=RunConfig( | ||
| platform=platform, | ||
| device_id=device_id, | ||
| rtol=2e-2, | ||
| atol=2e-2, | ||
| strategy=OptimizationStrategy.Default, | ||
| dump_passes=dump_passes, | ||
| backend_type=BackendType.CCE, | ||
| work_dir=work_dir, | ||
| ), |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -type f -name "*.py" | xargs grep -l "class RunConfig" | head -5Repository: hw-native-sys/pypto
Length of output: 95
🏁 Script executed:
find . -type f -path "*/runner.py" | head -10Repository: hw-native-sys/pypto
Length of output: 95
🏁 Script executed:
find . -type f -name "*runner*" -type f | grep -E "\.(py|pyx)$" | head -10Repository: hw-native-sys/pypto
Length of output: 134
🏁 Script executed:
cat -n ./python/pypto/runtime/runner.py | head -200Repository: hw-native-sys/pypto
Length of output: 9397
🏁 Script executed:
cat -n ./examples-lib/deepseek_v3_2/deepseek_v3_2_prefill_back.py | sed -n '220,250p'Repository: hw-native-sys/pypto
Length of output: 1354
🏁 Script executed:
grep -n "work_dir" ./python/pypto/runtime/runner.py | head -20Repository: hw-native-sys/pypto
Length of output: 1076
🏁 Script executed:
cat -n ./python/pypto/runtime/runner.py | sed -n '200,230p'Repository: hw-native-sys/pypto
Length of output: 1361
work_dir is not a valid RunConfig parameter.
RunConfig does not expose work_dir. Forward the path through save_kernels=True and save_kernels_dir=work_dir instead, which the run() function uses internally to determine where to save generated artefacts.
Suggested fix
result = run(
program=program,
tensor_specs=tensor_specs,
golden=None,
config=RunConfig(
platform=platform,
device_id=device_id,
rtol=2e-2,
atol=2e-2,
strategy=OptimizationStrategy.Default,
dump_passes=dump_passes,
backend_type=BackendType.CCE,
- work_dir=work_dir,
+ save_kernels=True,
+ save_kernels_dir=work_dir,
),
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| result = run( | |
| program=program, | |
| tensor_specs=tensor_specs, | |
| golden=None, | |
| config=RunConfig( | |
| platform=platform, | |
| device_id=device_id, | |
| rtol=2e-2, | |
| atol=2e-2, | |
| strategy=OptimizationStrategy.Default, | |
| dump_passes=dump_passes, | |
| backend_type=BackendType.CCE, | |
| work_dir=work_dir, | |
| ), | |
| result = run( | |
| program=program, | |
| tensor_specs=tensor_specs, | |
| golden=None, | |
| config=RunConfig( | |
| platform=platform, | |
| device_id=device_id, | |
| rtol=2e-2, | |
| atol=2e-2, | |
| strategy=OptimizationStrategy.Default, | |
| dump_passes=dump_passes, | |
| backend_type=BackendType.CCE, | |
| save_kernels=True, | |
| save_kernels_dir=work_dir, | |
| ), | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples-lib/deepseek_v3_2/deepseek_v3_2_prefill_back.py` around lines 228 -
241, The RunConfig being constructed incorrectly includes a nonexistent
parameter work_dir; remove work_dir from the RunConfig(...) call and instead
enable kernel artifact saving by adding save_kernels=True and
save_kernels_dir=work_dir to the RunConfig passed to run(), ensuring run(...)
will save generated artefacts to the intended work_dir; update the RunConfig
construction referenced in the run(...) invocation accordingly.
| result = run( | ||
| program=program, | ||
| tensor_specs=tensor_specs, | ||
| golden=None, | ||
| config=RunConfig( | ||
| platform=platform, | ||
| device_id=device_id, | ||
| rtol=2e-2, | ||
| atol=2e-2, | ||
| strategy=OptimizationStrategy.Default, | ||
| dump_passes=dump_passes, | ||
| backend_type=BackendType.CCE, | ||
| work_dir=work_dir, | ||
| ), |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "runner.py" -type f | head -20Repository: hw-native-sys/pypto
Length of output: 95
🏁 Script executed:
rg "class RunConfig" -A 50Repository: hw-native-sys/pypto
Length of output: 4062
This RunConfig(...) call will fail immediately.
work_dir is not a supported RunConfig field. Use save_kernels=True plus save_kernels_dir=work_dir instead.
Suggested fix
result = run(
program=program,
tensor_specs=tensor_specs,
golden=None,
config=RunConfig(
platform=platform,
device_id=device_id,
rtol=2e-2,
atol=2e-2,
strategy=OptimizationStrategy.Default,
dump_passes=dump_passes,
backend_type=BackendType.CCE,
- work_dir=work_dir,
+ save_kernels=True,
+ save_kernels_dir=work_dir,
),
)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples-lib/deepseek_v3_2/deepseek_v3_2_prefill_front.py` around lines 548 -
561, The RunConfig call currently passes an unsupported field work_dir which
causes an immediate failure; replace that parameter by setting save_kernels=True
and save_kernels_dir=work_dir within the RunConfig constructor (keeping
platform, device_id, rtol, atol, strategy, dump_passes, backend_type unchanged)
so the RunConfig creation succeeds; update the RunConfig invocation where
RunConfig(...) is used to include save_kernels and save_kernels_dir instead of
work_dir.
| platform: str = "a2a3", | ||
| device_id: int = 11, | ||
| work_dir: str | None = None, | ||
| dump_passes: bool = True, | ||
| ): | ||
| from pypto.backend import BackendType | ||
| from pypto.ir.pass_manager import OptimizationStrategy | ||
| from pypto.runtime import RunConfig, run | ||
|
|
||
| program = build_qwen3_single_layer_decode_program( | ||
| batch=batch, | ||
| max_seq_len=max_seq_len, | ||
| hidden_size=hidden_size, | ||
| num_heads=num_heads, | ||
| num_kv_heads=num_kv_heads, | ||
| head_dim=head_dim, | ||
| intermediate_size=intermediate_size, | ||
| ) | ||
|
|
||
| print(program) | ||
|
|
||
| tensor_specs = build_tensor_specs( | ||
| batch=batch, | ||
| max_seq_len=max_seq_len, | ||
| hidden_size=hidden_size, | ||
| num_heads=num_heads, | ||
| num_kv_heads=num_kv_heads, | ||
| head_dim=head_dim, | ||
| intermediate_size=intermediate_size, | ||
| ) | ||
|
|
||
| result = run( | ||
| program=program, | ||
| tensor_specs=tensor_specs, | ||
| golden=None, | ||
| config=RunConfig( | ||
| platform=platform, | ||
| device_id=device_id, | ||
| rtol=2e-2, | ||
| atol=2e-2, | ||
| strategy=OptimizationStrategy.Default, | ||
| dump_passes=dump_passes, | ||
| backend_type=BackendType.Ascend950, | ||
| ), |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "qwen3_32b_decode.py compile_and_run defaults:"
sed -n '403,454p' examples-lib/qwen3/qwen3_32b_decode.py
echo
echo "PTO backend -> target_arch mapping:"
sed -n '165,177p' src/codegen/pto/pto_codegen.cppRepository: hw-native-sys/pypto
Length of output: 2072
The default RunConfig targets two different architectures.
This helper defaults platform="a2a3" but hard-codes BackendType.Ascend950. Since src/codegen/pto/pto_codegen.cpp maps Ascend950 to target_arch = "a5" and Ascend910B_PTO to "a2a3", the program is compiled for a different target than what the runner will attempt to execute on.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples-lib/qwen3/qwen3_32b_decode.py` around lines 411 - 454, The RunConfig
defaults are inconsistent: the function parameter platform (default "a2a3") does
not match the hard-coded BackendType.Ascend950 used in RunConfig, causing
compilation for a different target; update the call that constructs RunConfig
inside this function so backend_type is chosen to match platform (e.g., select
BackendType.Ascend910B_PTO when platform == "a2a3", or derive backend_type from
the platform variable), or change the platform default to match
BackendType.Ascend950 — adjust the code around the RunConfig(...) call and
ensure BackendType.Ascend950 and BackendType.Ascend910B_PTO are the referenced
symbols used for the mapping.
examples-lib/qwen-src/qwen3-32b.py
Outdated
| scores_valid = pl.slice(scores, [1, valid_len], [0, 0]) | ||
| cur_mi = pl.cast(pl.row_max(scores_valid), target_type=pl.FP32) | ||
| exp_scores = pl.exp(pl.row_expand_sub(scores_valid, cur_mi)) | ||
| cur_li = pl.cast(pl.row_sum(exp_scores), target_type=pl.FP32) | ||
| exp_pad = pl.create_tensor([1, SEQ_TILE], dtype=pl.FP32) | ||
| exp_pad = pl.mul(exp_pad, 0.0) | ||
| exp_pad = pl.assemble(exp_pad, exp_scores, [0, 0]) |
There was a problem hiding this comment.
Keep the score tail slice physically [1, SEQ_TILE].
pl.slice(scores, [1, valid_len], ...) makes the tile width itself dynamic. PTO lowering has separate valid_shape plumbing for tail tiles; it is much safer to keep the physical score tile [1, SEQ_TILE] and carry valid_len via valid_shape, as examples-lib/qwen3/qwen3_32b_decode.py already does.
Suggested fix
- scores_valid = pl.slice(scores, [1, valid_len], [0, 0])
- cur_mi = pl.cast(pl.row_max(scores_valid), target_type=pl.FP32)
- exp_scores = pl.exp(pl.row_expand_sub(scores_valid, cur_mi))
+ scores_valid = pl.slice(
+ scores,
+ [1, SEQ_TILE],
+ [0, 0],
+ valid_shape=[1, valid_len],
+ )
+ scores_padded = pl.fillpad(scores_valid, pad_value=pl.PadValue.min)
+ cur_mi = pl.cast(pl.row_max(scores_padded), target_type=pl.FP32)
+ exp_scores = pl.exp(pl.row_expand_sub(scores_padded, cur_mi))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples-lib/qwen3/qwen3-32b.py` around lines 241 - 247, The pl.slice call
makes the tile width dynamic; instead, keep the physical tile size [1, SEQ_TILE]
and pass valid_len via valid_shape: replace the pl.slice(scores, [1, valid_len],
...) that produces scores_valid with a slice that always uses [1, SEQ_TILE] (so
scores_valid is physically shaped [1, SEQ_TILE]) and set its valid_shape to [1,
valid_len] (mirroring examples-lib/qwen3/qwen3_32b_decode.py); ensure downstream
uses (cur_mi, exp_scores, cur_li, exp_pad, etc.) operate on that fixed tile and
rely on valid_shape for correctness.
python/pypto/runtime/runner.py
Outdated
| # 4. Execute via Simpler's CodeRunner | ||
| _execute_on_device(work_dir, golden_path, config.platform, config.device_id) | ||
| if golden is not None: | ||
| _execute_on_device(work_dir, golden_path, config.platform, config.device_id) |
There was a problem hiding this comment.
codegen_only still doesn't stop execution.
RunConfig documents codegen_only=True as compile-only, but this guard only checks golden is not None. A caller that supplies a golden and sets codegen_only=True will still invoke CodeRunner.
💡 Suggested fix
- if golden is not None:
+ if golden is not None and not config.codegen_only:
_execute_on_device(work_dir, golden_path, config.platform, config.device_id)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # 4. Execute via Simpler's CodeRunner | |
| _execute_on_device(work_dir, golden_path, config.platform, config.device_id) | |
| if golden is not None: | |
| _execute_on_device(work_dir, golden_path, config.platform, config.device_id) | |
| # 4. Execute via Simpler's CodeRunner | |
| if golden is not None and not config.codegen_only: | |
| _execute_on_device(work_dir, golden_path, config.platform, config.device_id) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@python/pypto/runtime/runner.py` around lines 233 - 235, The guard currently
only checks `golden is not None`, so `codegen_only` in `RunConfig` is ignored;
modify the conditional before calling `_execute_on_device` to also require `not
config.codegen_only`. In other words, locate the block where
`_execute_on_device(work_dir, golden_path, config.platform, config.device_id)`
is invoked and change the condition to ensure both `golden is not None` and
`config.codegen_only` is false (e.g., `if golden is not None and not
config.codegen_only:`) so callers with `codegen_only=True` will skip the
CodeRunner execution.
| bool NeedsLayoutRepair(const TileTypePtr& tile_type) { | ||
| return tile_type && GetTileLayout(tile_type) != TileLayout::row_major; | ||
| } |
There was a problem hiding this comment.
Restore non-row-major outputs after generalized layout repair.
NeedsLayoutRepair() now rewrites every non-row-major input, but the assign path only reshapes back when the result is [N,1] col-major. Ordinary 2D col_major results will now be materialized as row-major and rebound to op->var_ without restoring their declared layout.
Suggested fix
- if (IsColumnVectorColMajor(result_tile_type)) {
+ if (NeedsLayoutRepair(result_tile_type)) {
auto row_major_var = std::make_shared<Var>(NextTempName(op->var_->name_hint_, {"row_major"}),
repaired_call->GetType(), call->span_);
rewritten.push_back(std::make_shared<AssignStmt>(row_major_var, repaired_call, op->span_));
auto reshape_back = CreateReshapeCall(row_major_var, result_tile_type->shape_, call->span_);Also applies to: 176-186
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ir/transforms/resolve_backend_op_layouts_pass.cpp` around lines 94 - 96,
NeedsLayoutRepair currently rewrites every non-row-major input (it returns true
when GetTileLayout(tile_type) != TileLayout::row_major) but the assign/rebind
path only restores layout for the special [N,1] col-major case, causing ordinary
2D col_major results to be materialized as row-major and rebound to op->var_
without restoring their declared layout; update the assign/restore logic that
follows NeedsLayoutRepair so that after generalized layout repair you detect the
original tile layout (via GetTileLayout on the operand/TileType) and, when it
was col_major (or any non-row-major), reshape or rematerialize the result back
to that declared layout before assigning to op->var_ (i.e., generalize the
existing [N,1] reshape code to handle 2D col_major and other non-row-major
layouts and ensure any path that sets op->var_ preserves the original
TileLayout).
| int64_t max_slot_size = 0; | ||
| bool found_any = false; | ||
| for (const auto* direction : {&metadata.c2v, &metadata.v2c}) { | ||
| if (!direction->has_ops) continue; | ||
| if (direction->has_inconsistent_slot_size || !direction->slot_size_bytes.has_value()) { | ||
| // When any observed size is non-static, we cannot infer a slot size. | ||
| if (direction->observed_slot_sizes.empty()) { | ||
| return std::nullopt; | ||
| } | ||
| if (!common_slot_size.has_value()) { | ||
| common_slot_size = direction->slot_size_bytes; | ||
| continue; | ||
| } | ||
| if (common_slot_size.value() != direction->slot_size_bytes.value()) { | ||
| return std::nullopt; | ||
| for (int64_t sz : direction->observed_slot_sizes) { | ||
| max_slot_size = std::max(max_slot_size, sz); | ||
| found_any = true; | ||
| } | ||
| } | ||
| return common_slot_size; | ||
| return found_any ? std::optional<int64_t>(max_slot_size) : std::nullopt; |
There was a problem hiding this comment.
Mixed static/dynamic pipe ops can still produce a bogus slot size.
observed_slot_sizes.empty() only catches the case where all ops in a direction are non-static. If one tpush/tpop has a dynamic tile size and another has a static size, RecordTileSlotSize() marks the direction as having ops but only records the static size, so this function now returns a max slot size instead of nullopt. That can drive automatic pipe setup with an invalid slot size. Please track an explicit “unknown slot size seen” flag in PipeDirectionMetadata and bail out when any contributing direction has it.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ir/transforms/utils/cross_core_pipe.cpp` around lines 130 - 143, The code
returns a max slot size even when a mix of static and dynamic tile sizes exists
because observed_slot_sizes.empty() only detects when all ops are dynamic; to
fix this add an explicit boolean flag (e.g. unknown_slot_size_seen) to
PipeDirectionMetadata, set that flag in RecordTileSlotSize() whenever a
dynamic/unknown tile size is observed, and update this function's loop over
metadata.c2v / metadata.v2c to bail out and return std::nullopt if any direction
has unknown_slot_size_seen (in addition to the existing empty check), otherwise
compute max_slot_size as before.
| std::string ExtractBaseName(const std::string& name_hint) { | ||
| auto pos = name_hint.find("__"); | ||
| return pos != std::string::npos ? name_hint.substr(0, pos) : name_hint; | ||
| } |
There was a problem hiding this comment.
Base-name matching can repair the wrong loop-carried value.
name_hint_ is only a cosmetic label, and ExtractBaseName() collapses every foo__* value to the same key. If a loop carries two distinct vars with the same stem—or a user-defined name already contains __—the first match wins and the dangling yield gets rewired to the wrong iter arg. Please derive this mapping from the enclosing loop’s return_vars_/iter_args_ correspondence or another identity-based relation instead of string matching.
Also applies to: 151-164
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ir/transforms/utils/loop_state_repair.cpp` around lines 114 - 117, The
current ExtractBaseName(const std::string& name_hint) uses string-prefix
matching and can map distinct loop-carried values incorrectly; replace the
name-hint based collapse with an identity-based mapping derived from the loop
node itself (use the loop's return_vars_/iter_args_ correspondence or the loop
AST/IR node that pairs yields to iter args) so keys are based on the actual
iter_arg identity rather than the stem of name_hint; update any code that calls
ExtractBaseName (including the other occurrences around the 151-164 region) to
look up the correct iter_arg via the loop's return_vars_/iter_args_ mapping and
only fall back to name_hint as a last resort.
There was a problem hiding this comment.
🧹 Nitpick comments (2)
examples-lib/qwen3/qwen3_32b_training_forward_and_backward_src.py (2)
618-620:grad_sinkis computed but never used.The variable
grad_sinkis assigned the result ofpl.mul(bwd_energy, 0.0)but is never referenced afterward. If this is intended as a dependency anchor to preventbwd_energyfrom being optimized away, consider adding a brief comment explaining its purpose. Otherwise, this is dead code that can be removed.💡 Suggestion: Add explanatory comment or remove
If intentional:
bwd_energy = pl.add(bwd_energy, pl.row_sum(attn_w)) + # Anchor to ensure bwd_energy computation is retained grad_sink = pl.mul(bwd_energy, 0.0)If unintentional dead code:
bwd_energy = pl.add(bwd_energy, pl.row_sum(attn_w)) - grad_sink = pl.mul(bwd_energy, 0.0)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples-lib/qwen3/qwen3_32b_training_forward_and_backward_src.py` around lines 618 - 620, The variable grad_sink is created as pl.mul(bwd_energy, 0.0) but never used; either remove this dead code or make its intent explicit as a dependency anchor. Locate the block where bwd_energy is accumulated (references: bwd_energy, contrib, attn_w, pl.row_sum) and either delete the grad_sink assignment or replace it with a clear comment stating it exists to create a side-effect/dependency to prevent optimization (e.g., “// dependency anchor to keep bwd_energy live”), ensuring no other code expects grad_sink to be consumed.
1261-1264: Fragile error detection heuristic.Checking
"code_runner" in result.errorto detect missing CodeRunner is brittle—error message wording may change. Consider using a more robust detection mechanism, such as checking for a specific exception type or a dedicated flag inRunResult.💡 Suggestion: Use a more specific check
If a specific exception type is available:
- if not result.passed and result.error and "code_runner" in result.error: + if not result.passed and result.error and "CodeRunner" in result.error: print("Result: COMPILE OK — device run skipped (code_runner not found).")Or better, if the runner can be updated to set a flag:
if not result.passed and getattr(result, 'compile_only', False): print("Result: COMPILE OK — device run skipped.")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples-lib/qwen3/qwen3_32b_training_forward_and_backward_src.py` around lines 1261 - 1264, The current check uses a brittle substring match on result.error; change it to a robust signal on the RunResult object instead: have the runner populate a boolean flag (e.g., RunResult.compile_only) or set a specific exception type on result.error, then update the condition here to use getattr(result, 'compile_only', False) (or isinstance(result.error, SpecificCompileOnlyException)) and print the "COMPILE OK — device run skipped" message based on that flag/type; update any code that constructs RunResult in the runner to set compile_only when compilation succeeds but device run is intentionally skipped.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@examples-lib/qwen3/qwen3_32b_training_forward_and_backward_src.py`:
- Around line 618-620: The variable grad_sink is created as pl.mul(bwd_energy,
0.0) but never used; either remove this dead code or make its intent explicit as
a dependency anchor. Locate the block where bwd_energy is accumulated
(references: bwd_energy, contrib, attn_w, pl.row_sum) and either delete the
grad_sink assignment or replace it with a clear comment stating it exists to
create a side-effect/dependency to prevent optimization (e.g., “// dependency
anchor to keep bwd_energy live”), ensuring no other code expects grad_sink to be
consumed.
- Around line 1261-1264: The current check uses a brittle substring match on
result.error; change it to a robust signal on the RunResult object instead: have
the runner populate a boolean flag (e.g., RunResult.compile_only) or set a
specific exception type on result.error, then update the condition here to use
getattr(result, 'compile_only', False) (or isinstance(result.error,
SpecificCompileOnlyException)) and print the "COMPILE OK — device run skipped"
message based on that flag/type; update any code that constructs RunResult in
the runner to set compile_only when compilation succeeds but device run is
intentionally skipped.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 039bcb0d-d6c9-4d68-9d3d-dc2c42cbd301
📒 Files selected for processing (1)
examples-lib/qwen3/qwen3_32b_training_forward_and_backward_src.py
|
Added more qwen3 models from hw-native-sys/pypto-lib/examples/models/qwen3, that are modified to support the pypto compile. The modified files are suffixed with *_new.py |
4aba28e to
3ae9bf7
Compare
|
Branch rebased to origin, conflicts fixed. pto_codegen.cpp modification: Branch on param direction and function name suffix for out column vectors |
Modify pypto ir framework to compile qwen3_32b_prefill, please check the changes if there are any issues. The summary of the changes is in CHANGES_SUMMARY.txt where the modifications are made with cursor (opus 4.6).