Pass Name
ConvertTensorToTileOps
Description
When pl.assemble is placed outside a with pl.incore() block in an Opaque function, OutlineIncoreScopes extracts the incore body into a separate function and creates a temporary ret0__out parameter for the result. Later, FuseCreateAssembleToSlice correctly fuses the orchestration-level create + assemble into a pl.tensor.slice of the target GM tensor.
However, ConvertTensorToTileOps fails to propagate the slice's stride information back into the incore function's ret0__out parameter type when the target tensor is 3D or higher.
- 2D target tensor (decode,
q_proj[16, 8192]): ret0__out correctly gets TensorView(stride=[8192, 1]) → kernel TSTORE writes to the correct GM address.
- 3D target tensor (prefill,
q_proj[16, 128, 5120]): ret0__out gets no TensorView → kernel TSTORE uses compact stride [64, 1] instead of the real stride [5120, 1], writing data to wrong addresses → ~47% element mismatch.
This does not occur when pl.assemble is inside with pl.incore(), because OutlineIncoreScopes directly includes the target tensor and indices in the incore function parameters, and the store goes directly to the correct GM location.
Git Commit ID
066b194
Before IR (Input)
# Pass 09_after_OutlineIncoreScopes — incore_1 returns a small tile,
# assemble is in orchestration scope (3D target tensor case)
import pypto.language as pl
@pl.program
class PrefillProjectionProgram:
@pl.function(type=pl.FunctionType.InCore)
def prefill_projection_incore_1(
self,
normed_tile__rv_v2: pl.Tensor[[16, 5120], pl.BF16],
q0__ssa_v0: pl.Scalar[pl.INDEX],
wq__ssa_v0: pl.Tensor[[5120, 5120], pl.BF16],
) -> pl.Tensor[[16, 64], pl.FP32]:
# ... matmul logic ...
return q_acc__rv_v2
@pl.function(type=pl.FunctionType.Orchestration)
def prefill_projection(self, ...):
for ob__idx_v0 in pl.range(80):
q0__ssa_v0 = ob__idx_v0 * 64
q_acc__rv_v2 = self.prefill_projection_incore_1(
normed_tile__rv_v2, q0__ssa_v0, wq__ssa_v0)
# assemble is OUTSIDE incore — in orchestration
q_proj__ssa_v7 = pl.tensor.assemble(
q_proj__iter_v5, q_acc__rv_v2,
[b__idx_v0, p0__ssa_v0, q0__ssa_v0])
Expected IR (After Transformation)
# After ConvertTensorToTileOps — ret0__out SHOULD have TensorView with
# the real stride from the 3D q_proj tensor slice
@pl.function(type=pl.FunctionType.InCore)
def prefill_projection_incore_1(
self,
normed_tile__rv_v2: pl.Tensor[[16, 5120], pl.BF16],
q0__ssa_v0: pl.Scalar[pl.INDEX],
wq__ssa_v0: pl.Tensor[[5120, 5120], pl.BF16],
ret0__out: pl.Out[pl.Tensor[[16, 64], pl.FP32,
pl.TensorView(stride=[5120, 1], layout=pl.TensorLayout.ND)]],
) -> pl.Tensor[[16, 64], pl.FP32,
pl.TensorView(stride=[5120, 1], layout=pl.TensorLayout.ND)]:
# ...
ret0__store = pl.tile.store(q_acc__rv_v2, [0, 0], ret0__out)
return ret0__store
Actual IR or Error
# After ConvertTensorToTileOps — ret0__out has NO TensorView stride!
@pl.function(type=pl.FunctionType.InCore)
def prefill_projection_incore_1(
self,
normed_tile__rv_v2: pl.Tensor[[16, 5120], pl.BF16],
q0__ssa_v0: pl.Scalar[pl.INDEX],
wq__ssa_v0: pl.Tensor[[5120, 5120], pl.BF16],
ret0__out: pl.Out[pl.Tensor[[16, 64], pl.FP32]], # <-- missing TensorView!
) -> pl.Tensor[[16, 64], pl.FP32]:
# ...
ret0__store = pl.tile.store(q_acc__rv_v2, [0, 0], ret0__out)
return ret0__store
This causes the generated C++ kernel to use compact stride [64, 1]:
// WRONG: Stride<1024, 1024, 1024, 64, 1> — compact [16,64] layout
GlobalTensor<float, Shape<1,1,1,16,64>, Stride<1024,1024,1024,64,1>> v40 = ...;
TSTORE(v40, v24);
Instead of the correct stride from q_proj [16, 128, 5120]:
// CORRECT: Stride<655360, 655360, 655360, 5120, 1> — real q_proj stride
GlobalTensor<float, Shape<1,1,1,16,64>, Stride<655360,655360,655360,5120,1>> v42 = ...;
TSTORE(v42, v26);
Result: AssertionError: Output 'q_proj' does not match golden. Mismatched elements: 4948885/10485760
NPU Kind
Ascend 910C
Host Platform
Linux (aarch64)
Additional Context
Reproduction: examples/models/qwen3/qwen3_32b_prefill_scope1.py — move pl.assemble for Q/K/V projections outside the with pl.incore() blocks. The 2D decode equivalent (qwen3_32b_decode_scope1.py) works correctly with assemble outside incore.
Working case for comparison: The 2D decode scope1 (target tensor q_proj[16, 8192]) correctly produces TensorView(stride=[8192, 1]) on ret0__out. The bug is specific to 3D+ target tensors.
Workaround: Keep pl.assemble inside with pl.incore() for 3D+ tensor targets.
Related pass files:
src/ir/transforms/convert_tensor_to_tile_ops_pass.cpp — stride propagation logic
src/ir/transforms/fuse_create_assemble_to_slice_pass.cpp — slice creation
src/ir/transforms/outline_incore_scopes_pass.cpp — incore function extraction
Possibly related closed issue: #899 (stride computation for assemble outputs, but that was for 2D tensors)
Pass Name
ConvertTensorToTileOps
Description
When
pl.assembleis placed outside awith pl.incore()block in an Opaque function,OutlineIncoreScopesextracts the incore body into a separate function and creates a temporaryret0__outparameter for the result. Later,FuseCreateAssembleToSlicecorrectly fuses the orchestration-levelcreate + assembleinto apl.tensor.sliceof the target GM tensor.However,
ConvertTensorToTileOpsfails to propagate the slice's stride information back into the incore function'sret0__outparameter type when the target tensor is 3D or higher.q_proj[16, 8192]):ret0__outcorrectly getsTensorView(stride=[8192, 1])→ kernel TSTORE writes to the correct GM address.q_proj[16, 128, 5120]):ret0__outgets no TensorView → kernel TSTORE uses compact stride[64, 1]instead of the real stride[5120, 1], writing data to wrong addresses → ~47% element mismatch.This does not occur when
pl.assembleis insidewith pl.incore(), becauseOutlineIncoreScopesdirectly includes the target tensor and indices in the incore function parameters, and the store goes directly to the correct GM location.Git Commit ID
066b194
Before IR (Input)
Expected IR (After Transformation)
Actual IR or Error
This causes the generated C++ kernel to use compact stride
[64, 1]:Instead of the correct stride from q_proj
[16, 128, 5120]:Result:
AssertionError: Output 'q_proj' does not match golden. Mismatched elements: 4948885/10485760NPU Kind
Ascend 910C
Host Platform
Linux (aarch64)
Additional Context
Reproduction:
examples/models/qwen3/qwen3_32b_prefill_scope1.py— movepl.assemblefor Q/K/V projections outside thewith pl.incore()blocks. The 2D decode equivalent (qwen3_32b_decode_scope1.py) works correctly with assemble outside incore.Working case for comparison: The 2D decode scope1 (target tensor
q_proj[16, 8192]) correctly producesTensorView(stride=[8192, 1])onret0__out. The bug is specific to 3D+ target tensors.Workaround: Keep
pl.assembleinsidewith pl.incore()for 3D+ tensor targets.Related pass files:
src/ir/transforms/convert_tensor_to_tile_ops_pass.cpp— stride propagation logicsrc/ir/transforms/fuse_create_assemble_to_slice_pass.cpp— slice creationsrc/ir/transforms/outline_incore_scopes_pass.cpp— incore function extractionPossibly related closed issue: #899 (stride computation for assemble outputs, but that was for 2D tensors)