Skip to content

[Pass Bug] ConvertTensorToTileOps fails to propagate TensorView stride for 3D+ tensor slice outputs #950

@wangqin1723-max

Description

@wangqin1723-max

Pass Name

ConvertTensorToTileOps

Description

When pl.assemble is placed outside a with pl.incore() block in an Opaque function, OutlineIncoreScopes extracts the incore body into a separate function and creates a temporary ret0__out parameter for the result. Later, FuseCreateAssembleToSlice correctly fuses the orchestration-level create + assemble into a pl.tensor.slice of the target GM tensor.

However, ConvertTensorToTileOps fails to propagate the slice's stride information back into the incore function's ret0__out parameter type when the target tensor is 3D or higher.

  • 2D target tensor (decode, q_proj[16, 8192]): ret0__out correctly gets TensorView(stride=[8192, 1]) → kernel TSTORE writes to the correct GM address.
  • 3D target tensor (prefill, q_proj[16, 128, 5120]): ret0__out gets no TensorView → kernel TSTORE uses compact stride [64, 1] instead of the real stride [5120, 1], writing data to wrong addresses → ~47% element mismatch.

This does not occur when pl.assemble is inside with pl.incore(), because OutlineIncoreScopes directly includes the target tensor and indices in the incore function parameters, and the store goes directly to the correct GM location.

Git Commit ID

066b194

Before IR (Input)

# Pass 09_after_OutlineIncoreScopes — incore_1 returns a small tile,
# assemble is in orchestration scope (3D target tensor case)
import pypto.language as pl

@pl.program
class PrefillProjectionProgram:
    @pl.function(type=pl.FunctionType.InCore)
    def prefill_projection_incore_1(
        self,
        normed_tile__rv_v2: pl.Tensor[[16, 5120], pl.BF16],
        q0__ssa_v0: pl.Scalar[pl.INDEX],
        wq__ssa_v0: pl.Tensor[[5120, 5120], pl.BF16],
    ) -> pl.Tensor[[16, 64], pl.FP32]:
        # ... matmul logic ...
        return q_acc__rv_v2

    @pl.function(type=pl.FunctionType.Orchestration)
    def prefill_projection(self, ...):
        for ob__idx_v0 in pl.range(80):
            q0__ssa_v0 = ob__idx_v0 * 64
            q_acc__rv_v2 = self.prefill_projection_incore_1(
                normed_tile__rv_v2, q0__ssa_v0, wq__ssa_v0)
            # assemble is OUTSIDE incore — in orchestration
            q_proj__ssa_v7 = pl.tensor.assemble(
                q_proj__iter_v5, q_acc__rv_v2,
                [b__idx_v0, p0__ssa_v0, q0__ssa_v0])

Expected IR (After Transformation)

# After ConvertTensorToTileOps — ret0__out SHOULD have TensorView with
# the real stride from the 3D q_proj tensor slice
@pl.function(type=pl.FunctionType.InCore)
def prefill_projection_incore_1(
    self,
    normed_tile__rv_v2: pl.Tensor[[16, 5120], pl.BF16],
    q0__ssa_v0: pl.Scalar[pl.INDEX],
    wq__ssa_v0: pl.Tensor[[5120, 5120], pl.BF16],
    ret0__out: pl.Out[pl.Tensor[[16, 64], pl.FP32,
        pl.TensorView(stride=[5120, 1], layout=pl.TensorLayout.ND)]],
) -> pl.Tensor[[16, 64], pl.FP32,
        pl.TensorView(stride=[5120, 1], layout=pl.TensorLayout.ND)]:
    # ...
    ret0__store = pl.tile.store(q_acc__rv_v2, [0, 0], ret0__out)
    return ret0__store

Actual IR or Error

# After ConvertTensorToTileOps — ret0__out has NO TensorView stride!
@pl.function(type=pl.FunctionType.InCore)
def prefill_projection_incore_1(
    self,
    normed_tile__rv_v2: pl.Tensor[[16, 5120], pl.BF16],
    q0__ssa_v0: pl.Scalar[pl.INDEX],
    wq__ssa_v0: pl.Tensor[[5120, 5120], pl.BF16],
    ret0__out: pl.Out[pl.Tensor[[16, 64], pl.FP32]],  # <-- missing TensorView!
) -> pl.Tensor[[16, 64], pl.FP32]:
    # ...
    ret0__store = pl.tile.store(q_acc__rv_v2, [0, 0], ret0__out)
    return ret0__store

This causes the generated C++ kernel to use compact stride [64, 1]:

// WRONG: Stride<1024, 1024, 1024, 64, 1> — compact [16,64] layout
GlobalTensor<float, Shape<1,1,1,16,64>, Stride<1024,1024,1024,64,1>> v40 = ...;
TSTORE(v40, v24);

Instead of the correct stride from q_proj [16, 128, 5120]:

// CORRECT: Stride<655360, 655360, 655360, 5120, 1> — real q_proj stride
GlobalTensor<float, Shape<1,1,1,16,64>, Stride<655360,655360,655360,5120,1>> v42 = ...;
TSTORE(v42, v26);

Result: AssertionError: Output 'q_proj' does not match golden. Mismatched elements: 4948885/10485760

NPU Kind

Ascend 910C

Host Platform

Linux (aarch64)

Additional Context

Reproduction: examples/models/qwen3/qwen3_32b_prefill_scope1.py — move pl.assemble for Q/K/V projections outside the with pl.incore() blocks. The 2D decode equivalent (qwen3_32b_decode_scope1.py) works correctly with assemble outside incore.

Working case for comparison: The 2D decode scope1 (target tensor q_proj[16, 8192]) correctly produces TensorView(stride=[8192, 1]) on ret0__out. The bug is specific to 3D+ target tensors.

Workaround: Keep pl.assemble inside with pl.incore() for 3D+ tensor targets.

Related pass files:

  • src/ir/transforms/convert_tensor_to_tile_ops_pass.cpp — stride propagation logic
  • src/ir/transforms/fuse_create_assemble_to_slice_pass.cpp — slice creation
  • src/ir/transforms/outline_incore_scopes_pass.cpp — incore function extraction

Possibly related closed issue: #899 (stride computation for assemble outputs, but that was for 2D tensors)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions