[Pass Bug] ConvertTensorToTileOps fails to propagate TensorView stride for 3D+ tensor slice outputs

### Pass Name

ConvertTensorToTileOps

### Description

When `pl.assemble` is placed **outside** a `with pl.incore()` block in an Opaque function, `OutlineIncoreScopes` extracts the incore body into a separate function and creates a temporary `ret0__out` parameter for the result. Later, `FuseCreateAssembleToSlice` correctly fuses the orchestration-level `create + assemble` into a `pl.tensor.slice` of the target GM tensor.

However, `ConvertTensorToTileOps` fails to propagate the slice's stride information back into the incore function's `ret0__out` parameter type when the target tensor is **3D or higher**.

- **2D target tensor** (decode, `q_proj[16, 8192]`): `ret0__out` correctly gets `TensorView(stride=[8192, 1])` → kernel TSTORE writes to the correct GM address.
- **3D target tensor** (prefill, `q_proj[16, 128, 5120]`): `ret0__out` gets **no TensorView** → kernel TSTORE uses compact stride `[64, 1]` instead of the real stride `[5120, 1]`, writing data to wrong addresses → ~47% element mismatch.

This does **not** occur when `pl.assemble` is inside `with pl.incore()`, because `OutlineIncoreScopes` directly includes the target tensor and indices in the incore function parameters, and the store goes directly to the correct GM location.

### Git Commit ID

066b1947

### Before IR (Input)

```python
# Pass 09_after_OutlineIncoreScopes — incore_1 returns a small tile,
# assemble is in orchestration scope (3D target tensor case)
import pypto.language as pl

@pl.program
class PrefillProjectionProgram:
    @pl.function(type=pl.FunctionType.InCore)
    def prefill_projection_incore_1(
        self,
        normed_tile__rv_v2: pl.Tensor[[16, 5120], pl.BF16],
        q0__ssa_v0: pl.Scalar[pl.INDEX],
        wq__ssa_v0: pl.Tensor[[5120, 5120], pl.BF16],
    ) -> pl.Tensor[[16, 64], pl.FP32]:
        # ... matmul logic ...
        return q_acc__rv_v2

    @pl.function(type=pl.FunctionType.Orchestration)
    def prefill_projection(self, ...):
        for ob__idx_v0 in pl.range(80):
            q0__ssa_v0 = ob__idx_v0 * 64
            q_acc__rv_v2 = self.prefill_projection_incore_1(
                normed_tile__rv_v2, q0__ssa_v0, wq__ssa_v0)
            # assemble is OUTSIDE incore — in orchestration
            q_proj__ssa_v7 = pl.tensor.assemble(
                q_proj__iter_v5, q_acc__rv_v2,
                [b__idx_v0, p0__ssa_v0, q0__ssa_v0])
```

### Expected IR (After Transformation)

```python
# After ConvertTensorToTileOps — ret0__out SHOULD have TensorView with
# the real stride from the 3D q_proj tensor slice
@pl.function(type=pl.FunctionType.InCore)
def prefill_projection_incore_1(
    self,
    normed_tile__rv_v2: pl.Tensor[[16, 5120], pl.BF16],
    q0__ssa_v0: pl.Scalar[pl.INDEX],
    wq__ssa_v0: pl.Tensor[[5120, 5120], pl.BF16],
    ret0__out: pl.Out[pl.Tensor[[16, 64], pl.FP32,
        pl.TensorView(stride=[5120, 1], layout=pl.TensorLayout.ND)]],
) -> pl.Tensor[[16, 64], pl.FP32,
        pl.TensorView(stride=[5120, 1], layout=pl.TensorLayout.ND)]:
    # ...
    ret0__store = pl.tile.store(q_acc__rv_v2, [0, 0], ret0__out)
    return ret0__store
```

### Actual IR or Error

```python
# After ConvertTensorToTileOps — ret0__out has NO TensorView stride!
@pl.function(type=pl.FunctionType.InCore)
def prefill_projection_incore_1(
    self,
    normed_tile__rv_v2: pl.Tensor[[16, 5120], pl.BF16],
    q0__ssa_v0: pl.Scalar[pl.INDEX],
    wq__ssa_v0: pl.Tensor[[5120, 5120], pl.BF16],
    ret0__out: pl.Out[pl.Tensor[[16, 64], pl.FP32]],  # <-- missing TensorView!
) -> pl.Tensor[[16, 64], pl.FP32]:
    # ...
    ret0__store = pl.tile.store(q_acc__rv_v2, [0, 0], ret0__out)
    return ret0__store
```

This causes the generated C++ kernel to use compact stride `[64, 1]`:
```cpp
// WRONG: Stride<1024, 1024, 1024, 64, 1> — compact [16,64] layout
GlobalTensor<float, Shape<1,1,1,16,64>, Stride<1024,1024,1024,64,1>> v40 = ...;
TSTORE(v40, v24);
```

Instead of the correct stride from q_proj `[16, 128, 5120]`:
```cpp
// CORRECT: Stride<655360, 655360, 655360, 5120, 1> — real q_proj stride
GlobalTensor<float, Shape<1,1,1,16,64>, Stride<655360,655360,655360,5120,1>> v42 = ...;
TSTORE(v42, v26);
```

Result: `AssertionError: Output 'q_proj' does not match golden. Mismatched elements: 4948885/10485760`

### NPU Kind

Ascend 910C

### Host Platform

Linux (aarch64)

### Additional Context

**Reproduction:** `examples/models/qwen3/qwen3_32b_prefill_scope1.py` — move `pl.assemble` for Q/K/V projections outside the `with pl.incore()` blocks. The 2D decode equivalent (`qwen3_32b_decode_scope1.py`) works correctly with assemble outside incore.

**Working case for comparison:** The 2D decode scope1 (target tensor `q_proj[16, 8192]`) correctly produces `TensorView(stride=[8192, 1])` on `ret0__out`. The bug is specific to 3D+ target tensors.

**Workaround:** Keep `pl.assemble` inside `with pl.incore()` for 3D+ tensor targets.

**Related pass files:**
- `src/ir/transforms/convert_tensor_to_tile_ops_pass.cpp` — stride propagation logic
- `src/ir/transforms/fuse_create_assemble_to_slice_pass.cpp` — slice creation
- `src/ir/transforms/outline_incore_scopes_pass.cpp` — incore function extraction

**Possibly related closed issue:** #899 (stride computation for assemble outputs, but that was for 2D tensors)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Pass Bug] ConvertTensorToTileOps fails to propagate TensorView stride for 3D+ tensor slice outputs #950

Pass Name

Description

Git Commit ID

Before IR (Input)

Expected IR (After Transformation)

Actual IR or Error

NPU Kind

Host Platform

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Pass Bug] ConvertTensorToTileOps fails to propagate TensorView stride for 3D+ tensor slice outputs #950

Description

Pass Name

Description

Git Commit ID

Before IR (Input)

Expected IR (After Transformation)

Actual IR or Error

NPU Kind

Host Platform

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions