Skip to content

Commit 0ccdea1

Browse files
committed
Merge main into qwen tilelet branch and resolve opcode conflicts
2 parents 1764530 + 33f371d commit 0ccdea1

18 files changed

Lines changed: 1037 additions & 88 deletions

File tree

.github/workflows/ci.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -293,7 +293,7 @@ jobs:
293293
# suite (RUN_ONLY_CASES is empty), skip the non-matching variant based
294294
# on SOC_VERSION to keep the remote validation portable.
295295
A3_ONLY_CASES="partition5d,partition5d_dynamic,mrgsort,tmatmulk_autosync"
296-
A5_ONLY_CASES="partition5d_a5,partition5d_dynamic_a5,mrgsort_a5,tmatmulk_autosync_a5"
296+
A5_ONLY_CASES="partition5d_a5,partition5d_dynamic_a5,mrgsort_a5,tmatmulk_autosync_a5,tpack"
297297
298298
sv_lc="$(printf '%s' "${SOC_VERSION}" | tr '[:upper:]' '[:lower:]')"
299299
is_a5=0

docs/PTO_IR_manual.md

Lines changed: 198 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -707,6 +707,103 @@ pto.tload ins(%pv : !pto.partition_tensor_view<16x16xf16>)
707707

708708
---
709709

710+
##### `pto.tprefetch` - Prefetch Partition View into Tile
711+
712+
**Summary:** Prefetches a GM-backed partition view into a temporary local tile buffer. This maps to PTO-ISA `TPREFETCH(dst, src)` and, unlike most PTO intrinsics, does not add implicit wait-event synchronization in the C++ wrapper.
713+
714+
**Semantics:**
715+
716+
```
717+
TPREFETCH(dst, src)
718+
```
719+
720+
The detailed caching / hint behavior is target-defined by PTO-ISA. In PTOAS the
721+
op is modeled as writing the prefetched data into `dst`.
722+
723+
**Arguments:**
724+
725+
| Name | Type | Description |
726+
|------|------|-------------|
727+
| `src` | `pto.partition_tensor_view` or lowered GM memref | Source global view |
728+
| `dst` | `pto.tile_buf` or lowered local memref | Destination local tile |
729+
730+
**Results:** None. Writes into `dst` via DPS pattern.
731+
732+
**Constraints & Verification:**
733+
734+
- `src` must be a partition view before lowering, or the corresponding lowered ranked memref form after `PTOViewToMemref`.
735+
- `dst` must be a tile buffer before lowering, or the corresponding lowered ranked memref form after `PTOViewToMemref`.
736+
- `dst` must use `loc=vec` or `loc=mat`.
737+
- Static source extents and static destination valid extents must be positive when known.
738+
- `src` and `dst` element types must have the same element size in bytes.
739+
740+
**Hardware Mapping:**
741+
742+
- Executes on the **DMA pipeline** (`PIPE_MTE2`, GM -> local tile)
743+
744+
**Basic Example:**
745+
746+
```mlir
747+
pto.tprefetch ins(%pv : !pto.partition_tensor_view<16x16xf16>)
748+
outs(%tb : !pto.tile_buf<loc=vec, dtype=f16, rows=16, cols=16,
749+
v_row=16, v_col=16, blayout=row_major, slayout=none_box,
750+
fractal=512, pad=0>)
751+
```
752+
753+
---
754+
755+
##### `pto.tpack` - Pack a Wider Vec Tile into a Narrower Vec Tile
756+
757+
**Summary:** A5-only vector packing operation that narrows a source VEC tile into
758+
a destination VEC tile of the same valid shape.
759+
760+
**Semantics:**
761+
762+
```
763+
TPACK(dst, src)
764+
```
765+
766+
Supported packing directions follow PTO-ISA:
767+
- `b32 -> b16`
768+
- `b32 -> b8`
769+
- `b16 -> b8`
770+
771+
**Arguments:**
772+
773+
| Name | Type | Description |
774+
|------|------|-------------|
775+
| `src` | `pto.tile_buf` | Source VEC tile with wider element type |
776+
| `dst` | `pto.tile_buf` | Destination VEC tile with narrower element type |
777+
778+
**Results:** None. Writes into `dst` via DPS pattern.
779+
780+
**Constraints & Verification:**
781+
782+
- `pto.tpack` is only supported on **A5** targets.
783+
- `src` and `dst` must both be VEC tiles (`loc=vec`) with row-major layout.
784+
- `src` and `dst` must have the same `valid_shape`.
785+
- Supported element-size pairs are exactly:
786+
- `4 -> 2` bytes
787+
- `4 -> 1` bytes
788+
- `2 -> 1` bytes
789+
790+
**Hardware Mapping:**
791+
792+
- Executes on the **Vector pipeline** (`PIPE_V`)
793+
794+
**Basic Example:**
795+
796+
```mlir
797+
pto.tpack ins(%src : !pto.tile_buf<loc=vec, dtype=i32, rows=128, cols=128,
798+
v_row=128, v_col=128, blayout=row_major, slayout=none_box,
799+
fractal=512, pad=0>)
800+
outs(%dst : !pto.tile_buf<loc=vec, dtype=i16, rows=128, cols=128,
801+
v_row=128, v_col=128, blayout=row_major, slayout=none_box,
802+
fractal=512, pad=0>)
803+
```
804+
805+
---
806+
710807
##### `pto.tstore` - Store Tile to Partition View
711808

712809
**Summary:** Stores a 2-D tile buffer back to a 2-D partition view. Supports phase/atomic/relu/pre-quant controls that lower to the corresponding `TSTORE` template overload family.
@@ -1801,6 +1898,7 @@ Division-by-zero behavior is target-defined.
18011898
|------|------|-------------|
18021899
| `src0` | `pto.tile_buf` | Dividend tile buffer |
18031900
| `src1` | `pto.tile_buf` | Divisor tile buffer |
1901+
| `tmp` | `pto.tile_buf` | Temporary tile buffer required by the ISA API |
18041902
| `dst` | `pto.tile_buf` | Destination tile buffer |
18051903

18061904
**Results:** None. Writes into `dst` via DPS pattern.
@@ -1996,23 +2094,25 @@ For each element (i, j):
19962094
**Assembly Format:**
19972095

19982096
```
1999-
pto.trem ins(<src0>, <src1> : <src0_type>, <src1_type>)
2097+
pto.trem ins(<src0>, <src1>, <tmp> : <src0_type>, <src1_type>, <tmp_type>)
20002098
outs(<dst> : <dst_type>)
20012099
```
20022100

20032101
**Constraints & Verification:**
20042102

20052103
- The implementation uses `dst valid row` / `dst valid column` as the iteration domain.
20062104
- **Implementation checks (A2A3)**
2007-
- Tile element type must be one of: `i32`, `i16`, `f16`, `f32`.
2008-
- Tile must use row-major layout (`blayout=row_major`).
2009-
- Valid bounds: `valid row <= rows` and `valid column <= cols`.
2010-
- Runtime: `src0`, `src1` and `dst` tiles should have the same `validRow/validCol`.
2105+
- `src0/src1/dst` element type must match, and must be `i32` or `f32`.
2106+
- `tmp` element type must match `dst`.
2107+
- `src0/src1/tmp/dst` must use row-major layout (`blayout=row_major`).
2108+
- `src0/src1/dst` must have the same `validRow/validCol`.
2109+
- `tmp` must provide at least `1` valid row and `tmp.validCol >= dst.validCol`.
20112110
- **Implementation checks (A5)**
2012-
- Tile element type must be one of: `i32`, `i16`, `f32`, `f16`.
2013-
- Tile must use row-major layout (`blayout=row_major`).
2014-
- Valid bounds: `valid row <= rows` and `valid column <= cols`.
2015-
- Runtime: `src0`, `src1` and `dst` tiles should have the same `validRow/validCol`.
2111+
- `src0/src1/dst` element type must match, and must be one of: `i32`, `i16`, `f16`, `f32`.
2112+
- `tmp` element type must match `dst`.
2113+
- `src0/src1/tmp/dst` must use row-major layout (`blayout=row_major`).
2114+
- `src0/src1/dst` must have the same `validRow/validCol`.
2115+
- `tmp` must provide at least `1` valid row and `tmp.validCol >= dst.validCol`.
20162116

20172117
**Hardware Mapping:**
20182118

@@ -2022,11 +2122,14 @@ pto.trem ins(<src0>, <src1> : <src0_type>, <src1_type>)
20222122
**Basic Example:**
20232123

20242124
```mlir
2025-
pto.trem ins(%a, %b : !pto.tile_buf<loc=vec, dtype=f32, rows=16, cols=16,
2125+
pto.trem ins(%a, %b, %tmp : !pto.tile_buf<loc=vec, dtype=f32, rows=16, cols=16,
20262126
v_row=16, v_col=16, blayout=row_major, slayout=none_box,
20272127
fractal=512, pad=0>,
20282128
!pto.tile_buf<loc=vec, dtype=f32, rows=16, cols=16,
20292129
v_row=16, v_col=16, blayout=row_major, slayout=none_box,
2130+
fractal=512, pad=0>,
2131+
!pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=16,
2132+
v_row=1, v_col=16, blayout=row_major, slayout=none_box,
20302133
fractal=512, pad=0>)
20312134
outs(%c : !pto.tile_buf<loc=vec, dtype=f32, rows=16, cols=16,
20322135
v_row=16, v_col=16, blayout=row_major, slayout=none_box,
@@ -2809,30 +2912,35 @@ For each element (i, j):
28092912
|------|------|-------------|
28102913
| `src` | `pto.tile_buf` | Source tile buffer |
28112914
| `scalar` | `ScalarType` (signless integer / float) | Scalar divisor |
2915+
| `tmp` | `pto.tile_buf` | Temporary tile buffer required by the ISA API |
28122916
| `dst` | `pto.tile_buf` | Destination tile buffer |
28132917

28142918
**Results:** None. Writes into `dst` via DPS pattern.
28152919

28162920
**Assembly Format:**
28172921

28182922
```
2819-
pto.trems ins(<src>, <scalar> : <src_type>, <scalar_type>)
2923+
pto.trems ins(<src>, <scalar>, <tmp> : <src_type>, <scalar_type>, <tmp_type>)
28202924
outs(<dst> : <dst_type>)
28212925
```
28222926

28232927
**Constraints & Verification:**
28242928

28252929
- Division-by-zero behavior is target-defined; the CPU simulator asserts in debug builds.
28262930
- **Implementation checks (A2A3)**
2827-
- Tile element type must be one of: `i32`, `int`, `i16`, `f16`, `f32`.
2828-
- Tile must use `loc=vec`.
2829-
- Valid bounds: `valid row <= rows` and `valid column <= cols`.
2830-
- Runtime: `src0 valid row == dst valid row` and `src0 valid column == dst valid column`.
2931+
- `src/dst` element type must match, and must be `i32` or `f32`.
2932+
- `scalar` type must match the tile element type.
2933+
- `tmp` element type must match `dst`.
2934+
- `src/tmp/dst` must use row-major layout (`blayout=row_major`).
2935+
- `src` and `dst` must have the same `validRow/validCol`.
2936+
- `tmp` must provide at least `1` valid row and `tmp.validCol >= dst.validCol`.
28312937
- **Implementation checks (A5)**
2832-
- Tile element type must be one of: `i8`, `i16`, `i32`, `f16`, `f32`, `bf16`.
2833-
- Tile must use `loc=vec`.
2834-
- Valid bounds: `valid row <= rows` and `valid column <= cols`.
2835-
- Runtime: `src0 valid row == dst valid row` and `src0 valid column == dst valid column`.
2938+
- `src/dst` element type must match, and must be one of: `i32`, `i16`, `f16`, `f32`.
2939+
- `scalar` type must match the tile element type.
2940+
- `tmp` element type must match `dst`.
2941+
- `src/tmp/dst` must use row-major layout (`blayout=row_major`).
2942+
- `src` and `dst` must have the same `validRow/validCol`.
2943+
- `tmp` must provide at least `1` valid row and `tmp.validCol >= dst.validCol`.
28362944

28372945
**Hardware Mapping:**
28382946

@@ -2842,9 +2950,12 @@ pto.trems ins(<src>, <scalar> : <src_type>, <scalar_type>)
28422950
**Basic Example:**
28432951

28442952
```mlir
2845-
pto.trems ins(%a, %s : !pto.tile_buf<loc=vec, dtype=f32, rows=32, cols=32,
2953+
pto.trems ins(%a, %s, %tmp : !pto.tile_buf<loc=vec, dtype=f32, rows=32, cols=32,
28462954
v_row=32, v_col=32, blayout=row_major, slayout=none_box,
2847-
fractal=512, pad=0>, f32)
2955+
fractal=512, pad=0>, f32,
2956+
!pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=32,
2957+
v_row=1, v_col=32, blayout=row_major, slayout=none_box,
2958+
fractal=512, pad=0>)
28482959
outs(%c : !pto.tile_buf<loc=vec, dtype=f32, rows=32, cols=32,
28492960
v_row=32, v_col=32, blayout=row_major, slayout=none_box,
28502961
fractal=512, pad=0>)
@@ -3377,6 +3488,72 @@ pto.tsqrt ins(%a : !pto.tile_buf<loc=vec, dtype=f16, rows=16, cols=16,
33773488

33783489
---
33793490

3491+
##### `pto.ttri` - Fill Triangular Tile Region
3492+
3493+
**Summary:** Fills a VEC tile using the PTO-ISA `TTRI` triangular pattern.
3494+
3495+
**Semantics:**
3496+
3497+
```
3498+
TTRI(dst, diagonal)
3499+
```
3500+
3501+
`upperOrLower=0` selects the lower-triangular form and `upperOrLower=1`
3502+
selects the upper-triangular form. The exact per-element fill pattern follows
3503+
the target PTO-ISA implementation.
3504+
3505+
**Arguments:**
3506+
3507+
| Name | Type | Description |
3508+
|------|------|-------------|
3509+
| `diagonal` | integer SSA value | Runtime diagonal selector |
3510+
| `dst` | `pto.tile_buf` | Destination vector tile |
3511+
3512+
**Attributes:**
3513+
3514+
| Name | Type | Description |
3515+
|------|------|-------------|
3516+
| `upperOrLower` | `I32Attr` (default: `0`) | `0` for lower triangular, `1` for upper triangular |
3517+
3518+
**Results:** None. Writes into `dst` via DPS pattern.
3519+
3520+
**Assembly Format:**
3521+
3522+
```mlir
3523+
pto.ttri ins(%diag {upperOrLower = 1 : i32} : i32)
3524+
outs(%dst : !pto.tile_buf<...>)
3525+
```
3526+
3527+
**Constraints & Verification:**
3528+
3529+
- `dst` must be a VEC tile (`loc=vec`) whose valid shape stays within the static tile shape.
3530+
- `dst` must use `blayout=row_major`.
3531+
- `diagonal` must have an integer type.
3532+
- `upperOrLower` must be either `0` or `1`.
3533+
- Supported element types:
3534+
- A2/A3: `f16`, `f32`, `i16`, `i32`, `u16`, `u32`
3535+
- A5: `f16`, `f32`, `bf16`, `i8`, `i16`, `i32`, `u8`, `u16`, `u32`
3536+
3537+
**Hardware Mapping:**
3538+
3539+
- Executes on the **Vector pipeline** (`PIPE_V`)
3540+
3541+
**Basic Example:**
3542+
3543+
```mlir
3544+
pto.ttri ins(%diag : i32)
3545+
outs(%lower : !pto.tile_buf<loc=vec, dtype=i32, rows=32, cols=32,
3546+
v_row=32, v_col=32, blayout=row_major, slayout=none_box,
3547+
fractal=512, pad=0>)
3548+
3549+
pto.ttri ins(%diag {upperOrLower = 1 : i32} : i32)
3550+
outs(%upper : !pto.tile_buf<loc=vec, dtype=i32, rows=32, cols=32,
3551+
v_row=32, v_col=32, blayout=row_major, slayout=none_box,
3552+
fractal=512, pad=0>)
3553+
```
3554+
3555+
---
3556+
33803557
##### `pto.trsqrt` - Elementwise Reciprocal Square Root
33813558

33823559
**Summary:** Computes the reciprocal square root for every element.

0 commit comments

Comments
 (0)