@@ -707,6 +707,103 @@ pto.tload ins(%pv : !pto.partition_tensor_view<16x16xf16>)
707707
708708---
709709
710+ ##### ` pto.tprefetch ` - Prefetch Partition View into Tile
711+
712+ ** Summary:** Prefetches a GM-backed partition view into a temporary local tile buffer. This maps to PTO-ISA ` TPREFETCH(dst, src) ` and, unlike most PTO intrinsics, does not add implicit wait-event synchronization in the C++ wrapper.
713+
714+ ** Semantics:**
715+
716+ ```
717+ TPREFETCH(dst, src)
718+ ```
719+
720+ The detailed caching / hint behavior is target-defined by PTO-ISA. In PTOAS the
721+ op is modeled as writing the prefetched data into ` dst ` .
722+
723+ ** Arguments:**
724+
725+ | Name | Type | Description |
726+ | ------| ------| -------------|
727+ | ` src ` | ` pto.partition_tensor_view ` or lowered GM memref | Source global view |
728+ | ` dst ` | ` pto.tile_buf ` or lowered local memref | Destination local tile |
729+
730+ ** Results:** None. Writes into ` dst ` via DPS pattern.
731+
732+ ** Constraints & Verification:**
733+
734+ - ` src ` must be a partition view before lowering, or the corresponding lowered ranked memref form after ` PTOViewToMemref ` .
735+ - ` dst ` must be a tile buffer before lowering, or the corresponding lowered ranked memref form after ` PTOViewToMemref ` .
736+ - ` dst ` must use ` loc=vec ` or ` loc=mat ` .
737+ - Static source extents and static destination valid extents must be positive when known.
738+ - ` src ` and ` dst ` element types must have the same element size in bytes.
739+
740+ ** Hardware Mapping:**
741+
742+ - Executes on the ** DMA pipeline** (` PIPE_MTE2 ` , GM -> local tile)
743+
744+ ** Basic Example:**
745+
746+ ``` mlir
747+ pto.tprefetch ins(%pv : !pto.partition_tensor_view<16x16xf16>)
748+ outs(%tb : !pto.tile_buf<loc=vec, dtype=f16, rows=16, cols=16,
749+ v_row=16, v_col=16, blayout=row_major, slayout=none_box,
750+ fractal=512, pad=0>)
751+ ```
752+
753+ ---
754+
755+ ##### ` pto.tpack ` - Pack a Wider Vec Tile into a Narrower Vec Tile
756+
757+ ** Summary:** A5-only vector packing operation that narrows a source VEC tile into
758+ a destination VEC tile of the same valid shape.
759+
760+ ** Semantics:**
761+
762+ ```
763+ TPACK(dst, src)
764+ ```
765+
766+ Supported packing directions follow PTO-ISA:
767+ - ` b32 -> b16 `
768+ - ` b32 -> b8 `
769+ - ` b16 -> b8 `
770+
771+ ** Arguments:**
772+
773+ | Name | Type | Description |
774+ | ------| ------| -------------|
775+ | ` src ` | ` pto.tile_buf ` | Source VEC tile with wider element type |
776+ | ` dst ` | ` pto.tile_buf ` | Destination VEC tile with narrower element type |
777+
778+ ** Results:** None. Writes into ` dst ` via DPS pattern.
779+
780+ ** Constraints & Verification:**
781+
782+ - ` pto.tpack ` is only supported on ** A5** targets.
783+ - ` src ` and ` dst ` must both be VEC tiles (` loc=vec ` ) with row-major layout.
784+ - ` src ` and ` dst ` must have the same ` valid_shape ` .
785+ - Supported element-size pairs are exactly:
786+ - ` 4 -> 2 ` bytes
787+ - ` 4 -> 1 ` bytes
788+ - ` 2 -> 1 ` bytes
789+
790+ ** Hardware Mapping:**
791+
792+ - Executes on the ** Vector pipeline** (` PIPE_V ` )
793+
794+ ** Basic Example:**
795+
796+ ``` mlir
797+ pto.tpack ins(%src : !pto.tile_buf<loc=vec, dtype=i32, rows=128, cols=128,
798+ v_row=128, v_col=128, blayout=row_major, slayout=none_box,
799+ fractal=512, pad=0>)
800+ outs(%dst : !pto.tile_buf<loc=vec, dtype=i16, rows=128, cols=128,
801+ v_row=128, v_col=128, blayout=row_major, slayout=none_box,
802+ fractal=512, pad=0>)
803+ ```
804+
805+ ---
806+
710807##### ` pto.tstore ` - Store Tile to Partition View
711808
712809** Summary:** Stores a 2-D tile buffer back to a 2-D partition view. Supports phase/atomic/relu/pre-quant controls that lower to the corresponding ` TSTORE ` template overload family.
@@ -1801,6 +1898,7 @@ Division-by-zero behavior is target-defined.
18011898| ------| ------| -------------|
18021899| ` src0 ` | ` pto.tile_buf ` | Dividend tile buffer |
18031900| ` src1 ` | ` pto.tile_buf ` | Divisor tile buffer |
1901+ | ` tmp ` | ` pto.tile_buf ` | Temporary tile buffer required by the ISA API |
18041902| ` dst ` | ` pto.tile_buf ` | Destination tile buffer |
18051903
18061904** Results:** None. Writes into ` dst ` via DPS pattern.
@@ -1996,23 +2094,25 @@ For each element (i, j):
19962094** Assembly Format:**
19972095
19982096```
1999- pto.trem ins(<src0>, <src1> : <src0_type>, <src1_type>)
2097+ pto.trem ins(<src0>, <src1>, <tmp> : <src0_type>, <src1_type>, <tmp_type >)
20002098 outs(<dst> : <dst_type>)
20012099```
20022100
20032101** Constraints & Verification:**
20042102
20052103- The implementation uses ` dst valid row ` / ` dst valid column ` as the iteration domain.
20062104- ** Implementation checks (A2A3)**
2007- - Tile element type must be one of: ` i32 ` , ` i16 ` , ` f16 ` , ` f32 ` .
2008- - Tile must use row-major layout (` blayout=row_major ` ).
2009- - Valid bounds: ` valid row <= rows ` and ` valid column <= cols ` .
2010- - Runtime: ` src0 ` , ` src1 ` and ` dst ` tiles should have the same ` validRow/validCol ` .
2105+ - ` src0/src1/dst ` element type must match, and must be ` i32 ` or ` f32 ` .
2106+ - ` tmp ` element type must match ` dst ` .
2107+ - ` src0/src1/tmp/dst ` must use row-major layout (` blayout=row_major ` ).
2108+ - ` src0/src1/dst ` must have the same ` validRow/validCol ` .
2109+ - ` tmp ` must provide at least ` 1 ` valid row and ` tmp.validCol >= dst.validCol ` .
20112110- ** Implementation checks (A5)**
2012- - Tile element type must be one of: ` i32 ` , ` i16 ` , ` f32 ` , ` f16 ` .
2013- - Tile must use row-major layout (` blayout=row_major ` ).
2014- - Valid bounds: ` valid row <= rows ` and ` valid column <= cols ` .
2015- - Runtime: ` src0 ` , ` src1 ` and ` dst ` tiles should have the same ` validRow/validCol ` .
2111+ - ` src0/src1/dst ` element type must match, and must be one of: ` i32 ` , ` i16 ` , ` f16 ` , ` f32 ` .
2112+ - ` tmp ` element type must match ` dst ` .
2113+ - ` src0/src1/tmp/dst ` must use row-major layout (` blayout=row_major ` ).
2114+ - ` src0/src1/dst ` must have the same ` validRow/validCol ` .
2115+ - ` tmp ` must provide at least ` 1 ` valid row and ` tmp.validCol >= dst.validCol ` .
20162116
20172117** Hardware Mapping:**
20182118
@@ -2022,11 +2122,14 @@ pto.trem ins(<src0>, <src1> : <src0_type>, <src1_type>)
20222122** Basic Example:**
20232123
20242124``` mlir
2025- pto.trem ins(%a, %b : !pto.tile_buf<loc=vec, dtype=f32, rows=16, cols=16,
2125+ pto.trem ins(%a, %b, %tmp : !pto.tile_buf<loc=vec, dtype=f32, rows=16, cols=16,
20262126 v_row=16, v_col=16, blayout=row_major, slayout=none_box,
20272127 fractal=512, pad=0>,
20282128 !pto.tile_buf<loc=vec, dtype=f32, rows=16, cols=16,
20292129 v_row=16, v_col=16, blayout=row_major, slayout=none_box,
2130+ fractal=512, pad=0>,
2131+ !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=16,
2132+ v_row=1, v_col=16, blayout=row_major, slayout=none_box,
20302133 fractal=512, pad=0>)
20312134 outs(%c : !pto.tile_buf<loc=vec, dtype=f32, rows=16, cols=16,
20322135 v_row=16, v_col=16, blayout=row_major, slayout=none_box,
@@ -2809,30 +2912,35 @@ For each element (i, j):
28092912| ------| ------| -------------|
28102913| ` src ` | ` pto.tile_buf ` | Source tile buffer |
28112914| ` scalar ` | ` ScalarType ` (signless integer / float) | Scalar divisor |
2915+ | ` tmp ` | ` pto.tile_buf ` | Temporary tile buffer required by the ISA API |
28122916| ` dst ` | ` pto.tile_buf ` | Destination tile buffer |
28132917
28142918** Results:** None. Writes into ` dst ` via DPS pattern.
28152919
28162920** Assembly Format:**
28172921
28182922```
2819- pto.trems ins(<src>, <scalar> : <src_type>, <scalar_type>)
2923+ pto.trems ins(<src>, <scalar>, <tmp> : <src_type>, <scalar_type>, <tmp_type >)
28202924 outs(<dst> : <dst_type>)
28212925```
28222926
28232927** Constraints & Verification:**
28242928
28252929- Division-by-zero behavior is target-defined; the CPU simulator asserts in debug builds.
28262930- ** Implementation checks (A2A3)**
2827- - Tile element type must be one of: ` i32 ` , ` int ` , ` i16 ` , ` f16 ` , ` f32 ` .
2828- - Tile must use ` loc=vec ` .
2829- - Valid bounds: ` valid row <= rows ` and ` valid column <= cols ` .
2830- - Runtime: ` src0 valid row == dst valid row ` and ` src0 valid column == dst valid column ` .
2931+ - ` src/dst ` element type must match, and must be ` i32 ` or ` f32 ` .
2932+ - ` scalar ` type must match the tile element type.
2933+ - ` tmp ` element type must match ` dst ` .
2934+ - ` src/tmp/dst ` must use row-major layout (` blayout=row_major ` ).
2935+ - ` src ` and ` dst ` must have the same ` validRow/validCol ` .
2936+ - ` tmp ` must provide at least ` 1 ` valid row and ` tmp.validCol >= dst.validCol ` .
28312937- ** Implementation checks (A5)**
2832- - Tile element type must be one of: ` i8 ` , ` i16 ` , ` i32 ` , ` f16 ` , ` f32 ` , ` bf16 ` .
2833- - Tile must use ` loc=vec ` .
2834- - Valid bounds: ` valid row <= rows ` and ` valid column <= cols ` .
2835- - Runtime: ` src0 valid row == dst valid row ` and ` src0 valid column == dst valid column ` .
2938+ - ` src/dst ` element type must match, and must be one of: ` i32 ` , ` i16 ` , ` f16 ` , ` f32 ` .
2939+ - ` scalar ` type must match the tile element type.
2940+ - ` tmp ` element type must match ` dst ` .
2941+ - ` src/tmp/dst ` must use row-major layout (` blayout=row_major ` ).
2942+ - ` src ` and ` dst ` must have the same ` validRow/validCol ` .
2943+ - ` tmp ` must provide at least ` 1 ` valid row and ` tmp.validCol >= dst.validCol ` .
28362944
28372945** Hardware Mapping:**
28382946
@@ -2842,9 +2950,12 @@ pto.trems ins(<src>, <scalar> : <src_type>, <scalar_type>)
28422950** Basic Example:**
28432951
28442952``` mlir
2845- pto.trems ins(%a, %s : !pto.tile_buf<loc=vec, dtype=f32, rows=32, cols=32,
2953+ pto.trems ins(%a, %s, %tmp : !pto.tile_buf<loc=vec, dtype=f32, rows=32, cols=32,
28462954 v_row=32, v_col=32, blayout=row_major, slayout=none_box,
2847- fractal=512, pad=0>, f32)
2955+ fractal=512, pad=0>, f32,
2956+ !pto.tile_buf<loc=vec, dtype=f32, rows=1, cols=32,
2957+ v_row=1, v_col=32, blayout=row_major, slayout=none_box,
2958+ fractal=512, pad=0>)
28482959 outs(%c : !pto.tile_buf<loc=vec, dtype=f32, rows=32, cols=32,
28492960 v_row=32, v_col=32, blayout=row_major, slayout=none_box,
28502961 fractal=512, pad=0>)
@@ -3377,6 +3488,72 @@ pto.tsqrt ins(%a : !pto.tile_buf<loc=vec, dtype=f16, rows=16, cols=16,
33773488
33783489---
33793490
3491+ ##### ` pto.ttri ` - Fill Triangular Tile Region
3492+
3493+ ** Summary:** Fills a VEC tile using the PTO-ISA ` TTRI ` triangular pattern.
3494+
3495+ ** Semantics:**
3496+
3497+ ```
3498+ TTRI(dst, diagonal)
3499+ ```
3500+
3501+ ` upperOrLower=0 ` selects the lower-triangular form and ` upperOrLower=1 `
3502+ selects the upper-triangular form. The exact per-element fill pattern follows
3503+ the target PTO-ISA implementation.
3504+
3505+ ** Arguments:**
3506+
3507+ | Name | Type | Description |
3508+ | ------| ------| -------------|
3509+ | ` diagonal ` | integer SSA value | Runtime diagonal selector |
3510+ | ` dst ` | ` pto.tile_buf ` | Destination vector tile |
3511+
3512+ ** Attributes:**
3513+
3514+ | Name | Type | Description |
3515+ | ------| ------| -------------|
3516+ | ` upperOrLower ` | ` I32Attr ` (default: ` 0 ` ) | ` 0 ` for lower triangular, ` 1 ` for upper triangular |
3517+
3518+ ** Results:** None. Writes into ` dst ` via DPS pattern.
3519+
3520+ ** Assembly Format:**
3521+
3522+ ``` mlir
3523+ pto.ttri ins(%diag {upperOrLower = 1 : i32} : i32)
3524+ outs(%dst : !pto.tile_buf<...>)
3525+ ```
3526+
3527+ ** Constraints & Verification:**
3528+
3529+ - ` dst ` must be a VEC tile (` loc=vec ` ) whose valid shape stays within the static tile shape.
3530+ - ` dst ` must use ` blayout=row_major ` .
3531+ - ` diagonal ` must have an integer type.
3532+ - ` upperOrLower ` must be either ` 0 ` or ` 1 ` .
3533+ - Supported element types:
3534+ - A2/A3: ` f16 ` , ` f32 ` , ` i16 ` , ` i32 ` , ` u16 ` , ` u32 `
3535+ - A5: ` f16 ` , ` f32 ` , ` bf16 ` , ` i8 ` , ` i16 ` , ` i32 ` , ` u8 ` , ` u16 ` , ` u32 `
3536+
3537+ ** Hardware Mapping:**
3538+
3539+ - Executes on the ** Vector pipeline** (` PIPE_V ` )
3540+
3541+ ** Basic Example:**
3542+
3543+ ``` mlir
3544+ pto.ttri ins(%diag : i32)
3545+ outs(%lower : !pto.tile_buf<loc=vec, dtype=i32, rows=32, cols=32,
3546+ v_row=32, v_col=32, blayout=row_major, slayout=none_box,
3547+ fractal=512, pad=0>)
3548+
3549+ pto.ttri ins(%diag {upperOrLower = 1 : i32} : i32)
3550+ outs(%upper : !pto.tile_buf<loc=vec, dtype=i32, rows=32, cols=32,
3551+ v_row=32, v_col=32, blayout=row_major, slayout=none_box,
3552+ fractal=512, pad=0>)
3553+ ```
3554+
3555+ ---
3556+
33803557##### ` pto.trsqrt ` - Elementwise Reciprocal Square Root
33813558
33823559** Summary:** Computes the reciprocal square root for every element.
0 commit comments