Problem
A PTO-IR vector kernel (decode_projection_incore_0.pto) compiles and runs correctly when targeting A2A3, but hangs at runtime when compiled for A5.
Environment
Reproduction
PTO file attached below.
decode_projection_incore_0.pto.txt
rmsnorm_incore_0.pto.txt
Steps:
- Change
pto.target_arch from "a2a3" to "a5" in the module attributes
- Compile with
ptoas
- Run on A5 platform — program hangs indefinitely (no crash, no error)
Behavior
| Target |
Compile |
Run |
| A2A3 |
OK |
OK |
| A5 |
OK |
Hangs |
Kernel Summary
This is a RMSNorm vector kernel (decode_projection_incore_0) from the Qwen3-32B decode layer projection. It operates on [16, 5120] BF16 input with K_CHUNK=128 (40 iterations):
- Loop 1 — accumulate squared partial sums:
tload → tcvt(bf16→f32) → tmul(x²) → trowsum → tadd (accumulate) → tmov
- Post-loop — compute inv_rms:
tmuls(÷5120) → tadds(+ε) → trsqrt
- Loop 2 — apply normalization:
tload → tcvt → trowexpandmul(×inv_rms) → tcolexpandmul(×γ) → tcvt(f32→bf16) → tstore
Operations Used
tload, tstore, tcvt, tmul, trowsum, tadd, tmov, tmuls, tadds, trsqrt, trowexpandmul, tcolexpandmul, texpands
PTO File
Context
Discovered during E2E validation of pypto-lib Qwen3-32B decode tilelet (hw-native-sys/pypto-lib#58, Scope 1).
Problem
A PTO-IR vector kernel (
decode_projection_incore_0.pto) compiles and runs correctly when targeting A2A3, but hangs at runtime when compiled for A5.Environment
Reproduction
PTO file attached below.
decode_projection_incore_0.pto.txt
rmsnorm_incore_0.pto.txt
Steps:
pto.target_archfrom"a2a3"to"a5"in the module attributesptoasBehavior
Kernel Summary
This is a RMSNorm vector kernel (
decode_projection_incore_0) from the Qwen3-32B decode layer projection. It operates on[16, 5120]BF16 input withK_CHUNK=128(40 iterations):tload→tcvt(bf16→f32) →tmul(x²) →trowsum→tadd(accumulate) →tmovtmuls(÷5120) →tadds(+ε) →trsqrttload→tcvt→trowexpandmul(×inv_rms) →tcolexpandmul(×γ) →tcvt(f32→bf16) →tstoreOperations Used
tload,tstore,tcvt,tmul,trowsum,tadd,tmov,tmuls,tadds,trsqrt,trowexpandmul,tcolexpandmul,texpandsPTO File
Context
Discovered during E2E validation of
pypto-libQwen3-32B decode tilelet (hw-native-sys/pypto-lib#58, Scope 1).