Skip to content

A5 target hangs on RMSNorm vector kernel that passes on A2A3 #441

@zhangqi-chen

Description

@zhangqi-chen

Problem

A PTO-IR vector kernel (decode_projection_incore_0.pto) compiles and runs correctly when targeting A2A3, but hangs at runtime when compiled for A5.

Environment

  • PTOAS version: 0.22

Reproduction

PTO file attached below.

decode_projection_incore_0.pto.txt
rmsnorm_incore_0.pto.txt

Steps:

  1. Change pto.target_arch from "a2a3" to "a5" in the module attributes
  2. Compile with ptoas
  3. Run on A5 platform — program hangs indefinitely (no crash, no error)

Behavior

Target Compile Run
A2A3 OK OK
A5 OK Hangs

Kernel Summary

This is a RMSNorm vector kernel (decode_projection_incore_0) from the Qwen3-32B decode layer projection. It operates on [16, 5120] BF16 input with K_CHUNK=128 (40 iterations):

  1. Loop 1 — accumulate squared partial sums: tloadtcvt(bf16→f32) → tmul(x²) → trowsumtadd (accumulate) → tmov
  2. Post-loop — compute inv_rms: tmuls(÷5120) → tadds(+ε) → trsqrt
  3. Loop 2 — apply normalization: tloadtcvttrowexpandmul(×inv_rms) → tcolexpandmul(×γ) → tcvt(f32→bf16) → tstore

Operations Used

tload, tstore, tcvt, tmul, trowsum, tadd, tmov, tmuls, tadds, trsqrt, trowexpandmul, tcolexpandmul, texpands

PTO File

Context

Discovered during E2E validation of pypto-lib Qwen3-32B decode tilelet (hw-native-sys/pypto-lib#58, Scope 1).

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions