[AMDGPU] Detect BUFFER_STORE source-vgpr WAR hazard on gfx940+ regard… by panditsa · Pull Request #2446 · ROCm/llvm-project

panditsa · 2026-05-07T21:30:43Z

…less of soffset

createsVALUHazard previously gated the MUBUF/MTBUF source-vgpr WAR hazard to fire only when SOFFSET was a literal or absent. On gfx940-family subtargets (gfx942, gfx950) this is too narrow: the hazard fires equally when SOFFSET is sourced from an SGPR.

Concretely, on gfx950 a sequence of the form

buffer_store_dwordx4 v[X:X+3], voff, descr, sN offen
v_pk_mul_f32 v[X:X+1], , # next VALU cycle

deterministically commits the post-pk_mul value of v[X+1] to memory for the second dword of the dwordx4 store; the other three dwords store correctly. checkVALUHazardsHelper already returns the right 2-wait-state cure for gfx940 family, so widening createsVALUHazard's trigger is sufficient.

Empirically reproduced on AMD Instinct MI350X (gfx950) by Triton's fused-attention backward kernel. Adding a single S_NOP 1 (or any 2-cycle bubble) between the store and the v_pk_mul makes the corruption go away; literal-soffset stores were already covered by the pre-existing rule which is why this hazard had not shown up before. The new MIR test covers both literal and SGPR soffset on gfx900 (older), gfx942 and gfx950, plus negative cases (non-overlapping write, dwordx2 store).

Assisted-by: Cursor cursoragent@cursor.com

…less of soffset createsVALUHazard previously gated the MUBUF/MTBUF source-vgpr WAR hazard to fire only when SOFFSET was a literal or absent. On gfx940-family subtargets (gfx942, gfx950) this is too narrow: the hazard fires equally when SOFFSET is sourced from an SGPR. Concretely, on gfx950 a sequence of the form buffer_store_dwordx4 v[X:X+3], voff, descr, sN offen v_pk_mul_f32 v[X:X+1], <src>, <src> # next VALU cycle deterministically commits the post-pk_mul value of v[X+1] to memory for the second dword of the dwordx4 store; the other three dwords store correctly. checkVALUHazardsHelper already returns the right 2-wait-state cure for gfx940 family, so widening createsVALUHazard's trigger is sufficient. Empirically reproduced on AMD Instinct MI350X (gfx950) by Triton's fused-attention backward kernel. Adding a single S_NOP 1 (or any 2-cycle bubble) between the store and the v_pk_mul makes the corruption go away; literal-soffset stores were already covered by the pre-existing rule which is why this hazard had not shown up before. The new MIR test covers both literal and SGPR soffset on gfx900 (older), gfx942 and gfx950, plus negative cases (non-overlapping write, dwordx2 store). Assisted-by: Cursor <cursoragent@cursor.com> Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>

z1-cciauto · 2026-05-07T21:36:35Z

PSDB Link: https://compiler-ci.amd.com/job/compiler-psdb-amd-staging/5782

jayfoad · 2026-05-08T09:04:56Z

The patch seems reasonable and should be submitted upstream at https://github.com/llvm/llvm-project/pulls

However...

On gfx940-family subtargets (gfx942, gfx950) this is too narrow: the hazard fires equally when SOFFSET is sourced from an SGPR.

Is this confirmed by the hardware team? Is it documented anywhere? I'm looking at the GFX9 SPG section 3.1.2 "Manually Inserted Wait States (NOPs)", row 8A in the table, and it clearly says:

BUFFER_STORE_* operations
which use an SGPR for ‘offset’ do
not require any wait states.

with no exceptions.

panditsa · 2026-05-10T23:53:48Z

The patch seems reasonable and should be submitted upstream at https://github.com/llvm/llvm-project/pulls

However...

On gfx940-family subtargets (gfx942, gfx950) this is too narrow: the hazard fires equally when SOFFSET is sourced from an SGPR.

Is this confirmed by the hardware team? Is it documented anywhere? I'm looking at the GFX9 SPG section 3.1.2 "Manually Inserted Wait States (NOPs)", row 8A in the table, and it clearly says:

BUFFER_STORE_* operations
which use an SGPR for ‘offset’ do
not require any wait states.

with no exceptions.

Hi @jayfoad I didn't see anything in Shader's Programming Guide but I have a sample asm test that recreates this bug.

        // fused: zero cycles between the buffer_store_dwordx4 and the
        // v_pk_mul_f32 that overwrites v[0:1]. Reproduces the hazard.
        // Uncomment the "s_nop 0" below to reproduce the 1nop fix.
        asm volatile(
            "s_mov_b64 s[16:17], %[base]\n"
            "s_and_b32 s17, s17, 0xFFFF\n"
            "s_mov_b32 s18, 0x7FFFFFFE\n"
            "s_mov_b32 s19, 0x00027000\n"
            "s_mov_b32 s20, 0\n"
            "v_mov_b32_e32 v100, %[voff]\n"
            "v_cvt_pk_f16_f32 v0, %[a], %[b]\n"
            "v_cvt_pk_f16_f32 v1, %[c], %[d]\n"
            "v_cvt_pk_f16_f32 v2, %[e], %[f]\n"
            "v_cvt_pk_f16_f32 v3, %[g], %[h]\n"
            "buffer_store_dwordx4 v[0:3], v100, s[16:19], s20 offen\n"
            // "s_nop 0\n"
            "v_pk_mul_f32 v[0:1], %[scale_pair], v[6:7]\n"
            "v_pk_mul_f32 v[2:3], %[scale_pair], v[8:9]\n"
            : : [base] "s"(reinterpret_cast<uint64_t>(out)),
                [voff] "s"(voff),
                [a] "v"(a), [b] "v"(b),
                [c] "v"(c), [d] "v"(d),
                [e] "v"(e), [f] "v"(f),
                [g] "v"(g), [h] "v"(h),
                [scale_pair] "s"(scale_pair)
            : "v0","v1","v2","v3","v100",
              "s16","s17","s18","s19","s20","memory"
        );

Do you know whom I can connect to validate this hazard?

…ul-war-hazard

Merge branch 'ROCm:amd-staging' into sanketp/gfx940-buffer-store-vpkm…

882a2c0

…ul-war-hazard

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMDGPU] Detect BUFFER_STORE source-vgpr WAR hazard on gfx940+ regard…#2446

[AMDGPU] Detect BUFFER_STORE source-vgpr WAR hazard on gfx940+ regard…#2446
panditsa wants to merge 2 commits into
ROCm:amd-stagingfrom
panditsa:sanketp/gfx940-buffer-store-vpkmul-war-hazard

panditsa commented May 7, 2026

Uh oh!

z1-cciauto commented May 7, 2026

Uh oh!

jayfoad commented May 8, 2026

Uh oh!

panditsa commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

panditsa commented May 7, 2026

Uh oh!

z1-cciauto commented May 7, 2026

Uh oh!

jayfoad commented May 8, 2026

Uh oh!

panditsa commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants