Skip to content

[AMDGPU] Detect BUFFER_STORE source-vgpr WAR hazard on gfx940+ regard…#2446

Draft
panditsa wants to merge 2 commits into
ROCm:amd-stagingfrom
panditsa:sanketp/gfx940-buffer-store-vpkmul-war-hazard
Draft

[AMDGPU] Detect BUFFER_STORE source-vgpr WAR hazard on gfx940+ regard…#2446
panditsa wants to merge 2 commits into
ROCm:amd-stagingfrom
panditsa:sanketp/gfx940-buffer-store-vpkmul-war-hazard

Conversation

@panditsa
Copy link
Copy Markdown

@panditsa panditsa commented May 7, 2026

…less of soffset

createsVALUHazard previously gated the MUBUF/MTBUF source-vgpr WAR hazard to fire only when SOFFSET was a literal or absent. On gfx940-family subtargets (gfx942, gfx950) this is too narrow: the hazard fires equally when SOFFSET is sourced from an SGPR.

Concretely, on gfx950 a sequence of the form

buffer_store_dwordx4 v[X:X+3], voff, descr, sN offen
v_pk_mul_f32 v[X:X+1], , # next VALU cycle

deterministically commits the post-pk_mul value of v[X+1] to memory for the second dword of the dwordx4 store; the other three dwords store correctly. checkVALUHazardsHelper already returns the right 2-wait-state cure for gfx940 family, so widening createsVALUHazard's trigger is sufficient.

Empirically reproduced on AMD Instinct MI350X (gfx950) by Triton's fused-attention backward kernel. Adding a single S_NOP 1 (or any 2-cycle bubble) between the store and the v_pk_mul makes the corruption go away; literal-soffset stores were already covered by the pre-existing rule which is why this hazard had not shown up before. The new MIR test covers both literal and SGPR soffset on gfx900 (older), gfx942 and gfx950, plus negative cases (non-overlapping write, dwordx2 store).

Assisted-by: Cursor cursoragent@cursor.com

…less of soffset

createsVALUHazard previously gated the MUBUF/MTBUF source-vgpr WAR hazard
to fire only when SOFFSET was a literal or absent. On gfx940-family
subtargets (gfx942, gfx950) this is too narrow: the hazard fires equally
when SOFFSET is sourced from an SGPR.

Concretely, on gfx950 a sequence of the form

  buffer_store_dwordx4 v[X:X+3], voff, descr, sN offen
  v_pk_mul_f32 v[X:X+1], <src>, <src>           # next VALU cycle

deterministically commits the post-pk_mul value of v[X+1] to memory for
the second dword of the dwordx4 store; the other three dwords store
correctly. checkVALUHazardsHelper already returns the right
2-wait-state cure for gfx940 family, so widening createsVALUHazard's
trigger is sufficient.

Empirically reproduced on AMD Instinct MI350X (gfx950) by Triton's
fused-attention backward kernel. Adding a single S_NOP 1 (or any 2-cycle
bubble) between the store and the v_pk_mul makes the corruption go away;
literal-soffset stores were already covered by the pre-existing rule
which is why this hazard had not shown up before. The new MIR test
covers both literal and SGPR soffset on gfx900 (older), gfx942 and
gfx950, plus negative cases (non-overlapping write, dwordx2 store).

Assisted-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
@z1-cciauto
Copy link
Copy Markdown
Collaborator

@jayfoad
Copy link
Copy Markdown

jayfoad commented May 8, 2026

The patch seems reasonable and should be submitted upstream at https://github.com/llvm/llvm-project/pulls

However...

On gfx940-family subtargets (gfx942, gfx950) this is too narrow: the hazard fires equally when SOFFSET is sourced from an SGPR.

Is this confirmed by the hardware team? Is it documented anywhere? I'm looking at the GFX9 SPG section 3.1.2 "Manually Inserted Wait States (NOPs)", row 8A in the table, and it clearly says:

BUFFER_STORE_* operations
which use an SGPR for ‘offset’ do
not require any wait states.

with no exceptions.

@panditsa
Copy link
Copy Markdown
Author

The patch seems reasonable and should be submitted upstream at https://github.com/llvm/llvm-project/pulls

However...

On gfx940-family subtargets (gfx942, gfx950) this is too narrow: the hazard fires equally when SOFFSET is sourced from an SGPR.

Is this confirmed by the hardware team? Is it documented anywhere? I'm looking at the GFX9 SPG section 3.1.2 "Manually Inserted Wait States (NOPs)", row 8A in the table, and it clearly says:

BUFFER_STORE_* operations
which use an SGPR for ‘offset’ do
not require any wait states.

with no exceptions.

Hi @jayfoad I didn't see anything in Shader's Programming Guide but I have a sample asm test that recreates this bug.

        // fused: zero cycles between the buffer_store_dwordx4 and the
        // v_pk_mul_f32 that overwrites v[0:1]. Reproduces the hazard.
        // Uncomment the "s_nop 0" below to reproduce the 1nop fix.
        asm volatile(
            "s_mov_b64 s[16:17], %[base]\n"
            "s_and_b32 s17, s17, 0xFFFF\n"
            "s_mov_b32 s18, 0x7FFFFFFE\n"
            "s_mov_b32 s19, 0x00027000\n"
            "s_mov_b32 s20, 0\n"
            "v_mov_b32_e32 v100, %[voff]\n"
            "v_cvt_pk_f16_f32 v0, %[a], %[b]\n"
            "v_cvt_pk_f16_f32 v1, %[c], %[d]\n"
            "v_cvt_pk_f16_f32 v2, %[e], %[f]\n"
            "v_cvt_pk_f16_f32 v3, %[g], %[h]\n"
            "buffer_store_dwordx4 v[0:3], v100, s[16:19], s20 offen\n"
            // "s_nop 0\n"
            "v_pk_mul_f32 v[0:1], %[scale_pair], v[6:7]\n"
            "v_pk_mul_f32 v[2:3], %[scale_pair], v[8:9]\n"
            : : [base] "s"(reinterpret_cast<uint64_t>(out)),
                [voff] "s"(voff),
                [a] "v"(a), [b] "v"(b),
                [c] "v"(c), [d] "v"(d),
                [e] "v"(e), [f] "v"(f),
                [g] "v"(g), [h] "v"(h),
                [scale_pair] "s"(scale_pair)
            : "v0","v1","v2","v3","v100",
              "s16","s17","s18","s19","s20","memory"
        );

Do you know whom I can connect to validate this hazard?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants