Skip to content

[opus] Add gfx1201 (Navi 48 / RDNA4) support: header parse + buffer rsrc + WMMA#3236

Open
carlushuang wants to merge 7 commits into
mainfrom
carhuang/opus_gfx1201_parse_fix
Open

[opus] Add gfx1201 (Navi 48 / RDNA4) support: header parse + buffer rsrc + WMMA#3236
carlushuang wants to merge 7 commits into
mainfrom
carhuang/opus_gfx1201_parse_fix

Conversation

@carlushuang
Copy link
Copy Markdown
Collaborator

@carlushuang carlushuang commented May 16, 2026

First-class gfx1201 (Navi 48 / RX 9070 XT, RDNA4) support in opus, in 5 reviewable commits:

Commit Layer Effect
70f6e4a parse fwd-decl mma adaptors → opus.hpp parses on gfx1201
9878ec9 rsrc gfx1200/gfx1201 branch in buffer_default_configmake_gmem<>.load/.store works
b8aa331 wmma extend opus::wmma<> with 8 wave32 16x16x16 _w32_gfx12 variants + unit tests
fdf9d87 style black + ruff on the new Python harness
cde14bb style condense the opus.hpp comments to file-local 1-line density

All gates are per-arch (no __GFX12__). gfx1250 / gfx9x paths are bytewise unchanged.

What unblocks for gfx1201

  • Header parses → JIT builds work for every kernel that pulls in aiter_opus_plus.h (sample_kernels, moe_fused_gate, topk_gating, gated_rmsnorm_quant, topk_softmax_kernels_group, mhc, quant, quant_mxfp4, fused_qk_rmsnorm_group_quant, rope_common.h).
  • make_gmem<>.load/.store returns correct data instead of silently dropping stores / returning zeros (root cause: __gfx11__/__gfx12__ lowercase are typos — clang only predefines uppercase).
  • opus::wmma<>::operator() dispatches the gfx12 _w32_gfx12 builtin family.

WMMA coverage (all wave32 16x16x16, matches what gfx12 hw exposes for fp acc)

f32←f16, f32←bf16, f16←f16, bf16←bf16, f32←{fp8,bf8}×{fp8,bf8} (8 variants).
iu8 / iu4 deferred — opus has no iu8_t/iu4_t aliases yet.

Out of scope

The high-level make_tiled_mma / partition_layout_* path stays gfx1250-only. gfx12's fragment layout is asymmetric (A row-distributed, B/C column-distributed per AMD RDNA4 ISA §7.12.2 / CK wmma_gemm.hpp) and needs a dedicated wmma_adaptor_gfx12 — TODO inline. Callers can use opus::wmma<> directly today (the unit test does).

Tests (op_tests/opus/device/)

  • test_opus_gmem_gfx1201.cu — exercises opus::make_gmem<>.load<>/.store<> (was test_opus_parse_gfx1201.cu, renamed to reflect scope).
  • test_wmma_gfx1201.cu — 8 launchers, one per dtype combo. Each lane builds its A/B fragment per the gfx12 layout, calls opus::wmma<>::operator(), stores C back.
  • Both wired into setup.py (_CU_SOURCES) + test_opus_device.py (ctypes wrappers + test fns + main() dispatch + skip-on-non-gfx1201).

Verified on RX 9070 XT (gfx1201)

opus_gmem_gfx1201            PASS   max_diff=0.00e+00  (n=1310720)
wmma_gfx1201_f32_f16         PASS   max_diff=0.0000    (bit-exact)
wmma_gfx1201_f32_bf16        PASS   max_diff=0.0000
wmma_gfx1201_f16_f16         PASS   max_diff=0.0312    (1 ULP fp16)
wmma_gfx1201_bf16_bf16       PASS   max_diff=0.5000    (1 ULP bf16)
wmma_gfx1201_f32_fp8_fp8     PASS   max_diff=0.0000
wmma_gfx1201_f32_fp8_bf8     PASS   max_diff=0.0000
wmma_gfx1201_f32_bf8_fp8     PASS   max_diff=0.0000
wmma_gfx1201_f32_bf8_bf8     PASS   max_diff=0.0000

Integration: with this PR's opus.hpp in place, module_sample JIT-builds end-to-end for gfx1201 and aiter.mixed_sample_outer_exponential is bit-identical to torch Gumbel-max (15.4 µs/call warm avg).

Sources

On gfx1200 / gfx1201 (Navi 44 / Navi 48, RDNA4) device code, neither
__GFX9__ nor __gfx1250__ is active, so the inner mfma_adaptor and
wmma_adaptor definitions never get pulled in. That alone would be fine,
except make_tiled_mma() at csrc/include/opus/opus.hpp:3057-3063 names
both as default template arguments:

    template<..., typename WA =
    #if defined(__gfx1250__)
             wmma_adaptor,
    #else
             mfma_adaptor,
    #endif
             ...>

Name lookup runs at parse time even when no caller instantiates the
template, so on gfx1201 the header fails to compile with:

    csrc/include/opus/opus.hpp:3057:24: error: unknown type name 'mfma_adaptor'

This blocks JIT builds for every kernel that includes aiter_opus_plus.h,
e.g. sample_kernels.cu, moe_fused_gate.cu, topk_gating_kernels.cu,
gated_rmsnorm_quant_kernels.cu, topk_softmax_kernels_group.cu,
mhc_kernels.cu, quant_kernels.cu, quant_mxfp4.cu,
fused_qk_rmsnorm_group_quant.cu, and rope_common.h.

Fix: forward-declare mfma_adaptor / mfma_adaptor_swap_ab / wmma_adaptor /
wmma_adaptor_swap_ab as incomplete types at the top of the opus
namespace. Default-arg name lookup is satisfied for all archs;
instantiation still requires the full definition, so callers that
actually invoke make_tiled_mma() / make_mfma() / make_wmma() are
unaffected. Behavior on gfx1250 and gfx9x is unchanged.

Unit test: op_tests/opus/device/test_opus_parse_gfx1201.cu exercises the
opus utilities sample_kernels.cu actually uses (opus::vector_t and
opus::cast<float>) — pure template machinery, no HIP intrinsics — under
a gfx1201-gated kernel body. Wired into the existing test_opus_device.py
harness; skips on non-gfx1201 archs.

Verified on RX 9070 XT (gfx1201): test compiles + runs, max_diff=0.0
against torch reference. End-to-end check separately confirmed: with this
fix in place, module_sample JIT-builds for gfx1201 and
aiter.mixed_sample_outer_exponential produces bit-identical output to
torch Gumbel-max (15.4us avg over 1000 calls).
@carlushuang carlushuang requested a review from a team May 16, 2026 13:15
@github-actions
Copy link
Copy Markdown
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
ci:atom ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
ci:atom_full ATOM accuracy suite for PR and main models from ATOM models_accuracy.json
ci:vllm vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
ci:all All standard extended tests (excludes ci:atom_full)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3236 --add-label <label>

…t zero loads/stores)

opus::make_gmem<>.load<N>() / .store<N>() silently returned 0 / dropped
writes on gfx1201, because buffer_default_config() landed in the
0xffffffff fallback for that arch and the resulting __amdgpu_buffer_rsrc_t
was invalid. The HIP buffer_load_b32 / buffer_store_b32 intrinsics
themselves work on RDNA4 — the bug was purely in which config word the
header picked.

Root cause: the existing branch

    #elif defined(__gfx11__) || defined(__gfx12__) || defined(__gfx1250__)
        return 0x31004000;

uses __gfx11__ / __gfx12__ (lowercase) which clang does NOT predefine
(only the uppercase __GFX11__ / __GFX12__ exist). gfx1250 was already
covered by its explicit per-arch token; everything else in the gfx11x /
gfx12x families silently fell into the 0xffffffff sentinel branch.

Minimal fix: add explicit __gfx1201__ / __gfx1200__ checks alongside the
existing __gfx1250__ check, so Navi 44 / Navi 48 also get the correct
RDNA buffer rsrc config (0x31004000). The lowercase __gfx11__/__gfx12__
typo is left alone here — fixing it would also flip behavior for
gfx1100-1103 / gfx1150-1153, which is out of scope for the gfx1201
enablement work this branch covers.

No change for gfx1250 (already used 0x31004000 via the explicit per-arch
check); no change for gfx9x (different branch entirely).

Unit test: test_opus_parse_gfx1201.cu now uses opus::make_gmem<>.load<4>
and .store<4> (the API users actually want working) instead of plain
pointer arithmetic. max_diff=0.0 against torch reference on
n=1310720 fp32 elements, gfx1201.

Integration check: with the buffer config fix in place, sample_kernels.cu
JIT-rebuilds cleanly and aiter.mixed_sample_outer_exponential still
produces bit-identical output to torch Gumbel-max — confirms the change
does not regress kernels that already worked (sample_kernels passes an
explicit size to make_gmem, so it tolerated the bad default config; now
both the default-config path and the explicit-size path work).
…variants

Extends opus::wmma<> to dispatch through the gfx12-specific
__builtin_amdgcn_wmma_*_w32_gfx12 family on gfx1201, in parallel with
the existing gfx1250 path. gfx12 builtins have a leaner argument
signature than gfx1250 (no matrix_fmts / neg_c slot, no opsel) so they
need their own dispatch macros.

Variants covered (all wave32 16x16x16; matches the breadth gfx12
exposes for fp / fp8 acc):

  - f32  <- f16  / f16
  - f32  <- bf16 / bf16
  - f16  <- f16  / f16
  - bf16 <- bf16 / bf16
  - f32  <- fp8  / fp8
  - f32  <- fp8  / bf8
  - f32  <- bf8  / fp8
  - f32  <- bf8  / bf8

iu8 / iu4 variants are deliberately deferred — opus has no iu8_t /
iu4_t dtype aliases yet, so wiring them needs a small separate change.

What is NOT touched on the wmma_adaptor / make_tiled_mma path: the
existing wmma_adaptor encoding (rows cross-thread along M) was
designed for gfx1250's WMMA fragment layout, which is row-distributed
for A and column-distributed for B / C. gfx12 has a different
asymmetry — A is row-distributed (lane[i] holds A[i%16, (i/16)*8 + j]
for j in [0,7]) while B and C are column-distributed (lane[i] holds
B[(i/16)*8 + j, i%16] / C[(i/16)*8 + j, i%16]). That asymmetry is
documented inline in test_wmma_gfx1201.cu but a dedicated
opus::wmma_adaptor_gfx12 specialization is needed before the
high-level tiled API can route gfx1201 — TODO comment added near
the wmma_adaptor block. The opus::wmma<> struct itself is fully
usable by callers who construct their own fragments (which is what
the unit test does).

Test (op_tests/opus/device/test_wmma_gfx1201.cu): one kernel per
dtype combo loads its lane-local A / B fragment per the gfx12
layout, calls opus::wmma<>::operator(), and stores the C fragment
back. Verified on RX 9070 XT (gfx1201) against torch matmul ref:

    f32  <- f16  / f16  : max_diff = 0.0000  (bit-exact)
    f32  <- bf16 / bf16 : max_diff = 0.0000
    f16  <- f16  / f16  : max_diff = 0.0312  (1 ULP fp16)
    bf16 <- bf16 / bf16 : max_diff = 0.5000  (1 ULP bf16)
    f32  <- fp8  / fp8  : max_diff = 0.0000
    f32  <- fp8  / bf8  : max_diff = 0.0000
    f32  <- bf8  / fp8  : max_diff = 0.0000
    f32  <- bf8  / bf8  : max_diff = 0.0000
    8/8 variants passed

Also renames the earlier test_opus_parse_gfx1201.cu to
test_opus_gmem_gfx1201.cu — the test exercises opus::make_gmem
load / store on gfx1201 (enabled by the prior buffer_default_config
fix in this PR), not anything specifically about parsing. The new
name reflects the actual scope.
@carlushuang carlushuang changed the title [opus] Forward-declare mma adaptors so opus.hpp parses on gfx1201 [opus] Add gfx1201 (Navi 48 / RDNA4) support: header parse + buffer rsrc + WMMA May 16, 2026
CI black + ruff complained about the compact single-line wrapper and
helper defs added in the previous commit. Just running black on the
file (which also resolves the ruff E701 multiple-statements-on-one-line
warnings). No behavior change.
Same code paths, same dispatch — just tightens the multi-paragraph comment blocks I added in 70f6e4a / 9878ec9 / b8aa331 down to the 1-line-per-anchor density used everywhere else in opus.hpp. Diff vs origin/main shrinks ~166 → ~109 lines; 8/8 wmma variants and the gmem test still pass on RX 9070 XT.
Two existing test .cu files use opus _async_load, which calls
__builtin_amdgcn_raw_ptr_buffer_load_lds — that builtin needs the
vmem-to-lds-load-insts target feature (only present on gfx9x / gfx950
/ gfx1250). On gfx1201 the per-file hipcc compile errors out and that
fails the whole opus_device_test.so build, so before this fix
**0 of 72 opus device tests actually ran on gfx1201**.

setup.py: add _ARCH_SKIP_SOURCES that drops the two incompatible
files (test_async_load.cu, test_load_store_if.cu) from _CU_SOURCES
when arch is gfx1200 / gfx1201. The remaining 19 .cu files compile
fine for gfx1201 and the .so links cleanly.

test_opus_device.py: add a small _skip_if_missing_symbol() helper +
early-skip guard at the top of the 5 test functions whose extern "C"
launcher symbols come from those skipped files
(test_async_load, test_predicated_copy, test_predicated_copy_2d,
test_free_func_vector_add, test_predicated_async_load). They now
print SKIP cleanly instead of AttributeError-ing at runtime.

Result on gfx1201 (after this commit):
- 41 PASS  (includes all 9 new gfx1201 tests: 8 wmma + 1 gmem)
- 53 SKIP  (arch-gated mfma/wmma_1250/wmma_scale/mxfp/etc tests)
-  4 FAIL  (pre-existing fp8/bf8/bf16 ABI mismatches in
            dtype_convert_fp32_bf16, dtype_convert_fp32_bf16_vec4,
            numeric_limits, finfo — unrelated to this PR;
            tests assume fnuz fp8 semantics while gfx12 uses OCP)

Behavior on every other arch (gfx9x, gfx1250) is unchanged — those
archs are not in _ARCH_SKIP_SOURCES so all sources still compile and
all launchers exist, so the symbol-existence guard is a no-op.
… ISA)

gfx1200 (Navi 44) and gfx1201 (Navi 48) are siblings in the same RDNA4
family — they share the wmma-128b ISA and the same buffer rsrc format.
The buffer_default_config branch added in 9878ec9 already listed both;
this commit brings the wmma struct dispatch, the wmma class outer guard,
and the two unit-test guards in line.

Verification (no gfx1200 hardware available, so compile-only):

  1. clang predefines __GFX12__ for both archs; only the per-arch macro
     differs (__gfx1200__ vs __gfx1201__).
  2. LLVM gates all 8 __builtin_amdgcn_wmma_*_w32_gfx12 builtins on
     "wmma-128b-insts,wavefrontsize32" — a feature both gfx1200 and
     gfx1201 enable per AMDGPUSubtarget (only gfx1250 adds the bigger
     wmma-256b-insts shapes).
  3. Direct probe: a .cu calling all 8 builtins compiles for both
     --offload-arch=gfx1200 and --offload-arch=gfx1201 with no errors.
  4. test_wmma_gfx1201.cu now builds for both archs producing identical
     49432-byte .so files with all 8 run_wmma_gfx1201_* launcher symbols.

Risks of broadening: low. If real gfx1200 hardware turns out to differ
semantically we would see it as wrong WMMA outputs (not a build break)
and can narrow back to per-arch in one line. The alternative — leaving
Navi 44 unsupported in opus while it shares the exact same gfx12 wmma
ISA — would be worse for downstream consumers.

Verified on gfx1201 (RX 9070 XT): all 9 gfx1201 tests still pass.

  PASS: opus_gmem_gfx1201           max_diff=0.00e+00
  PASS: wmma_gfx1201_f32_f16        max_diff=3.81e-06
  PASS: wmma_gfx1201_f32_bf16       max_diff=1.91e-06
  PASS: wmma_gfx1201_f16_f16        max_diff=3.13e-02  (1 ULP fp16)
  PASS: wmma_gfx1201_bf16_bf16      max_diff=5.00e-01  (1 ULP bf16)
  PASS: wmma_gfx1201_f32_fp8_fp8    max_diff=0.00e+00
  PASS: wmma_gfx1201_f32_fp8_bf8    max_diff=0.00e+00
  PASS: wmma_gfx1201_f32_bf8_fp8    max_diff=0.00e+00
  PASS: wmma_gfx1201_f32_bf8_bf8    max_diff=0.00e+00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant