[opus] Add gfx1201 (Navi 48 / RDNA4) support: header parse + buffer rsrc + WMMA by carlushuang · Pull Request #3236 · ROCm/aiter

carlushuang · 2026-05-16T13:15:08Z

First-class gfx1201 (Navi 48 / RX 9070 XT, RDNA4) support in opus, in 5 reviewable commits:

Commit	Layer	Effect
`70f6e4a`	parse	fwd-decl mma adaptors → `opus.hpp` parses on gfx1201
`9878ec9`	rsrc	gfx1200/gfx1201 branch in `buffer_default_config` → `make_gmem<>.load/.store` works
`b8aa331`	wmma	extend `opus::wmma<>` with 8 wave32 16x16x16 `_w32_gfx12` variants + unit tests
`fdf9d87`	style	black + ruff on the new Python harness
`cde14bb`	style	condense the opus.hpp comments to file-local 1-line density

All gates are per-arch (no __GFX12__). gfx1250 / gfx9x paths are bytewise unchanged.

What unblocks for gfx1201

Header parses → JIT builds work for every kernel that pulls in aiter_opus_plus.h (sample_kernels, moe_fused_gate, topk_gating, gated_rmsnorm_quant, topk_softmax_kernels_group, mhc, quant, quant_mxfp4, fused_qk_rmsnorm_group_quant, rope_common.h).
make_gmem<>.load/.store returns correct data instead of silently dropping stores / returning zeros (root cause: __gfx11__/__gfx12__ lowercase are typos — clang only predefines uppercase).
opus::wmma<>::operator() dispatches the gfx12 _w32_gfx12 builtin family.

WMMA coverage (all wave32 16x16x16, matches what gfx12 hw exposes for fp acc)

f32←f16, f32←bf16, f16←f16, bf16←bf16, f32←{fp8,bf8}×{fp8,bf8} (8 variants).
iu8 / iu4 deferred — opus has no iu8_t/iu4_t aliases yet.

Out of scope

The high-level make_tiled_mma / partition_layout_* path stays gfx1250-only. gfx12's fragment layout is asymmetric (A row-distributed, B/C column-distributed per AMD RDNA4 ISA §7.12.2 / CK wmma_gemm.hpp) and needs a dedicated wmma_adaptor_gfx12 — TODO inline. Callers can use opus::wmma<> directly today (the unit test does).

Tests (`op_tests/opus/device/`)

test_opus_gmem_gfx1201.cu — exercises opus::make_gmem<>.load<>/.store<> (was test_opus_parse_gfx1201.cu, renamed to reflect scope).
test_wmma_gfx1201.cu — 8 launchers, one per dtype combo. Each lane builds its A/B fragment per the gfx12 layout, calls opus::wmma<>::operator(), stores C back.
Both wired into setup.py (_CU_SOURCES) + test_opus_device.py (ctypes wrappers + test fns + main() dispatch + skip-on-non-gfx1201).

Verified on RX 9070 XT (gfx1201)

opus_gmem_gfx1201            PASS   max_diff=0.00e+00  (n=1310720)
wmma_gfx1201_f32_f16         PASS   max_diff=0.0000    (bit-exact)
wmma_gfx1201_f32_bf16        PASS   max_diff=0.0000
wmma_gfx1201_f16_f16         PASS   max_diff=0.0312    (1 ULP fp16)
wmma_gfx1201_bf16_bf16       PASS   max_diff=0.5000    (1 ULP bf16)
wmma_gfx1201_f32_fp8_fp8     PASS   max_diff=0.0000
wmma_gfx1201_f32_fp8_bf8     PASS   max_diff=0.0000
wmma_gfx1201_f32_bf8_fp8     PASS   max_diff=0.0000
wmma_gfx1201_f32_bf8_bf8     PASS   max_diff=0.0000

Integration: with this PR's opus.hpp in place, module_sample JIT-builds end-to-end for gfx1201 and aiter.mixed_sample_outer_exponential is bit-identical to torch Gumbel-max (15.4 µs/call warm avg).

Sources

AMD RDNA4 ISA Reference (Apr 2025), §7.12.2 Matrix Element Storage in VGPRs
LLVM BuiltinsAMDGPU.td — gfx12 WMMA builtin signatures
Community rdna4-wmma-guide — C/D layout
AMD GPUOpen RDNA4 Matrix Cores

On gfx1200 / gfx1201 (Navi 44 / Navi 48, RDNA4) device code, neither __GFX9__ nor __gfx1250__ is active, so the inner mfma_adaptor and wmma_adaptor definitions never get pulled in. That alone would be fine, except make_tiled_mma() at csrc/include/opus/opus.hpp:3057-3063 names both as default template arguments: template<..., typename WA = #if defined(__gfx1250__) wmma_adaptor, #else mfma_adaptor, #endif ...> Name lookup runs at parse time even when no caller instantiates the template, so on gfx1201 the header fails to compile with: csrc/include/opus/opus.hpp:3057:24: error: unknown type name 'mfma_adaptor' This blocks JIT builds for every kernel that includes aiter_opus_plus.h, e.g. sample_kernels.cu, moe_fused_gate.cu, topk_gating_kernels.cu, gated_rmsnorm_quant_kernels.cu, topk_softmax_kernels_group.cu, mhc_kernels.cu, quant_kernels.cu, quant_mxfp4.cu, fused_qk_rmsnorm_group_quant.cu, and rope_common.h. Fix: forward-declare mfma_adaptor / mfma_adaptor_swap_ab / wmma_adaptor / wmma_adaptor_swap_ab as incomplete types at the top of the opus namespace. Default-arg name lookup is satisfied for all archs; instantiation still requires the full definition, so callers that actually invoke make_tiled_mma() / make_mfma() / make_wmma() are unaffected. Behavior on gfx1250 and gfx9x is unchanged. Unit test: op_tests/opus/device/test_opus_parse_gfx1201.cu exercises the opus utilities sample_kernels.cu actually uses (opus::vector_t and opus::cast<float>) — pure template machinery, no HIP intrinsics — under a gfx1201-gated kernel body. Wired into the existing test_opus_device.py harness; skips on non-gfx1201 archs. Verified on RX 9070 XT (gfx1201): test compiles + runs, max_diff=0.0 against torch reference. End-to-end check separately confirmed: with this fix in place, module_sample JIT-builds for gfx1201 and aiter.mixed_sample_outer_exponential produces bit-identical output to torch Gumbel-max (15.4us avg over 1000 calls).

github-actions · 2026-05-16T13:15:30Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-300x`	Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
`ci:sglang`	SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
`ci:atom`	ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
`ci:atom_full`	ATOM accuracy suite for PR and main models from ATOM `models_accuracy.json`
`ci:vllm`	vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
`ci:all`	All standard extended tests (excludes `ci:atom_full`)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3236 --add-label <label>

…t zero loads/stores) opus::make_gmem<>.load<N>() / .store<N>() silently returned 0 / dropped writes on gfx1201, because buffer_default_config() landed in the 0xffffffff fallback for that arch and the resulting __amdgpu_buffer_rsrc_t was invalid. The HIP buffer_load_b32 / buffer_store_b32 intrinsics themselves work on RDNA4 — the bug was purely in which config word the header picked. Root cause: the existing branch #elif defined(__gfx11__) || defined(__gfx12__) || defined(__gfx1250__) return 0x31004000; uses __gfx11__ / __gfx12__ (lowercase) which clang does NOT predefine (only the uppercase __GFX11__ / __GFX12__ exist). gfx1250 was already covered by its explicit per-arch token; everything else in the gfx11x / gfx12x families silently fell into the 0xffffffff sentinel branch. Minimal fix: add explicit __gfx1201__ / __gfx1200__ checks alongside the existing __gfx1250__ check, so Navi 44 / Navi 48 also get the correct RDNA buffer rsrc config (0x31004000). The lowercase __gfx11__/__gfx12__ typo is left alone here — fixing it would also flip behavior for gfx1100-1103 / gfx1150-1153, which is out of scope for the gfx1201 enablement work this branch covers. No change for gfx1250 (already used 0x31004000 via the explicit per-arch check); no change for gfx9x (different branch entirely). Unit test: test_opus_parse_gfx1201.cu now uses opus::make_gmem<>.load<4> and .store<4> (the API users actually want working) instead of plain pointer arithmetic. max_diff=0.0 against torch reference on n=1310720 fp32 elements, gfx1201. Integration check: with the buffer config fix in place, sample_kernels.cu JIT-rebuilds cleanly and aiter.mixed_sample_outer_exponential still produces bit-identical output to torch Gumbel-max — confirms the change does not regress kernels that already worked (sample_kernels passes an explicit size to make_gmem, so it tolerated the bad default config; now both the default-config path and the explicit-size path work).

…variants Extends opus::wmma<> to dispatch through the gfx12-specific __builtin_amdgcn_wmma_*_w32_gfx12 family on gfx1201, in parallel with the existing gfx1250 path. gfx12 builtins have a leaner argument signature than gfx1250 (no matrix_fmts / neg_c slot, no opsel) so they need their own dispatch macros. Variants covered (all wave32 16x16x16; matches the breadth gfx12 exposes for fp / fp8 acc): - f32 <- f16 / f16 - f32 <- bf16 / bf16 - f16 <- f16 / f16 - bf16 <- bf16 / bf16 - f32 <- fp8 / fp8 - f32 <- fp8 / bf8 - f32 <- bf8 / fp8 - f32 <- bf8 / bf8 iu8 / iu4 variants are deliberately deferred — opus has no iu8_t / iu4_t dtype aliases yet, so wiring them needs a small separate change. What is NOT touched on the wmma_adaptor / make_tiled_mma path: the existing wmma_adaptor encoding (rows cross-thread along M) was designed for gfx1250's WMMA fragment layout, which is row-distributed for A and column-distributed for B / C. gfx12 has a different asymmetry — A is row-distributed (lane[i] holds A[i%16, (i/16)*8 + j] for j in [0,7]) while B and C are column-distributed (lane[i] holds B[(i/16)*8 + j, i%16] / C[(i/16)*8 + j, i%16]). That asymmetry is documented inline in test_wmma_gfx1201.cu but a dedicated opus::wmma_adaptor_gfx12 specialization is needed before the high-level tiled API can route gfx1201 — TODO comment added near the wmma_adaptor block. The opus::wmma<> struct itself is fully usable by callers who construct their own fragments (which is what the unit test does). Test (op_tests/opus/device/test_wmma_gfx1201.cu): one kernel per dtype combo loads its lane-local A / B fragment per the gfx12 layout, calls opus::wmma<>::operator(), and stores the C fragment back. Verified on RX 9070 XT (gfx1201) against torch matmul ref: f32 <- f16 / f16 : max_diff = 0.0000 (bit-exact) f32 <- bf16 / bf16 : max_diff = 0.0000 f16 <- f16 / f16 : max_diff = 0.0312 (1 ULP fp16) bf16 <- bf16 / bf16 : max_diff = 0.5000 (1 ULP bf16) f32 <- fp8 / fp8 : max_diff = 0.0000 f32 <- fp8 / bf8 : max_diff = 0.0000 f32 <- bf8 / fp8 : max_diff = 0.0000 f32 <- bf8 / bf8 : max_diff = 0.0000 8/8 variants passed Also renames the earlier test_opus_parse_gfx1201.cu to test_opus_gmem_gfx1201.cu — the test exercises opus::make_gmem load / store on gfx1201 (enabled by the prior buffer_default_config fix in this PR), not anything specifically about parsing. The new name reflects the actual scope.

CI black + ruff complained about the compact single-line wrapper and helper defs added in the previous commit. Just running black on the file (which also resolves the ruff E701 multiple-statements-on-one-line warnings). No behavior change.

Same code paths, same dispatch — just tightens the multi-paragraph comment blocks I added in 70f6e4a / 9878ec9 / b8aa331 down to the 1-line-per-anchor density used everywhere else in opus.hpp. Diff vs origin/main shrinks ~166 → ~109 lines; 8/8 wmma variants and the gmem test still pass on RX 9070 XT.

Two existing test .cu files use opus _async_load, which calls __builtin_amdgcn_raw_ptr_buffer_load_lds — that builtin needs the vmem-to-lds-load-insts target feature (only present on gfx9x / gfx950 / gfx1250). On gfx1201 the per-file hipcc compile errors out and that fails the whole opus_device_test.so build, so before this fix **0 of 72 opus device tests actually ran on gfx1201**. setup.py: add _ARCH_SKIP_SOURCES that drops the two incompatible files (test_async_load.cu, test_load_store_if.cu) from _CU_SOURCES when arch is gfx1200 / gfx1201. The remaining 19 .cu files compile fine for gfx1201 and the .so links cleanly. test_opus_device.py: add a small _skip_if_missing_symbol() helper + early-skip guard at the top of the 5 test functions whose extern "C" launcher symbols come from those skipped files (test_async_load, test_predicated_copy, test_predicated_copy_2d, test_free_func_vector_add, test_predicated_async_load). They now print SKIP cleanly instead of AttributeError-ing at runtime. Result on gfx1201 (after this commit): - 41 PASS (includes all 9 new gfx1201 tests: 8 wmma + 1 gmem) - 53 SKIP (arch-gated mfma/wmma_1250/wmma_scale/mxfp/etc tests) - 4 FAIL (pre-existing fp8/bf8/bf16 ABI mismatches in dtype_convert_fp32_bf16, dtype_convert_fp32_bf16_vec4, numeric_limits, finfo — unrelated to this PR; tests assume fnuz fp8 semantics while gfx12 uses OCP) Behavior on every other arch (gfx9x, gfx1250) is unchanged — those archs are not in _ARCH_SKIP_SOURCES so all sources still compile and all launchers exist, so the symbol-existence guard is a no-op.

… ISA) gfx1200 (Navi 44) and gfx1201 (Navi 48) are siblings in the same RDNA4 family — they share the wmma-128b ISA and the same buffer rsrc format. The buffer_default_config branch added in 9878ec9 already listed both; this commit brings the wmma struct dispatch, the wmma class outer guard, and the two unit-test guards in line. Verification (no gfx1200 hardware available, so compile-only): 1. clang predefines __GFX12__ for both archs; only the per-arch macro differs (__gfx1200__ vs __gfx1201__). 2. LLVM gates all 8 __builtin_amdgcn_wmma_*_w32_gfx12 builtins on "wmma-128b-insts,wavefrontsize32" — a feature both gfx1200 and gfx1201 enable per AMDGPUSubtarget (only gfx1250 adds the bigger wmma-256b-insts shapes). 3. Direct probe: a .cu calling all 8 builtins compiles for both --offload-arch=gfx1200 and --offload-arch=gfx1201 with no errors. 4. test_wmma_gfx1201.cu now builds for both archs producing identical 49432-byte .so files with all 8 run_wmma_gfx1201_* launcher symbols. Risks of broadening: low. If real gfx1200 hardware turns out to differ semantically we would see it as wrong WMMA outputs (not a build break) and can narrow back to per-arch in one line. The alternative — leaving Navi 44 unsupported in opus while it shares the exact same gfx12 wmma ISA — would be worse for downstream consumers. Verified on gfx1201 (RX 9070 XT): all 9 gfx1201 tests still pass. PASS: opus_gmem_gfx1201 max_diff=0.00e+00 PASS: wmma_gfx1201_f32_f16 max_diff=3.81e-06 PASS: wmma_gfx1201_f32_bf16 max_diff=1.91e-06 PASS: wmma_gfx1201_f16_f16 max_diff=3.13e-02 (1 ULP fp16) PASS: wmma_gfx1201_bf16_bf16 max_diff=5.00e-01 (1 ULP bf16) PASS: wmma_gfx1201_f32_fp8_fp8 max_diff=0.00e+00 PASS: wmma_gfx1201_f32_fp8_bf8 max_diff=0.00e+00 PASS: wmma_gfx1201_f32_bf8_fp8 max_diff=0.00e+00 PASS: wmma_gfx1201_f32_bf8_bf8 max_diff=0.00e+00

carlushuang requested a review from a team May 16, 2026 13:15

carlushuang added 2 commits May 16, 2026 21:43

carlushuang changed the title ~~[opus] Forward-declare mma adaptors so opus.hpp parses on gfx1201~~ [opus] Add gfx1201 (Navi 48 / RDNA4) support: header parse + buffer rsrc + WMMA May 16, 2026

carlushuang added 4 commits May 16, 2026 23:13

carlushuang mentioned this pull request May 16, 2026

opus_attn_gfx1201: flash attention forward for Navi 48 (RDNA4 WMMA) carlushuang/gcnasm#24

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[opus] Add gfx1201 (Navi 48 / RDNA4) support: header parse + buffer rsrc + WMMA#3236

[opus] Add gfx1201 (Navi 48 / RDNA4) support: header parse + buffer rsrc + WMMA#3236
carlushuang wants to merge 7 commits into
mainfrom
carhuang/opus_gfx1201_parse_fix

carlushuang commented May 16, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

carlushuang commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What unblocks for gfx1201

WMMA coverage (all wave32 16x16x16, matches what gfx12 hw exposes for fp acc)

Out of scope

Tests (op_tests/opus/device/)

Verified on RX 9070 XT (gfx1201)

Sources

Uh oh!

github-actions Bot commented May 16, 2026

🏷️ CI Guide

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

carlushuang commented May 16, 2026 •

edited

Loading

Tests (`op_tests/opus/device/`)