[opus] Add gfx1201 (Navi 48 / RDNA4) support: header parse + buffer rsrc + WMMA#3236
Open
carlushuang wants to merge 7 commits into
Open
[opus] Add gfx1201 (Navi 48 / RDNA4) support: header parse + buffer rsrc + WMMA#3236carlushuang wants to merge 7 commits into
carlushuang wants to merge 7 commits into
Conversation
On gfx1200 / gfx1201 (Navi 44 / Navi 48, RDNA4) device code, neither
__GFX9__ nor __gfx1250__ is active, so the inner mfma_adaptor and
wmma_adaptor definitions never get pulled in. That alone would be fine,
except make_tiled_mma() at csrc/include/opus/opus.hpp:3057-3063 names
both as default template arguments:
template<..., typename WA =
#if defined(__gfx1250__)
wmma_adaptor,
#else
mfma_adaptor,
#endif
...>
Name lookup runs at parse time even when no caller instantiates the
template, so on gfx1201 the header fails to compile with:
csrc/include/opus/opus.hpp:3057:24: error: unknown type name 'mfma_adaptor'
This blocks JIT builds for every kernel that includes aiter_opus_plus.h,
e.g. sample_kernels.cu, moe_fused_gate.cu, topk_gating_kernels.cu,
gated_rmsnorm_quant_kernels.cu, topk_softmax_kernels_group.cu,
mhc_kernels.cu, quant_kernels.cu, quant_mxfp4.cu,
fused_qk_rmsnorm_group_quant.cu, and rope_common.h.
Fix: forward-declare mfma_adaptor / mfma_adaptor_swap_ab / wmma_adaptor /
wmma_adaptor_swap_ab as incomplete types at the top of the opus
namespace. Default-arg name lookup is satisfied for all archs;
instantiation still requires the full definition, so callers that
actually invoke make_tiled_mma() / make_mfma() / make_wmma() are
unaffected. Behavior on gfx1250 and gfx9x is unchanged.
Unit test: op_tests/opus/device/test_opus_parse_gfx1201.cu exercises the
opus utilities sample_kernels.cu actually uses (opus::vector_t and
opus::cast<float>) — pure template machinery, no HIP intrinsics — under
a gfx1201-gated kernel body. Wired into the existing test_opus_device.py
harness; skips on non-gfx1201 archs.
Verified on RX 9070 XT (gfx1201): test compiles + runs, max_diff=0.0
against torch reference. End-to-end check separately confirmed: with this
fix in place, module_sample JIT-builds for gfx1201 and
aiter.mixed_sample_outer_exponential produces bit-identical output to
torch Gumbel-max (15.4us avg over 1000 calls).
Contributor
🏷️ CI GuideRuns automatically on every PR:
Extended tests (opt-in via labels):
|
…t zero loads/stores)
opus::make_gmem<>.load<N>() / .store<N>() silently returned 0 / dropped
writes on gfx1201, because buffer_default_config() landed in the
0xffffffff fallback for that arch and the resulting __amdgpu_buffer_rsrc_t
was invalid. The HIP buffer_load_b32 / buffer_store_b32 intrinsics
themselves work on RDNA4 — the bug was purely in which config word the
header picked.
Root cause: the existing branch
#elif defined(__gfx11__) || defined(__gfx12__) || defined(__gfx1250__)
return 0x31004000;
uses __gfx11__ / __gfx12__ (lowercase) which clang does NOT predefine
(only the uppercase __GFX11__ / __GFX12__ exist). gfx1250 was already
covered by its explicit per-arch token; everything else in the gfx11x /
gfx12x families silently fell into the 0xffffffff sentinel branch.
Minimal fix: add explicit __gfx1201__ / __gfx1200__ checks alongside the
existing __gfx1250__ check, so Navi 44 / Navi 48 also get the correct
RDNA buffer rsrc config (0x31004000). The lowercase __gfx11__/__gfx12__
typo is left alone here — fixing it would also flip behavior for
gfx1100-1103 / gfx1150-1153, which is out of scope for the gfx1201
enablement work this branch covers.
No change for gfx1250 (already used 0x31004000 via the explicit per-arch
check); no change for gfx9x (different branch entirely).
Unit test: test_opus_parse_gfx1201.cu now uses opus::make_gmem<>.load<4>
and .store<4> (the API users actually want working) instead of plain
pointer arithmetic. max_diff=0.0 against torch reference on
n=1310720 fp32 elements, gfx1201.
Integration check: with the buffer config fix in place, sample_kernels.cu
JIT-rebuilds cleanly and aiter.mixed_sample_outer_exponential still
produces bit-identical output to torch Gumbel-max — confirms the change
does not regress kernels that already worked (sample_kernels passes an
explicit size to make_gmem, so it tolerated the bad default config; now
both the default-config path and the explicit-size path work).
…variants
Extends opus::wmma<> to dispatch through the gfx12-specific
__builtin_amdgcn_wmma_*_w32_gfx12 family on gfx1201, in parallel with
the existing gfx1250 path. gfx12 builtins have a leaner argument
signature than gfx1250 (no matrix_fmts / neg_c slot, no opsel) so they
need their own dispatch macros.
Variants covered (all wave32 16x16x16; matches the breadth gfx12
exposes for fp / fp8 acc):
- f32 <- f16 / f16
- f32 <- bf16 / bf16
- f16 <- f16 / f16
- bf16 <- bf16 / bf16
- f32 <- fp8 / fp8
- f32 <- fp8 / bf8
- f32 <- bf8 / fp8
- f32 <- bf8 / bf8
iu8 / iu4 variants are deliberately deferred — opus has no iu8_t /
iu4_t dtype aliases yet, so wiring them needs a small separate change.
What is NOT touched on the wmma_adaptor / make_tiled_mma path: the
existing wmma_adaptor encoding (rows cross-thread along M) was
designed for gfx1250's WMMA fragment layout, which is row-distributed
for A and column-distributed for B / C. gfx12 has a different
asymmetry — A is row-distributed (lane[i] holds A[i%16, (i/16)*8 + j]
for j in [0,7]) while B and C are column-distributed (lane[i] holds
B[(i/16)*8 + j, i%16] / C[(i/16)*8 + j, i%16]). That asymmetry is
documented inline in test_wmma_gfx1201.cu but a dedicated
opus::wmma_adaptor_gfx12 specialization is needed before the
high-level tiled API can route gfx1201 — TODO comment added near
the wmma_adaptor block. The opus::wmma<> struct itself is fully
usable by callers who construct their own fragments (which is what
the unit test does).
Test (op_tests/opus/device/test_wmma_gfx1201.cu): one kernel per
dtype combo loads its lane-local A / B fragment per the gfx12
layout, calls opus::wmma<>::operator(), and stores the C fragment
back. Verified on RX 9070 XT (gfx1201) against torch matmul ref:
f32 <- f16 / f16 : max_diff = 0.0000 (bit-exact)
f32 <- bf16 / bf16 : max_diff = 0.0000
f16 <- f16 / f16 : max_diff = 0.0312 (1 ULP fp16)
bf16 <- bf16 / bf16 : max_diff = 0.5000 (1 ULP bf16)
f32 <- fp8 / fp8 : max_diff = 0.0000
f32 <- fp8 / bf8 : max_diff = 0.0000
f32 <- bf8 / fp8 : max_diff = 0.0000
f32 <- bf8 / bf8 : max_diff = 0.0000
8/8 variants passed
Also renames the earlier test_opus_parse_gfx1201.cu to
test_opus_gmem_gfx1201.cu — the test exercises opus::make_gmem
load / store on gfx1201 (enabled by the prior buffer_default_config
fix in this PR), not anything specifically about parsing. The new
name reflects the actual scope.
CI black + ruff complained about the compact single-line wrapper and helper defs added in the previous commit. Just running black on the file (which also resolves the ruff E701 multiple-statements-on-one-line warnings). No behavior change.
Same code paths, same dispatch — just tightens the multi-paragraph comment blocks I added in 70f6e4a / 9878ec9 / b8aa331 down to the 1-line-per-anchor density used everywhere else in opus.hpp. Diff vs origin/main shrinks ~166 → ~109 lines; 8/8 wmma variants and the gmem test still pass on RX 9070 XT.
Two existing test .cu files use opus _async_load, which calls
__builtin_amdgcn_raw_ptr_buffer_load_lds — that builtin needs the
vmem-to-lds-load-insts target feature (only present on gfx9x / gfx950
/ gfx1250). On gfx1201 the per-file hipcc compile errors out and that
fails the whole opus_device_test.so build, so before this fix
**0 of 72 opus device tests actually ran on gfx1201**.
setup.py: add _ARCH_SKIP_SOURCES that drops the two incompatible
files (test_async_load.cu, test_load_store_if.cu) from _CU_SOURCES
when arch is gfx1200 / gfx1201. The remaining 19 .cu files compile
fine for gfx1201 and the .so links cleanly.
test_opus_device.py: add a small _skip_if_missing_symbol() helper +
early-skip guard at the top of the 5 test functions whose extern "C"
launcher symbols come from those skipped files
(test_async_load, test_predicated_copy, test_predicated_copy_2d,
test_free_func_vector_add, test_predicated_async_load). They now
print SKIP cleanly instead of AttributeError-ing at runtime.
Result on gfx1201 (after this commit):
- 41 PASS (includes all 9 new gfx1201 tests: 8 wmma + 1 gmem)
- 53 SKIP (arch-gated mfma/wmma_1250/wmma_scale/mxfp/etc tests)
- 4 FAIL (pre-existing fp8/bf8/bf16 ABI mismatches in
dtype_convert_fp32_bf16, dtype_convert_fp32_bf16_vec4,
numeric_limits, finfo — unrelated to this PR;
tests assume fnuz fp8 semantics while gfx12 uses OCP)
Behavior on every other arch (gfx9x, gfx1250) is unchanged — those
archs are not in _ARCH_SKIP_SOURCES so all sources still compile and
all launchers exist, so the symbol-existence guard is a no-op.
… ISA) gfx1200 (Navi 44) and gfx1201 (Navi 48) are siblings in the same RDNA4 family — they share the wmma-128b ISA and the same buffer rsrc format. The buffer_default_config branch added in 9878ec9 already listed both; this commit brings the wmma struct dispatch, the wmma class outer guard, and the two unit-test guards in line. Verification (no gfx1200 hardware available, so compile-only): 1. clang predefines __GFX12__ for both archs; only the per-arch macro differs (__gfx1200__ vs __gfx1201__). 2. LLVM gates all 8 __builtin_amdgcn_wmma_*_w32_gfx12 builtins on "wmma-128b-insts,wavefrontsize32" — a feature both gfx1200 and gfx1201 enable per AMDGPUSubtarget (only gfx1250 adds the bigger wmma-256b-insts shapes). 3. Direct probe: a .cu calling all 8 builtins compiles for both --offload-arch=gfx1200 and --offload-arch=gfx1201 with no errors. 4. test_wmma_gfx1201.cu now builds for both archs producing identical 49432-byte .so files with all 8 run_wmma_gfx1201_* launcher symbols. Risks of broadening: low. If real gfx1200 hardware turns out to differ semantically we would see it as wrong WMMA outputs (not a build break) and can narrow back to per-arch in one line. The alternative — leaving Navi 44 unsupported in opus while it shares the exact same gfx12 wmma ISA — would be worse for downstream consumers. Verified on gfx1201 (RX 9070 XT): all 9 gfx1201 tests still pass. PASS: opus_gmem_gfx1201 max_diff=0.00e+00 PASS: wmma_gfx1201_f32_f16 max_diff=3.81e-06 PASS: wmma_gfx1201_f32_bf16 max_diff=1.91e-06 PASS: wmma_gfx1201_f16_f16 max_diff=3.13e-02 (1 ULP fp16) PASS: wmma_gfx1201_bf16_bf16 max_diff=5.00e-01 (1 ULP bf16) PASS: wmma_gfx1201_f32_fp8_fp8 max_diff=0.00e+00 PASS: wmma_gfx1201_f32_fp8_bf8 max_diff=0.00e+00 PASS: wmma_gfx1201_f32_bf8_fp8 max_diff=0.00e+00 PASS: wmma_gfx1201_f32_bf8_bf8 max_diff=0.00e+00
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
First-class gfx1201 (Navi 48 / RX 9070 XT, RDNA4) support in opus, in 5 reviewable commits:
70f6e4aopus.hppparses on gfx12019878ec9buffer_default_config→make_gmem<>.load/.storeworksb8aa331opus::wmma<>with 8 wave32 16x16x16_w32_gfx12variants + unit testsfdf9d87cde14bbAll gates are per-arch (no
__GFX12__). gfx1250 / gfx9x paths are bytewise unchanged.What unblocks for gfx1201
aiter_opus_plus.h(sample_kernels,moe_fused_gate,topk_gating,gated_rmsnorm_quant,topk_softmax_kernels_group,mhc,quant,quant_mxfp4,fused_qk_rmsnorm_group_quant,rope_common.h).make_gmem<>.load/.storereturns correct data instead of silently dropping stores / returning zeros (root cause:__gfx11__/__gfx12__lowercase are typos — clang only predefines uppercase).opus::wmma<>::operator()dispatches the gfx12_w32_gfx12builtin family.WMMA coverage (all wave32 16x16x16, matches what gfx12 hw exposes for fp acc)
f32←f16,f32←bf16,f16←f16,bf16←bf16,f32←{fp8,bf8}×{fp8,bf8}(8 variants).iu8 / iu4 deferred — opus has no
iu8_t/iu4_taliases yet.Out of scope
The high-level
make_tiled_mma/partition_layout_*path stays gfx1250-only. gfx12's fragment layout is asymmetric (A row-distributed, B/C column-distributed per AMD RDNA4 ISA §7.12.2 / CKwmma_gemm.hpp) and needs a dedicatedwmma_adaptor_gfx12— TODO inline. Callers can useopus::wmma<>directly today (the unit test does).Tests (
op_tests/opus/device/)test_opus_gmem_gfx1201.cu— exercisesopus::make_gmem<>.load<>/.store<>(wastest_opus_parse_gfx1201.cu, renamed to reflect scope).test_wmma_gfx1201.cu— 8 launchers, one per dtype combo. Each lane builds its A/B fragment per the gfx12 layout, callsopus::wmma<>::operator(), stores C back.setup.py(_CU_SOURCES) +test_opus_device.py(ctypes wrappers + test fns +main()dispatch + skip-on-non-gfx1201).Verified on RX 9070 XT (gfx1201)
Integration: with this PR's
opus.hppin place,module_sampleJIT-builds end-to-end for gfx1201 andaiter.mixed_sample_outer_exponentialis bit-identical to torch Gumbel-max (15.4 µs/call warm avg).Sources