CK Tile MXFP8 Group GEMM gfx1250 by aris134 · Pull Request #578 · ROCm/TransformerEngine

aris134 · 2026-05-06T20:37:04Z

Description

This PR integrates CK Tile MXFP8 grouped GEMM backend with TDM into TE. Replaces 3rdparty/aiter with 3rdparty/rocm-libraries for the gfx1250 changes from CK.

Fixes # (16490)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

…rash; remaining issue is numerical validation vs BF16 sequential reference.

ipanfilo · 2026-05-07T16:34:20Z

            test_dequantize_mxfp8.cu
            test_dequantize_nvfp4.cu
            test_cast_nvfp4_transpose.cu
+            test_ck_grouped_mxfp8.cu


It should be for non CUDA only

Done in 3db2e5a

ipanfilo · 2026-05-07T17:14:55Z

  // Currently only support cutlass group gemm on Hopper Arch
-  if (!(is_hopper && use_cutlass)) {
+  // if (!(is_hopper && use_cutlass)) {
+  if (!use_cutlass) {


It is CUDA path

Reverted in 3db2e5a

sudhu2k · 2026-05-19T18:19:58Z

    delay_wgrad_compute,
 ):
    os.environ["NVTE_USE_CUTLASS_GROUPED_GEMM"] = "1"
+    os.environ["NVTE_ROCM_ENABLE_MXFP8"] = "1"


I think this should only be set when the recipe we are testing is mxfp8.

Good point. Looking at the parametrization, MXFP8BlockScaling is only added to fp8_recipes when NVTE_ROCM_ENABLE_MXFP8=1 is already set before test collection. So setting it inside this test is redundant and also broader than intended. Removed in 746afea

sudhu2k · 2026-05-19T18:22:44Z

+
+// Treat TE tensors as generalized 2D matrices by flattening:
+// (D1, D2, ..., Dn) -> (D1*...*D(n-1), Dn), consistent with TE Tensor::flat_*_dim.
+static inline bool get_flat_2d_dims(const transformer_engine::Tensor& t,


Re-use get_flat_2d_dims from ck_grouped_gemm_common.h

I think some portion of the code is already present in ck_grouped_gemm_common.h inside ck_grouped_gemm folder. What was the reasoning behind having a separate directory for ck_mx_grouped_gemm?

No, there really was not a good reason for this. I agree that it makes more sense to keep it all under the same directory, and re-use the common functions already defined in the shared header. I have made these changes in 175855d

wangye805 · 2026-05-21T05:17:51Z

+#ifndef CK_TILE_USE_OCP_FP8
+#define CK_TILE_USE_OCP_FP8 1
+#endif


Just curious, where is this macro used?

wangye805 · 2026-05-21T05:19:58Z

+static float to_float(const bf16_t& x) { return static_cast<float>(x); }
+static float to_float(const ck_tile::bfloat16_t& x) { return static_cast<float>(x); }


is ck_tile::bfloat16_t same as our bf16_t?

wangye805 · 2026-05-21T05:20:41Z

+  setenv("NVTE_ROCM_ENABLE_MXFP8", "1", 0);
+}
+
+static float to_float(float x) { return x; }


Why do we need a float to float?

wangye805 · 2026-05-21T05:21:23Z

+static float to_float(const bf16_t& x) { return static_cast<float>(x); }
+static float to_float(const ck_tile::bfloat16_t& x) { return static_cast<float>(x); }
+
+__device__ __host__ __forceinline__ float ref_gelu_unused(float x) {


unused? or unfused?

wangye805 · 2026-05-21T05:23:11Z

+      size_t a_idx = 0;
+      size_t b_idx = 0;
+
+      if (use_mxfp8) {


Based on your test name I presume you wanted to test mxfp8 but here it looks like you wanted to cover non-mxfp8 as well?

wangye805 · 2026-05-21T05:25:13Z

+
+  cudaDeviceProp prop;
+  NVTE_CHECK_CUDA(cudaGetDeviceProperties(&prop, 0));
+#ifdef __HIP_PLATFORM_AMD__


Probably not needed since NV upstream do not have this file

aris134 added 11 commits May 6, 2026 14:18

initial commit for CK Tile MXFP8 integration for gfx1250

1f707d7

ck mxfp8 gfx1250 integration builds successfully

e102f00

add entrypoint to ck mx group gemm in caller

52a2887

temporary hacky change to test_numerics for bringup testing

8022777

add warning print to confirm we are in fallback

bc6253d

MXFP8 grouped fwd/bwd now reaches CK path and runs without fallback/c…

d26f52e

…rash; remaining issue is numerical validation vs BF16 sequential reference.

add cpp test for ck tile group mxfp8 gemm forward

e295e74

Fix MXFP8 grouped GEMM scale handling for NN/TN/NT

1784045

update ck mxfp8 group gemm gtest to exercise mixed dtypes

fe99bf3

include renamed test file

e7159c4

clean up code

972cea3

aris134 requested review from ipanfilo, wangye805 and wenchenvincent as code owners May 6, 2026 20:37

aris134 self-assigned this May 6, 2026

Update cublaslt_gemm.cu

c0fabff

aris134 requested a review from matthiasdiener May 6, 2026 20:49

matthiasdiener added the ci-level 1 CI test level 1 label May 7, 2026

ipanfilo requested changes May 7, 2026

View reviewed changes

address pr comments

3db2e5a

aris134 requested a review from ipanfilo May 11, 2026 13:11

aris134 added 3 commits May 17, 2026 17:27

fix ck group mxfp8 dispatch

910d30f

update CMakeLists.txt

1b66d29

Add direct ROCm libraries dependency for CK grouped GEMM

23b505f

sudhu2k requested changes May 19, 2026

View reviewed changes

aris134 added 2 commits May 19, 2026 22:03

Remove redundant MXFP8 env override from grouped linear test

746afea

factor out common definitions from mxfp8 ck ggemm

175855d

aris134 requested a review from sudhu2k May 20, 2026 00:49

wangye805 reviewed May 21, 2026

View reviewed changes

wangye805 approved these changes May 21, 2026

View reviewed changes

		static float to_float(const bf16_t& x) { return static_cast<float>(x); }
		static float to_float(const ck_tile::bfloat16_t& x) { return static_cast<float>(x); }

Conversation

aris134 commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

aris134 commented May 6, 2026 •

edited

Loading