Skip to content

[gfx1250] Fix cluster launch detection and silent fallback#532

Draft
aoli26 wants to merge 3 commits into
mainfrom
gfx1250/fix_cluster
Draft

[gfx1250] Fix cluster launch detection and silent fallback#532
aoli26 wants to merge 3 commits into
mainfrom
gfx1250/fix_cluster

Conversation

@aoli26
Copy link
Copy Markdown
Contributor

@aoli26 aoli26 commented May 15, 2026

Motivation

Cluster-launch on gfx1250 was effectively a no-op: mgpuLaunchClusterKernel gated cluster attributes on #ifdef hipLaunchAttributeClusterDimension, but that name is an enum, not a macro — the cluster path was always dead. Worse, when hipDrvLaunchKernelEx failed we silently fell back to hipModuleLaunchKernel even for cluster=(>1,>1,>1), so kernels ran without cluster semantics and tests still "passed". Separately, compile_mxscale_gemm quietly forced waves_per_eu=2 under cluster, overriding the caller / autotuner.

Technical Details

  • lib/Runtime/ROCm/CMakeLists.txt: detect cluster-attr support via a CMake check_cxx_source_compiles probe and expose it as FLY_HIP_HAS_CLUSTER_ATTR.
  • lib/Runtime/ROCm/FlyRocmRuntimeWrappers.cpp: gate the cluster path on FLY_HIP_HAS_CLUSTER_ATTR, cache hipDeviceProp_t::clusterLaunch per device, and abort with hipErrorNotSupported when a real cluster is requested but unsupported (no silent fallback). cluster=(1,1,1) still falls back gracefully.
  • kernels/gemm_fp8fp4_gfx1250.py: drop the implicit waves_per_eu=2 override under cluster.
  • Drop the FFM-simulator COMGR preload shim (python/flydsl/_compat.py + the test-side import flydsl workaround); current FFM builds no longer collide with the system libamd_comgr LLVM CommandLine options.

Test Plan

Run test_gemm_fp8fp4_gfx1250 cluster paths (test_mxfp4_gemm_mcast, mxscale cluster) on gfx1250 with cluster_m/n > 1.

Test Result

All cluster unit tests pass; cluster code path is actually exercised; the original non-cluster code still works well.

Submission Checklist

Copilot AI review requested due to automatic review settings May 15, 2026 13:00
@aoli26 aoli26 added the bug Something isn't working label May 15, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes ROCm cluster-launch behavior on gfx1250 by correctly detecting HIP cluster-attribute support at build time, ensuring the cluster launch path is actually taken when supported, and preventing silent non-cluster fallbacks when a real cluster launch was requested. It also removes an outdated FFM-simulator COMGR preload shim and stops implicitly overriding waves_per_eu for clustered GEMM compilation.

Changes:

  • Add a CMake compile-probe to detect hipLaunchAttributeClusterDimension support and expose it as FLY_HIP_HAS_CLUSTER_ATTR.
  • Update mgpuLaunchClusterKernel to gate cluster launches on FLY_HIP_HAS_CLUSTER_ATTR, check device capability, and error out (no silent fallback) when a real cluster is requested but unsupported.
  • Remove the COMGR preload shim and drop the implicit waves_per_eu=2 override under cluster in compile_mxscale_gemm.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/kernels/test_gemm_fp8fp4_gfx1250.py Removes simulator-specific flydsl import preload workaround.
python/flydsl/_compat.py Deletes COMGR preload compatibility shim.
python/flydsl/__init__.py Stops importing/running the removed compatibility shim at import time.
lib/Runtime/ROCm/FlyRocmRuntimeWrappers.cpp Implements cluster-attr gating + capability check and removes silent fallback for real cluster requests.
lib/Runtime/ROCm/CMakeLists.txt Adds a compile-time probe and defines FLY_HIP_HAS_CLUSTER_ATTR for the runtime wrapper build.
kernels/gemm_fp8fp4_gfx1250.py Removes implicit waves_per_eu=2 override under cluster.
Comments suppressed due to low confidence (1)

lib/Runtime/ROCm/FlyRocmRuntimeWrappers.cpp:157

  • Same as above: HIP_REPORT_IF_ERROR(hipErrorNotSupported) will produce a confusing log line because it’s not a HIP call. Prefer an explicit error-report path (or no additional log since the detailed message is already printed) to keep stderr output actionable.
            "[mgpuLaunchClusterKernel] cluster=(%ld,%ld,%ld) requested but "
            "FlyDSL was built against a HIP without "
            "hipLaunchAttributeClusterDimension; aborting "
            "(no silent fallback).\n",
            static_cast<long>(clusterX), static_cast<long>(clusterY),
            static_cast<long>(clusterZ));
    HIP_REPORT_IF_ERROR(hipErrorNotSupported);
    return;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lib/Runtime/ROCm/FlyRocmRuntimeWrappers.cpp Outdated
Comment thread lib/Runtime/ROCm/FlyRocmRuntimeWrappers.cpp Outdated
@aoli26 aoli26 force-pushed the gfx1250/fix_cluster branch from 19beee1 to 03a6325 Compare May 15, 2026 13:59
set(CMAKE_REQUIRED_INCLUDES "${hip_INCLUDE_DIRS};${HIP_INCLUDE_DIRS}")
set(CMAKE_REQUIRED_DEFINITIONS "-D__HIP_PLATFORM_AMD__")
set(CMAKE_TRY_COMPILE_TARGET_TYPE STATIC_LIBRARY)
check_cxx_source_compiles("
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This inline C++ probe feels a bit brittle in CMakeLists, and it only checks one symbol while the runtime uses the full cluster-launch surface (hipLaunchAttribute, HIP_LAUNCH_CONFIG, hipDrvLaunchKernelEx, hipDeviceProp_t::clusterLaunch). Could we either gate on a known HIP/ROCm version or move this into a small standalone probe that validates the full API set?

"device reports clusterLaunch=0; aborting (no silent fallback).\n",
static_cast<long>(clusterX), static_cast<long>(clusterY),
static_cast<long>(clusterZ));
return;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still returns normally after deciding a real cluster launch cannot be honored, so callers/tests can continue as if the kernel launched. Since the PR goal is no silent fallback, can we surface this as a real runtime failure (for example hipErrorNotSupported through a shared error path) instead of only fprintf + return?

hipDeviceProp_t prop{};
if (hipGetDeviceProperties(&prop, d) != hipSuccess)
return 0;
return prop.clusterLaunch ? 1 : 0;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This helper is compiled unconditionally but uses hipDeviceProp_t::clusterLaunch, while the CMake probe only checks hipLaunchAttributeClusterDimension. On HIP headers where the attr check is false, this field may also be missing and break the build before the #if FLY_HIP_HAS_CLUSTER_ATTR path matters. Could we guard this helper or include clusterLaunch in the same feature probe?

if (deviceClusterCap) {
hipLaunchAttribute attrs[1];
attrs[0].id = hipLaunchAttributeClusterDimension;
attrs[0].value.clusterDim.x = static_cast<unsigned>(clusterX);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a C ABI boundary, it would be safer to validate cluster dims before casting from intptr_t to unsigned. A zero/negative or overflowing value would become a confusing launch config or HIP error, and it also affects the requestedRealCluster check above.

@coderfeli
Copy link
Copy Markdown
Collaborator

@aoli26 what are these change for ? Is it a must to run? Looks a little bit hacky.

@aoli26
Copy link
Copy Markdown
Contributor Author

aoli26 commented May 15, 2026

what are these change for ? Is it a must to run? Looks a little bit hacky.

The original #ifdef hipLaunchAttributeClusterDimension was always false (it's an enum, not a macro), so the cluster path was effectively dead and silently fell back. The CMake probe was the quickest way to make the gate real, but I agree it's a bit hacky — a HIP_VERSION-based #if in the C++ (covering the full cluster surface) is probably cleaner. Let me investigate and follow up.

@aoli26 aoli26 marked this pull request as draft May 15, 2026 15:12
@coderfeli
Copy link
Copy Markdown
Collaborator

what are these change for ? Is it a must to run? Looks a little bit hacky.

The original #ifdef hipLaunchAttributeClusterDimension was always false (it's an enum, not a macro), so the cluster path was effectively dead and silently fell back. The CMake probe was the quickest way to make the gate real, but I agree it's a bit hacky — a HIP_VERSION-based #if in the C++ (covering the full cluster surface) is probably cleaner. Let me investigate and follow up.

OK. So it's for perf tuning. Current functional enablement is not blocked by this, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants