[gfx1250] Fix cluster launch detection and silent fallback#532
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes ROCm cluster-launch behavior on gfx1250 by correctly detecting HIP cluster-attribute support at build time, ensuring the cluster launch path is actually taken when supported, and preventing silent non-cluster fallbacks when a real cluster launch was requested. It also removes an outdated FFM-simulator COMGR preload shim and stops implicitly overriding waves_per_eu for clustered GEMM compilation.
Changes:
- Add a CMake compile-probe to detect
hipLaunchAttributeClusterDimensionsupport and expose it asFLY_HIP_HAS_CLUSTER_ATTR. - Update
mgpuLaunchClusterKernelto gate cluster launches onFLY_HIP_HAS_CLUSTER_ATTR, check device capability, and error out (no silent fallback) when a real cluster is requested but unsupported. - Remove the COMGR preload shim and drop the implicit
waves_per_eu=2override under cluster incompile_mxscale_gemm.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
tests/kernels/test_gemm_fp8fp4_gfx1250.py |
Removes simulator-specific flydsl import preload workaround. |
python/flydsl/_compat.py |
Deletes COMGR preload compatibility shim. |
python/flydsl/__init__.py |
Stops importing/running the removed compatibility shim at import time. |
lib/Runtime/ROCm/FlyRocmRuntimeWrappers.cpp |
Implements cluster-attr gating + capability check and removes silent fallback for real cluster requests. |
lib/Runtime/ROCm/CMakeLists.txt |
Adds a compile-time probe and defines FLY_HIP_HAS_CLUSTER_ATTR for the runtime wrapper build. |
kernels/gemm_fp8fp4_gfx1250.py |
Removes implicit waves_per_eu=2 override under cluster. |
Comments suppressed due to low confidence (1)
lib/Runtime/ROCm/FlyRocmRuntimeWrappers.cpp:157
- Same as above:
HIP_REPORT_IF_ERROR(hipErrorNotSupported)will produce a confusing log line because it’s not a HIP call. Prefer an explicit error-report path (or no additional log since the detailed message is already printed) to keep stderr output actionable.
"[mgpuLaunchClusterKernel] cluster=(%ld,%ld,%ld) requested but "
"FlyDSL was built against a HIP without "
"hipLaunchAttributeClusterDimension; aborting "
"(no silent fallback).\n",
static_cast<long>(clusterX), static_cast<long>(clusterY),
static_cast<long>(clusterZ));
HIP_REPORT_IF_ERROR(hipErrorNotSupported);
return;
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
19beee1 to
03a6325
Compare
| set(CMAKE_REQUIRED_INCLUDES "${hip_INCLUDE_DIRS};${HIP_INCLUDE_DIRS}") | ||
| set(CMAKE_REQUIRED_DEFINITIONS "-D__HIP_PLATFORM_AMD__") | ||
| set(CMAKE_TRY_COMPILE_TARGET_TYPE STATIC_LIBRARY) | ||
| check_cxx_source_compiles(" |
There was a problem hiding this comment.
This inline C++ probe feels a bit brittle in CMakeLists, and it only checks one symbol while the runtime uses the full cluster-launch surface (hipLaunchAttribute, HIP_LAUNCH_CONFIG, hipDrvLaunchKernelEx, hipDeviceProp_t::clusterLaunch). Could we either gate on a known HIP/ROCm version or move this into a small standalone probe that validates the full API set?
| "device reports clusterLaunch=0; aborting (no silent fallback).\n", | ||
| static_cast<long>(clusterX), static_cast<long>(clusterY), | ||
| static_cast<long>(clusterZ)); | ||
| return; |
There was a problem hiding this comment.
This still returns normally after deciding a real cluster launch cannot be honored, so callers/tests can continue as if the kernel launched. Since the PR goal is no silent fallback, can we surface this as a real runtime failure (for example hipErrorNotSupported through a shared error path) instead of only fprintf + return?
| hipDeviceProp_t prop{}; | ||
| if (hipGetDeviceProperties(&prop, d) != hipSuccess) | ||
| return 0; | ||
| return prop.clusterLaunch ? 1 : 0; |
There was a problem hiding this comment.
This helper is compiled unconditionally but uses hipDeviceProp_t::clusterLaunch, while the CMake probe only checks hipLaunchAttributeClusterDimension. On HIP headers where the attr check is false, this field may also be missing and break the build before the #if FLY_HIP_HAS_CLUSTER_ATTR path matters. Could we guard this helper or include clusterLaunch in the same feature probe?
| if (deviceClusterCap) { | ||
| hipLaunchAttribute attrs[1]; | ||
| attrs[0].id = hipLaunchAttributeClusterDimension; | ||
| attrs[0].value.clusterDim.x = static_cast<unsigned>(clusterX); |
There was a problem hiding this comment.
Since this is a C ABI boundary, it would be safer to validate cluster dims before casting from intptr_t to unsigned. A zero/negative or overflowing value would become a confusing launch config or HIP error, and it also affects the requestedRealCluster check above.
|
@aoli26 what are these change for ? Is it a must to run? Looks a little bit hacky. |
The original |
OK. So it's for perf tuning. Current functional enablement is not blocked by this, right? |
Motivation
Cluster-launch on
gfx1250was effectively a no-op:mgpuLaunchClusterKernelgated cluster attributes on#ifdef hipLaunchAttributeClusterDimension, but that name is an enum, not a macro — the cluster path was always dead. Worse, whenhipDrvLaunchKernelExfailed we silently fell back tohipModuleLaunchKerneleven forcluster=(>1,>1,>1), so kernels ran without cluster semantics and tests still "passed". Separately,compile_mxscale_gemmquietly forcedwaves_per_eu=2under cluster, overriding the caller / autotuner.Technical Details
lib/Runtime/ROCm/CMakeLists.txt: detect cluster-attr support via a CMakecheck_cxx_source_compilesprobe and expose it asFLY_HIP_HAS_CLUSTER_ATTR.lib/Runtime/ROCm/FlyRocmRuntimeWrappers.cpp: gate the cluster path onFLY_HIP_HAS_CLUSTER_ATTR, cachehipDeviceProp_t::clusterLaunchper device, and abort withhipErrorNotSupportedwhen a real cluster is requested but unsupported (no silent fallback).cluster=(1,1,1)still falls back gracefully.kernels/gemm_fp8fp4_gfx1250.py: drop the implicitwaves_per_eu=2override under cluster.python/flydsl/_compat.py+ the test-sideimport flydslworkaround); current FFM builds no longer collide with the systemlibamd_comgrLLVM CommandLine options.Test Plan
Run
test_gemm_fp8fp4_gfx1250cluster paths (test_mxfp4_gemm_mcast, mxscale cluster) ongfx1250withcluster_m/n > 1.Test Result
All cluster unit tests pass; cluster code path is actually exercised; the original non-cluster code still works well.
Submission Checklist