[gfx1250] Optimize FP8/FP4 GEMM kernel and tests#533
Merged
Conversation
Bring kernels/gemm_fp8fp4_gfx1250.py and its test up to the latest optimized version: - LRU-cache compile_mxscale_gemm via flyc.compile fast path - B-streaming compute path option (b_streaming=True) - FP8 scale async load split path (scale_load_path= 'buffer_lds_stage'|'buffer_lds_stage_ab_split') - Quadrant callback / scheduling reorganization - FP8 active TDM load handling refactor - Test parametrization extended to cover the new paths
Contributor
There was a problem hiding this comment.
Pull request overview
Optimizes the gfx1250 FP8/FP4 GEMM kernel by adding a new FP8 quadrant compute schedule, an opt-in B-streaming schedule, an alternative scale-load path through buffer_load+LDS (with an A/B-split variant), and a lru_cache/flyc.compile fast-launch path. Also removes the FFM COMGR preload shim that fails on current FFM builds and adds new test coverage (b-streaming, scale-load-path matrix, hipGraph capture/replay).
Changes:
- New compute schedules (
fp8_quadrant,b_streaming) and matchinghot_loop_scheduler_*, plusbuffer_lds_stage[_ab_split]scale-load paths with new TDM-descriptor halves and refactored_select_active_tdmselection. - Test additions: b-streaming correctness, scale-load-path matrix, AB-split, hipGraph cudagraph test, and a hipGraph-based bench helper; benchmark CLI gains
--scale-load-path,--b-streaming,--use-graph; default bench shape shrunk to 1024³ × 2048. lru_cacheoncompile_mxscale_gemm+ pre-bind viaflyc.compilefor the ctypes fast launch path; removal ofpython/flydsl/_compat.pyand its_maybe_preload_system_comgrshim invocation.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.
| File | Description |
|---|---|
| kernels/gemm_fp8fp4_gfx1250.py | Adds new compute schedules, B-streaming, AB-split scale path, refactors cluster helpers to gpu.*, caches compile result. |
| tests/kernels/test_gemm_fp8fp4_gfx1250.py | Adds b-streaming, scale-path, AB-split, cudagraph tests; switches launch to flyc.compile fast path; extends bench CLI. |
| python/flydsl/init.py | Drops the COMGR preload call to allow FFM import to succeed. |
| python/flydsl/_compat.py | File deleted along with the _maybe_preload_system_comgr shim. |
Comments suppressed due to low confidence (1)
kernels/gemm_fp8fp4_gfx1250.py:2015
- Same issue as the cluster_position/mcast_masks calls above:
gpu.cluster_barrier()does not exist onflydsl.expr.gpu;cluster_barrieris defined inflydsl.expr.rocdl.cluster. This branch fires whenloop_iters == 0anduse_cluster=True, so it will raise at compile/codegen time for small-K cluster configurations.
gpu.cluster_barrier()
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
198d358 to
27f4217
Compare
27f4217 to
fabd391
Compare
coderfeli
approved these changes
May 16, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Optimize gfx1250 fp8 gemm:
buffer_load+ LDS.lru_cache+flyc.compilectypes fast path (~17us/call).Also drop the FFM COMGR preload shim — on current FFM it pulls in a second LLVM and crashes import with
Option 'spirv-expand-step' registered more than once.Technical Details
b_streaming=True: issue B fragment loads inside the WMMA loop per quadrant, overlapping with prior-quadrant WMMA. Lower VGPR / LDS footprint.scale_load_path="buffer_lds_stage" | "buffer_lds_stage_ab_split": scales go viabuffer_load→ LDS instead of TDM. Frees TDM for A/B; the_ab_splitvariant pipelines A-scale and B-scale separately.@lru_cacheoncompile_mxscale_gemm+flyc.compilepre-binding, so calls bypassJitFunction.bind'sinspect.Signature+ cache-key hashing.python/flydsl/__init__.py,python/flydsl/_compat.py): required for FFM import.Test Plan
Run
tests/kernels/test_gemm_fp8fp4_gfx1250.py(FP4 / FP8 / A8W4, mcast, irregular-tile) ongfx1250.Test Result
pytest ::test_mxfp8_gemm: 60 passed, 36 skipped, 0 failed. Other parametrized cases pass ongfx1250.Submission Checklist