Suggestion Description
We are trying to optimize a FA forward kernel on AMD MI350 / gfx950 using FlyDSL. FlyDSL is already expressive enough for many important primitives: MFMA, raw_ptr_buffer_load_lds, ds_read_tr16_b64, s_waitcnt, s_barrier, sched_group_barrier, s_setprio, and llvm.inline_asm.
However, for peak performance, the current high-level FlyDSL path still trails hand-tuned kernels from HipKittens/AITER/native ISA codegen. For example, a native ISA implementation of a similar 2-tile pipeline reaches 1000-1160+ TFLOPS on the same MI350/gfx950 hardware.
This issue is a feature request for more explicit low-level controls in FlyDSL so that advanced users can close this gap without leaving the FlyDSL programming model.
Observed gap
FlyDSL can emit the right broad instruction families, but the generated final ISA differs from high-performance handwritten kernels in several important ways:
- less predictable register allocation
- less precise MFMA co-execute packing
- extra VALU instructions around address calculation
- extra VMEM/store instructions in some output paths
- conservative or suboptimal wait/scheduling behavior in some hot-loop regions
- difficulty reproducing fixed VGPR/SGPR layouts used by native ISA kernels
The native ISA path uses a fixed register layout, explicit instruction order, precomputed DMA m0 SGPRs, DMA-in-MFMA shadows, deferred softmax, ping-pong S buffers, and permlane dwordx4 stores. That level of control is difficult to reproduce reliably in FlyDSL today.
Requested features
1. Explicit register allocation / register pinning
It would be very useful to have an advanced API to pin values or vector ranges to specific VGPR/SGPR ranges, or at least to reserve named register regions.
Example use case:
v32-v95 S buffers
v96-v159 O accumulators
v160-v191 persistent Q
v192-v255 K/V preread
s63-s67 precomputed DMA m0 values
Even a limited "best effort" or "expert mode" register allocation annotation would help.
2. Better inline assembly integration
llvm.inline_asm works, but for full-kernel optimization it would help to have first-class helpers for common AMD instructions and constraints:
buffer_load_dwordx4 ... offen lds
ds_read_b128
ds_read_b64_tr_b16
v_permlane32_swap_b32
v_permlane16_swap_b32
- precise
s_waitcnt vmcnt(N) lgkmcnt(M)
s_mov_b32 m0, ...
Ideally these helpers should make it clear which operands are SGPR/VGPR/immediate and should avoid accidental lowering changes.
3. More control over final instruction scheduling
For peak CDNA kernels, instruction placement inside MFMA co-execute windows is critical. It would be helpful to have stronger scheduling primitives than the current sched_group_barrier/inline asm combination.
Desired capability:
- group exactly N MFMAs with M VALU/EXP/DS/VMEM instructions
- prevent LLVM from moving selected instructions across a region
- preserve ordering of hand-scheduled blocks
- inspect or assert final generated instruction counts/order
4. First-class byte-offset buffer store/load APIs
buffer_ops.buffer_store(..., offset_is_bytes=True) helps, but output store optimization is still tricky. A lower-level API for exact AMD buffer store forms would help implement dwordx4 output paths without extra address arithmetic or unwanted cache modifiers.
Useful forms:
buffer_store_dword
buffer_store_dwordx2
buffer_store_dwordx4
- explicit
voffset, soffset, offset, aux/cache fields
5. Easier compiler option tuning
FlyDSL already supports llvm_options and maxnreg, but it would be helpful to document and expose recommended AMD options for performance tuning, for example:
enable-post-misched
greedy-reverse-local-assignment
amdgpu-early-inline-all
amdgpu-function-calls
unroll-count
maxnreg / --amdgpu-num-vgpr
A small official example showing how to sweep these options for ROCm kernels would be useful.
6. ISA dump / analysis workflow as a first-class feature
FLYDSL_DUMP_IR=1 is very useful. It would be even better if FlyDSL had a documented way to dump:
- final ISA
- VGPR/SGPR/LDS usage
- waitcnt counts
- MFMA counts
- VMEM/LDS/store counts
- kernel metadata
This would make it easier to compare FlyDSL kernels against CK/AITER/HipKittens/native ISA kernels.
Why this matters
FlyDSL is close enough to express many advanced CDNA concepts, but peak attention kernels need more final-ISA control than typical compiler-generated GPU code. HipKittens/AITER/native ISA kernels achieve substantially higher performance on the same hardware mainly through exact register layout and instruction scheduling.
If FlyDSL can expose a small set of expert-level controls, it could become a much stronger path for writing maintainable kernels that still approach handwritten ISA performance.
Current workaround
The workaround is to leave FlyDSL and use native ISA codegen directly. That works for performance, but it loses many benefits of FlyDSL: Python-level composition, easier integration, MLIR-level structure, and better maintainability.
We would prefer to stay in FlyDSL if the framework can expose enough low-level control for expert AMD kernel tuning.
Operating System
No response
GPU
No response
ROCm Component
No response
Suggestion Description
We are trying to optimize a FA forward kernel on AMD MI350 / gfx950 using FlyDSL. FlyDSL is already expressive enough for many important primitives: MFMA,
raw_ptr_buffer_load_lds,ds_read_tr16_b64,s_waitcnt,s_barrier,sched_group_barrier,s_setprio, andllvm.inline_asm.However, for peak performance, the current high-level FlyDSL path still trails hand-tuned kernels from HipKittens/AITER/native ISA codegen. For example, a native ISA implementation of a similar 2-tile pipeline reaches 1000-1160+ TFLOPS on the same MI350/gfx950 hardware.
This issue is a feature request for more explicit low-level controls in FlyDSL so that advanced users can close this gap without leaving the FlyDSL programming model.
Observed gap
FlyDSL can emit the right broad instruction families, but the generated final ISA differs from high-performance handwritten kernels in several important ways:
The native ISA path uses a fixed register layout, explicit instruction order, precomputed DMA
m0SGPRs, DMA-in-MFMA shadows, deferred softmax, ping-pong S buffers, and permlane dwordx4 stores. That level of control is difficult to reproduce reliably in FlyDSL today.Requested features
1. Explicit register allocation / register pinning
It would be very useful to have an advanced API to pin values or vector ranges to specific VGPR/SGPR ranges, or at least to reserve named register regions.
Example use case:
Even a limited "best effort" or "expert mode" register allocation annotation would help.
2. Better inline assembly integration
llvm.inline_asmworks, but for full-kernel optimization it would help to have first-class helpers for common AMD instructions and constraints:buffer_load_dwordx4 ... offen ldsds_read_b128ds_read_b64_tr_b16v_permlane32_swap_b32v_permlane16_swap_b32s_waitcnt vmcnt(N) lgkmcnt(M)s_mov_b32 m0, ...Ideally these helpers should make it clear which operands are SGPR/VGPR/immediate and should avoid accidental lowering changes.
3. More control over final instruction scheduling
For peak CDNA kernels, instruction placement inside MFMA co-execute windows is critical. It would be helpful to have stronger scheduling primitives than the current
sched_group_barrier/inline asm combination.Desired capability:
4. First-class byte-offset buffer store/load APIs
buffer_ops.buffer_store(..., offset_is_bytes=True)helps, but output store optimization is still tricky. A lower-level API for exact AMD buffer store forms would help implement dwordx4 output paths without extra address arithmetic or unwanted cache modifiers.Useful forms:
buffer_store_dwordbuffer_store_dwordx2buffer_store_dwordx4voffset,soffset,offset,aux/cachefields5. Easier compiler option tuning
FlyDSL already supports
llvm_optionsandmaxnreg, but it would be helpful to document and expose recommended AMD options for performance tuning, for example:enable-post-mischedgreedy-reverse-local-assignmentamdgpu-early-inline-allamdgpu-function-callsunroll-countmaxnreg/--amdgpu-num-vgprA small official example showing how to sweep these options for ROCm kernels would be useful.
6. ISA dump / analysis workflow as a first-class feature
FLYDSL_DUMP_IR=1is very useful. It would be even better if FlyDSL had a documented way to dump:This would make it easier to compare FlyDSL kernels against CK/AITER/HipKittens/native ISA kernels.
Why this matters
FlyDSL is close enough to express many advanced CDNA concepts, but peak attention kernels need more final-ISA control than typical compiler-generated GPU code. HipKittens/AITER/native ISA kernels achieve substantially higher performance on the same hardware mainly through exact register layout and instruction scheduling.
If FlyDSL can expose a small set of expert-level controls, it could become a much stronger path for writing maintainable kernels that still approach handwritten ISA performance.
Current workaround
The workaround is to leave FlyDSL and use native ISA codegen directly. That works for performance, but it loses many benefits of FlyDSL: Python-level composition, easier integration, MLIR-level structure, and better maintainability.
We would prefer to stay in FlyDSL if the framework can expose enough low-level control for expert AMD kernel tuning.
Operating System
No response
GPU
No response
ROCm Component
No response