Skip to content

[Feature]: More low-level AMD/CDNA controls on SoTA Kernels #515

@gangliao

Description

@gangliao

Suggestion Description

We are trying to optimize a FA forward kernel on AMD MI350 / gfx950 using FlyDSL. FlyDSL is already expressive enough for many important primitives: MFMA, raw_ptr_buffer_load_lds, ds_read_tr16_b64, s_waitcnt, s_barrier, sched_group_barrier, s_setprio, and llvm.inline_asm.

However, for peak performance, the current high-level FlyDSL path still trails hand-tuned kernels from HipKittens/AITER/native ISA codegen. For example, a native ISA implementation of a similar 2-tile pipeline reaches 1000-1160+ TFLOPS on the same MI350/gfx950 hardware.

This issue is a feature request for more explicit low-level controls in FlyDSL so that advanced users can close this gap without leaving the FlyDSL programming model.

Observed gap

FlyDSL can emit the right broad instruction families, but the generated final ISA differs from high-performance handwritten kernels in several important ways:

  • less predictable register allocation
  • less precise MFMA co-execute packing
  • extra VALU instructions around address calculation
  • extra VMEM/store instructions in some output paths
  • conservative or suboptimal wait/scheduling behavior in some hot-loop regions
  • difficulty reproducing fixed VGPR/SGPR layouts used by native ISA kernels

The native ISA path uses a fixed register layout, explicit instruction order, precomputed DMA m0 SGPRs, DMA-in-MFMA shadows, deferred softmax, ping-pong S buffers, and permlane dwordx4 stores. That level of control is difficult to reproduce reliably in FlyDSL today.

Requested features

1. Explicit register allocation / register pinning

It would be very useful to have an advanced API to pin values or vector ranges to specific VGPR/SGPR ranges, or at least to reserve named register regions.

Example use case:

v32-v95    S buffers
v96-v159   O accumulators
v160-v191  persistent Q
v192-v255  K/V preread
s63-s67    precomputed DMA m0 values

Even a limited "best effort" or "expert mode" register allocation annotation would help.

2. Better inline assembly integration

llvm.inline_asm works, but for full-kernel optimization it would help to have first-class helpers for common AMD instructions and constraints:

  • buffer_load_dwordx4 ... offen lds
  • ds_read_b128
  • ds_read_b64_tr_b16
  • v_permlane32_swap_b32
  • v_permlane16_swap_b32
  • precise s_waitcnt vmcnt(N) lgkmcnt(M)
  • s_mov_b32 m0, ...

Ideally these helpers should make it clear which operands are SGPR/VGPR/immediate and should avoid accidental lowering changes.

3. More control over final instruction scheduling

For peak CDNA kernels, instruction placement inside MFMA co-execute windows is critical. It would be helpful to have stronger scheduling primitives than the current sched_group_barrier/inline asm combination.

Desired capability:

  • group exactly N MFMAs with M VALU/EXP/DS/VMEM instructions
  • prevent LLVM from moving selected instructions across a region
  • preserve ordering of hand-scheduled blocks
  • inspect or assert final generated instruction counts/order

4. First-class byte-offset buffer store/load APIs

buffer_ops.buffer_store(..., offset_is_bytes=True) helps, but output store optimization is still tricky. A lower-level API for exact AMD buffer store forms would help implement dwordx4 output paths without extra address arithmetic or unwanted cache modifiers.

Useful forms:

  • buffer_store_dword
  • buffer_store_dwordx2
  • buffer_store_dwordx4
  • explicit voffset, soffset, offset, aux/cache fields

5. Easier compiler option tuning

FlyDSL already supports llvm_options and maxnreg, but it would be helpful to document and expose recommended AMD options for performance tuning, for example:

  • enable-post-misched
  • greedy-reverse-local-assignment
  • amdgpu-early-inline-all
  • amdgpu-function-calls
  • unroll-count
  • maxnreg / --amdgpu-num-vgpr

A small official example showing how to sweep these options for ROCm kernels would be useful.

6. ISA dump / analysis workflow as a first-class feature

FLYDSL_DUMP_IR=1 is very useful. It would be even better if FlyDSL had a documented way to dump:

  • final ISA
  • VGPR/SGPR/LDS usage
  • waitcnt counts
  • MFMA counts
  • VMEM/LDS/store counts
  • kernel metadata

This would make it easier to compare FlyDSL kernels against CK/AITER/HipKittens/native ISA kernels.

Why this matters

FlyDSL is close enough to express many advanced CDNA concepts, but peak attention kernels need more final-ISA control than typical compiler-generated GPU code. HipKittens/AITER/native ISA kernels achieve substantially higher performance on the same hardware mainly through exact register layout and instruction scheduling.

If FlyDSL can expose a small set of expert-level controls, it could become a much stronger path for writing maintainable kernels that still approach handwritten ISA performance.

Current workaround

The workaround is to leave FlyDSL and use native ISA codegen directly. That works for performance, but it loses many benefits of FlyDSL: Python-level composition, easier integration, MLIR-level structure, and better maintainability.

We would prefer to stay in FlyDSL if the framework can expose enough low-level control for expert AMD kernel tuning.

Operating System

No response

GPU

No response

ROCm Component

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions