[Feature]: More low-level AMD/CDNA controls on SoTA Kernels

### Suggestion Description

We are trying to optimize a FA forward kernel on AMD MI350 / gfx950 using FlyDSL. FlyDSL is already expressive enough for many important primitives: MFMA, `raw_ptr_buffer_load_lds`, `ds_read_tr16_b64`, `s_waitcnt`, `s_barrier`, `sched_group_barrier`, `s_setprio`, and `llvm.inline_asm`.

However, for peak performance, the current high-level FlyDSL path still trails hand-tuned kernels from HipKittens/AITER/native ISA codegen. For example, a native ISA implementation of a similar 2-tile pipeline reaches 1000-1160+ TFLOPS on the same MI350/gfx950 hardware.

This issue is a feature request for more explicit low-level controls in FlyDSL so that advanced users can close this gap without leaving the FlyDSL programming model.

## Observed gap

FlyDSL can emit the right broad instruction families, but the generated final ISA differs from high-performance handwritten kernels in several important ways:

- less predictable register allocation
- less precise MFMA co-execute packing
- extra VALU instructions around address calculation
- extra VMEM/store instructions in some output paths
- conservative or suboptimal wait/scheduling behavior in some hot-loop regions
- difficulty reproducing fixed VGPR/SGPR layouts used by native ISA kernels

The native ISA path uses a fixed register layout, explicit instruction order, precomputed DMA `m0` SGPRs, DMA-in-MFMA shadows, deferred softmax, ping-pong S buffers, and permlane dwordx4 stores. That level of control is difficult to reproduce reliably in FlyDSL today.

## Requested features

### 1. Explicit register allocation / register pinning

It would be very useful to have an advanced API to pin values or vector ranges to specific VGPR/SGPR ranges, or at least to reserve named register regions.

Example use case:

```text
v32-v95    S buffers
v96-v159   O accumulators
v160-v191  persistent Q
v192-v255  K/V preread
s63-s67    precomputed DMA m0 values
```

Even a limited "best effort" or "expert mode" register allocation annotation would help.

### 2. Better inline assembly integration

`llvm.inline_asm` works, but for full-kernel optimization it would help to have first-class helpers for common AMD instructions and constraints:

- `buffer_load_dwordx4 ... offen lds`
- `ds_read_b128`
- `ds_read_b64_tr_b16`
- `v_permlane32_swap_b32`
- `v_permlane16_swap_b32`
- precise `s_waitcnt vmcnt(N) lgkmcnt(M)`
- `s_mov_b32 m0, ...`

Ideally these helpers should make it clear which operands are SGPR/VGPR/immediate and should avoid accidental lowering changes.

### 3. More control over final instruction scheduling

For peak CDNA kernels, instruction placement inside MFMA co-execute windows is critical. It would be helpful to have stronger scheduling primitives than the current `sched_group_barrier`/inline asm combination.

Desired capability:

- group exactly N MFMAs with M VALU/EXP/DS/VMEM instructions
- prevent LLVM from moving selected instructions across a region
- preserve ordering of hand-scheduled blocks
- inspect or assert final generated instruction counts/order

### 4. First-class byte-offset buffer store/load APIs

`buffer_ops.buffer_store(..., offset_is_bytes=True)` helps, but output store optimization is still tricky. A lower-level API for exact AMD buffer store forms would help implement dwordx4 output paths without extra address arithmetic or unwanted cache modifiers.

Useful forms:

- `buffer_store_dword`
- `buffer_store_dwordx2`
- `buffer_store_dwordx4`
- explicit `voffset`, `soffset`, `offset`, `aux/cache` fields

### 5. Easier compiler option tuning

FlyDSL already supports `llvm_options` and `maxnreg`, but it would be helpful to document and expose recommended AMD options for performance tuning, for example:

- `enable-post-misched`
- `greedy-reverse-local-assignment`
- `amdgpu-early-inline-all`
- `amdgpu-function-calls`
- `unroll-count`
- `maxnreg` / `--amdgpu-num-vgpr`

A small official example showing how to sweep these options for ROCm kernels would be useful.

### 6. ISA dump / analysis workflow as a first-class feature

`FLYDSL_DUMP_IR=1` is very useful. It would be even better if FlyDSL had a documented way to dump:

- final ISA
- VGPR/SGPR/LDS usage
- waitcnt counts
- MFMA counts
- VMEM/LDS/store counts
- kernel metadata

This would make it easier to compare FlyDSL kernels against CK/AITER/HipKittens/native ISA kernels.

## Why this matters

FlyDSL is close enough to express many advanced CDNA concepts, but peak attention kernels need more final-ISA control than typical compiler-generated GPU code. HipKittens/AITER/native ISA kernels achieve substantially higher performance on the same hardware mainly through exact register layout and instruction scheduling.

If FlyDSL can expose a small set of expert-level controls, it could become a much stronger path for writing maintainable kernels that still approach handwritten ISA performance.

## Current workaround

The workaround is to leave FlyDSL and use native ISA codegen directly. That works for performance, but it loses many benefits of FlyDSL: Python-level composition, easier integration, MLIR-level structure, and better maintainability.

We would prefer to stay in FlyDSL if the framework can expose enough low-level control for expert AMD kernel tuning.

### Operating System

_No response_

### GPU

_No response_

### ROCm Component

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: More low-level AMD/CDNA controls on SoTA Kernels #515

Suggestion Description

Observed gap

Requested features

1. Explicit register allocation / register pinning

2. Better inline assembly integration

3. More control over final instruction scheduling

4. First-class byte-offset buffer store/load APIs

5. Easier compiler option tuning

6. ISA dump / analysis workflow as a first-class feature

Why this matters

Current workaround

Operating System

GPU

ROCm Component

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: More low-level AMD/CDNA controls on SoTA Kernels #515

Description

Suggestion Description

Observed gap

Requested features

1. Explicit register allocation / register pinning

2. Better inline assembly integration

3. More control over final instruction scheduling

4. First-class byte-offset buffer store/load APIs

5. Easier compiler option tuning

6. ISA dump / analysis workflow as a first-class feature

Why this matters

Current workaround

Operating System

GPU

ROCm Component

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions