Optimizations for Decode Attn fp8 kernel MI350 by mycpuorg · Pull Request #75 · ROCm/xformers

mycpuorg · 2025-09-19T19:40:16Z

Optimizations for Decode Attn fp8 kernel

 - Baseline (MI350 FP8 split‑K forward)
      - Runtime ~125 µs (BF16 path was 110 µs).
      - rocprof showed the Triton JIT emitted a 512‑VGPR kernel with 56 spills, so each SIMD carried only a single wavefront.
      - The inner loop kept every K/V fragment plus dequant buffers resident, and the HIP autotuner could pick even heavier configs (≥150 VGPR), cementing the one‑wave bottleneck.
  - Streaming K and On‑Demand V ( xformers/ops/fmha/_triton/splitk_kernels.py )
      - Rewrote the FP8 branch so each quantization group loads K directly into the dot product and reloads V only when updating the accumulator.
      - Removed the persistent register lists for K/V, collapsing VGPR usage to 108 with zero spills; MI350 can now run two waves per SIMD.
      - Result: FP8 runtime dropped to ~105 µs, now beating BF16’s 110 µs.
  - HIP Autotune Guardrails (same file)
      - Constrained the HIP autotuner to tiles ≤64×64 and ≤4 warps, preventing Triton from revisiting the high‑VGPR plans.
      - Ensures every new launch stays in the low‑register regime uncovered by the streaming change.
  - Forced HIP FP8 Launch Parameters ( xformers/ops/fmha/triton_splitk.py )
      - Added FwOp.force_kernel_config and, by default, return the measured best tuple (BLOCK_M=16, BLOCK_N=64, num_stages=2, num_warps=1) whenever FP8 scale/shift tensors are present.
      - Eliminates heuristics drifting at runtime and locks in the ~105 µs profile.

- Baseline (MI350 FP8 split‑K forward) - Runtime ~125 µs (BF16 path was 110 µs). - rocprof showed the Triton JIT emitted a 512‑VGPR kernel with 56 spills, so each SIMD carried only a single wavefront. - The inner loop kept every K/V fragment plus dequant buffers resident, and the HIP autotuner could pick even heavier configs (≥150 VGPR), cementing the one‑wave bottleneck. - Streaming K and On‑Demand V ( xformers/ops/fmha/_triton/splitk_kernels.py ) - Rewrote the FP8 branch so each quantization group loads K directly into the dot product and reloads V only when updating the accumulator. - Removed the persistent register lists for K/V, collapsing VGPR usage to 108 with zero spills; MI350 can now run two waves per SIMD. - Result: FP8 runtime dropped to ~105 µs, now beating BF16’s 110 µs. - HIP Autotune Guardrails (same file) - Constrained the HIP autotuner to tiles ≤64×64 and ≤4 warps, preventing Triton from revisiting the high‑VGPR plans. - Ensures every new launch stays in the low‑register regime uncovered by the streaming change. - Forced HIP FP8 Launch Parameters ( xformers/ops/fmha/triton_splitk.py ) - Added FwOp.force_kernel_config and, by default, return the measured best tuple (BLOCK_M=16, BLOCK_N=64, num_stages=2, num_warps=1) whenever FP8 scale/shift tensors are present. - Eliminates heuristics drifting at runtime and locks in the ~105 µs profile.

- HIP Autotune Guardrails (same file) - Constrained the HIP autotuner to tiles ≤64×64 and ≤4 warps, preventing Triton from revisiting the high‑VGPR plans. - Ensures every new launch stays in the low‑register regime uncovered by the streaming change. - Forced HIP FP8 Launch Parameters ( xformers/ops/fmha/triton_splitk.py ) - Added FwOp.force_kernel_config and, by default, return the measured best tuple (BLOCK_M=16, BLOCK_N=64, num_stages=2, num_warps=1) whenever FP8 scale/shift tensors are present. - Eliminates heuristics drifting at runtime and locks in the ~105 µs profile.

mycpuorg · 2025-09-19T19:43:35Z

dup of #74

scxiao and others added 12 commits August 21, 2025 17:21

attn decode optimization

0ddeb13

update CK to tip of tree develop

46d716f

move CK to 26d33009306b0e77d3f51f071f8367f4c5bdf353

579d05f

move CK to 352f87e6841f04c83a86eeab6c9718a99f7aad84

582d522

move CK to b0a97498b0965d1b33cf90d117f9783989ef9ccb

6e3778c

move CK to 2622ff06cb2aabfd94df191083777b4caeb03966

9667e68

tune splitk for good perf of fp8 seq length 8193

39d5de1

update to fp8 mi350 format

428bd6a

change num_stages to 2 for fp8

0f5612a

turn on the HIP flag in calling the splitk kernel

d16d252

mycpuorg closed this Sep 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizations for Decode Attn fp8 kernel MI350#75

Optimizations for Decode Attn fp8 kernel MI350#75
mycpuorg wants to merge 12 commits into
ROCm:developfrom
mycpuorg:manrao/decode_attn_fp8_sept_19

mycpuorg commented Sep 19, 2025

Uh oh!

mycpuorg commented Sep 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mycpuorg commented Sep 19, 2025

Uh oh!

mycpuorg commented Sep 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants