Scheduling 2 large du by sebvince · Pull Request #28 · b-shi/rocm-libraries

sebvince · 2026-04-10T08:11:52Z

Support large DepthU (>256) for subtile-based scheduler with MXFP4 scales

Description:

Summary

Add subtileK dimension support to the subtile-based scheduler, enabling DepthU values larger than 2×MI_K (e.g., DU=512 for FP4 with MI_K=128). Previously the scheduler assumed localSubtileGrid[1]==1; now it iterates over multiple K-layers within each partition.
Fix scale double-buffering race condition: scale DTL loads are now emitted on the last GR of an MT iteration (instead of first), ensuring all scale LR reads from the current bank complete before the DTL overwrites them.
Fix scale set swap logic: the needsScaleUnroll calculation now accounts for numSubtileK flips per partition (total flips = numPartitions × numSubtileK), not just partition count.

Key changes

SubtileBasedScheduler.py:

Rename numSubIterK → subtileShapeK (intra-subtile K steps) and introduce numSubtileK (K-layers from localSubtileGrid[1])
Add subtileK field to MFMAOp, GROp, LROp, SubIterKSchedule — all ops now track which K-layer they target
_buildSubIterK loop becomes subtileK × localK (flat K indexing for VGPR allocation)
_insertGROps emits one GR per (M-chunk, subtileK), spread across subIterK steps
_loadTargetsHalfPrefetch advances localK → subtileK → partition in sequence
Scale LR/MFMA emit use subtileK for correct LDS offset calculation (dsOffset = groupStride * (scaleGroupIdx * numSubtileK + subtileK))
emitGR passes subtileK to emitSingleBufferLoad; scale DTL moved from firstForMT to lastForMT
_emitLoop swaps scaleSet/scaleLRSet at subtileK boundaries

SubtileBasedKernel.py:

scaleFlipsPerIter = numPartitions * numSubtileK replaces numPartitions for unroll/NLL set logic

Test YAML files:

subtile_bf16.yaml / subtile_mxfp4.yaml: add PrefetchGlobalRead: [0, 2] across all groups to test PGR2 scheduler path; add DU=512 group for MXFP4

Unit tests:

Add test_PGR2_128_128_DU512_fp4_schedule for large DU with 2 subtileK layers
Update all expected schedule outputs for new subtileK field in op labels
Fix gr_inc expected dependency counts
Update exact instruction sequence expectations

Test plan

subtile_bf16.yaml — all pass (PGR=0 and PGR=2)
subtile_mxfp4.yaml — 1 all pass (PGR=0 and PGR=2); non-aligned M/N sizes fail as pre-existing issue
Unit tests: pytest test_SubtileBasedScheduler.py

sebvince added 11 commits April 9, 2026 08:41

Initial support for DU>256

deedf54

Renaming

a13c43a

add option to do DU=512 in the tests

499c677

blocked K-major for scale

7f43f05

Change scaleSet swap logic

cbf0224

Update print functions

71b17bd

Put scales after values for avoid race conditions

dc095f7

Fix tests

34f1cd9

more test

cd7fe13

tweak printschedule display

2dd0e34

Add PGR2 in the yaml tests

9e3dc98

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduling 2 large du#28

Scheduling 2 large du#28
sebvince wants to merge 11 commits into
b-shi:subtile_mxfrom
sebvince:scheduling_2_large_du

sebvince commented Apr 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sebvince commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Support large DepthU (>256) for subtile-based scheduler with MXFP4 scales

Description:

Summary

Key changes

Test YAML files:

Unit tests:

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sebvince commented Apr 10, 2026 •

edited

Loading