Skip to content

Scheduling 2 large du#28

Draft
sebvince wants to merge 11 commits into
b-shi:subtile_mxfrom
sebvince:scheduling_2_large_du
Draft

Scheduling 2 large du#28
sebvince wants to merge 11 commits into
b-shi:subtile_mxfrom
sebvince:scheduling_2_large_du

Conversation

@sebvince

@sebvince sebvince commented Apr 10, 2026

Copy link
Copy Markdown
Collaborator

Support large DepthU (>256) for subtile-based scheduler with MXFP4 scales

Description:

Summary

  • Add subtileK dimension support to the subtile-based scheduler, enabling DepthU values larger than 2×MI_K (e.g., DU=512 for FP4 with MI_K=128). Previously the scheduler assumed localSubtileGrid[1]==1; now it iterates over multiple K-layers within each partition.
  • Fix scale double-buffering race condition: scale DTL loads are now emitted on the last GR of an MT iteration (instead of first), ensuring all scale LR reads from the current bank complete before the DTL overwrites them.
  • Fix scale set swap logic: the needsScaleUnroll calculation now accounts for numSubtileK flips per partition (total flips = numPartitions × numSubtileK), not just partition count.

Key changes

SubtileBasedScheduler.py:

  • Rename numSubIterK → subtileShapeK (intra-subtile K steps) and introduce numSubtileK (K-layers from localSubtileGrid[1])
  • Add subtileK field to MFMAOp, GROp, LROp, SubIterKSchedule — all ops now track which K-layer they target
  • _buildSubIterK loop becomes subtileK × localK (flat K indexing for VGPR allocation)
  • _insertGROps emits one GR per (M-chunk, subtileK), spread across subIterK steps
  • _loadTargetsHalfPrefetch advances localK → subtileK → partition in sequence
  • Scale LR/MFMA emit use subtileK for correct LDS offset calculation (dsOffset = groupStride * (scaleGroupIdx * numSubtileK + subtileK))
  • emitGR passes subtileK to emitSingleBufferLoad; scale DTL moved from firstForMT to lastForMT
  • _emitLoop swaps scaleSet/scaleLRSet at subtileK boundaries

SubtileBasedKernel.py:

  • scaleFlipsPerIter = numPartitions * numSubtileK replaces numPartitions for unroll/NLL set logic

Test YAML files:

  • subtile_bf16.yaml / subtile_mxfp4.yaml: add PrefetchGlobalRead: [0, 2] across all groups to test PGR2 scheduler path; add DU=512 group for MXFP4

Unit tests:

  • Add test_PGR2_128_128_DU512_fp4_schedule for large DU with 2 subtileK layers
  • Update all expected schedule outputs for new subtileK field in op labels
  • Fix gr_inc expected dependency counts
  • Update exact instruction sequence expectations

Test plan

  • subtile_bf16.yaml — all pass (PGR=0 and PGR=2)
  • subtile_mxfp4.yaml — 1 all pass (PGR=0 and PGR=2); non-aligned M/N sizes fail as pre-existing issue
  • Unit tests: pytest test_SubtileBasedScheduler.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant