Skip to content

[Hipblaslt] Subtile/TDM (gfx1250): load dedup, multi-partition LDS fix, WMMA matrix-A reuse#7

Draft
sebvince wants to merge 8 commits into
schedmodefrom
test_tdm
Draft

[Hipblaslt] Subtile/TDM (gfx1250): load dedup, multi-partition LDS fix, WMMA matrix-A reuse#7
sebvince wants to merge 8 commits into
schedmodefrom
test_tdm

Conversation

@sebvince

Copy link
Copy Markdown
Owner

Subtile/TDM (gfx1250): load dedup, multi-partition LDS fix, WMMA matrix-A reuse

Changes

  • Remove duplicated TDM loads : global loads were split only across each tensor's own wave axis, so waves sharing an axis id re-issued identical tensor_load_to_lds. Now split across all waves in all three coordinated sites
  • Fix multi-partition double-buffer LDS collision : GR(MT n+2) reuses the LDS buffer read by MT n, but the conflict check was scoped per-partition. Collisions can span partitions, so the check now runs in flat execution order across partitions and also compares the tile-id range. Adds regression tests.
  • Add matrix_a_reuse/matrix_b_reuse WMMA hints to rocisa : new reuseA/reuseB flags on the MFMA/WMMA emitters that append the gfx1250 reuse modifiers; default false.
  • Enable matrix-A reuse on gfx1250: setMatrixAReuse post-pass sets reuseA where safe (identical consecutive WMMA), on the final post-schedule order.
  • Arch-dependent ds_read→waitcnt gap : gap is now 8 on gfx1250, 4 elsewhere
  • Tests : add large macro-tile cases to subtile_bf16_gfx1250.yaml.

sebvince added 7 commits June 12, 2026 14:25
The subtile TDM path split each tensor's global load only across its own
wave axis (mt/numWavesThisAxis), so waves sharing an axis id re-issued
bit-identical tensor_load_to_lds (same global source + same LDS dest).

Split across the full wave count (mt/numWaves) instead, in all three
coordinated sites: global address, LDS write offset, and descriptor
tile1 extent. The global->LDS mapping stays identity, so the local-read
consumer is unchanged; cross-wave LDS visibility is covered by the
existing WaitGROp(has_sync) barrier before local reads.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant