[Hipblaslt] Subtile/TDM (gfx1250): load dedup, multi-partition LDS fix, WMMA matrix-A reuse by sebvince · Pull Request #7 · sebvince/rocm-libraries

sebvince · 2026-06-15T10:30:22Z

Subtile/TDM (gfx1250): load dedup, multi-partition LDS fix, WMMA matrix-A reuse

Changes

Remove duplicated TDM loads : global loads were split only across each tensor's own wave axis, so waves sharing an axis id re-issued identical tensor_load_to_lds. Now split across all waves in all three coordinated sites
Fix multi-partition double-buffer LDS collision : GR(MT n+2) reuses the LDS buffer read by MT n, but the conflict check was scoped per-partition. Collisions can span partitions, so the check now runs in flat execution order across partitions and also compares the tile-id range. Adds regression tests.
Add matrix_a_reuse/matrix_b_reuse WMMA hints to rocisa : new reuseA/reuseB flags on the MFMA/WMMA emitters that append the gfx1250 reuse modifiers; default false.
Enable matrix-A reuse on gfx1250: setMatrixAReuse post-pass sets reuseA where safe (identical consecutive WMMA), on the final post-schedule order.
Arch-dependent ds_read→waitcnt gap : gap is now 8 on gfx1250, 4 elsewhere
Tests : add large macro-tile cases to subtile_bf16_gfx1250.yaml.

The subtile TDM path split each tensor's global load only across its own wave axis (mt/numWavesThisAxis), so waves sharing an axis id re-issued bit-identical tensor_load_to_lds (same global source + same LDS dest). Split across the full wave count (mt/numWaves) instead, in all three coordinated sites: global address, LDS write offset, and descriptor tile1 extent. The global->LDS mapping stays identity, so the local-read consumer is unchanged; cross-wave LDS visibility is covered by the existing WaitGROp(has_sync) barrier before local reads.

sebvince added 7 commits June 12, 2026 14:25

Fix multi-partition double LDS buffer colision

e0ec2d1

Increase DS_READ_TO_WAIT delay (To be Removed)

ba54503

Add reuse flas to rocisa

349798e

Post process to set reuse MatA

0e0ad86

Add large MTs to the tests

d91416c

Conditional MIN_MFMA_GAP_DS_READ_TO_WAIT (Arch dep)

44f61e1

github-actions Bot added project: hipblaslt project: hipsparselt ci:hipsparselt-fast labels Jun 15, 2026

Update some comments

725207e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Hipblaslt] Subtile/TDM (gfx1250): load dedup, multi-partition LDS fix, WMMA matrix-A reuse#7

[Hipblaslt] Subtile/TDM (gfx1250): load dedup, multi-partition LDS fix, WMMA matrix-A reuse#7
sebvince wants to merge 8 commits into
schedmodefrom
test_tdm

sebvince commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sebvince commented Jun 15, 2026

Subtile/TDM (gfx1250): load dedup, multi-partition LDS fix, WMMA matrix-A reuse

Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant