Skip to content

Fix regression on MI300 caused by Async LDS optimization for MI350#156

Open
aryaman-gupta wants to merge 2 commits into
abokovoi/async-lds-inference-optfrom
aryaman/async-lds-mi300-fix
Open

Fix regression on MI300 caused by Async LDS optimization for MI350#156
aryaman-gupta wants to merge 2 commits into
abokovoi/async-lds-inference-optfrom
aryaman/async-lds-mi300-fix

Conversation

@aryaman-gupta
Copy link
Copy Markdown

No description provided.

aryaman-gupta and others added 2 commits May 13, 2026 06:57
cp_async_zfill_cg is async on Ampere+ and gfx950 but synchronous
elsewhere. Inlining the sync fallback into the per-iteration row-load
loop kills load pipelining (load->store dependency forces N waitcnts
instead of one) and adds wave divergence on mixed-validity warps.
Measured up to -19% BW on MI300 (gfx942) for weighted L=20/L=50.

Wrap the row-store section in a #if matching the helper's dispatch:
gfx950/Ampere keep the fused cp_async_zfill_cg loop; everything else
gets the original two-loop pattern (load all -> masked store).
Helper and gfx950 paths untouched.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant