Bypass LDS for scale B operand for skinny gemms by plognjen · Pull Request #817 · ROCm/triton

plognjen · 2025-05-29T15:20:48Z

Skip LDS for the scale B tensor when warpsPerCTA is {1, numWarps} and
the load layout matches the expected layout for scale B in the dotScaled op.

guacamoleo · 2025-05-30T20:12:45Z

+          mlir::triton::LinearLayout scaleBLayout =
+              mlir::triton::gpu::toLinearLayout(scaleBTy.getShape(),
+                                                scaleBTy.getEncoding());
+          bypassLDS = bypassLDS ||


What is this doing here? Is it checking if bypassing LDS succeeded?

I think @plognjen wanted to restore the previous condition, i.e. width < 32 should bypassLDS.
If this is the case, maybe we can use another variable to store the value of (width < 32) rather than bypassLDS to avoid any confusions.

yes, this was to restore the previous condition. I will change the name.

guacamoleo · 2025-05-30T20:15:33Z

      loadInfo.usedByDot = true;
      // If the max continugous bits we can read is < 32, buffer in registers.
-      if (width >= 32) {
+      bool bypassLDS = width < 32;


So, we're only bypassing LDS when the we're loading smaller than dword, such as buffer_load_short or buffer_load_ushort?
Are there other cases when bypass LDS could be beneficial? If so, let's add a comment reminding us of those additional scenarios.

Due to preshuffling, width is guaranteed to be >= 32. Therefore, it's confusing to enable bypassLDS only when width < 32.
More generally, bypassLDS should not check width. Later it checks if the loaded layout is the same as the scale layout, and this makes sure width = 32.

Bypass LDS for scale B operand for skinny gemms

522e8e3

plognjen marked this pull request as ready for review May 29, 2025 15:21

plognjen requested review from antiagainst and zhanglx13 as code owners May 29, 2025 15:21

guacamoleo reviewed May 30, 2025

View reviewed changes

oplavsic added 3 commits June 6, 2025 14:01

Change block layout of scale B load

92f130c

Remove coalesceOp function

9331d6b

Fix remove layout conversion pass canonicalizer

f5a1263

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bypass LDS for scale B operand for skinny gemms#817

Bypass LDS for scale B operand for skinny gemms#817
plognjen wants to merge 4 commits into
shared/triton-gfx950-launchfrom
shared/bypassLDS

plognjen commented May 29, 2025 •

edited

Loading

Uh oh!

guacamoleo May 30, 2025

Uh oh!

zhanglx13 Jun 2, 2025

Uh oh!

plognjen Jun 6, 2025

Uh oh!

guacamoleo May 30, 2025

Uh oh!

zhanglx13 Jun 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

plognjen commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guacamoleo May 30, 2025

Choose a reason for hiding this comment

Uh oh!

zhanglx13 Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

plognjen Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

guacamoleo May 30, 2025

Choose a reason for hiding this comment

Uh oh!

zhanglx13 Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

plognjen commented May 29, 2025 •

edited

Loading