[DRAFT] Shared/aggregate load by alefimov-amd · Pull Request #804 · ROCm/triton

alefimov-amd · 2025-05-21T15:01:31Z

Introduces AggregateLoad pass, which aggregates multiple small loads inside a loop in one wide load and moves it it outeerloop:

for i in range(0, 32):
  val = global_load: tensor<8x8>

transforms to:

for i in range(0,2):
  wide_load = global_load: tensor<8x128>
  smem = local_alloc(val)
  for i in range(0, 16):
    view = subview(smem)
    val = local_load(view): tensor<8x8>

In case all iterations could be aggregated, outer loop is not created.

For now application of this pass is limited to scale parameters of scaled_dot operation.

IR is ready, need to debug correctness issue

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

- Generalize implementation to handle case where shared memory cannot hoist entire operand A. (taking into account size of operand B) - Fix stream pipeliner prologue to hoist out of nested loop by setting location to src pointer. Signed-off-by: Stanley Winata <stanley.winata@amd.com>

- remove auxiliary env variables from compiler.py - add TRITON_HIP_AGGREGATE_LOAD_FACTOR env variable in addition to kernel option - remove auxiliar mlir files and scripts for debug - remove limitation on warp==1 in scaled dot

alefimov-amd requested review from antiagainst and zhanglx13 as code owners May 21, 2025 15:01

alefimov-amd changed the title ~~Shared/aggregate load~~ [DRAFT] Shared/aggregate load May 21, 2025

alefimov-amd marked this pull request as draft May 21, 2025 15:03

alefimov-amd force-pushed the shared/aggregate_load branch from 347992c to 6c36ed1 Compare May 22, 2025 20:13

antiagainst force-pushed the shared/triton-gfx950-launch branch from 77c00fa to a259f0a Compare May 26, 2025 17:58

zhanglx13 and others added 24 commits May 29, 2025 20:04

[AMD] Add AggregateLoad pass

f5ddf28

IR is ready, need to debug correctness issue

Fix a bug regarding buffer offset

26e7a86

Fix smemBase computation

4612c43

Support mask along M dim when load opA

3f2b3bd

Add README of aggregate load pass

5a8a69b

Create standalone MOE example

f50fcaf

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

Fix aggregate load pass

cdacf12

debug scaffolds

01fd3c7

hoisting A in dotOp worked

d3045e4

add scaled moe kernel

ee772e6

hoisting scale load works

534d1bf

cleanup

a2600c5

add benchmarking#1

36240c3

support cvt; optimize swizzling

dcc253c

observe perf improvement on larger block

f0d1e0b

better support for weird K dim by padding LDS

afcd887

Fix moe_kernel.py to make sanity check and benchmark share parameters

25ebdaa

add reproducer

b0737aa

add more examples

eff53dc

fix repr2

080a0a8

process strided K dim

644ea5a

add m=1 case

5392e76

support 1d pointer case

63e1f3a

binarman added 25 commits May 29, 2025 20:04

add option for aggregate factor

c597136

skip pass if factor is 0 or 1

94b793e

add minimal lit tests

d705936

fix

99b45aa

add reproducer to numerical issue

ed25c72

take into account K strides when building outer loop

74c361a

fix

e15cb8d

fix

5ea9cc5

try coalesce aggregated load

9ead4c8

post rebase fix

6d07de8

add reproducer for warp=1

e1beab1

enablbe warp=1 in accelerate amd matmul

9531baa

remove limitation on outer loop

bc67839

fix option application

10d8978

add negative lit test checks un%aggregate_factor == 0

1832fcb

cleanup:

9c2525b

- remove auxiliary env variables from compiler.py - add TRITON_HIP_AGGREGATE_LOAD_FACTOR env variable in addition to kernel option - remove auxiliar mlir files and scripts for debug - remove limitation on warp==1 in scaled dot

add pytest tests for warp = 1

271890c

support dynamic strides

696acb1

fix block offset computation

6f14849

fix order of arguments

36c76a8

do not call coalesce pass if aggregate load is disabled

00d9541

general fix

9707d3d

fix env variable control

046b37f

support async loads in aggregate_load

d5c70eb

temporary disable some checks

5e47bef

alefimov-amd force-pushed the shared/aggregate_load branch from c65837e to 5e47bef Compare May 29, 2025 21:35

binarman added 2 commits June 3, 2025 18:40

support preshuffle kernel

f7fe25c

add lit test for preshuffle kernel

2d9a15e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] Shared/aggregate load#804

[DRAFT] Shared/aggregate load#804
alefimov-amd wants to merge 51 commits into
shared/triton-gfx950-launchfrom
shared/aggregate_load

alefimov-amd commented May 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

alefimov-amd commented May 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants