Skip to content

[AMD] Improve Scheduling for Async BF16 GEMM#812

Merged
raikonenfnu merged 1 commit into
shared/triton-gfx950-launchfrom
raikonenfnu/updatedBetterAsyncBF16Schedule
May 28, 2025
Merged

[AMD] Improve Scheduling for Async BF16 GEMM#812
raikonenfnu merged 1 commit into
shared/triton-gfx950-launchfrom
raikonenfnu/updatedBetterAsyncBF16Schedule

Conversation

@raikonenfnu
Copy link
Copy Markdown
Member

  • Use single AsyncWait (1030 -> 1070)
  • Move local load before global load to hide latency (1076)
  • Move slice local load(3) to the cluster before dot(3) (1080.5)
  • Update clusterBarrier to schedBarrier + s_barrier + schedBarrier (1086)

The core Triton is a small number of people, and we receive many PRs (thank
you!). To help us review your code more quickly, if you are a new
contributor (less than 3 PRs merged) we ask that you complete the following
tasks and include the filled-out checklist in your PR description.

Complete the following tasks before sending your PR, and replace [ ] with
[x] to indicate you have done them.

  • I am not making a trivial change, such as fixing a typo in a comment.

  • I have written a PR description following these
    rules.

  • I have run pre-commit run --from-ref origin/main --to-ref HEAD.

  • Select one of the following.

    • I have added tests.
      • /test for lit tests
      • /unittest for C++ tests
      • /python/test for end-to-end tests
    • This PR does not need a test because FILL THIS IN.
  • Select one of the following.

    • I have not added any lit tests.
    • The lit tests I have added follow these best practices,
      including the "tests should be minimal" section. (Usually running Python code
      and using the instructions it generates is not minimal.)

- Use single AsyncWait (1030 -> 1070)
- Move local load before global load to hide latency (1076)
- Move slice local load(3) to the cluster before dot(3)  (1080.5)
- Update clusterBarrier to schedBarrier + s_barrier + schedBarrier
  (1086)

Signed-off-by: Stanley Winata <stanley.winata@amd.com>
@raikonenfnu raikonenfnu changed the base branch from main to shared/triton-gfx950-launch May 27, 2025 14:29
@raikonenfnu
Copy link
Copy Markdown
Member Author

Closing original PR #802 for this one since I wanted to keep a checkpoint on the old branch

Copy link
Copy Markdown

@jungpark-mlir jungpark-mlir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@raikonenfnu raikonenfnu merged commit 1c1ea34 into shared/triton-gfx950-launch May 28, 2025
3 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants