MLAS/POWER10: Optimize Sgemm PackA kernel using VSX intrinsics and assembly. by BODAPATIMAHESH · Pull Request #27575 · microsoft/onnxruntime

BODAPATIMAHESH · 2026-03-06T05:20:36Z

Description

Introduce an optimized POWER10 PackA implementation leveraging VSX builtins and assembly to pre-pack 8 rows of matrix A, packing 64 bytes per row per iteration.

Motivation and Context

Performance improvements observed in prompt processing:

14% speedup (batch size 1)
6% speedup (batch size 4)
4% speedup (batch size 8)

Tested with granite-3.1-8b

BODAPATIMAHESH · 2026-03-10T08:08:45Z

could you review this PR @hariharans29

hariharans29 · 2026-03-10T17:15:21Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2026-03-10T17:15:43Z

Azure Pipelines successfully started running 4 pipeline(s).

Copilot

Pull request overview

This PR improves POWER10 SGEMM (single-precision GEMM) performance by introducing an explicit PackA stage and updating the MMA compute kernel to consume the packed-A layout, with an optimized assembly PackA implementation on non-AIX platforms.

Changes:

Add a new POWER10 PackA implementation (C++ fallback + optional assembly fast-path) and route SGEMM through it for the MMA kernel paths.
Refactor MlasSgemmMMAProcessCount to consume packed A (Pa) instead of reading A directly with lda.
Update the MLAS CMake configuration to enable ASM and build the new .S file on non-AIX POWER10 builds.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
`onnxruntime/core/mlas/lib/power/SgemmKernelPOWER10.cpp`	Switch MMA kernel to packed-A input; add C++ PackA routine, prefetching, and assembly PackA hook.
`onnxruntime/core/mlas/lib/power/SgemmKernelPackA.S`	New POWER10 assembly kernel to pack A efficiently for 4- or 8-row blocks.
`onnxruntime/core/mlas/lib/power/asmmacro.h`	New shared assembly macro header providing a function entry macro.
`cmake/onnxruntime_mlas.cmake`	Enable ASM for POWER10 and conditionally compile the new PackA assembly source (non-AIX).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

hariharans29 · 2026-03-10T19:25:27Z

Can you please address Copilot's comments ?

BODAPATIMAHESH · 2026-03-11T09:15:07Z

Can you please address Copilot's comments ?

Thanks @hariharans29 . I have addressed the Copilot's comments. Please review it.

hariharans29 · 2026-03-11T21:29:46Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2026-03-11T21:30:05Z

Azure Pipelines successfully started running 4 pipeline(s).

hariharans29 · 2026-03-11T22:10:58Z

Please wait for and eventually rebase to include #27618 - I think that will solve failing build.

…sembly Introduce an optimized POWER10 PackA implementation leveraging VSX builtins and assembly to pre-pack 8 rows of matrix A, packing 64 bytes per row per iteration. Performance improvements observed in prompt processing: - 14% speedup (batch size 1) - 6% speedup (batch size 4) - 4% speedup (batch size 8) Tested with granite-3.1-8b Signed-off-by: Mahesh Bodapati <bmahi496@linux.ibm.com>

1. Removed the memset — unnecessary for CountM == 8 2. Replaced CountM with explicit literals 8 and 4 in the PackAKernelPOWER10 calls — purely a readability fix, no behavioral change. 3. Update the header comment of file SgemmKernelPackA.S 4. Update the PackAKernelPOWER10 declaration.

BODAPATIMAHESH · 2026-03-12T05:43:27Z

Please wait for and eventually rebase to include #27618 - I think that will solve failing build.

Thanks. I have rebased my branch.

hariharans29 · 2026-03-12T18:05:35Z

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2026-03-12T18:05:57Z

Azure Pipelines successfully started running 4 pipeline(s).

BODAPATIMAHESH · 2026-03-17T09:56:59Z

@hariharans29 Thanks. I’d like to understand whether backporting patches to past releases is allowed. If so, could you please clarify what kinds of changes are eligible for backporting and what the process looks like?

hariharans29 · 2026-03-17T15:27:10Z

@hariharans29 Thanks. I’d like to understand whether backporting patches to past releases is allowed. If so, could you please clarify what kinds of changes are eligible for backporting and what the process looks like?

Generally backporting to an existing release is not allowed. Only when we plan new patch releases on top of existing releases, we take it commits. But the bar to go for patch release very high.

hariharans29 requested a review from Copilot March 10, 2026 17:14

Copilot started reviewing on behalf of hariharans29 March 10, 2026 17:16 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

Comment thread onnxruntime/core/mlas/lib/power/SgemmKernelPOWER10.cpp

Comment thread onnxruntime/core/mlas/lib/power/SgemmKernelPOWER10.cpp

Comment thread onnxruntime/core/mlas/lib/power/SgemmKernelPackA.S Outdated

Comment thread onnxruntime/core/mlas/lib/power/SgemmKernelPOWER10.cpp Outdated

hariharans29 approved these changes Mar 11, 2026

View reviewed changes

hariharans29 enabled auto-merge (squash) March 11, 2026 21:30

BODAPATIMAHESH added 2 commits March 12, 2026 00:26

auto-merge was automatically disabled March 12, 2026 05:40
Head branch was pushed to by a user without write access

BODAPATIMAHESH force-pushed the main_Sgemm_PackA branch from 9edea2c to a7b3d36 Compare March 12, 2026 05:40

hariharans29 approved these changes Mar 12, 2026

View reviewed changes

hariharans29 enabled auto-merge (squash) March 12, 2026 18:06

hariharans29 merged commit 5274c19 into microsoft:main Mar 12, 2026
89 checks passed

BrewTestBot mentioned this pull request Apr 20, 2026

onnxruntime 1.25.0 Homebrew/homebrew-core#278543

Merged

Conversation

BODAPATIMAHESH commented Mar 6, 2026

Description

Motivation and Context

Uh oh!

BODAPATIMAHESH commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hariharans29 commented Mar 10, 2026

Uh oh!

azure-pipelines Bot commented Mar 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hariharans29 commented Mar 10, 2026

Uh oh!

BODAPATIMAHESH commented Mar 11, 2026

Uh oh!

hariharans29 commented Mar 11, 2026

Uh oh!

azure-pipelines Bot commented Mar 11, 2026

Uh oh!

hariharans29 commented Mar 11, 2026

Uh oh!

BODAPATIMAHESH commented Mar 12, 2026

Uh oh!

hariharans29 commented Mar 12, 2026

Uh oh!

azure-pipelines Bot commented Mar 12, 2026

Uh oh!

Uh oh!

BODAPATIMAHESH commented Mar 17, 2026

Uh oh!

hariharans29 commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

BODAPATIMAHESH commented Mar 10, 2026 •

edited

Loading