Skip to content

MLAS/POWER10: Optimize Sgemm PackA kernel using VSX intrinsics and assembly.#27575

Open
BODAPATIMAHESH wants to merge 1 commit intomicrosoft:mainfrom
BODAPATIMAHESH:main_Sgemm_PackA
Open

MLAS/POWER10: Optimize Sgemm PackA kernel using VSX intrinsics and assembly.#27575
BODAPATIMAHESH wants to merge 1 commit intomicrosoft:mainfrom
BODAPATIMAHESH:main_Sgemm_PackA

Conversation

@BODAPATIMAHESH
Copy link
Contributor

Description

Introduce an optimized POWER10 PackA implementation leveraging VSX builtins and assembly to pre-pack 8 rows of matrix A, packing 64 bytes per row per iteration.

Motivation and Context

Performance improvements observed in prompt processing:

  • 14% speedup (batch size 1)
  • 6% speedup (batch size 4)
  • 4% speedup (batch size 8)

Tested with granite-3.1-8b

…sembly

Introduce an optimized POWER10 PackA implementation leveraging
VSX builtins and assembly to pre-pack 8 rows of matrix A, packing
64 bytes per row per iteration.

Performance improvements observed in prompt processing:
- 14% speedup (batch size 1)
- 6% speedup (batch size 4)
- 4% speedup (batch size 8)

Tested with granite-3.1-8b

Signed-off-by: Mahesh Bodapati <bmahi496@linux.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant