arm64 SME2 SMOPA int8 matmul kernel (sme_qmmm_i32_32x32) by czoli1976 · Pull Request #2279 · sonos/tract

czoli1976 · 2026-05-24T20:47:05Z

Adds sme_qmmm_i32_32x32, a 32×32 int8 → i32 matmul kernel using SME2 SMOPA (i8 outer-product). On compute-bound int8 GEMM it's 1.5–5× the SDOT kernel (#2278) on Apple M4.

Stacked on #2278 → #2277 (3 commits; SMOPA is the last). It needs the PackedI8K4 packing + lowering from #2278 and the dispatch fix from #2277 — review/merge those first.

What's here

sme_qmmm_i32_32x32 — 32×32 SMOPA i8→i32 kernel, runtime-gated on has_sme2(). Bit-exact int8 quant fuse ops (q_scale / rounding-shift / shift-left) via spill-ZA→scratch→reload; only LeakyRelu unsupported. Same PackedI8K4 (K=4-inner) packing as arm64 SDOT int8 matmul kernel (FEAT_DotProd) + PackedI8K4 packing #2278's SDOT.
sme.rs: selects it for qmmm_i32 when SME2 is present (over SDOT/SMLAL). Assembles + gates off cleanly on non-SME2 arm64.

Validation

sme_qmmm: 114/114 on M4 (SME2, SVL=512), bit-exact vs the NEON kernels. tract-linalg 3931/3931 · tract-core 244/244 · cargo fmt --check clean · no new clippy.
Non-SME2 arm64 (M1): kernel present but runtime-gated off; build + regression green.

Performance (Apple M4, single MatMulInteger, vs #2278's SDOT)

GEMM	SDOT	SMOPA
1024×1024×1024	4.67 ms	0.95 ms	4.9×
512×512×512	0.70 ms	0.17 ms	4.0×
128×768×3072	1.80 ms	1.17 ms	1.54×
32×2048×2048	1.33 ms	0.90 ms	1.47×

Honest scope: a wash on small / overhead-bound matmuls (MiniLM/InceptionV1 at seq=128, where SDOT already saturates); the win is on compute-bound int8 GEMM (large batch, big hidden dims, LLM prompt-processing). A real-model e2e demo on a larger compute-bound int8 model is a planned follow-up.

EinSumMatMul kernel selection (strategize) routed every 2D int8 matmul on arm64 to arm64simd_mmm_i32_64x1 -- a GEMV kernel (nr=1) -- instead of a square matrix kernel. Quant matmul (i8 in, i32 acc) skips the ops().mmm() fast path (operating_dt i32 != input i8) and hit two tie-breaks that do not distinguish a GEMV kernel from a matrix kernel: - concrete n>1: max_by_key((pe.is_none(), nr*mr)) -- i8 candidates 64x1/8x8 all tie at nr*mr=64, so the last (64x1) won. - symbolic n (dynamic shapes): i8 kernels use distinct packings, so no group forms a (GEMV + matrix) VecVsMat pair; the fallback ordered by mmm.mr and 64x1 (mr=64) won. f32/f16/block-quant are unaffected (each has a packing group forming a proper GEMV+matrix pair, so the first tie-break key already decides). Fix: demote nr==1 in the concrete branch; prefer nr>1 matrix kernels in the symbolic branch; filter list_impls by is_supported_here() (added to the MatMatMul trait) so platform-gated kernels are not selected on a CPU that would trap. Bit-exact vs the previous kernel (concrete + dynamic int8 models). core 244/244, linalg 3703/3703. On Apple M4, choosing 8x8 over 64x1 gives ~1.1x e2e on MiniLM/InceptionV1; the fix also unblocks faster int8 kernels (e.g. FEAT_DotProd SDOT) being selected for 2D matmuls, where the gain is ~2x. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds arm64simd_mmm_i32_8x8_dot, an int8->i32 8x8 matmul kernel using SDOT (FEAT_DotProd, ARMv8.2) -- ~4x the SMLAL 8x8 kernel at the matmul level -- plus the K=4-inner PackedI8K4 packing it consumes and the matmul/conv lowering to route PackedI8K4 through OptMatMulPack, einsum and conv-im2col. Builds on the dispatch fix (sonos#2277): without it, 2D int8 matmuls select the 64x1 GEMV kernel and this kernel is never chosen; with it, SDOT is selected. - arm64simd_mmm_i32_8x8_dot: gated on has_dotprod() (TRACT_DOTPROD_DISABLE=1 forces the SMLAL fallback). Same v16..v31 tile layout as the SMLAL 8x8, so it reuses the i32 fuse/store/q_scale machinery. - PackedI8K4: K=4-inner packing out[(k/4)*r*4 + m*4 + (k%4)], k_alignment=4; one-pass KOut4Writer + cache-friendly pack_view. - Lowering: OptMatMulPack / einsum / conv-im2col generalized from PackedFormat to Box<dyn MMMInputFormat>; QSumB reads the K=4 layout. PackedFormat paths stay byte-identical (monomorphic dispatch). Bit-exact vs the SMLAL kernel. core 244/244, linalg 3817/3817. Apple M4 e2e vs the SMLAL 8x8 the dispatch fix selects: MiniLM 44.4->24.8 ms (1.79x), InceptionV1 51.6->28.4 ms (1.82x). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds sme_qmmm_i32_32x32, a 32x32 int8->i32 matmul kernel using SME2 SMOPA (i8 outer-product), selected for qmmm_i32 when FEAT_SME2 is present. Consumes the same K=4-inner PackedI8K4 packing as the SDOT kernel and implements the int8 quant fuse ops (q_scale / rounding-shift / shift-left) bit-exactly (spill-ZA->scratch->reload); only LeakyRelu is unsupported. Builds on the SDOT kernel (sonos#2278) and dispatch fix (sonos#2277): needs PackedI8K4 plus the matmul/conv lowering from sonos#2278. sme_qmmm 114/114 on M4 (SME2, SVL=512), bit-exact vs the NEON kernels. core 244/244, linalg 3931/3931. Assembles + gates off cleanly on non-SME2 arm64 (kernel present, runtime-gated; M1 build + regression green). Apple M4 e2e, single MatMulInteger, vs the SDOT kernel (sonos#2278): 1024^3 4.67->0.95 ms (4.9x), 512^3 0.70->0.17 ms (4.0x), 128x768x3072 1.80->1.17 ms (1.54x), 32x2048x2048 1.33->0.90 ms (1.47x). Wash on small/overhead-bound matmuls (MiniLM/InceptionV1 seq=128); the win is compute-bound int8 GEMM (large batch/hidden, LLM prompt). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

kali · 2026-05-26T06:30:00Z

Converting to draft while we're dealing with dependant PRs.

czoli1976 · 2026-05-27T09:27:42Z

do you have A78AE in Prod already?

kali · 2026-05-27T09:42:28Z

Just in the lab. It is live again, catching up on historical benches (starting from the oldest, scheduling is not the smartest). ETA one day or two for benches.

czoli1976 and others added 3 commits May 24, 2026 20:38

kali marked this pull request as draft May 26, 2026 06:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

arm64 SME2 SMOPA int8 matmul kernel (sme_qmmm_i32_32x32)#2279

arm64 SME2 SMOPA int8 matmul kernel (sme_qmmm_i32_32x32)#2279
czoli1976 wants to merge 3 commits into
sonos:mainfrom
czoli1976:feat/int8-sme-smopa-kernel

czoli1976 commented May 24, 2026

Uh oh!

kali commented May 26, 2026

Uh oh!

czoli1976 commented May 27, 2026

Uh oh!

kali commented May 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

czoli1976 commented May 24, 2026

What's here

Validation

Performance (Apple M4, single MatMulInteger, vs #2278's SDOT)

Uh oh!

kali commented May 26, 2026

Uh oh!

czoli1976 commented May 27, 2026

Uh oh!

kali commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kali commented May 27, 2026 •

edited

Loading