Fix int8 matmul kernel selection picking the 64x1 GEMV for 2D matmuls by czoli1976 · Pull Request #2277 · sonos/tract

czoli1976 · 2026-05-24T19:39:06Z

Problem

einsum::kernel_selection::strategize selects arm64simd_mmm_i32_64x1 — a GEMV kernel (nr=1) — for every 2D int8 matmul on arm64, instead of a square matrix kernel. The GEMV kernel processes one output column per pass.

Quantized matmul (i8 inputs, i32 accumulator) skips the ops().mmm() fast path (it requires operating_dt == input dt, here i32 ≠ i8) and reaches two tie-breaks that don't distinguish a GEMV kernel from a matrix kernel:

concrete n > 1: max_by_key((pe.is_none(), nr*mr)) — the i8 candidates 64x1 and 8x8 both have nr*mr == 64, so max_by_key returns the last one (64x1).
symbolic n (dynamic shapes, the common ONNX-export case): the i8 kernels use distinct packings, so no group forms a (GEMV + matrix) VecVsMat pair; the fallback then orders by mmm.mr, and 64x1 (mr=64) wins.

f32/f16/block-quant are unaffected — each has a packing group that forms a proper GEMV+matrix pair (e.g. f32 32x1/32x3, Apple AMX 32x1/32x32), so the first tie-break key already selects correctly.

Fix

concrete n>1: demote nr == 1 so a square tile wins the tie.
symbolic n: prefer a group whose matrix-role kernel is a real matrix (nr > 1).
filter list_impls by is_supported_here() (added to the MatMatMul trait) so platform-gated kernels are never selected on a CPU that would trap — previously such kernels were candidates, avoided only by the 64x1 tie.

Validation

Bit-exact output vs the previous kernel (concrete and dynamic-shape int8 models, max|diff| = 0).
cargo test -p tract-core: 244/244. cargo test -p tract-linalg: 3703/3703.
f32 dynamic-n verified unchanged (still AMX 32x1/32x32).

Impact (Apple M4, e2e)

With the int8 kernels currently in tree (8x8 SMLAL), choosing the proper matrix kernel instead of the 64x1 GEMV is a modest e2e win:

model	before (`64x1`)	after (`8x8`)
all-MiniLM-L6-v2 (int8, seq=128)	49.3 ms	44.4 ms	1.11×
InceptionV1 (int8)	54.9 ms	51.6 ms	1.07×

More importantly, the fix lets faster int8 kernels be selected at all for 2D matmuls. With a FEAT_DotProd SDOT kernel (downstream), the same fix yields ~2× e2e on these models.

EinSumMatMul kernel selection (strategize) routed every 2D int8 matmul on arm64 to arm64simd_mmm_i32_64x1 -- a GEMV kernel (nr=1) -- instead of a square matrix kernel. Quant matmul (i8 in, i32 acc) skips the ops().mmm() fast path (operating_dt i32 != input i8) and hit two tie-breaks that do not distinguish a GEMV kernel from a matrix kernel: - concrete n>1: max_by_key((pe.is_none(), nr*mr)) -- i8 candidates 64x1/8x8 all tie at nr*mr=64, so the last (64x1) won. - symbolic n (dynamic shapes): i8 kernels use distinct packings, so no group forms a (GEMV + matrix) VecVsMat pair; the fallback ordered by mmm.mr and 64x1 (mr=64) won. f32/f16/block-quant are unaffected (each has a packing group forming a proper GEMV+matrix pair, so the first tie-break key already decides). Fix: demote nr==1 in the concrete branch; prefer nr>1 matrix kernels in the symbolic branch; filter list_impls by is_supported_here() (added to the MatMatMul trait) so platform-gated kernels are not selected on a CPU that would trap. Bit-exact vs the previous kernel (concrete + dynamic int8 models). core 244/244, linalg 3703/3703. On Apple M4, choosing 8x8 over 64x1 gives ~1.1x e2e on MiniLM/InceptionV1; the fix also unblocks faster int8 kernels (e.g. FEAT_DotProd SDOT) being selected for 2D matmuls, where the gain is ~2x. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

czoli1976 · 2026-05-24T20:10:13Z

Heads-up — I've stacked a follow-up on this: #2278 adds an SDOT (FEAT_DotProd) int8 kernel on top of this dispatch fix.

The split in impact (Apple M4, e2e):

This PR alone selects the proper 8x8 SMLAL kernel instead of the 64x1 GEMV → ~1.1× (MiniLM 49.3→44.4 ms, InceptionV1 54.9→51.6 ms) — mostly a correctness fix on the kernels in tree today.
With arm64 SDOT int8 matmul kernel (FEAT_DotProd) + PackedI8K4 packing #2278 the same matmuls can pick the SDOT kernel → ~2× combined (MiniLM 50.5→24.8 ms, InceptionV1 53.6→28.4 ms).

So this one is the base; suggest reviewing it first. Both are bit-exact, cargo fmt --check clean, and green on tract-core (244) + tract-linalg.

arm64simd_mmm_i32_8x8_dot: an int8->i32 8x8 matmul kernel using SDOT (FEAT_DotProd, ARMv8.2), ~4x the SMLAL 8x8 at the matmul level. Same v16..v31 tile layout as the SMLAL 8x8, so it reuses the existing i32 fuse/store/q_scale machinery, and consumes the K=4-inner PackedI8K4 packing now upstream (sonos#2281). - Gated on has_dotprod() (Apple M1+/A11+; Linux HWCAP_ASIMDDP). TRACT_DOTPROD_DISABLE=1 forces the SMLAL 8x8 fallback so callers can A/B on one binary. - Wired into qmmm_i32: int8 matmul/conv pick SDOT when FEAT_DotProd is present, SMLAL 8x8 otherwise. Relies on the merged dispatch fix (sonos#2277) to route 2D int8 matmuls to a matrix kernel instead of the 64x1 GEMV. - Adds linalg/benches/qmmm_i8.rs (SDOT vs SMLAL microbench). Bit-exact vs the SMLAL kernel: linalg 114/114 (i8i8 + i32i32 fuse/frame + q_scale), core int8 matmul 25/25. Apple M4 e2e (kernel unchanged from the original PR): MiniLM 44.4->24.8 ms (1.79x), InceptionV1 51.6->28.4 ms (1.82x). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

czoli1976 mentioned this pull request May 24, 2026

arm64 SDOT int8 matmul kernel (FEAT_DotProd) + PackedI8K4 packing #2278

Open

czoli1976 mentioned this pull request May 24, 2026

arm64 SME2 SMOPA int8 matmul kernel (sme_qmmm_i32_32x32) #2279

Draft

kali approved these changes May 26, 2026

View reviewed changes

kali merged commit f43dc4c into sonos:main May 26, 2026
55 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix int8 matmul kernel selection picking the 64x1 GEMV for 2D matmuls#2277

Fix int8 matmul kernel selection picking the 64x1 GEMV for 2D matmuls#2277
kali merged 1 commit into
sonos:mainfrom
czoli1976:fix/int8-matmul-dispatch

czoli1976 commented May 24, 2026

Uh oh!

czoli1976 commented May 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

czoli1976 commented May 24, 2026

Problem

Fix

Validation

Impact (Apple M4, e2e)

Uh oh!

czoli1976 commented May 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants