linalg/arm64/sme: experimental f16 SME kernels for Apple M5/A19 by czoli1976 · Pull Request #2273 · sonos/tract

czoli1976 · 2026-05-22T13:16:17Z

Summary

Addendum to the merged f32 SME backend (#2230): experimental half-precision SME kernels for the mmm_f16 / mmv_f16 slots, using the non-widening fmopa za.h path (FEAT_SME_F16F16).

Warning

This is effectively Apple M5 / A19-only and has NOT been validated on real hardware. It is QEMU-correctness-validated only and needs community testing on an actual M5 / A19 (iPhone 17). It cannot regress anything else (see Build safety), so it's offered as a low-risk, opt-in-by-hardware addition.

Why "M5 / A19-only" — the LLVM smoking gun

FEAT_SME_F16F16 (non-widening f16→f16 in ZA.H) is an optional SME2 sub-feature. Among shipping silicon, only the Apple M5 / A19 implement it. The proof is upstream LLVM's own CPU definition — llvm/llvm-project commit f85494f6afeb "Define apple-m5/a19 CPUs":

def TuneAppleM5 : SubtargetFeature<"apple-m5", ...,
    FeatureSME, FeatureSME2, FeatureSMEF64F64, FeatureSMEI16I64,
    FeatureSME2p1, FeatureSMEB16B16, FeatureSMEF16F16, ...>;
//  --print-enabled-extensions -mcpu=apple-m5  →  FEAT_SME_F16F16
//  apple-a19 is an alias of apple-m5

In LLVM's entire AArch64 CPU table, FeatureSMEF16F16 appears for apple-m5/a19 and nothing else:

Silicon	SME level	`FEAT_SME_F16F16`	These kernels
Apple M5 / A19 (iPhone 17)	SME2p1	yes	run natively
Apple M4 / A18	SME2	no (verified `sysctl=0` on an M4)	dormant → AMX f16
Arm C1 / Lumex (Exynos 2600), Cortex‑X925	SME2	no	dormant
Qualcomm Oryon (SD 8 Elite Gen 5)	SME (v1)	no	dormant
Neoverse / Graviton	none	—	—

So this can only be exercised on an M5/A19, which I don't have — hence experimental, community testing requested.

What it adds

sme_mmm_f16_32x32 (GEMM): one 32×32 ZA.H tile, fmopa za.h per K‑step (at SVL=512 a single f16 FMOPA covers the whole 32×32 tile = the f32 kernel's 4‑tile MAC count). where(SME_F16F16).
sme_mmv_f16_64x1 (GEMV, N==1): vgx2 ZA.H group, SME2 multi‑vec fmla za.h[w8,0,vgx2]. where(SME2 && SME_F16F16).
Both consume tract's native K‑major f16 packing — no custom packer. CAN_FUSE matches the f32 kernels.

Build safety (cannot regress the f32 SME backend)

The f16 unit needs the +sme-f16f16 assembler extension. A separate dummy_sme_f16f16.S probe + assembler_supports_sme_f16f16() compiles the f16 kernels as their own object behind a new tract_sme_f16f16 cfg. A toolchain with base SME but not f16f16 still builds the f32 SME kernels exactly as before — only the f16 unit (and all its Rust registration / detection / plug() wiring) is gated off. Detection is HWCAP2_SME_F16F16 (bit 42) on Linux / the hw.optional.arm.FEAT_SME_F16F16 sysctl on macOS, plus the existing 512‑bit SVL check.

Validation

QEMU only (no F16F16 hardware): qemu-aarch64 -cpu max,sme512=on → full SME auto-test surface 220/220 (the two new f16 kernels — matmul proptest, every fuse op, store layouts, frame — plus the existing f32 GEMM/GEMV, no regression).
Standalone probes confirmed fmopa za.h and the multi-vec fmla za.h vgx2 forms are bit-correct.
Builds clean on macOS (Apple clang) and debian:sid (gcc 15).
Not run on real FEAT_SME_F16F16 silicon — that's the ask.

Request

If you (or anyone) have an M5 Mac or an iPhone 17 (A19), a run of cargo test -p tract-linalg arm64::sme::test_sme_mmm_f16 / test_sme_mmv_f16 (or an f16 model A/B vs the AMX path) would confirm real-hardware correctness and let this graduate from experimental.

🤖 Generated with Claude Code

…_SME_F16F16) Adds half-precision SME kernels for the mmm_f16 / mmv_f16 slots, the f16 companions to the merged f32 SME backend (sonos#2230). They use the non-widening half-precision SME path (`fmopa za.h`, FEAT_SME_F16F16): f16 inputs, f16 accumulate in ZA.H, consuming tract's native K-major f16 packing directly. - sme_mmm_f16_32x32 (GEMM): one 32x32 ZA.H tile, `fmopa za.h` per K-step (at SVL=512, one f16 FMOPA covers the whole 32x32 tile). where(SME_F16F16). - sme_mmv_f16_64x1 (GEMV, N==1): vgx2 ZA.H group, SME2 multi-vec `fmla za.h[w8,0,vgx2]`. where(SME2 && SME_F16F16). EXPERIMENTAL / effectively Apple M5 + A19 only, and unvalidated on real hardware. FEAT_SME_F16F16 is an optional SME2 feature that, among shipping silicon, only the Apple M5 / A19 implement. The "smoking gun" is upstream LLVM's CPU definition (llvm/llvm-project commit f85494f6afeb, "Define apple-m5/a19"): def TuneAppleM5 : SubtargetFeature<"apple-m5", ..., FeatureSME, FeatureSME2, FeatureSMEF64F64, FeatureSMEI16I64, FeatureSME2p1, FeatureSMEB16B16, FeatureSMEF16F16, ...> In LLVM's whole AArch64 CPU table FeatureSMEF16F16 appears for apple-m5/a19 and nothing else: the Apple M4/A18 report it as 0 (verified by sysctl on an M4), and the newest non-Apple SME2 cores (Arm C1/Lumex in Exynos 2600, Cortex-X925, Qualcomm Oryon) have SME/SME2 but not FEAT_SME_F16F16. So this needs community testing on an actual M5 / A19 (iPhone 17) — the maintainers' M4 cannot exercise it (it falls back to AMX f16). Build gating (so the f32 SME backend is never regressed): the f16 unit uses the `+sme-f16f16` assembler extension. A new dummy_sme_f16f16.S probe + assembler_supports_sme_f16f16() compiles the f16 kernels as a separate object gated on the `tract_sme_f16f16` cfg; the f32 SME kernels keep building on toolchains that have base SME but not f16f16. The Rust f16 registrations, detection (HWCAP2_SME_F16F16 bit 42 on Linux / sysctl on macOS, plus the existing 512-bit SVL check), and plug() wiring are all behind that cfg. Validated under QEMU only (no SME_F16F16 hardware available): with `qemu-aarch64 -cpu max,sme512=on` the full SME auto-test surface passes 220/220 (the two new f16 kernels: matmul proptest, every fuse op, store layouts, frame; plus the existing f32 GEMM/GEMV, no regression). Builds clean on macOS (Apple clang) and debian:sid (gcc 15). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

linalg/arm64/sme: experimental f16 SME kernels for Apple M5/A19#2273

linalg/arm64/sme: experimental f16 SME kernels for Apple M5/A19#2273
czoli1976 wants to merge 1 commit into
sonos:mainfrom
czoli1976:feature/sme-f16-m5-experimental

czoli1976 commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

czoli1976 commented May 22, 2026

Summary

Why "M5 / A19-only" — the LLVM smoking gun

What it adds

Build safety (cannot regress the f32 SME backend)

Validation

Request

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant