Skip to content

linalg/arm64/sme: experimental f16 SME kernels for Apple M5/A19#2273

Draft
czoli1976 wants to merge 1 commit into
sonos:mainfrom
czoli1976:feature/sme-f16-m5-experimental
Draft

linalg/arm64/sme: experimental f16 SME kernels for Apple M5/A19#2273
czoli1976 wants to merge 1 commit into
sonos:mainfrom
czoli1976:feature/sme-f16-m5-experimental

Conversation

@czoli1976
Copy link
Copy Markdown
Contributor

Summary

Addendum to the merged f32 SME backend (#2230): experimental half-precision SME kernels for the mmm_f16 / mmv_f16 slots, using the non-widening fmopa za.h path (FEAT_SME_F16F16).

Warning

This is effectively Apple M5 / A19-only and has NOT been validated on real hardware. It is QEMU-correctness-validated only and needs community testing on an actual M5 / A19 (iPhone 17). It cannot regress anything else (see Build safety), so it's offered as a low-risk, opt-in-by-hardware addition.

Why "M5 / A19-only" — the LLVM smoking gun

FEAT_SME_F16F16 (non-widening f16→f16 in ZA.H) is an optional SME2 sub-feature. Among shipping silicon, only the Apple M5 / A19 implement it. The proof is upstream LLVM's own CPU definition — llvm/llvm-project commit f85494f6afeb "Define apple-m5/a19 CPUs":

def TuneAppleM5 : SubtargetFeature<"apple-m5", ...,
    FeatureSME, FeatureSME2, FeatureSMEF64F64, FeatureSMEI16I64,
    FeatureSME2p1, FeatureSMEB16B16, FeatureSMEF16F16, ...>;
//  --print-enabled-extensions -mcpu=apple-m5  →  FEAT_SME_F16F16
//  apple-a19 is an alias of apple-m5

In LLVM's entire AArch64 CPU table, FeatureSMEF16F16 appears for apple-m5/a19 and nothing else:

Silicon SME level FEAT_SME_F16F16 These kernels
Apple M5 / A19 (iPhone 17) SME2p1 yes run natively
Apple M4 / A18 SME2 no (verified sysctl=0 on an M4) dormant → AMX f16
Arm C1 / Lumex (Exynos 2600), Cortex‑X925 SME2 no dormant
Qualcomm Oryon (SD 8 Elite Gen 5) SME (v1) no dormant
Neoverse / Graviton none

So this can only be exercised on an M5/A19, which I don't have — hence experimental, community testing requested.

What it adds

  • sme_mmm_f16_32x32 (GEMM): one 32×32 ZA.H tile, fmopa za.h per K‑step (at SVL=512 a single f16 FMOPA covers the whole 32×32 tile = the f32 kernel's 4‑tile MAC count). where(SME_F16F16).
  • sme_mmv_f16_64x1 (GEMV, N==1): vgx2 ZA.H group, SME2 multi‑vec fmla za.h[w8,0,vgx2]. where(SME2 && SME_F16F16).
  • Both consume tract's native K‑major f16 packing — no custom packer. CAN_FUSE matches the f32 kernels.

Build safety (cannot regress the f32 SME backend)

The f16 unit needs the +sme-f16f16 assembler extension. A separate dummy_sme_f16f16.S probe + assembler_supports_sme_f16f16() compiles the f16 kernels as their own object behind a new tract_sme_f16f16 cfg. A toolchain with base SME but not f16f16 still builds the f32 SME kernels exactly as before — only the f16 unit (and all its Rust registration / detection / plug() wiring) is gated off. Detection is HWCAP2_SME_F16F16 (bit 42) on Linux / the hw.optional.arm.FEAT_SME_F16F16 sysctl on macOS, plus the existing 512‑bit SVL check.

Validation

  • QEMU only (no F16F16 hardware): qemu-aarch64 -cpu max,sme512=on → full SME auto-test surface 220/220 (the two new f16 kernels — matmul proptest, every fuse op, store layouts, frame — plus the existing f32 GEMM/GEMV, no regression).
  • Standalone probes confirmed fmopa za.h and the multi-vec fmla za.h vgx2 forms are bit-correct.
  • Builds clean on macOS (Apple clang) and debian:sid (gcc 15).
  • Not run on real FEAT_SME_F16F16 silicon — that's the ask.

Request

If you (or anyone) have an M5 Mac or an iPhone 17 (A19), a run of cargo test -p tract-linalg arm64::sme::test_sme_mmm_f16 / test_sme_mmv_f16 (or an f16 model A/B vs the AMX path) would confirm real-hardware correctness and let this graduate from experimental.

🤖 Generated with Claude Code

…_SME_F16F16)

Adds half-precision SME kernels for the mmm_f16 / mmv_f16 slots, the f16
companions to the merged f32 SME backend (sonos#2230). They use the non-widening
half-precision SME path (`fmopa za.h`, FEAT_SME_F16F16): f16 inputs, f16
accumulate in ZA.H, consuming tract's native K-major f16 packing directly.

- sme_mmm_f16_32x32 (GEMM): one 32x32 ZA.H tile, `fmopa za.h` per K-step (at
  SVL=512, one f16 FMOPA covers the whole 32x32 tile). where(SME_F16F16).
- sme_mmv_f16_64x1 (GEMV, N==1): vgx2 ZA.H group, SME2 multi-vec
  `fmla za.h[w8,0,vgx2]`. where(SME2 && SME_F16F16).

EXPERIMENTAL / effectively Apple M5 + A19 only, and unvalidated on real
hardware. FEAT_SME_F16F16 is an optional SME2 feature that, among shipping
silicon, only the Apple M5 / A19 implement. The "smoking gun" is upstream LLVM's
CPU definition (llvm/llvm-project commit f85494f6afeb, "Define apple-m5/a19"):

    def TuneAppleM5 : SubtargetFeature<"apple-m5", ...,
        FeatureSME, FeatureSME2, FeatureSMEF64F64, FeatureSMEI16I64,
        FeatureSME2p1, FeatureSMEB16B16, FeatureSMEF16F16, ...>

In LLVM's whole AArch64 CPU table FeatureSMEF16F16 appears for apple-m5/a19 and
nothing else: the Apple M4/A18 report it as 0 (verified by sysctl on an M4), and
the newest non-Apple SME2 cores (Arm C1/Lumex in Exynos 2600, Cortex-X925,
Qualcomm Oryon) have SME/SME2 but not FEAT_SME_F16F16. So this needs community
testing on an actual M5 / A19 (iPhone 17) — the maintainers' M4 cannot exercise
it (it falls back to AMX f16).

Build gating (so the f32 SME backend is never regressed): the f16 unit uses the
`+sme-f16f16` assembler extension. A new dummy_sme_f16f16.S probe +
assembler_supports_sme_f16f16() compiles the f16 kernels as a separate object
gated on the `tract_sme_f16f16` cfg; the f32 SME kernels keep building on
toolchains that have base SME but not f16f16. The Rust f16 registrations,
detection (HWCAP2_SME_F16F16 bit 42 on Linux / sysctl on macOS, plus the
existing 512-bit SVL check), and plug() wiring are all behind that cfg.

Validated under QEMU only (no SME_F16F16 hardware available): with
`qemu-aarch64 -cpu max,sme512=on` the full SME auto-test surface passes 220/220
(the two new f16 kernels: matmul proptest, every fuse op, store layouts, frame;
plus the existing f32 GEMM/GEMV, no regression). Builds clean on macOS (Apple
clang) and debian:sid (gcc 15).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant