linalg/mmm: cache-adaptive 2D-blocking for the single-thread tile walk#2274
linalg/mmm: cache-adaptive 2D-blocking for the single-thread tile walk#2274czoli1976 wants to merge 2 commits into
Conversation
The single-thread MMM tile walk used a naive nested loop, re-streaming the full inner operand (all of A in col-outer / B in row-outer) per panel at large k, which is memory/L1-bound. The multithread path already 2D-blocks the panel grid (chunk_grid); this brings the same blocking to the single-thread path, with the block edge cache-derived (detected L2/3, conservative 256 KiB fallback) so it stays L2-resident across hardware and never over-blocks a cache it cannot see. Bit-identical: it only reorders independent tiles (each computes its full-k reduction into a disjoint C region). The block-edge floor of 1 degrades exactly to the naive loop; the cap of 16 matches the multithread chunk_grid blocking already shipped on all platforms. Frame-level, so all kernels benefit. +20-45% at large k on Apple Silicon (single-thread); small / GEMV / multithreaded shapes are unchanged. Adds 5 large-shape (>16-panel) frame tests exercising the blocked path against the naive reference (the existing frame proptests only reach 3 panels). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Context, caveats & follow-ups (for review)Where this came from. Profiling tract's MMM vs Apple's Accelerate ( Activation scope (why it's low-risk). The blocked walk is bit-identical to the old loop unless the inner panel dimension exceeds the block edge — i.e. only large 2D single-thread matmuls are reordered. GEMV ( Validation honesty. Perf is verified on Apple Silicon only — M1 Pro is reliable (consistent re-runs). M4 is P/E-core-scheduling confounded over SSH (the same binary measured ~1772 on the P-cluster vs ~838 on the E-cluster at k=2048; macOS has no user-facing P-core pin — Known remaining tail — Other follow-ups (not in this PR):
|
The cfg(linux) sysfs read in `detect_l2_bytes` was not rustfmt-conformant (it wasn't run through rustfmt on the macOS dev machine), so `cargo fmt --check` failed in CI. Pure formatting; no behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@kali this needs verification on Cortex A5x, thanks ! |
|
Nice, this is an option I had considered in the initial design, since it is central in BLIS which was more or less my model. Let's see what it does. |
|
on A53 and A55, in the noise bracket, on audio loads and image classifiers. |
|
good |
|
Besides Apple Silicon, would you have access to more recent ARM Cores like Cortex-A76/A78/A715/A720 ? |
|
i have a plan to generalize benching to graviton instances. I also think i can run on a A78AE, but it will take a while, the bench runner was down so it will have to catch up a lot of history. |
|
can't wait to see how it goes with Graviton |
Summary
The single-thread MMM tile walk uses a naive
for n_panel { for m_panel }loop, so at largekit re-streams the entire inner operand (all of A in col-outer, all of B in row-outer) once per outer panel — memory/L1-bound. The multithread path already avoids this by 2D-blocking the panel grid (chunk_grid, 16-panel chunks). This PR brings the same blocking to the single-thread path, with the block size cache-derived so it stays L2-resident across hardware.The change
run_single_thread_blocked: walk them×npanel grid inBLK×BLKblocks (col/row-outer preserved as the within-block inner order).st_block_edge:BLKsized so the block's A+B sub-panels (~BLK·(mr+nr)·k·elem) fit a working-set budget = detected L2 / 3 (sysctl hw.perflevel0.l2cachesizeon macOS, sysfs on Linux), clamped to[1, 16], with a conservative 256 KiB fallback when L2 can't be read.Results (single-thread, m=512, n=2048, f32; GFLOP/s, M1 Pro)
cblas_sgemmLargest where the old loop thrashed cache;
k≤512within noise. Frame-level, so all kernels (NEON/AMX/SME/…) benefit.Correctness / safety
kreduction into a disjoint C region, so order changes no result (the multithread path already iterates in a different, chunked order).BLKclamps to1⇒ exactly the naive nested loop, so a small/unknown cache can never over-block.BLK ≤ 16= thechunk_gridblocking already shipped on x86/WASM/ARM via the multithread path.packed_packed.rsexercising the blocked path (16²- and 17²-panel boundaries) against the naive reference — the existing frame proptests only reach 3 panels, below the block threshold. Fulltract-linalgsuite green.Scope / non-goals
Touches only the single-thread arms of
run_with_scratch_space_{col,row}_outer; the multithread path is unchanged. Nokc-loop (k-blocking) yet — see the follow-up comment. No new dependencies.🤖 Generated with Claude Code