A high-performance General Matrix Multiply (GEMM) implementation targeting x86-64 processors with AVX-512F and AVX-512BF16 instruction support. Achieves 347.69 GFLOPs on 1024×1024×1024 matrices on a single thread or 5,563.04 GFLOPs on my 16 core mini-PC.
This is just a POC and so only supports high-end x86-64 CPUs.
This mutiplies two matrices in column major order, the first of which is transposed. The output is currently f32, but we could also quantise the result.
When building LLMs, it would probably be wise to transpose the weight matrices to avoid permutes and broadcasts.
It is also possible to double the performance by using u8 and i8, but this is out of scope for this POC.
cargo build --releasecargo run --release --example gemm_1kExpected output:
Execution time: ~6.2 ms
Performance: ~347 GFLOPs
Sample output C[0]: 1024
Expected (approximately): 1024
The implementation uses a three-level tiling strategy to maximize cache reuse and instruction-level parallelism:
-
Supertile (8×16): Outer loop level for L3 cache optimization
- Height: 8 rows
- Width: 16 columns
- Fits working set into L3 cache for large matrices
-
Tile (4×4): Inner loop level with full manual unrolling
- Height: 4 rows
- Width: 4 columns
- Generates 16 VDPBF16PS instructions per k-loop iteration
- Compiler schedules instructions optimally across execution ports
-
VDPBF16PS: AVX-512BF16 dot product instruction
- BF16 format: 16-bit floating-point (1 sign, 8 exponent, 7 mantissa)
- Computes 16 dot products (BF16 pairs → F32 scalars) per instruction
- Accumulation into 512-bit ZMM registers
- A (k×m): Column-major, leading dimension
lda - B (k×n): Column-major, leading dimension
ldb - C (m×n): Column-major, leading dimension
ldc - Computation: C = α·A^T·B + β·C
Inner loops are manually unrolled using a Rust macro:
macro_rules! dpbf16_step {
($ii_local:expr, $jj_local:expr, ...) => {
// Single 4×4 tile VDPBF16PS operation
};
}
// Expanded to 16 explicit calls at compile time
dpbf16_step!(0, 0, ...);
dpbf16_step!(1, 0, ...);
// ... (14 more)
dpbf16_step!(3, 3, ...);This eliminates loop overhead and enables better instruction scheduling.
- Intel: 3rd Gen Xeon Scalable (Ice Lake), 4th Gen Xeon Scalable (Sapphire Rapids)
- AMD: EPYC 7004 series (Bergamo) with AVX-512F and AVX-512BF16 support
- AMD 9955HX, inexpensive mini PC with 16 cores.
- x86-64 architecture
- AVX-512F extension (core instruction set)
- AVX-512BF16 extension (BF16 dot product instructions)
- CPU: AMD Zen5 (5 GHz, dual-issue AVX-512)
- RAM: 8+ GB
- Build: Rust 1.70+ with LLVM backend
- 1024×1024×1024 matrices: 347.69 GFLOPs (single-iteration warm-cache)
- Execution time: ~6.2 ms
Assuming a 5 GHz AMD Zen5 with dual-issue AVX-512 (single core):
Single-core capacity:
- VDPBF16PS throughput: 2 instructions/cycle (dual-issue)
- FLOPs per instruction: 32 (16 BF16 pairs → 16 F32 results, with fused multiply-add)
- Per-cycle: 2 × 32 = 64 GFLOPs/cycle
- Single-core: 64 × 5 GHz = 320 GFLOPs/core theoretical maximum
- Achieved: 347.69 GFLOPs (single-threaded, 1024×1024×1024)
- Speedup vs. theoretical baseline: 1.09× (347.69 / 320 GFLOPs)
- Primary factors enabling high performance:
- Hierarchical tiling exploits L3 cache reuse
- Manual macro unrolling improves instruction scheduling
- Memory access patterns reduce cache misses
- Horizontal reduction minimized through careful register allocation
- Baseline (reference 4×4 tile): ~336 GFLOPs
- With 8×8 supertile: 230 GFLOPs (register pressure, cache conflicts)
- Current (8×16 supertile + macro unroll): 347.69 GFLOPs ✓
src/lib.rs - Core GEMM implementation
examples/gemm_1k.rs - Benchmark for 1024×1024×1024 matrices
Cargo.toml - Rust dependencies (half::bf16)
.gitignore - Ignores /target and /artifacts
pub unsafe fn gemm_bf16(
m: usize, // Output rows
n: usize, // Output columns
k: usize, // Reduction dimension
alpha: f32, // Scaling factor for A·B
a: &[bf16], // Matrix A (column-major)
lda: usize, // Leading dimension of A
b: &[bf16], // Matrix B (column-major)
ldb: usize, // Leading dimension of B
beta: f32, // Scaling factor for C
c: &mut [f32], // Output matrix C (column-major)
ldc: usize, // Leading dimension of C
)m % 8 == 0(multiple of supertile height)n % 16 == 0(multiple of supertile width)k % 32 == 0(multiple of VDPBF16PS element count)
pub unsafe fn gemm_bf16_with_tiling(
m: usize, n: usize, k: usize,
alpha: f32, a: &[bf16], lda: usize,
b: &[bf16], ldb: usize,
beta: f32, c: &mut [f32], ldc: usize,
supertile_i: usize, // Custom supertile height (multiple of 4)
supertile_j: usize, // Custom supertile width (multiple of 4)
)Run the example benchmark:
cargo run --release --example gemm_1kCustomization via environment variables:
- Modify
examples/gemm_1k.rsfor different matrix sizes - Build in release mode for accurate performance measurements
- Macro-based unrolling: Eliminates loop overhead, enables compiler optimization
- Hierarchical tiling: Maximizes cache hit rates
- Column-major storage: Matches GEMM computation pattern (contiguous loads)
- Horizontal reduction: Uses VDPBF16PS accumulation +
_mm512_reduce_add_ps - Alpha/beta scaling: Applied during writeback to reduce memory traffic
-O3: Optimization level 3 (enabled in release builds)--target-cpu=native: Use native CPU capabilities (recommended)
Build for a specific CPU:
RUSTFLAGS="-C target-cpu=skylake-avx512" cargo build --release- Single-threaded: No OpenMP/Rayon parallelization
- Fixed precision: BF16 inputs, F32 outputs only
- No transpose support: Hardcoded
trans_a='T',trans_b='N' - x86-64 only: No fallback for non-AVX-512 systems
- AVX-512 Intrinsics: Intel Intrinsics Guide
- BF16 Specification: Google Brain Floating Point Format
- GEMM Optimization: Anatomy of High-Performance Matrix Multiplication