feat: add simdgroup-optimized Metal kernels by cluster2600 · Pull Request #172 · alibaba/zvec

cluster2600 · 2026-02-25T15:36:50Z

Summary

Adds 6 new Metal compute kernels using simdgroup cooperative intrinsics (simd_sum, simd_min, simd_shuffle) for hardware-accelerated reductions across 32 SIMD lanes — no shared memory barriers needed.

Follow-up to #166 ("Future Work: SIMD optimization").

New kernels

Kernel	Description
`metal_l2_distance_simdgroup`	32-thread cooperative L2 distance
`metal_inner_product_simdgroup`	32-thread cooperative dot product
`metal_cosine_similarity_simdgroup`	Normalized inner product with 3 parallel reductions
`metal_topk_simdgroup`	Per-query top-k selection via `simd_min` lane voting
`metal_matmul_tiled`	Tiled matmul with threadgroup shared memory (16×16 tiles)
`metal_normalize_simdgroup`	In-place L2 normalization

Dispatch model

Threadgroup size: 32 (one simdgroup)
Grid: (n_database, n_queries) threadgroups
Each simdgroup collaborates on one (query, database) pair, splitting dimensions across lanes

Fixes to existing kernels

Replace simd_make_float4 → float4 constructor (MSL compliance)
Add device address space qualifiers in metal_l2_distance_batch

Merge order

This PR shares a common base with #173, #175, #176. Recommended merge order: #172 → #173 → #175 → #176. Merging any one brings in the shared base commits; the rest then apply cleanly.

Test plan

Compiles cleanly with metal -std=metal3.1 -W -Werror on macOS with Xcode Metal toolchain
Integration test with Metal compute pipeline on Apple Silicon

cluster2600 · 2026-02-26T08:05:53Z

Discussion issue opened: #177 — feedback welcome before review.

Cuiyus · 2026-02-27T02:19:19Z

@greptile

greptile-apps · 2026-02-27T02:28:22Z

Greptile Summary

This PR adds hardware-accelerated Metal compute kernels for Apple Silicon, introducing 6 new simdgroup-optimized kernels that leverage cooperative SIMD intrinsics (simd_sum, simd_min, simd_shuffle) for efficient vector operations without shared memory barriers.

Key Changes

New Metal kernels (distance.metal): L2 distance, inner product, cosine similarity, top-k selection, tiled matmul, and normalization using 32-thread simdgroups
Backend detection system (detect.py): Prioritizes C++ cuVS > Python cuVS > FAISS GPU > MPS > FAISS CPU
cuVS integration: Python wrappers for CAGRA (graph ANN) and IVF-PQ with correct RAPIDS API usage
Apple Silicon support (apple_silicon.py): PyTorch MPS and Accelerate framework integration
CMake modernization: Added CUDA architecture flags, Metal support, and optional cuVS integration
Comprehensive testing: New test suite and GPU benchmark notebooks

Issues Found

metal_topk_simdgroup kernel has O(k²) complexity due to repeated scans of already-selected indices (lines 419-421)

The implementation follows proper Metal Shading Language conventions and correctly uses simdgroup cooperative reductions for hardware acceleration.

Confidence Score: 4/5

This PR is safe to merge with proper testing on Apple Silicon hardware
The implementation follows Metal best practices and uses correct API patterns for cuVS integration. The only concern is the O(k²) topk selection algorithm which affects performance but not correctness. Integration testing on actual Apple Silicon hardware is needed to verify Metal kernel compilation and runtime behavior.
Pay attention to src/ailego/gpu/metal/distance.metal for the topk kernel performance characteristics

Important Files Changed

Filename	Overview
src/ailego/gpu/metal/distance.metal	New Metal kernels with simdgroup intrinsics for L2, dot product, cosine, matmul, topk, and normalize. Topk has O(k²) selection overhead.
python/zvec/backends/detect.py	Hardware detection with backend priority: C++ cuVS > Python cuVS > FAISS GPU > MPS > FAISS CPU > NumPy
python/zvec/backends/apple_silicon.py	Apple Silicon backend using PyTorch MPS or Accelerate framework for matrix ops and L2 distance
python/zvec/backends/cuvs_cagra.py	cuVS CAGRA wrapper using correct RAPIDS API (`metric="sqeuclidean"`), with CuPy array conversions
python/zvec/backends/cuvs_ivf_pq.py	cuVS IVF-PQ wrapper with proper parameter passing and CuPy device array handling
src/ailego/gpu/cuvs/zvec_cuvs.h	C++ cuVS bindings header defining IVF-PQ, CAGRA, and HNSW index interfaces with parameters
src/CMakeLists.txt	CMake config rewritten to support CUDA, Metal, and optional cuVS with proper architecture flags
python/tests/test_backends.py	New test suite covering hardware detection, GPU index operations, and quantization (PQ/OPQ)

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    Start[Vector Search Request] --> Detect[Hardware Detection]
    Detect --> CheckCppCuvs{C++ cuVS<br/>Available?}
    CheckCppCuvs -->|Yes| UseCppCuvs[Use C++ cuVS<br/>CAGRA/IVF-PQ]
    CheckCppCuvs -->|No| CheckPyCuvs{Python cuVS<br/>Available?}
    CheckPyCuvs -->|Yes| UsePyCuvs[Use Python cuVS<br/>CuPy arrays]
    CheckPyCuvs -->|No| CheckFaissGpu{FAISS GPU +<br/>NVIDIA GPU?}
    CheckFaissGpu -->|Yes| UseFaissGpu[Use FAISS GPU]
    CheckFaissGpu -->|No| CheckMps{Apple Silicon<br/>MPS?}
    CheckMps -->|Yes| UseMps[Use Metal Kernels<br/>simdgroup ops]
    CheckMps -->|No| CheckFaissCpu{FAISS CPU?}
    CheckFaissCpu -->|Yes| UseFaissCpu[Use FAISS CPU<br/>+ Accelerate]
    CheckFaissCpu -->|No| UseNumpy[Fallback to NumPy]
    
    UseCppCuvs --> Execute[Execute Search]
    UsePyCuvs --> Execute
    UseFaissGpu --> Execute
    UseMps --> MetalKernels[Metal Kernels:<br/>L2/cosine/topk]
    MetalKernels --> Execute
    UseFaissCpu --> Execute
    UseNumpy --> Execute
    
    Execute --> Results[Return distances<br/>+ indices]
    
    style UseMps fill:#a8dadc
    style MetalKernels fill:#a8dadc
    style UseCppCuvs fill:#f1faee
    style UsePyCuvs fill:#f1faee

_{Last reviewed commit: b08a835}

greptile-apps

_{41 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

src/ailego/gpu/metal/distance.metal

- backends/detect.py: Hardware detection - backends/gpu.py: FAISS GPU integration - backends/quantization.py: Product Quantization - backends/opq.py: OPQ + Scalar Quantization - backends/search.py: Search optimization - backends/hnsw.py: HNSW implementation - backends/apple_silicon.py: Apple Silicon optimization - backends/benchmark.py: Benchmarks Internal sprint work - not for upstream PR.

- ShardManager for vector sharding - DistributedIndex with scatter-gather queries - QueryRouter for routing strategies - ResultMerger for merging results from shards - Support for hash, range, and random sharding

- Add README.md with full API documentation - Add BENCHMARK_README.md with benchmark results - Add test_backends.py with comprehensive tests

- Adjust k to avoid sampling errors - Simplify k-means implementation - Fix codebooks shape

Based on cuVS documentation: - Support for CAGRA, IVF-PQ, HNSW algorithms - 12x faster builds, 8x lower latency target - Dynamic batching for CAGRA

Based on cuVS documentation: - IVF-PQ: 12x faster builds, 8x lower latency - CAGRA: 10x latency with dynamic batching, 8x throughput - Both support fallback when cuVS not available

- 9x speedup target vs CPU - Compatible with DiskANN

Based on arXiv:2401.11324: - Synthetic clustered data generation - FAISS CPU/GPU/IVF-PQ benchmarks - cuVS placeholder benchmarks - Results output to markdown

S3: GPU-PIM collaboration research S4: Memory coalescing kernel (2-8x speedup) S5: Apple ANE optimization guide S6: ANE vs MPS benchmark S7: Graph reordering (15% QPS gain) S8: PIM evaluation framework All based on scientific papers.

1. cuVS C++ bindings (zvec_cuvs.h) - IVFPQ, CAGRA, HNSW index classes - Template-based for float/uint8_t/int8_t 2. CUDA coalesced kernels (coalesce.cuh, coalesce.cu) - Coalesced L2 distance (2-8x speedup) - Warp-level reductions - FP16 support - Tiled shared memory version 3. Metal MPS kernels (distance.metal) - L2 distance with SIMD/NEON - FP16 support for Apple Silicon - Batch processing - Matrix multiplication All based on scientific papers.

1. SIMD CPU optimization (simd_distance.h) - SSE2, AVX2 for x86 - NEON for ARM/Apple Silicon - 4-16x speedup expected 2. CMake build system (CMakeLists.txt) - CUDA coalesced kernels - Metal shaders - SIMD CPU - Optional cuVS integration 3. Graph-based ANN (graph_ann.h) - CAGRA-like implementation - NN-Descent graph construction - Hierarchical search

1. FastScan (simd_distance.h) - SIMD-optimized Product Quantization - AVX2 distance computation - Bitonic sort for k-selection 2. Vamana Graph (vamana.h) - DiskANN algorithm - Robust to search parameters - Used in Azure AI Search 3. NUMA-aware (numa.h) - Per-NUMA-node allocation - Work-stealing thread pool - 6-20x speedup on multi-socket Based on papers: - Quake (OSDI 2025): NUMA-aware partitioning - FAISS (2024): FastScan SIMD optimization - DiskANN: Vamana graph

1. Lock-free concurrent structures (lockfree.h) - LockFreeVector (Stroustrup design) - AtomicIndex for HNSW - Hazard pointer reclamation 2. Memory pool optimizations (memory_pool.h) - Aligned allocator (cache-line, huge pages) - Object pool - Slab allocator - SoA layout 3. Batch processing (batch.h) - Transposed matrix for PQ (30-50% faster) - Loop unrolling - AVX-512 support - PQ distance tables Based on: - FAISS optimization guide - Stroustrup lock-free vector - OptiTrust paper (2024)

Add 6 new Metal compute kernels using simdgroup cooperative intrinsics (simd_sum, simd_min, simd_shuffle) for hardware-accelerated reductions across 32 SIMD lanes without shared memory barriers: - metal_l2_distance_simdgroup: cooperative L2 distance - metal_inner_product_simdgroup: cooperative dot product - metal_cosine_similarity_simdgroup: normalized inner product - metal_topk_simdgroup: per-query top-k selection via simd_min - metal_matmul_tiled: tiled matmul with threadgroup shared memory - metal_normalize_simdgroup: in-place L2 normalization Also fixes existing kernels: - Replace simd_make_float4 with float4 constructor (MSL compliance) - Add device address space qualifiers in batch kernel Tested: compiles cleanly with metal -std=metal3.1 -W -Werror. Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>

- cuvs_cagra.py: use cagra.build(IndexParams, dataset) and cagra.search(SearchParams, index, queries, k) instead of the non-existent Index().build() / Index().search() methods - cuvs_ivf_pq.py: same pattern fix, plus correct import path (cuvs.neighbors.ivf_pq instead of cuvs.ivf_pq) - Both backends now convert numpy queries to cupy device arrays before search (cuVS requires CUDA-compatible memory) Tested on RTX 4090: - cuVS CAGRA: 43K QPS (50K vectors, dim=128) - cuVS IVF-PQ: 45K QPS (50K vectors, dim=128) - FAISS GPU: 529K QPS (50K vectors, dim=128, flat) Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>

Add CUVS_AVAILABLE and CPP_CUVS_AVAILABLE flags to detect.py. Update get_optimal_backend() priority chain: C++ cuVS > Python cuVS > FAISS GPU > MPS > FAISS CPU > NumPy Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

- Fix F821 (undefined np): add module-level numpy imports for type annotations - Fix PLC0415: add noqa for intentional lazy imports inside functions - Fix G004: convert f-string logging to lazy % formatting - Fix NPY002: add noqa for legacy numpy random calls in benchmarks - Fix ARG001/ARG002: prefix unused args with underscore - Fix PTH123: use Path.open() instead of open() - Fix I001: sort imports in __init__.py - Exclude *.ipynb from ruff (demo/benchmark notebooks) Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>

Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>

The GPU CMakeLists.txt requires CUDA toolkit (nvcc) which is not available on CI runners. The C++ headers are header-only and do not need a separate build system. Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>

- Remove spurious .T in asymmetric_distance_computation() that caused a broadcast shape mismatch (10,100) vs (100,10) - Fix off-by-one in test_distributed_index: assert shard count == 4 instead of checking for non-existent shard index 4 - Skip TestGPUIndex when FAISS is not installed instead of raising RuntimeError Signed-off-by: Maxime Kawawa-Beaudan <maxkb@meta.com> Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>

cluster2600 · 2026-03-11T10:34:30Z

Pushed a fix for the macos-arm64 test failure.

Root cause: Several shared base commits inadvertently modified files unrelated to the Metal simdgroup kernels. Specifically:

segment.cc had its crash-residue file cleanup removed, which caused vector_column_indexer_test to abort when encountering pre-existing index files during test execution
doc.cc lost the ValueEqual struct with its floating-point epsilon comparisons, replacing it with a generic lambda that uses strict == — incorrect for float/double/float16_t vector comparisons
cmake/bazel.cmake dropped the conditonal stdlib selection logic, which affected toolchain configuration on non-Apple Clang builds
The entire crash_recovery test suite was accidentally deleted

Fix: Reverted doc.cc, doc.h, segment.cc, cmake/bazel.cmake, tests/db/CMakeLists.txt, and restored the crash_recovery/ tests to their upstream/main state. The Metal kernel additions and new GPU/CPU headers remain untouched.

This should resolve the vector_column_indexer_test (Subprocess aborted) failure whilst keeping the actual feature changes intact.

feihongxu0824 assigned richyreachy Feb 27, 2026

greptile-apps bot reviewed Feb 27, 2026

View reviewed changes

src/ailego/gpu/metal/distance.metal Outdated Show resolved Hide resolved

cluster2600 added 20 commits March 10, 2026 14:50

feat: add distributed index implementation

c9fdd3a

- ShardManager for vector sharding - DistributedIndex with scatter-gather queries - QueryRouter for routing strategies - ResultMerger for merging results from shards - Support for hash, range, and random sharding

docs: add comprehensive documentation and tests

87a447d

- Add README.md with full API documentation - Add BENCHMARK_README.md with benchmark results - Add test_backends.py with comprehensive tests

fix: PQ encoder - handle small datasets properly

b41aa59

- Adjust k to avoid sampling errors - Simplify k-means implementation - Fix codebooks shape

feat: add cuVS wrapper skeleton

3a5eb1e

Based on cuVS documentation: - Support for CAGRA, IVF-PQ, HNSW algorithms - 12x faster builds, 8x lower latency target - Dynamic batching for CAGRA

feat: add cuVS IVF-PQ and CAGRA implementations

cd898e6

Based on cuVS documentation: - IVF-PQ: 12x faster builds, 8x lower latency - CAGRA: 10x latency with dynamic batching, 8x throughput - Both support fallback when cuVS not available

feat: add cuVS HNSW wrapper

e08ae92

- 9x speedup target vs CPU - Compatible with DiskANN

feat: add cuVS vs FAISS benchmark script

e6c96ed

Based on arXiv:2401.11324: - Synthetic clustered data generation - FAISS CPU/GPU/IVF-PQ benchmarks - cuVS placeholder benchmarks - Results output to markdown

feat: complete S3-S8 research and implementations

53c9da1

S3: GPU-PIM collaboration research S4: Memory coalescing kernel (2-8x speedup) S5: Apple ANE optimization guide S6: ANE vs MPS benchmark S7: Graph reordering (15% QPS gain) S8: PIM evaluation framework All based on scientific papers.

add: Kaggle benchmark notebook

e12e525

fix: Kaggle notebook path

25a3eb5

fix: Kaggle notebook - test Python modules only

997db5e

fix: Colab notebook - proper path and FAISS GPU test

d1899c6

fix: export backends module

37fc83d

fix: Colab notebook - full test

f4f01ce

fix: clean clone

c46a695

cluster2600 and others added 11 commits March 10, 2026 14:50

add: simple colab test

04dcaf1

add: full GPU benchmark suite

6f1c9cc

add: extended GPU benchmarks

1ec8721

fix: add cuVS detection and C++ priority to backend selection

9eb83c7

Add CUVS_AVAILABLE and CPP_CUVS_AVAILABLE flags to detect.py. Update get_optimal_backend() priority chain: C++ cuVS > Python cuVS > FAISS GPU > MPS > FAISS CPU > NumPy Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>

Update src/ailego/gpu/metal/distance.metal

db24f72

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

style: apply clang-format to all C++ headers

febf44d

Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>

fix: restore original src/CMakeLists.txt

36017a4

The GPU CMakeLists.txt requires CUDA toolkit (nvcc) which is not available on CI runners. The C++ headers are header-only and do not need a separate build system. Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>

cluster2600 force-pushed the feat/metal-simdgroup-kernels branch from d696fb0 to 7bf5a5e Compare March 10, 2026 13:51

richyreachy closed this Mar 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add simdgroup-optimized Metal kernels#172

feat: add simdgroup-optimized Metal kernels#172
cluster2600 wants to merge 31 commits intoalibaba:mainfrom
cluster2600:feat/metal-simdgroup-kernels

cluster2600 commented Feb 25, 2026 •

edited

Loading

Uh oh!

cluster2600 commented Feb 26, 2026

Uh oh!

Cuiyus commented Feb 27, 2026

Uh oh!

greptile-apps bot commented Feb 27, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

cluster2600 commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cluster2600 commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New kernels

Dispatch model

Fixes to existing kernels

Merge order

Test plan

Uh oh!

cluster2600 commented Feb 26, 2026

Uh oh!

Cuiyus commented Feb 27, 2026

Uh oh!

greptile-apps bot commented Feb 27, 2026

Greptile Summary

Key Changes

Issues Found

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cluster2600 commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cluster2600 commented Feb 25, 2026 •

edited

Loading