feat: add simdgroup-optimized Metal kernels#172
feat: add simdgroup-optimized Metal kernels#172cluster2600 wants to merge 31 commits intoalibaba:mainfrom
Conversation
|
Discussion issue opened: #177 — feedback welcome before review. |
|
@greptile |
Greptile SummaryThis PR adds hardware-accelerated Metal compute kernels for Apple Silicon, introducing 6 new simdgroup-optimized kernels that leverage cooperative SIMD intrinsics ( Key Changes
Issues Found
The implementation follows proper Metal Shading Language conventions and correctly uses simdgroup cooperative reductions for hardware acceleration. Confidence Score: 4/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
Start[Vector Search Request] --> Detect[Hardware Detection]
Detect --> CheckCppCuvs{C++ cuVS<br/>Available?}
CheckCppCuvs -->|Yes| UseCppCuvs[Use C++ cuVS<br/>CAGRA/IVF-PQ]
CheckCppCuvs -->|No| CheckPyCuvs{Python cuVS<br/>Available?}
CheckPyCuvs -->|Yes| UsePyCuvs[Use Python cuVS<br/>CuPy arrays]
CheckPyCuvs -->|No| CheckFaissGpu{FAISS GPU +<br/>NVIDIA GPU?}
CheckFaissGpu -->|Yes| UseFaissGpu[Use FAISS GPU]
CheckFaissGpu -->|No| CheckMps{Apple Silicon<br/>MPS?}
CheckMps -->|Yes| UseMps[Use Metal Kernels<br/>simdgroup ops]
CheckMps -->|No| CheckFaissCpu{FAISS CPU?}
CheckFaissCpu -->|Yes| UseFaissCpu[Use FAISS CPU<br/>+ Accelerate]
CheckFaissCpu -->|No| UseNumpy[Fallback to NumPy]
UseCppCuvs --> Execute[Execute Search]
UsePyCuvs --> Execute
UseFaissGpu --> Execute
UseMps --> MetalKernels[Metal Kernels:<br/>L2/cosine/topk]
MetalKernels --> Execute
UseFaissCpu --> Execute
UseNumpy --> Execute
Execute --> Results[Return distances<br/>+ indices]
style UseMps fill:#a8dadc
style MetalKernels fill:#a8dadc
style UseCppCuvs fill:#f1faee
style UsePyCuvs fill:#f1faee
Last reviewed commit: b08a835 |
- backends/detect.py: Hardware detection - backends/gpu.py: FAISS GPU integration - backends/quantization.py: Product Quantization - backends/opq.py: OPQ + Scalar Quantization - backends/search.py: Search optimization - backends/hnsw.py: HNSW implementation - backends/apple_silicon.py: Apple Silicon optimization - backends/benchmark.py: Benchmarks Internal sprint work - not for upstream PR.
- ShardManager for vector sharding - DistributedIndex with scatter-gather queries - QueryRouter for routing strategies - ResultMerger for merging results from shards - Support for hash, range, and random sharding
- Add README.md with full API documentation - Add BENCHMARK_README.md with benchmark results - Add test_backends.py with comprehensive tests
- Adjust k to avoid sampling errors - Simplify k-means implementation - Fix codebooks shape
Based on cuVS documentation: - Support for CAGRA, IVF-PQ, HNSW algorithms - 12x faster builds, 8x lower latency target - Dynamic batching for CAGRA
Based on cuVS documentation: - IVF-PQ: 12x faster builds, 8x lower latency - CAGRA: 10x latency with dynamic batching, 8x throughput - Both support fallback when cuVS not available
- 9x speedup target vs CPU - Compatible with DiskANN
Based on arXiv:2401.11324: - Synthetic clustered data generation - FAISS CPU/GPU/IVF-PQ benchmarks - cuVS placeholder benchmarks - Results output to markdown
S3: GPU-PIM collaboration research S4: Memory coalescing kernel (2-8x speedup) S5: Apple ANE optimization guide S6: ANE vs MPS benchmark S7: Graph reordering (15% QPS gain) S8: PIM evaluation framework All based on scientific papers.
1. cuVS C++ bindings (zvec_cuvs.h) - IVFPQ, CAGRA, HNSW index classes - Template-based for float/uint8_t/int8_t 2. CUDA coalesced kernels (coalesce.cuh, coalesce.cu) - Coalesced L2 distance (2-8x speedup) - Warp-level reductions - FP16 support - Tiled shared memory version 3. Metal MPS kernels (distance.metal) - L2 distance with SIMD/NEON - FP16 support for Apple Silicon - Batch processing - Matrix multiplication All based on scientific papers.
1. SIMD CPU optimization (simd_distance.h) - SSE2, AVX2 for x86 - NEON for ARM/Apple Silicon - 4-16x speedup expected 2. CMake build system (CMakeLists.txt) - CUDA coalesced kernels - Metal shaders - SIMD CPU - Optional cuVS integration 3. Graph-based ANN (graph_ann.h) - CAGRA-like implementation - NN-Descent graph construction - Hierarchical search
1. FastScan (simd_distance.h) - SIMD-optimized Product Quantization - AVX2 distance computation - Bitonic sort for k-selection 2. Vamana Graph (vamana.h) - DiskANN algorithm - Robust to search parameters - Used in Azure AI Search 3. NUMA-aware (numa.h) - Per-NUMA-node allocation - Work-stealing thread pool - 6-20x speedup on multi-socket Based on papers: - Quake (OSDI 2025): NUMA-aware partitioning - FAISS (2024): FastScan SIMD optimization - DiskANN: Vamana graph
1. Lock-free concurrent structures (lockfree.h) - LockFreeVector (Stroustrup design) - AtomicIndex for HNSW - Hazard pointer reclamation 2. Memory pool optimizations (memory_pool.h) - Aligned allocator (cache-line, huge pages) - Object pool - Slab allocator - SoA layout 3. Batch processing (batch.h) - Transposed matrix for PQ (30-50% faster) - Loop unrolling - AVX-512 support - PQ distance tables Based on: - FAISS optimization guide - Stroustrup lock-free vector - OptiTrust paper (2024)
Add 6 new Metal compute kernels using simdgroup cooperative intrinsics (simd_sum, simd_min, simd_shuffle) for hardware-accelerated reductions across 32 SIMD lanes without shared memory barriers: - metal_l2_distance_simdgroup: cooperative L2 distance - metal_inner_product_simdgroup: cooperative dot product - metal_cosine_similarity_simdgroup: normalized inner product - metal_topk_simdgroup: per-query top-k selection via simd_min - metal_matmul_tiled: tiled matmul with threadgroup shared memory - metal_normalize_simdgroup: in-place L2 normalization Also fixes existing kernels: - Replace simd_make_float4 with float4 constructor (MSL compliance) - Add device address space qualifiers in batch kernel Tested: compiles cleanly with metal -std=metal3.1 -W -Werror. Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>
- cuvs_cagra.py: use cagra.build(IndexParams, dataset) and cagra.search(SearchParams, index, queries, k) instead of the non-existent Index().build() / Index().search() methods - cuvs_ivf_pq.py: same pattern fix, plus correct import path (cuvs.neighbors.ivf_pq instead of cuvs.ivf_pq) - Both backends now convert numpy queries to cupy device arrays before search (cuVS requires CUDA-compatible memory) Tested on RTX 4090: - cuVS CAGRA: 43K QPS (50K vectors, dim=128) - cuVS IVF-PQ: 45K QPS (50K vectors, dim=128) - FAISS GPU: 529K QPS (50K vectors, dim=128, flat) Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>
Add CUVS_AVAILABLE and CPP_CUVS_AVAILABLE flags to detect.py. Update get_optimal_backend() priority chain: C++ cuVS > Python cuVS > FAISS GPU > MPS > FAISS CPU > NumPy Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
- Fix F821 (undefined np): add module-level numpy imports for type annotations - Fix PLC0415: add noqa for intentional lazy imports inside functions - Fix G004: convert f-string logging to lazy % formatting - Fix NPY002: add noqa for legacy numpy random calls in benchmarks - Fix ARG001/ARG002: prefix unused args with underscore - Fix PTH123: use Path.open() instead of open() - Fix I001: sort imports in __init__.py - Exclude *.ipynb from ruff (demo/benchmark notebooks) Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>
Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>
The GPU CMakeLists.txt requires CUDA toolkit (nvcc) which is not available on CI runners. The C++ headers are header-only and do not need a separate build system. Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>
- Remove spurious .T in asymmetric_distance_computation() that caused a broadcast shape mismatch (10,100) vs (100,10) - Fix off-by-one in test_distributed_index: assert shard count == 4 instead of checking for non-existent shard index 4 - Skip TestGPUIndex when FAISS is not installed instead of raising RuntimeError Signed-off-by: Maxime Kawawa-Beaudan <maxkb@meta.com> Signed-off-by: Maxime Grenu <maxime.grenu@gmail.com>
d696fb0 to
7bf5a5e
Compare
|
Pushed a fix for the macos-arm64 test failure. Root cause: Several shared base commits inadvertently modified files unrelated to the Metal simdgroup kernels. Specifically:
Fix: Reverted This should resolve the |
Summary
Adds 6 new Metal compute kernels using simdgroup cooperative intrinsics (
simd_sum,simd_min,simd_shuffle) for hardware-accelerated reductions across 32 SIMD lanes — no shared memory barriers needed.Follow-up to #166 ("Future Work: SIMD optimization").
New kernels
metal_l2_distance_simdgroupmetal_inner_product_simdgroupmetal_cosine_similarity_simdgroupmetal_topk_simdgroupsimd_minlane votingmetal_matmul_tiledmetal_normalize_simdgroupDispatch model
(n_database, n_queries)threadgroupsFixes to existing kernels
simd_make_float4→float4constructor (MSL compliance)deviceaddress space qualifiers inmetal_l2_distance_batchMerge order
Test plan
metal -std=metal3.1 -W -Werroron macOS with Xcode Metal toolchain