Skip to content

Add MPS (Apple Metal) platform support#1

Draft
robtaylor wants to merge 6 commits intomainfrom
mps-platform-support
Draft

Add MPS (Apple Metal) platform support#1
robtaylor wants to merge 6 commits intomainfrom
mps-platform-support

Conversation

@robtaylor
Copy link

Summary

  • Add MPS platform detection so vLLM uses Apple Silicon GPU instead of falling to CPU on macOS
  • Pure PyTorch attention backend with paged KV cache (no C++ extensions needed)
  • MPS worker and model runner extending the GPU base classes with CUDA stub wrappers
  • CustomOp dispatch for forward_mps() falling back to forward_native()
  • CI updated to use macos-15-xlarge runner with MPS platform assertion

New files

  • vllm/platforms/mps.py — MPS platform class
  • vllm/v1/attention/backends/mps_attn.py — Pure PyTorch attention with paged KV cache
  • vllm/v1/worker/mps_model_runner.py — MPS model runner
  • vllm/v1/worker/mps_worker.py — MPS worker

Modified files

  • vllm/platforms/interface.pyPlatformEnum.MPS, is_mps()
  • vllm/platforms/__init__.py — MPS plugin, CPU plugin mutual exclusion fix
  • vllm/model_executor/custom_op.pyforward_mps() dispatch
  • vllm/v1/attention/backends/registry.pyMPS_ATTN enum
  • vllm/config/device.py"mps" in Device literal
  • .github/workflows/macos-smoke-test.yml — xlarge runner, PR trigger, MPS verification

Test plan

  • CI: MPS platform detection assertion passes on macos-15-xlarge
  • CI: vllm serve with dummy weights starts and responds on MPS
  • Local: verified current_platform.is_mps() == True on Apple Silicon
  • Local: all new and modified files pass py_compile

@robtaylor robtaylor force-pushed the mps-platform-support branch 9 times, most recently from fd871bc to fa9b5e4 Compare March 10, 2026 00:40
Add a minimal viable MPS platform so vLLM can detect and use Apple
Silicon GPUs via the Metal Performance Shaders backend. This enables
model loading and inference on macOS without CUDA.

New files:
- vllm/platforms/mps.py: MPS platform class (device detection, memory
  APIs, config validation)
- vllm/v1/attention/backends/mps_attn.py: Pure PyTorch attention with
  paged KV cache (no C++ extensions needed)
- vllm/v1/worker/mps_model_runner.py: MPS model runner extending
  GPUModelRunner with CUDA stub wrappers
- vllm/v1/worker/mps_worker.py: MPS worker with gloo distributed
  backend

Modified files:
- PlatformEnum.MPS added to interface.py with is_mps() method
- MPS platform plugin in __init__.py; CPU plugin updated to avoid
  mutual exclusion on macOS
- forward_mps() dispatch added to CustomOp
- MPS_ATTN registered in attention backend registry
- "mps" added to Device literal type

Co-developed-by: Claude Code v2.1.50 (claude-opus-4-6)
Signed-off-by: Rob Taylor <rob.taylor@chipflow.io>
- test_llama_7b_bfloat16_generation: Run Llama-7B inference with BF16 on MPS
- test_llama_7b_float16_generation: Run Llama-7B inference with FP16 on MPS
- These tests validate real-world inference performance with Metal kernels
- Includes memory utilization and generation quality checks

These are the primary E2E validation tests for the vLLM MPS platform
integration with Hub Metal kernels.

Co-developed-by: Claude Code v2.0.76 (claude-haiku-4-5-20251001)
Signed-off-by: Rob Taylor <rob.taylor@chipflow.io>
- benchmark_mps_vs_llamacpp.py: Measure throughput, latency, memory usage
- Supports BF16, FP16, FP32 precision
- Configurable prompt/token count for flexible benchmarking
- Outputs metrics: tokens/sec, ms/token, peak GPU memory
- Includes instructions for running equivalent llama.cpp benchmark

This enables quantitative E2E validation against llama.cpp Metal backend.

Co-developed-by: Claude Code v2.0.76 (claude-haiku-4-5-20251001)
Signed-off-by: Rob Taylor <rob.taylor@chipflow.io>
Branch AWQ apply() and GPTQ process_weights_after_loading()/apply()
on is_mps() to use dequant+matmul instead of CUDA-only fused kernels.

On MPS, GPTQ skips gptq_shuffle (exllama reorder) and dequantizes
from the original checkpoint layout. AWQ uses its native interleaved
bit order directly.

The mps_dequant.py wrapper tries to import the dequant_int4 Metal
kernel package for GPU-accelerated dequant, falling back to pure
PyTorch bitwise operations when the package isn't installed.

Co-developed-by: Claude Code v2.1.58 (claude-opus-4-6)
Signed-off-by: Rob Taylor <rob.taylor@chipflow.io>
Add Metal kernel path for GGUF quantized models on MPS (Apple Metal).
Implements dequant+matmul for Q4_0, Q8_0, and Q4_K types via the
dequant_gguf kernel package, with a numpy-based fallback using the
gguf Python library.

Changes:
- gguf.py: Add MPS branch in _fused_mul_mat_gguf and _apply_gguf_embedding
  to route through gguf_dequant_on_mps instead of CUDA ops
- gguf.py: Fix get_supported_act_dtypes and get_min_capability for MPS
- mps_dequant.py: Add GGUF section with Metal kernel import, numpy
  fallback, and gguf_dequant_on_mps entry point

Co-developed-by: Claude Code v2.1.58 (claude-opus-4-6)
Signed-off-by: Rob Taylor <rob.taylor@chipflow.io>
Add MPS as a GPU backend tab in the installation docs alongside
CUDA, ROCm, and XPU. Covers requirements, build from source,
optional Metal quantization kernels, usage examples, performance
expectations, memory guidelines, and troubleshooting.

Update cpu.apple.inc.md to point to the new GPU/MPS docs instead
of the external vllm-metal project.

Co-developed-by: Claude Code v2.1.58 (claude-opus-4-6)
Signed-off-by: Rob Taylor <rob.taylor@chipflow.io>
@robtaylor robtaylor force-pushed the mps-platform-support branch from fa9b5e4 to 6102f77 Compare March 10, 2026 18:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant