Cross‑platform AI inference engine — runs on every CPU and GPU, every OS.
Based on flash-moe by Daniel Woods — extended by AHX47 to work universally.
Flash-MoE is a bare‑metal inference engine that can run giant AI models (397 billion parameters) locally, without cloud, without Python, without frameworks. The original works only on Apple Silicon Macs.
This fork makes it universal:
| Platform | CPU | GPU |
|---|---|---|
| macOS Intel (x86_64) | ✅ AVX2+FMA | ✅ OpenCL |
| macOS Apple Silicon (arm64) | ✅ NEON | ✅ Metal |
| Linux x86_64 | ✅ AVX2+FMA / AVX‑512 | ✅ CUDA / OpenCL / Vulkan |
| Linux ARM64 (Raspberry Pi, Jetson) | ✅ NEON | ✅ OpenCL |
| Linux ARMv7 | ✅ NEON (32‑bit) | — |
| Linux RISC‑V | ✅ Scalar | — |
| Windows x86_64 | ✅ AVX2+FMA | ✅ CUDA / OpenCL |
| Windows ARM64 | ✅ NEON | ✅ OpenCL |
| Model | Params | Active | RAM | Type |
|---|---|---|---|---|
| Qwen3.5-397B-A17B | 397B | 17B/tok | 48GB | MoE |
| Qwen2.5-72B | 72B | 72B | 24GB+ | Dense |
| Qwen2.5-7B | 7B | 7B | 8GB | Dense |
| Qwen2.5-1.5B | 1.5B | 1.5B | 2GB | Dense |
| Llama-3.1-70B | 70B | 70B | 24GB+ | Dense |
| Llama-3.1-8B | 8B | 8B | 8GB | Dense |
| Llama-3.2-3B | 3B | 3B | 4GB | Dense |
| Mistral-7B | 7B | 7B | 8GB | Dense |
| Mixtral-8x7B | 47B | 12.9B | 16GB | MoE |
| Phi-4 | 14B | 14B | 12GB | Dense |
| Gemma-2-9B | 9B | 9B | 8GB | Dense |
And any HuggingFace model with config.json.
# Linux / macOS
git clone https://github.com/AHX47/flash-moe-universal
cd flash-moe-universal
./build.sh # auto‑detects best backend
# Windows
build.bat
# Manual (any platform)
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build . -j$(nproc)# Download a model (requires huggingface-cli)
pip install huggingface_hub
huggingface-cli download Qwen/Qwen2.5-7B-Instruct --local-dir ./qwen2.5-7b
# Extract weights to binary format
python scripts/extract_weights.py --model ./qwen2.5-7b --output ./qwen2.5-7b
# For MoE models, also repack experts:
python scripts/repack_experts.py --model ./qwen3.5-397b --output ./qwen3.5-397b# Single prompt
./build/flash_moe --model ./qwen2.5-7b --prompt "Explain quantum computing"
# Interactive chat
./build/flash_moe --model ./qwen2.5-7b --chat
# HTTP server (OpenAI‑compatible API)
./build/flash_moe --model ./qwen2.5-7b --serve 8080
# Force CPU only
./build/flash_moe --model ./qwen2.5-7b --backend cpu --prompt "Hello"
# With specific SIMD/GPU
./build/flash_moe --model ./qwen2.5-7b --backend opencl --prompt "Hello"
# System info
./build/flash_moe --infopip install PyQt5
python gui/app.pyThe GUI provides:
- Model Manager – browse local, download from HuggingFace
- Chat interface with streaming output
- Settings – temperature, top‑p, backend, threads
- System info and build console
flash-moe-universal/
├── src/
│ ├── platform/
│ │ └── platform.h # OS/Arch abstraction (mmap, pread, threads, timing)
│ ├── core/
│ │ ├── model_config.h # Dynamic config (any model 1M–397B)
│ │ ├── inference_engine.h # Public API
│ │ └── inference_engine.c # Forward pass, MoE routing, KV cache
│ ├── backends/
│ │ ├── cpu_backend.h # x86 AVX2, ARM NEON, scalar fallback
│ │ ├── kernels.cl # OpenCL GPU kernels
│ │ ├── metal_backend.m # Apple Metal (original flash-moe shaders)
│ │ └── cuda_backend.cu # NVIDIA CUDA kernels
│ └── cli/
│ └── main.c # CLI entry point
├── gui/
│ └── app.py # PyQt5 GUI
├── scripts/
│ ├── extract_weights.py # HF safetensors → binary
│ ├── repack_experts.py # MoE expert repacking
│ └── export_tokenizer.py # Tokenizer export
├── CMakeLists.txt # Cross‑platform build
├── build.sh # Linux/macOS build script
├── build.bat # Windows build script
├── LICENSE
├── README.md
└── CONTRIBUTING.md
| Hardware | Model | Speed |
|---|---|---|
| Intel i9-13900K (AVX2) | Qwen2.5-7B 4‑bit | ~15‑25 tok/s |
| Intel i7-1260P (AVX2) | Qwen2.5-3B 4‑bit | ~20‑35 tok/s |
| AMD Ryzen 9 7950X (AVX2) | Qwen2.5-7B 4‑bit | ~18‑30 tok/s |
| Apple M3 Max (NEON, no Metal) | Qwen2.5-7B 4‑bit | ~30‑50 tok/s |
| ARM64 Jetson Orin (NEON) | Llama-3.2-3B 4‑bit | ~8‑15 tok/s |
| Raspberry Pi 5 (ARM64) | Qwen2.5-0.5B 4‑bit | ~2‑5 tok/s |
| Hardware | Model | Speed |
|---|---|---|
| Apple M3 Max (Metal) | Qwen3.5-397B 4‑bit | 4.4 tok/s |
| NVIDIA RTX 4090 (CUDA) | Qwen2.5-70B 4‑bit | ~25 tok/s |
| NVIDIA RTX 3060 (CUDA) | Qwen2.5-7B 4‑bit | ~35 tok/s |
| AMD RX 7900 XTX (OpenCL) | Qwen2.5-7B 4‑bit | ~20 tok/s |
- Expert streaming from SSD – Only load the 4 active experts per token, not all 512
- Trust the OS – No custom expert cache; OS page cache achieves ~71% hit rate naturally
- Pipeline GPU+CPU+IO – Submit GPU commands without waiting; overlap with next layer’s I/O
- FMA optimization –
fma(nibble, scale*x, bias*x)reduces dequant+multiply to one instruction - Multi‑architecture SIMD – AVX2 on x86, NEON on ARM, scalar fallback everywhere
#include "src/core/inference_engine.h"
// 1. Set up config (auto‑detect from model name)
ModelConfig cfg;
model_config_init(&cfg);
model_config_apply_preset(&cfg, "qwen2.5-7");
// or: model_config_load_json(&cfg, "path/to/config.json");
// 2. Create context
InferOptions opts = { .backend = BACKEND_AUTO, .num_threads = 8 };
InferContext *ctx = infer_create(&cfg, &opts);
// 3. Load weights
infer_load_weights(ctx, "/path/to/model");
// 4. Generate
infer_generate(ctx, "Hello, world!", 100, 0.7f, 0.9f,
[](const InferToken *tok, void *ud) -> bool {
printf("%s", tok->token_text);
return !tok->is_eos;
}, NULL);
// 5. Cleanup
infer_destroy(ctx);| Feature | Original | Universal (AHX47) |
|---|---|---|
| OS | macOS only | macOS, Linux, Windows |
| CPU | Apple Silicon ARM64 | Intel x86_64, AMD64, ARM64, ARMv7, RISC‑V |
| GPU | Metal only | Metal, OpenCL, CUDA, Vulkan |
| Language | Objective‑C + Metal | C11 + OpenCL/CUDA/Vulkan + optional Metal |
| Build | Makefile (macOS) | CMake (cross‑platform) |
| Models | Qwen3.5‑397B only | Any HuggingFace model 1M–397B |
| Config | Hardcoded constants | Dynamic from config.json |
| GUI | None | PyQt5 (all platforms) |
| Threads | GCD | pthreads / std::thread |
| I/O | pread + GCD | pread (POSIX) / ReadFile (Windows) |
Build:
- CMake 3.15+
- C11 compiler: GCC 9+, Clang 10+, MSVC 2019+
- (Optional) OpenCL SDK for GPU acceleration
- (Optional) CUDA Toolkit 11+ for NVIDIA
Python scripts:
- Python 3.8+
torch,safetensors,numpy
GUI:
- Python 3.8+
PyQt5- (Optional)
psutilfor RAM info
MIT – same as original flash-moe.
Credits:
- Daniel Woods – original Flash-MoE
- AHX47 – universal port, cross‑platform extensions, GUI, CMake build system
- Claude – coding partner
Contributions are welcome! Please read CONTRIBUTING.md to get started.
You can report issues or request features via GitHub Issues.
