🚀 Flash-MoE Universal(UniFlash-MoE)

Cross‑platform AI inference engine — runs on every CPU and GPU, every OS.

Based on flash-moe by Daniel Woods — extended by AHX47 to work universally.

What is this?

Flash-MoE is a bare‑metal inference engine that can run giant AI models (397 billion parameters) locally, without cloud, without Python, without frameworks. The original works only on Apple Silicon Macs.

This fork makes it universal:

Platform	CPU	GPU
macOS Intel (x86_64)	✅ AVX2+FMA	✅ OpenCL
macOS Apple Silicon (arm64)	✅ NEON	✅ Metal
Linux x86_64	✅ AVX2+FMA / AVX‑512	✅ CUDA / OpenCL / Vulkan
Linux ARM64 (Raspberry Pi, Jetson)	✅ NEON	✅ OpenCL
Linux ARMv7	✅ NEON (32‑bit)	—
Linux RISC‑V	✅ Scalar	—
Windows x86_64	✅ AVX2+FMA	✅ CUDA / OpenCL
Windows ARM64	✅ NEON	✅ OpenCL

Models Supported

Model	Params	Active	RAM	Type
Qwen3.5-397B-A17B	397B	17B/tok	48GB	MoE
Qwen2.5-72B	72B	72B	24GB+	Dense
Qwen2.5-7B	7B	7B	8GB	Dense
Qwen2.5-1.5B	1.5B	1.5B	2GB	Dense
Llama-3.1-70B	70B	70B	24GB+	Dense
Llama-3.1-8B	8B	8B	8GB	Dense
Llama-3.2-3B	3B	3B	4GB	Dense
Mistral-7B	7B	7B	8GB	Dense
Mixtral-8x7B	47B	12.9B	16GB	MoE
Phi-4	14B	14B	12GB	Dense
Gemma-2-9B	9B	9B	8GB	Dense

And any HuggingFace model with config.json.

Quick Start

1. Build

# Linux / macOS
git clone https://github.com/AHX47/flash-moe-universal
cd flash-moe-universal
./build.sh         # auto‑detects best backend

# Windows
build.bat

# Manual (any platform)
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build . -j$(nproc)

2. Prepare Model

# Download a model (requires huggingface-cli)
pip install huggingface_hub
huggingface-cli download Qwen/Qwen2.5-7B-Instruct --local-dir ./qwen2.5-7b

# Extract weights to binary format
python scripts/extract_weights.py --model ./qwen2.5-7b --output ./qwen2.5-7b

# For MoE models, also repack experts:
python scripts/repack_experts.py --model ./qwen3.5-397b --output ./qwen3.5-397b

3. Run

# Single prompt
./build/flash_moe --model ./qwen2.5-7b --prompt "Explain quantum computing"

# Interactive chat
./build/flash_moe --model ./qwen2.5-7b --chat

# HTTP server (OpenAI‑compatible API)
./build/flash_moe --model ./qwen2.5-7b --serve 8080

# Force CPU only
./build/flash_moe --model ./qwen2.5-7b --backend cpu --prompt "Hello"

# With specific SIMD/GPU
./build/flash_moe --model ./qwen2.5-7b --backend opencl --prompt "Hello"

# System info
./build/flash_moe --info

4. GUI (PyQt5)

pip install PyQt5
python gui/app.py

The GUI provides:

Model Manager – browse local, download from HuggingFace
Chat interface with streaming output
Settings – temperature, top‑p, backend, threads
System info and build console

Architecture

flash-moe-universal/
├── src/
│   ├── platform/
│   │   └── platform.h          # OS/Arch abstraction (mmap, pread, threads, timing)
│   ├── core/
│   │   ├── model_config.h      # Dynamic config (any model 1M–397B)
│   │   ├── inference_engine.h  # Public API
│   │   └── inference_engine.c  # Forward pass, MoE routing, KV cache
│   ├── backends/
│   │   ├── cpu_backend.h       # x86 AVX2, ARM NEON, scalar fallback
│   │   ├── kernels.cl          # OpenCL GPU kernels
│   │   ├── metal_backend.m     # Apple Metal (original flash-moe shaders)
│   │   └── cuda_backend.cu     # NVIDIA CUDA kernels
│   └── cli/
│       └── main.c              # CLI entry point
├── gui/
│   └── app.py                  # PyQt5 GUI
├── scripts/
│   ├── extract_weights.py      # HF safetensors → binary
│   ├── repack_experts.py       # MoE expert repacking
│   └── export_tokenizer.py     # Tokenizer export
├── CMakeLists.txt              # Cross‑platform build
├── build.sh                    # Linux/macOS build script
├── build.bat                   # Windows build script
├── LICENSE
├── README.md
└── CONTRIBUTING.md

Performance

CPU (no GPU)

Hardware	Model	Speed
Intel i9-13900K (AVX2)	Qwen2.5-7B 4‑bit	~15‑25 tok/s
Intel i7-1260P (AVX2)	Qwen2.5-3B 4‑bit	~20‑35 tok/s
AMD Ryzen 9 7950X (AVX2)	Qwen2.5-7B 4‑bit	~18‑30 tok/s
Apple M3 Max (NEON, no Metal)	Qwen2.5-7B 4‑bit	~30‑50 tok/s
ARM64 Jetson Orin (NEON)	Llama-3.2-3B 4‑bit	~8‑15 tok/s
Raspberry Pi 5 (ARM64)	Qwen2.5-0.5B 4‑bit	~2‑5 tok/s

GPU (planned / coming soon)

Hardware	Model	Speed
Apple M3 Max (Metal)	Qwen3.5-397B 4‑bit	4.4 tok/s
NVIDIA RTX 4090 (CUDA)	Qwen2.5-70B 4‑bit	~25 tok/s
NVIDIA RTX 3060 (CUDA)	Qwen2.5-7B 4‑bit	~35 tok/s
AMD RX 7900 XTX (OpenCL)	Qwen2.5-7B 4‑bit	~20 tok/s

Key Design Principles (from original flash-moe)

Expert streaming from SSD – Only load the 4 active experts per token, not all 512
Trust the OS – No custom expert cache; OS page cache achieves ~71% hit rate naturally
Pipeline GPU+CPU+IO – Submit GPU commands without waiting; overlap with next layer’s I/O
FMA optimization – fma(nibble, scale*x, bias*x) reduces dequant+multiply to one instruction
Multi‑architecture SIMD – AVX2 on x86, NEON on ARM, scalar fallback everywhere

API

#include "src/core/inference_engine.h"

// 1. Set up config (auto‑detect from model name)
ModelConfig cfg;
model_config_init(&cfg);
model_config_apply_preset(&cfg, "qwen2.5-7");
// or: model_config_load_json(&cfg, "path/to/config.json");

// 2. Create context
InferOptions opts = { .backend = BACKEND_AUTO, .num_threads = 8 };
InferContext *ctx = infer_create(&cfg, &opts);

// 3. Load weights
infer_load_weights(ctx, "/path/to/model");

// 4. Generate
infer_generate(ctx, "Hello, world!", 100, 0.7f, 0.9f,
    [](const InferToken *tok, void *ud) -> bool {
        printf("%s", tok->token_text);
        return !tok->is_eos;
    }, NULL);

// 5. Cleanup
infer_destroy(ctx);

Differences from Original flash-moe

Feature	Original	Universal (AHX47)
OS	macOS only	macOS, Linux, Windows
CPU	Apple Silicon ARM64	Intel x86_64, AMD64, ARM64, ARMv7, RISC‑V
GPU	Metal only	Metal, OpenCL, CUDA, Vulkan
Language	Objective‑C + Metal	C11 + OpenCL/CUDA/Vulkan + optional Metal
Build	Makefile (macOS)	CMake (cross‑platform)
Models	Qwen3.5‑397B only	Any HuggingFace model 1M–397B
Config	Hardcoded constants	Dynamic from config.json
GUI	None	PyQt5 (all platforms)
Threads	GCD	pthreads / std::thread
I/O	pread + GCD	pread (POSIX) / ReadFile (Windows)

Requirements

Build:

CMake 3.15+
C11 compiler: GCC 9+, Clang 10+, MSVC 2019+
(Optional) OpenCL SDK for GPU acceleration
(Optional) CUDA Toolkit 11+ for NVIDIA

Python scripts:

Python 3.8+
torch, safetensors, numpy

GUI:

Python 3.8+
PyQt5
(Optional) psutil for RAM info

License

MIT – same as original flash-moe.

Credits:

Daniel Woods – original Flash-MoE
AHX47 – universal port, cross‑platform extensions, GUI, CMake build system
Claude – coding partner

Contributing

Contributions are welcome! Please read CONTRIBUTING.md to get started.
You can report issues or request features via GitHub Issues.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github/workflows		.github/workflows
docs		docs
gui		gui
scripts		scripts
src		src
CMakeLists.txt		CMakeLists.txt
README.md		README.md
build.bat		build.bat
build.sh		build.sh
has-anyone-tried-this-flash-moe-running-a-397b-parameter-v0-aipsbDL6RzWMY8YaJMD_OPdhrYChduqs4Rjw5we_vhg.webp		has-anyone-tried-this-flash-moe-running-a-397b-parameter-v0-aipsbDL6RzWMY8YaJMD_OPdhrYChduqs4Rjw5we_vhg.webp
launch_gui.bat		launch_gui.bat
launch_gui.sh		launch_gui.sh
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 Flash-MoE Universal(UniFlash-MoE)

What is this?

Models Supported

Quick Start

1. Build

2. Prepare Model

3. Run

4. GUI (PyQt5)

Architecture

Performance

CPU (no GPU)

GPU (planned / coming soon)

Key Design Principles (from original flash-moe)

API

Differences from Original flash-moe

Requirements

License

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🚀 Flash-MoE Universal(UniFlash-MoE)

What is this?

Models Supported

Quick Start

1. Build

2. Prepare Model

3. Run

4. GUI (PyQt5)

Architecture

Performance

CPU (no GPU)

GPU (planned / coming soon)

Key Design Principles (from original flash-moe)

API

Differences from Original flash-moe

Requirements

License

Contributing

About

Topics

Resources

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages