rMLX

Rust-native, single-binary MLX inference + conversion backend for Apple Silicon.

No Python at runtime. One cargo build --release artifact. The widest weight × KV-quantization matrix any MLX server ships, including rotation-based KV families (TurboQuant, IsoQuant, PlanarQuant, RotorQuant, ParoQuant) that no other MLX server offers.

Status: 0.1.0 — first feature-complete native MLX backend. Apple Silicon only (Metal). See What works.

Why

Pain	rMLX answer
`mlx_lm.server` — Python venv juggling, slow startup, no KV rotation	Rust + lifted Metal kernels, instant warm-start, zero Python at runtime
Multi-model Python servers — heavy deps, always-on	Single binary, load-on-demand / unload-on-idle lifecycle
Experimental quant forks (TurboQuant / PlanarQuant / ParoQuant) live in separate llama.cpp or Python trees	All first-class on one MLX path

What works

Text generation — OpenAI-compatible /v1/chat/completions and /v1/completions, plus an Anthropic-compatible surface. Temperature, top-k/p, penalties, thinking-budget, constrained / schema-guided decoding.
Image input — vision-capable models (Gemma 4 SigLIP tower, Qwen3-VL-MoE deepstack) accept images via image_url content parts (data-URI, http, file path, or base64).
Audio input — audio transcription / translation endpoints for audio-capable models.
Embeddings — /v1/embeddings, including multimodal (text + image) jina-v4.
Tool / function calling — OpenAI tool_calls and Anthropic tool_use, multi-turn, multiple emit formats (Qwen XML, Hermes-JSON, Gemma).
Quantization — affine 2–8 bit, mxfp4 / mxfp8, nvfp4, ParoQuant weights; KV-cache quant incl. fp8, TurboQuant, RotorQuant, PlanarQuant, IsoQuant, paged-KV, mixed / asymmetric K/V, and an SSD KV tier.
Speculative decoding — MTP, DFlash, and Eagle3 drafters.
Prompt caching — automatic prefix caching with block hashing.
Conversion — rmlx convert re-quantizes / repacks MLX → MLX.

Continuously smoke-tested end-to-end. The first four families carry committed golden-token decode gates (temp=0, exact token-id match); embeddings and the speculative drafters are validated end-to-end via their serving endpoints.

Family	Example snapshot(s)	Arch
Gemma 4	`gemma-4-e2b/e4b-it-mxfp8`, `gemma-4-26b-a4b-it-mxfp8` (MoE), `gemma-4-31b-it-mxfp8` (dense)	`Gemma4ForConditionalGeneration`
Qwen 3.6	`Qwen3.6-35B-A3B-8bit`	`Qwen3_5MoeForConditionalGeneration`
Bonsai	`Ternary-Bonsai-8B-mlx-2bit`	`Qwen3ForCausalLM`
BitNet	`bitnet-b1.58-2B-4T`	`BitNetForCausalLM`
Embeddings	`jina-embeddings-v4` (text + image)	`JinaEmbeddingsV4Model`

Speculative-decoding drafters are validated against their verifiers via --draft-kind mtp: the Qwen 3.6 MTP sidecar (Qwen3.6-35B-A3B-MTP-5bit, verifier Qwen3.6-35B-A3B-8bit) and the Gemma 4 assistant drafter (gemma-4-E2B-it-assistant-bf16, verifier gemma-4-e2b-it-mxfp8).

Requirements

Apple Silicon Mac (M-series). Metal only — no CUDA / ROCm / x86.
Rust stable (1.95+).
MLX + mlx-c installed locally. rMLX links the stable mlx-c C ABI; it does not vendor or build MLX itself.

brew install mlx-c          # provides the MLX + mlx-c libraries

If your MLX install is not on the default homebrew cellar path, point the build at it:

export MLX_C_PREFIX="$(brew --prefix mlx-c)"   # dir containing lib/libmlxc.dylib + include/

Install

All paths build from source — rMLX links the system MLX/mlx-c libraries, so MLX must be present (brew install mlx-c). The build targets the installing machine's own chip, so a single method serves every Apple Silicon generation (M1–M5).

Script (ensures Rust + MLX, then builds):

curl -fsSL https://raw.githubusercontent.com/Pushkinist/rMLX/main/install.sh | bash

Prefer to inspect first (recommended for any curl | bash):

curl -fsSL https://raw.githubusercontent.com/Pushkinist/rMLX/main/install.sh -o install.sh
less install.sh && bash install.sh

Homebrew (via tap):

brew tap Pushkinist/rmlx
brew install rmlx

Cargo:

brew install mlx-c
MLX_C_PREFIX="$(brew --prefix mlx-c)" \
  cargo install --git https://github.com/Pushkinist/rMLX --bin rmlx rmlx-cli

Build

For development / from a clone:

git clone https://github.com/Pushkinist/rMLX
cd rMLX
cp .env.example .env          # set RMLX_O_MODELS_ROOT to your models folder
cargo build --release        # → target/release/rmlx

Or use the Makefile wrapper (keeps the local gate identical to CI):

make build      # cargo build --workspace --release
make ci         # fmt-check + clippy + test + deny + audit (pre-merge gate)

Run

Serve an MLX-format model directory (the mlx-community safetensors layout):

target/release/rmlx serve --model /path/to/mlx-community__gemma-4-e4b-it-mxfp8 --port 8080

Then call it like any OpenAI endpoint:

curl http://localhost:8080/v1/chat/completions \
  -H 'content-type: application/json' \
  -d '{
    "model": "gemma-4-e4b",
    "messages": [{"role": "user", "content": "Say hello in one word."}],
    "temperature": 0
  }'

Inspect a model's architecture + quantization without serving:

target/release/rmlx info --model /path/to/snapshot

See docs/CLI.md for the full subcommand / flag reference.

Documentation

Doc	Topic
`docs/CLI.md`	CLI subcommands, flags, env vars
`docs/SERVER.md`	HTTP server: OpenAI / Anthropic compat, routes, tool calling
`docs/MODELS.md`	Per-architecture model reference
`docs/WEIGHT_QUANTS.md`	Weight quantization formats
`docs/KV_QUANT.md`	KV-cache quantization variants
`docs/KV_CACHE.md`	KV cache architecture
`docs/SPECULATIVE.md`	Speculative decoding (MTP / DFlash / Eagle3)
`docs/PROMPT_CACHE.md`	Prompt + automatic prefix caching
`docs/SAMPLING.md`	Per-token sampling + constrained decoding
`docs/FFI.md`	rmlx-mlx ↔ mlx-c FFI bridge
`docs/METRICS_DB.md`	Metrics DB schema + `rmlx metrics`

CLAUDE.md carries the architecture overview and the workspace crate graph.

Non-goals

Not a GGUF runtime (that is llama.cpp's lane). MLX-format only; rMLX can re-quantize / convert MLX → MLX but never reads GGUF.
No training / fine-tune / fuse / LoRA-merge. Quantization and format conversion are in scope; training is not.
Multi-LoRA hot-swap per request is out of scope — fuse externally and load the merged snapshot.
Apple Silicon only — no CUDA, ROCm, or x86 SIMD paths.

Releasing

The version lives in exactly one place: [workspace.package].version in the root Cargo.toml. Member crates inherit it via version.workspace = true, and internal path deps omit a version (deny.toml sets allow-wildcard-paths).

Bump version in Cargo.toml [workspace.package].
make ci green.
make tag — derives v<version> from Cargo.toml, creates the annotated tag.
git push origin v<version>, then cut the GitHub release from the tag.

License

Dual-licensed under either of

Apache License, Version 2.0 (LICENSE-APACHE)
MIT license (LICENSE-MIT)

at your option.

Name		Name	Last commit message	Last commit date
Latest commit History 124 Commits
.cargo		.cargo
.github		.github
config		config
crates		crates
docs		docs
packaging/homebrew		packaging/homebrew
prompts		prompts
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
clippy.toml		clippy.toml
deny.toml		deny.toml
install.sh		install.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

rMLX

Why

What works

Requirements

Install

Build

Run

Documentation

Non-goals

Releasing

License

About

Licenses found

Uh oh!

Releases 4

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

rMLX

Why

What works

Requirements

Install

Build

Run

Documentation

Non-goals

Releasing

License

About

Topics

Resources

License

Licenses found

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 4

Uh oh!

Contributors

Uh oh!

Languages