GitHub - cklxx/arle: Rust-native inference runtime for Qwen3 / Qwen3.5 — OpenAI-compatible serving + integrated agent, train, and self-evolution workflows. CUDA + Metal, no PyTorch on the hot path.

ARLE
Pure-Rust runtime for serving, local agents, training, and evaluation. infer is the OpenAI-compatible serving binary; arle is the unified front door.

Quick Start · HTTP API · Support Matrix · Architecture · Roadmap · Changelog · Contributing

English · 简体中文

Quick Start

1. Install

Apple Silicon — Homebrew (recommended):

brew install cklxx/tap/arle
arle --doctor

Apple Silicon or Linux x86_64 — one-line installer:

curl -fsSL https://github.com/cklxx/arle/releases/latest/download/install.sh | sh

The script grabs the matching tarball from the latest GitHub Release, SHA256-verifies it, and drops the binaries into ~/.local/bin (override with INSTALL_DIR=...). See docs/install.md for the full matrix, env-var overrides, and uninstall steps.

Linux + NVIDIA — pull the published Docker image, no compile:

docker run --rm --gpus all -p 8000:8000 \
  -v /path/to/Qwen3.5-4B:/model:ro \
  ghcr.io/cklxx/arle:latest \
  serve --backend cuda --model-path /model --port 8000

The :latest tag tracks the newest non-prerelease release image. Tagged releases are published as ghcr.io/cklxx/arle:X.Y.Z (note: no v prefix - the docker metadata-action strips it). For the current release: ghcr.io/cklxx/arle:0.1.5.

From source (any backend; needed for cpu, CUDA/TileLang, or local hacking):

git clone https://github.com/cklxx/arle && cd arle
# Apple Silicon:
cargo build --release --no-default-features --features metal,no-cuda,cli --bin arle
# Linux + NVIDIA:
cargo build --release --features cuda --bin arle

2. Serve a model

arle serve --backend metal \
  --model-path mlx-community/Qwen3.5-0.8B-MLX-4bit --port 8000   # Apple Silicon
arle serve --backend cuda \
  --model-path /path/to/Qwen3.5-4B --port 8000                   # Linux + NVIDIA

3. Talk to it

# pip install openai
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
print(client.chat.completions.create(
    model="qwen3.5-4b",
    messages=[{"role": "user", "content": "Hello from ARLE"}],
).choices[0].message.content)

Or with curl: see examples/curl_chat.sh. More copy-paste paths: examples/.

4. Run the local agent

arle                                                       # interactive REPL with built-in tools
arle --model-path /path/to/Qwen3.5-4B run --prompt "Summarize this repo"   # one-shot
arle --doctor --json                                       # self-check, machine-readable

CPU-only smoke build (no GPU required, source build):

cargo build --release --no-default-features --features cpu,no-cuda,cli --bin arle
./target/release/arle --doctor

Status at a glance

Backend	Platform	Status	Notes
CUDA	Linux + NVIDIA	Stable	Continuous batching, paged KV, radix-backed reuse, TileLang BF16 attention, CUDA Graph decode. L4 / Qwen3.5-4B BF16 + FP8 KV: 197 tok/s @ c=16 / 4k-in.
Metal	Apple Silicon	Beta	Scheduler-backed serving, chunked prefill, replay prefix reuse. Qwen3.5-0.8B MLX-4bit step-driver: 305.5 tok/s on M4 Pro 20c.
Metal DFlash	Apple Silicon	Beta — default-on	Speculative decode for Qwen3.5. Qwen3.5-4B-4bit bit-identical, c=1..8.
CPU	Portable	Dev-only	Smoke tests and request-path validation; not a perf target.

Models: Qwen3.5 family (0.8B / 4B / 30B-A3B / 35B; dense, hybrid linear-attn, and MoE; GGUF Q4_K_M and 4B hybrid attention) on CUDA + Metal. Qwen3.6 / Qwen3.5-MoE has a narrow Metal Beta path; CUDA stubbed. Next-model queue: DeepSeek V4 (#1) → Qwen 3.6 (#2), see ROADMAP.md. DeepSeek V2/V3/R1 intentionally out of scope.

Authoritative matrix (HTTP API tiers, quantization, agent / train / eval surfaces): docs/support-matrix.md. Stability tiers: docs/stability-policy.md.

Why ARLE

In agent and RL workloads every turn pays a prefill tax: system prompt + history + tool results must be re-processed. As context grows, prefill dominates latency. ARLE treats this as the core problem in both serving and agent / RL loops:

Multi-turn KV reuse. Slot-sticky reuse keeps prior-turn KV hot for the next turn. CUDA also includes a radix-backed tiered-KV path (T0 GPU → T1 host pinned → T2 local disk → T3 cluster-shared) for full-block reuse and staged readmission, so only the new user message requires prefill each turn when the prefix stays reusable.
Paged KV pool. Main CUDA KV formats use page_size=16 with direct GPU page attach and tail-page CoW on shared prefixes — predictable accounting, reusable full blocks, cheaper prefix sharing.
Shared runtime authority. infer, arle, and the in-tree train / eval jobs resolve models and reuse the same Rust runtime / model contracts. Serving, local agent work, and RL tooling stay on one code path instead of drifting across separate stacks.

Architecture deep-dive: docs/architecture.md · docs/codebase-map.md. Latest benchmark snapshots (per change, dated): docs/experience/wins/ · run your own with scripts/bench_guidellm.sh.

Entry surfaces

arle is the single binary users interact with:

Command	What it does
`arle` (no args)	Interactive agent REPL with built-in `python` and `shell` tools (sandboxed).
`arle run --prompt "…"` / `--stdin --json`	Script-friendly one-shot agent prompt. Use `--no-tools` to disable tool execution.
`arle serve --backend {cuda,metal,cpu} --model-path …`	Launch the OpenAI-compatible HTTP server through an ARLE-native backend.
`arle train {pretrain,sft,grpo,multi-turn,eval}`	In-tree training and RL workflows on the same runtime.
`arle data {download,convert}`	Dataset utilities.
`arle --doctor [--json] [--strict]`	Self-check: backend, hardware, HF cache, model resolution. CI-friendly.

The REPL persists line history at ~/.arle-history and exposes slash commands: /help, /reset, /clear, /tools, /model, /stats, /models, /save, /load, /export.

Operators who want only the native serving binary can use infer directly (cargo build -p infer --release --features cuda on Linux, --features metal,no-cuda on Apple Silicon) — same HTTP contract, without the agent / train / data surface.

📰 Latest Updates

2026-05-15 — DSv4 DeepEP decode lands default B=1 padded BF16 reduce-scatter combine, fused local-expert prepare kernel, and broad scratch-reuse cleanup. Real 8xH20 on DeepSeek-V4-Flash: decode64 holds 12.05 post-first tok/s; isolated single-token nsys wave 105.2 → 87.7 ms, cuMemsetD8Async calls 3,640 → 544, arithmetic exact (410/406). Remaining stack: NCCL SendRecv/AllReduce, FP8/FP4 expert GEMV (awaits true grouped GEMM/DeepGEMM), launch churn, D2H route-count readback. Evidence: docs/trace-artifacts/2026-05-15-dsv4-deepep/, docs/trace-artifacts/2026-05-14-dsv4-deepep/, docs/experience/errors/2026-05-14-dsv4-decode-nccl-bottleneck.md.
2026-05-10 — W4-hybrid prefill graph capture closes the 4k/c=4 SGLang +76.6% gap via Path B.2 bucketed allocation key. Engine-side TTFT p50 2000 → 150 ms (-92.5%), throughput +632% in 60s on RTX 4070 Ti SUPER 16GB. Capture-key churn 388 → 7 unique (98.5% LRU reuse). Opt-in via INFER_PREFILL_GRAPH=1 + INFER_HYBRID_W4A8_PREFILL=1. Also lands RoPE YARN/Linear/NtkAware scaling. Evidence: docs/experience/wins/2026-05-10-bench-40-pathB2-tier1-strong-proceed.md, docs/experience/wins/2026-05-10-m-rope-yarn-scaling-phase1-phase2-landed.md.

Full history: CHANGELOG.md. Next up: ROADMAP.md.

Documentation map

docs/http-api.md — HTTP route contract, streaming behavior, boundary guarantees
docs/support-matrix.md — backend / model / quant / API support tiers
docs/stability-policy.md — stability levels and compatibility posture
docs/architecture.md — package boundaries and dependency direction
docs/codebase-map.md — workspace layout and main execution paths
docs/environment.md — environment variables and runtime knobs
docs/troubleshooting.md — common build / runtime errors and fixes
docs/comparison.md — how ARLE compares to vLLM / SGLang / mistral.rs / llama.cpp
docs/release-checklist.md · docs/perf-and-correctness-gates.md
CONTRIBUTING.md — contributor setup, validation, release expectations
SECURITY.md — vulnerability reporting policy
examples/ — copy-paste smoke paths (curl, OpenAI SDK, Docker, Metal, train fixtures)
docs/index.md — maintainer-facing PARA index, plans, and experience logs

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3,019 Commits
.cargo		.cargo
.claude		.claude
.githooks		.githooks
.github		.github
benchmarks		benchmarks
crates		crates
docs		docs
examples		examples
infer		infer
memory		memory
scripts		scripts
src		src
tests		tests
traces		traces
web		web
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
README.zh-CN.md		README.zh-CN.md
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
deny.toml		deny.toml
pyproject.toml		pyproject.toml
requirements-bench.txt		requirements-bench.txt
requirements-build.txt		requirements-build.txt
rust-toolchain.toml		rust-toolchain.toml
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quick Start

1. Install

2. Serve a model

3. Talk to it

4. Run the local agent

Status at a glance

Why ARLE

Entry surfaces

📰 Latest Updates

Documentation map

License

About

Uh oh!

Releases 6

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Quick Start

1. Install

2. Serve a model

3. Talk to it

4. Run the local agent

Status at a glance

Why ARLE

Entry surfaces

📰 Latest Updates

Documentation map

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages