Skip to content

cklxx/arle

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3,019 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

ARLE
Pure-Rust runtime for serving, local agents, training, and evaluation. infer is the OpenAI-compatible serving binary; arle is the unified front door.

Website CI CUDA CI Metal CI MIT License Release

Quick Start · HTTP API · Support Matrix · Architecture · Roadmap · Changelog · Contributing

English · 简体中文


Quick Start

1. Install

Apple Silicon — Homebrew (recommended):

brew install cklxx/tap/arle
arle --doctor

Apple Silicon or Linux x86_64 — one-line installer:

curl -fsSL https://github.com/cklxx/arle/releases/latest/download/install.sh | sh

The script grabs the matching tarball from the latest GitHub Release, SHA256-verifies it, and drops the binaries into ~/.local/bin (override with INSTALL_DIR=...). See docs/install.md for the full matrix, env-var overrides, and uninstall steps.

Linux + NVIDIA — pull the published Docker image, no compile:

docker run --rm --gpus all -p 8000:8000 \
  -v /path/to/Qwen3.5-4B:/model:ro \
  ghcr.io/cklxx/arle:latest \
  serve --backend cuda --model-path /model --port 8000

The :latest tag tracks the newest non-prerelease release image. Tagged releases are published as ghcr.io/cklxx/arle:X.Y.Z (note: no v prefix - the docker metadata-action strips it). For the current release: ghcr.io/cklxx/arle:0.1.5.

From source (any backend; needed for cpu, CUDA/TileLang, or local hacking):

git clone https://github.com/cklxx/arle && cd arle
# Apple Silicon:
cargo build --release --no-default-features --features metal,no-cuda,cli --bin arle
# Linux + NVIDIA:
cargo build --release --features cuda --bin arle

2. Serve a model

arle serve --backend metal \
  --model-path mlx-community/Qwen3.5-0.8B-MLX-4bit --port 8000   # Apple Silicon
arle serve --backend cuda \
  --model-path /path/to/Qwen3.5-4B --port 8000                   # Linux + NVIDIA

3. Talk to it

# pip install openai
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
print(client.chat.completions.create(
    model="qwen3.5-4b",
    messages=[{"role": "user", "content": "Hello from ARLE"}],
).choices[0].message.content)

Or with curl: see examples/curl_chat.sh. More copy-paste paths: examples/.

4. Run the local agent

arle                                                       # interactive REPL with built-in tools
arle --model-path /path/to/Qwen3.5-4B run --prompt "Summarize this repo"   # one-shot
arle --doctor --json                                       # self-check, machine-readable

CPU-only smoke build (no GPU required, source build):

cargo build --release --no-default-features --features cpu,no-cuda,cli --bin arle
./target/release/arle --doctor

Status at a glance

Backend Platform Status Notes
CUDA Linux + NVIDIA Stable Continuous batching, paged KV, radix-backed reuse, TileLang BF16 attention, CUDA Graph decode. L4 / Qwen3.5-4B BF16 + FP8 KV: 197 tok/s @ c=16 / 4k-in.
Metal Apple Silicon Beta Scheduler-backed serving, chunked prefill, replay prefix reuse. Qwen3.5-0.8B MLX-4bit step-driver: 305.5 tok/s on M4 Pro 20c.
Metal DFlash Apple Silicon Beta — default-on Speculative decode for Qwen3.5. Qwen3.5-4B-4bit bit-identical, c=1..8.
CPU Portable Dev-only Smoke tests and request-path validation; not a perf target.

Models: Qwen3.5 family (0.8B / 4B / 30B-A3B / 35B; dense, hybrid linear-attn, and MoE; GGUF Q4_K_M and 4B hybrid attention) on CUDA + Metal. Qwen3.6 / Qwen3.5-MoE has a narrow Metal Beta path; CUDA stubbed. Next-model queue: DeepSeek V4 (#1)Qwen 3.6 (#2), see ROADMAP.md. DeepSeek V2/V3/R1 intentionally out of scope.

Authoritative matrix (HTTP API tiers, quantization, agent / train / eval surfaces): docs/support-matrix.md. Stability tiers: docs/stability-policy.md.


Why ARLE

In agent and RL workloads every turn pays a prefill tax: system prompt + history + tool results must be re-processed. As context grows, prefill dominates latency. ARLE treats this as the core problem in both serving and agent / RL loops:

  • Multi-turn KV reuse. Slot-sticky reuse keeps prior-turn KV hot for the next turn. CUDA also includes a radix-backed tiered-KV path (T0 GPU → T1 host pinned → T2 local disk → T3 cluster-shared) for full-block reuse and staged readmission, so only the new user message requires prefill each turn when the prefix stays reusable.
  • Paged KV pool. Main CUDA KV formats use page_size=16 with direct GPU page attach and tail-page CoW on shared prefixes — predictable accounting, reusable full blocks, cheaper prefix sharing.
  • Shared runtime authority. infer, arle, and the in-tree train / eval jobs resolve models and reuse the same Rust runtime / model contracts. Serving, local agent work, and RL tooling stay on one code path instead of drifting across separate stacks.

Architecture deep-dive: docs/architecture.md · docs/codebase-map.md. Latest benchmark snapshots (per change, dated): docs/experience/wins/ · run your own with scripts/bench_guidellm.sh.


Entry surfaces

arle is the single binary users interact with:

Command What it does
arle (no args) Interactive agent REPL with built-in python and shell tools (sandboxed).
arle run --prompt "…" / --stdin --json Script-friendly one-shot agent prompt. Use --no-tools to disable tool execution.
arle serve --backend {cuda,metal,cpu} --model-path … Launch the OpenAI-compatible HTTP server through an ARLE-native backend.
arle train {pretrain,sft,grpo,multi-turn,eval} In-tree training and RL workflows on the same runtime.
arle data {download,convert} Dataset utilities.
arle --doctor [--json] [--strict] Self-check: backend, hardware, HF cache, model resolution. CI-friendly.

The REPL persists line history at ~/.arle-history and exposes slash commands: /help, /reset, /clear, /tools, /model, /stats, /models, /save, /load, /export.

Operators who want only the native serving binary can use infer directly (cargo build -p infer --release --features cuda on Linux, --features metal,no-cuda on Apple Silicon) — same HTTP contract, without the agent / train / data surface.


📰 Latest Updates

Full history: CHANGELOG.md. Next up: ROADMAP.md.


Documentation map


License

MIT

About

Rust-native inference runtime for Qwen3 / Qwen3.5 — OpenAI-compatible serving + integrated agent, train, and self-evolution workflows. CUDA + Metal, no PyTorch on the hot path.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors