A sandbox for running pi — a minimal terminal coding agent — against local models on Apple Silicon. Swap models, swap inference engines, A/B them on your own hardware. No API keys. No cloud round-trips. All inference on your machine.
The documented default is Qwen3-Coder-30B-A3B-Instruct via llama.cpp, with wired-up alternates for Qwen3-Coder-Next-80B, gpt-oss-20b, and GLM-4.5-Air. See docs/02-models.md for the trade-offs and what didn't work.
- Local models on Apple Silicon are practical now. A modern MoE in Q5 quant (~12–20 GB) runs at 50–60 tok/s decode on an M1 Max with no cloud round-trip. Coding-agent latency is workable; cost is electricity.
- pi is an OpenAI-API-compatible coding agent, so it talks to a local inference server (llama.cpp, mistral.rs, vLLM, …) the same way it talks to any cloud provider. Drop-in by design.
- llama.cpp has the most mature Metal backend and ships precompiled via Homebrew — no Xcode required. The repo also has scaffolding to swap in mistral.rs as a Rust-native alternative.
- Sandbox by intent. The scripts and configs make it cheap to try a new model: download a GGUF, drop in a serve wrapper, add a
models.jsonentry, runtool-call-test. The Tested and rejected section is what that workflow's failure cases look like in practice.
- Apple M1 Max, 64 GB RAM, macOS 26.3 (
arm64) - llama.cpp via Homebrew (
b9100) - pi
0.74.0 - Quant:
Qwen3-Coder-30B-A3B-Instruct-Q5_K_M.gguf(~21 GB on disk, ~37 GB resident with the default 131 k context)
Measured throughput on this hardware (M1 Max, llama.cpp b9100, -fa 1 -ngl 99 -r 3):
| test | Qwen3-Coder-30B-A3B (Q5_K_M, 20 GiB) |
Qwen3-Coder-Next-80B-A3B (Q3_K_M, 36 GiB) |
gpt-oss-20b (MXFP4, 11 GiB) |
GLM-4.5-Air (UD-Q3_K_XL, 51 GiB) |
|---|---|---|---|---|
| pp512 (prefill) | 593.80 ± 4.38 | 407.39 ± 0.89 | 755.59 ± 0.90 | 160.62 ± 1.25 |
| pp2048 | 554.40 ± 0.51 | 398.95 ± 3.05 | 741.53 ± 1.33 | 150.45 ± 1.76 |
| pp8192 | 409.13 ± 7.86 | 381.04 ± 1.87 | 650.61 ± 7.65 | 116.98 ± 0.47 |
| tg128 (decode) | 50.76 ± 0.21 | 31.93 ± 0.10 | 59.67 ± 0.46 | 20.57 ± 0.08 |
| tg512 | 50.00 ± 0.11 | 31.94 ± 0.26 | 60.40 ± 0.31 | 19.82 ± 0.30 |
| pp8192+tg128 | 356.51 ± 2.71 | 323.00 ± 2.03 | 544.15 ± 3.87 | 104.77 ± 1.50 |
See docs/05-benchmarking.md for raw output, sweep parameters, and a full "reading the numbers" breakdown.
Should work on any Apple Silicon Mac with ≥ 32 GB RAM. Bigger context windows or higher-bit quants need more.
Linux/CPU path (X270-class): Validated on a ThinkPad X270 (i7-7600U, 14 GiB, no GPU, Ubuntu 26.04) running Qwen3-4B-Instruct-2507 Q4_K_M (~2.5 GB) via apt-shipped llama.cpp. tool-call-test passes and real pi sessions reliably emit structured tool calls — 4B-Instruct-2507 is non-thinking by design, so there's no thinking-vs-speed trade-off. Decode is ~3.3 tok/s; first-turn prefill is the pain point, mitigated by --cache-reuse 256 (turn-2+ skips the system-prompt portion). Setup walkthrough in docs/12-linux-cpu.md; design notes — including why smaller Qwen3-1.7B and Qwen2.5-Coder-1.5B were tried and rejected — in docs/TODO_x270.md.
┌─────────┐ HTTP /v1/chat/completions ┌──────────────┐ Metal ┌──────────┐
│ pi │ ──────────────────────────────▶│ llama-server │ ──────────▶ │ GPU │
│ (agent) │ │ (llama.cpp) │ │ (M-chip) │
└─────────┘ └──────────────┘ └──────────┘
▲
│ reads
▼
~/models/...Q5_K_M.gguf
pi sees an OpenAI-compatible endpoint. llama-server does the actual inference on Metal. Your model file sits on disk and is memory-mapped at load time.
The full lifecycle — system setup → model install → daily use → model switching — looks like this. Diagram authored by GLM-4.5-Air running locally in pi, using the project's mermaid skill (format.sh + validate.sh iterate loop) — see docs/codebase-workflow.mmd for the source.
flowchart TD
A[Start: System Setup] --> B[Install Prerequisites]
B --> C[Install llama.cpp via Homebrew]
B --> D[Install pi via curl]
B --> E[Install uv via curl]
C --> F[Clone Repository]
D --> F
E --> F
F --> G[Run uv sync]
G --> H[Base Install: ./install/base.sh]
H --> I[Copy helper scripts to ~/bin]
H --> J[Copy pi provider config to ~/.pi/agent/]
I --> K[Choose Model & Install]
J --> K
K --> L{Model Selection}
L --> M[Qwen3-Coder-30B<br>./install/qwen3-coder-30b.sh]
L --> N[Qwen3-Coder-Next-80B<br>./install/qwen3-coder-next-80b.sh]
L --> O[GPT-OSS-20B<br>./install/gpt-oss-20b.sh]
L --> P[GLM-4.5-Air<br>./install/glm-4.5-air.sh]
M --> QWEN30B_DOWNLOAD["Download Model (~21 GB)"]
N --> QWENNEXT_DOWNLOAD["Download Model (~38 GB)"]
O --> GPTOSS_DOWNLOAD["Download Model (~12 GB)"]
P --> GLMAIR_DOWNLOAD["Download Model (~55 GB)"]
QWEN30B_DOWNLOAD --> U[Daily Use Workflow]
QWENNEXT_DOWNLOAD --> U
GPTOSS_DOWNLOAD --> U
GLMAIR_DOWNLOAD --> U
U --> V[Terminal 1: Start Server]
U --> W[Terminal 2: Test & Run pi]
V --> X{Choose Server Script}
X --> Y[qwen-serve<br>Qwen3-Coder-30B-A3B]
X --> Z[qwennext-serve<br>Qwen3-Coder-Next-80B-A3B]
X --> AA[gptoss-serve<br>GPT-OSS-20B]
X --> AB[glmair-serve<br>GLM-4.5-Air]
Y --> AC[llama-server with Qwen flags]
Z --> AD[llama-server with Next flags]
AA --> AE[llama-server with GPT-OSS flags]
AB --> AF[llama-server with GLM flags]
AC --> AG[Server running on port 8080]
AD --> AG
AE --> AG
AF --> AG
W --> AH[Smoke Test: qwen-test]
W --> AI[Tool Call Test: tool-call-test]
W --> AJ[Run pi: pi --model <alias>]
AH --> AK[Reply with hello]
AI --> AL[Verify structured tool_calls]
AJ --> AM[Interactive pi session]
AK --> AN[✓ Server working]
AL --> AN
AN --> AM
AM --> AO[Code with pi<br>Exit with /exit or Ctrl-D]
AO --> AP{Switch Models?}
AP --> |Yes| AQ[Terminal 1: serve-stop]
AP -->|No| AR[Continue using current model]
AQ --> AS[Free port 8080]
AS --> AT[Choose different server script]
AT --> X
AR --> AU[Daily workflow complete]
style A fill:#e1f5fe
style U fill:#f3e5f5
style X fill:#fff3e0
style AP fill:#e8f5e8
Two commands in two terminals:
# Terminal 1 — start the server (leave it running)
qwen-serve
# Terminal 2 — drop into pi against the local model
cd /path/to/your/project
pi --model qwen3-coder-30b-a3bExit pi with /exit or Ctrl-D. The server keeps running across pi sessions; stop it with Ctrl-C or serve-stop. Switch models with serve-stop then a different *-serve.
- 01 — Quickstart — install llama.cpp + pi + uv, download the default model, first run.
- 02 — Models — the four working candidates, tested-and-rejected, next trials, choosing a quant.
- 03 — Serving — the
*-servescripts and helpers (qwen-test,tool-call-test,serve-stop,fetch-template). - 04 — Tool calling — chat-template fix, verification flow,
pi-web-accessextension, future MCP for git/GitHub. - 05 — Benchmarking —
llama-benchsweeps, raw output for all four models, reading the numbers. - 06 — Troubleshooting — common failure modes (slow downloads, npm permissions, OOM at load, Metal wired-memory cap, template bugs).
- 07 — Alternative engines — mistral.rs (primary alternate), candle-vllm, Crane, vllm-mlx.
- 08 — Recording the demo —
vhsscript for the README GIF. - 09 — Skills — what a skill is, where pi looks for them, adding project-local and global skills, reusing Claude Code / Codex skills.
- 10 — Extensions — TypeScript extensions, install patterns, what's wired into this repo (
pi-web-access+ thepi-hooksbundle), writing your own, skill vs. extension recap. - 11 — Prompt engineering — evidence-based prompt tips per local model, reasoning levels, prompting via skills and extensions, empirical findings from this sandbox.
pi_sandbox/
├── README.md # this file — summary + docs index
├── LICENSE # MIT
├── docs/ # detailed docs, one per stage
│ ├── 01-quickstart.md
│ ├── 02-models.md
│ ├── 03-serving.md
│ ├── 04-tool-calling.md
│ ├── 05-benchmarking.md
│ ├── 06-troubleshooting.md
│ ├── 07-engines.md
│ ├── 08-demo.md
│ ├── 09-skills.md
│ ├── 10-extensions.md
│ └── 11-prompt-engineering.md
├── install/ # one-shot install scripts (idempotent)
│ ├── base.sh # copies scripts + models.json into place
│ ├── qwen3-coder-30b.sh # default coder model, ~21 GB
│ ├── qwen3-coder-next-80b.sh # long-context coder, ~38 GB
│ ├── gpt-oss-20b.sh # fastest, generalist, ~12 GB
│ ├── glm-4.5-air.sh # largest, agent-tuned, ~55 GB
│ ├── all-models.sh # base + all four models in sequence
│ └── vllm-mlx.sh # alternative engine (Python+MLX, no Xcode)
├── scripts/
│ ├── qwen-serve # start llama-server for Qwen3-Coder-30B-A3B
│ ├── qwennext-serve # alternate: Qwen3-Coder-Next-80B-A3B
│ ├── gptoss-serve # alternate: OpenAI gpt-oss-20b
│ ├── glmair-serve # alternate: Z.ai GLM-4.5-Air
│ ├── serve-stop # kill whatever llama-server is on port 8080
│ ├── qwen-test # one-shot chat-completion smoke test
│ ├── tool-call-test # model-agnostic check that pi-style tool_calls fire
│ └── fetch-template # fetch Qwen's official chat template (fixes tool calls)
├── bench/
│ └── throughput.sh # llama-bench wrapper for pp/tg tok/s
├── demo/
│ ├── pi-qwen.tape # vhs script for the README demo
│ └── smoke.tape # minimal vhs script to verify the toolchain
└── config/
└── models.json # pi provider config (copy to ~/.pi/agent/)
- Qwen team for the base model
- Unsloth for the GGUF quants used here
- ggml-org/llama.cpp for the inference engine
- pi for the agent
Three upstream catalogs that source the skills and extensions referenced in docs/09-skills.md:
- anthropics/skills — Anthropic's official skills (docx, pdf, mcp-builder, skill-creator, frontend-design, …).
- badlogic/pi-skills — community pi skills (brave-search, browser-tools, Google APIs, transcribe, vscode, youtube-transcript).
- qualisero/awesome-pi-agent — meta-catalog of pi extensions, skills, tools, and prompt templates across the ecosystem.
MIT — see LICENSE.
This README was co-written by Claude Code and the Qwen3-Coder-30B-A3B model with Pi. The workflow diagram above was authored by GLM-4.5-Air running locally in pi via the project's mermaid skill — first-pass attempts on smaller models (gpt-oss-20b) failed the iterate-on-validator-feedback loop; GLM-Air converged with one steering message. See docs/11-prompt-engineering.md for the model-vs-task fit table this empirically supports.
