pi_sandbox

A sandbox for running pi — a minimal terminal coding agent — against local models on Apple Silicon. Swap models, swap inference engines, A/B them on your own hardware. No API keys. No cloud round-trips. All inference on your machine.

The documented default is Qwen3-Coder-30B-A3B-Instruct via llama.cpp, with wired-up alternates for Qwen3-Coder-Next-80B, gpt-oss-20b, and GLM-4.5-Air. See docs/02-models.md for the trade-offs and what didn't work.

Why this stack

Local models on Apple Silicon are practical now. A modern MoE in Q5 quant (~12–20 GB) runs at 50–60 tok/s decode on an M1 Max with no cloud round-trip. Coding-agent latency is workable; cost is electricity.
pi is an OpenAI-API-compatible coding agent, so it talks to a local inference server (llama.cpp, mistral.rs, vLLM, …) the same way it talks to any cloud provider. Drop-in by design.
llama.cpp has the most mature Metal backend and ships precompiled via Homebrew — no Xcode required. The repo also has scaffolding to swap in mistral.rs as a Rust-native alternative.
Sandbox by intent. The scripts and configs make it cheap to try a new model: download a GGUF, drop in a serve wrapper, add a models.json entry, run tool-call-test. The Tested and rejected section is what that workflow's failure cases look like in practice.

Tested on

Apple M1 Max, 64 GB RAM, macOS 26.3 (arm64)
llama.cpp via Homebrew (b9100)
pi 0.74.0
Quant: Qwen3-Coder-30B-A3B-Instruct-Q5_K_M.gguf (~21 GB on disk, ~37 GB resident with the default 131 k context)

Measured throughput on this hardware (M1 Max, llama.cpp b9100, -fa 1 -ngl 99 -r 3):

test	Qwen3-Coder-30B-A3B (Q5_K_M, 20 GiB)	Qwen3-Coder-Next-80B-A3B (Q3_K_M, 36 GiB)	gpt-oss-20b (MXFP4, 11 GiB)	GLM-4.5-Air (UD-Q3_K_XL, 51 GiB)
pp512 (prefill)	593.80 ± 4.38	407.39 ± 0.89	755.59 ± 0.90	160.62 ± 1.25
pp2048	554.40 ± 0.51	398.95 ± 3.05	741.53 ± 1.33	150.45 ± 1.76
pp8192	409.13 ± 7.86	381.04 ± 1.87	650.61 ± 7.65	116.98 ± 0.47
tg128 (decode)	50.76 ± 0.21	31.93 ± 0.10	59.67 ± 0.46	20.57 ± 0.08
tg512	50.00 ± 0.11	31.94 ± 0.26	60.40 ± 0.31	19.82 ± 0.30
pp8192+tg128	356.51 ± 2.71	323.00 ± 2.03	544.15 ± 3.87	104.77 ± 1.50

See docs/05-benchmarking.md for raw output, sweep parameters, and a full "reading the numbers" breakdown.

Should work on any Apple Silicon Mac with ≥ 32 GB RAM. Bigger context windows or higher-bit quants need more.

Linux/CPU path (X270-class): Validated on a ThinkPad X270 (i7-7600U, 14 GiB, no GPU, Ubuntu 26.04) running Qwen3-4B-Instruct-2507 Q4_K_M (~2.5 GB) via apt-shipped llama.cpp. tool-call-test passes and real pi sessions reliably emit structured tool calls — 4B-Instruct-2507 is non-thinking by design, so there's no thinking-vs-speed trade-off. Decode is ~3.3 tok/s; first-turn prefill is the pain point, mitigated by --cache-reuse 256 (turn-2+ skips the system-prompt portion). Setup walkthrough in docs/12-linux-cpu.md; design notes — including why smaller Qwen3-1.7B and Qwen2.5-Coder-1.5B were tried and rejected — in docs/TODO_x270.md.

How the pieces fit

┌─────────┐    HTTP /v1/chat/completions   ┌──────────────┐    Metal    ┌──────────┐
│   pi    │ ──────────────────────────────▶│ llama-server │ ──────────▶ │   GPU    │
│ (agent) │                                │  (llama.cpp) │             │ (M-chip) │
└─────────┘                                └──────────────┘             └──────────┘
                                                  ▲
                                                  │ reads
                                                  ▼
                                         ~/models/...Q5_K_M.gguf

pi sees an OpenAI-compatible endpoint. llama-server does the actual inference on Metal. Your model file sits on disk and is memory-mapped at load time.

Workflow

The full lifecycle — system setup → model install → daily use → model switching — looks like this. Diagram authored by GLM-4.5-Air running locally in pi, using the project's mermaid skill (format.sh + validate.sh iterate loop) — see docs/codebase-workflow.mmd for the source.

flowchart TD
    A[Start: System Setup] --> B[Install Prerequisites]
    B --> C[Install llama.cpp via Homebrew]
    B --> D[Install pi via curl]
    B --> E[Install uv via curl]

    C --> F[Clone Repository]
    D --> F
    E --> F

    F --> G[Run uv sync]
    G --> H[Base Install: ./install/base.sh]
    H --> I[Copy helper scripts to ~/bin]
    H --> J[Copy pi provider config to ~/.pi/agent/]

    I --> K[Choose Model & Install]
    J --> K

    K --> L{Model Selection}
    L --> M[Qwen3-Coder-30B<br>./install/qwen3-coder-30b.sh]
    L --> N[Qwen3-Coder-Next-80B<br>./install/qwen3-coder-next-80b.sh]
    L --> O[GPT-OSS-20B<br>./install/gpt-oss-20b.sh]
    L --> P[GLM-4.5-Air<br>./install/glm-4.5-air.sh]

    M --> QWEN30B_DOWNLOAD["Download Model (~21 GB)"]
    N --> QWENNEXT_DOWNLOAD["Download Model (~38 GB)"]
    O --> GPTOSS_DOWNLOAD["Download Model (~12 GB)"]
    P --> GLMAIR_DOWNLOAD["Download Model (~55 GB)"]

    QWEN30B_DOWNLOAD --> U[Daily Use Workflow]
    QWENNEXT_DOWNLOAD --> U
    GPTOSS_DOWNLOAD --> U
    GLMAIR_DOWNLOAD --> U

    U --> V[Terminal 1: Start Server]
    U --> W[Terminal 2: Test & Run pi]

    V --> X{Choose Server Script}
    X --> Y[qwen-serve<br>Qwen3-Coder-30B-A3B]
    X --> Z[qwennext-serve<br>Qwen3-Coder-Next-80B-A3B]
    X --> AA[gptoss-serve<br>GPT-OSS-20B]
    X --> AB[glmair-serve<br>GLM-4.5-Air]

    Y --> AC[llama-server with Qwen flags]
    Z --> AD[llama-server with Next flags]
    AA --> AE[llama-server with GPT-OSS flags]
    AB --> AF[llama-server with GLM flags]

    AC --> AG[Server running on port 8080]
    AD --> AG
    AE --> AG
    AF --> AG

    W --> AH[Smoke Test: qwen-test]
    W --> AI[Tool Call Test: tool-call-test]
    W --> AJ[Run pi: pi --model <alias>]

    AH --> AK[Reply with hello]
    AI --> AL[Verify structured tool_calls]
    AJ --> AM[Interactive pi session]

    AK --> AN[✓ Server working]
    AL --> AN
    AN --> AM

    AM --> AO[Code with pi<br>Exit with /exit or Ctrl-D]

    AO --> AP{Switch Models?}
    AP --> |Yes| AQ[Terminal 1: serve-stop]
    AP -->|No| AR[Continue using current model]

    AQ --> AS[Free port 8080]
    AS --> AT[Choose different server script]
    AT --> X

    AR --> AU[Daily workflow complete]

    style A fill:#e1f5fe
    style U fill:#f3e5f5
    style X fill:#fff3e0
    style AP fill:#e8f5e8

Daily use

Two commands in two terminals:

# Terminal 1 — start the server (leave it running)
qwen-serve

# Terminal 2 — drop into pi against the local model
cd /path/to/your/project
pi --model qwen3-coder-30b-a3b

Exit pi with /exit or Ctrl-D. The server keeps running across pi sessions; stop it with Ctrl-C or serve-stop. Switch models with serve-stop then a different *-serve.

Docs

01 — Quickstart — install llama.cpp + pi + uv, download the default model, first run.
02 — Models — the four working candidates, tested-and-rejected, next trials, choosing a quant.
03 — Serving — the *-serve scripts and helpers (qwen-test, tool-call-test, serve-stop, fetch-template).
04 — Tool calling — chat-template fix, verification flow, pi-web-access extension, future MCP for git/GitHub.
05 — Benchmarking — llama-bench sweeps, raw output for all four models, reading the numbers.
06 — Troubleshooting — common failure modes (slow downloads, npm permissions, OOM at load, Metal wired-memory cap, template bugs).
07 — Alternative engines — mistral.rs (primary alternate), candle-vllm, Crane, vllm-mlx.
08 — Recording the demo — vhs script for the README GIF.
09 — Skills — what a skill is, where pi looks for them, adding project-local and global skills, reusing Claude Code / Codex skills.
10 — Extensions — TypeScript extensions, install patterns, what's wired into this repo (pi-web-access + the pi-hooks bundle), writing your own, skill vs. extension recap.
11 — Prompt engineering — evidence-based prompt tips per local model, reasoning levels, prompting via skills and extensions, empirical findings from this sandbox.

Layout of this repo

pi_sandbox/
├── README.md           # this file — summary + docs index
├── LICENSE             # MIT
├── docs/               # detailed docs, one per stage
│   ├── 01-quickstart.md
│   ├── 02-models.md
│   ├── 03-serving.md
│   ├── 04-tool-calling.md
│   ├── 05-benchmarking.md
│   ├── 06-troubleshooting.md
│   ├── 07-engines.md
│   ├── 08-demo.md
│   ├── 09-skills.md
│   ├── 10-extensions.md
│   └── 11-prompt-engineering.md
├── install/            # one-shot install scripts (idempotent)
│   ├── base.sh                    # copies scripts + models.json into place
│   ├── qwen3-coder-30b.sh         # default coder model, ~21 GB
│   ├── qwen3-coder-next-80b.sh    # long-context coder, ~38 GB
│   ├── gpt-oss-20b.sh             # fastest, generalist, ~12 GB
│   ├── glm-4.5-air.sh             # largest, agent-tuned, ~55 GB
│   ├── all-models.sh              # base + all four models in sequence
│   └── vllm-mlx.sh                # alternative engine (Python+MLX, no Xcode)
├── scripts/
│   ├── qwen-serve      # start llama-server for Qwen3-Coder-30B-A3B
│   ├── qwennext-serve  # alternate: Qwen3-Coder-Next-80B-A3B
│   ├── gptoss-serve    # alternate: OpenAI gpt-oss-20b
│   ├── glmair-serve    # alternate: Z.ai GLM-4.5-Air
│   ├── serve-stop      # kill whatever llama-server is on port 8080
│   ├── qwen-test       # one-shot chat-completion smoke test
│   ├── tool-call-test  # model-agnostic check that pi-style tool_calls fire
│   └── fetch-template  # fetch Qwen's official chat template (fixes tool calls)
├── bench/
│   └── throughput.sh   # llama-bench wrapper for pp/tg tok/s
├── demo/
│   ├── pi-qwen.tape    # vhs script for the README demo
│   └── smoke.tape      # minimal vhs script to verify the toolchain
└── config/
    └── models.json     # pi provider config (copy to ~/.pi/agent/)

Credits

Qwen team for the base model
Unsloth for the GGUF quants used here
ggml-org/llama.cpp for the inference engine
pi for the agent

License

MIT — see LICENSE.

Acknowledgments

This README was co-written by Claude Code and the Qwen3-Coder-30B-A3B model with Pi. The workflow diagram above was authored by GLM-4.5-Air running locally in pi via the project's mermaid skill — first-pass attempts on smaller models (gpt-oss-20b) failed the iterate-on-validator-feedback loop; GLM-Air converged with one steering message. See docs/11-prompt-engineering.md for the model-vs-task fit table this empirically supports.

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
.pi		.pi
bench		bench
config		config
demo		demo
docs		docs
install		install
scripts		scripts
src		src
.gitignore		.gitignore
AGENTS.md		AGENTS.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pi_sandbox

Why this stack

Tested on

How the pieces fit

Workflow

Daily use

Docs

Layout of this repo

Credits

Further reading

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

pi_sandbox

Why this stack

Tested on

How the pieces fit

Workflow

Daily use

Docs

Layout of this repo

Credits

Further reading

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages