Skip to content

carecodeconnect/pi_sandbox

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

88 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pi_sandbox

A sandbox for running pi — a minimal terminal coding agent — against local models on Apple Silicon. Swap models, swap inference engines, A/B them on your own hardware. No API keys. No cloud round-trips. All inference on your machine.

The documented default is Qwen3-Coder-30B-A3B-Instruct via llama.cpp, with wired-up alternates for Qwen3-Coder-Next-80B, gpt-oss-20b, and GLM-4.5-Air. See docs/02-models.md for the trade-offs and what didn't work.

pi + Qwen3-Coder demo

Why this stack

  • Local models on Apple Silicon are practical now. A modern MoE in Q5 quant (~12–20 GB) runs at 50–60 tok/s decode on an M1 Max with no cloud round-trip. Coding-agent latency is workable; cost is electricity.
  • pi is an OpenAI-API-compatible coding agent, so it talks to a local inference server (llama.cpp, mistral.rs, vLLM, …) the same way it talks to any cloud provider. Drop-in by design.
  • llama.cpp has the most mature Metal backend and ships precompiled via Homebrew — no Xcode required. The repo also has scaffolding to swap in mistral.rs as a Rust-native alternative.
  • Sandbox by intent. The scripts and configs make it cheap to try a new model: download a GGUF, drop in a serve wrapper, add a models.json entry, run tool-call-test. The Tested and rejected section is what that workflow's failure cases look like in practice.

Tested on

  • Apple M1 Max, 64 GB RAM, macOS 26.3 (arm64)
  • llama.cpp via Homebrew (b9100)
  • pi 0.74.0
  • Quant: Qwen3-Coder-30B-A3B-Instruct-Q5_K_M.gguf (~21 GB on disk, ~37 GB resident with the default 131 k context)

Measured throughput on this hardware (M1 Max, llama.cpp b9100, -fa 1 -ngl 99 -r 3):

test Qwen3-Coder-30B-A3B
(Q5_K_M, 20 GiB)
Qwen3-Coder-Next-80B-A3B
(Q3_K_M, 36 GiB)
gpt-oss-20b
(MXFP4, 11 GiB)
GLM-4.5-Air
(UD-Q3_K_XL, 51 GiB)
pp512 (prefill) 593.80 ± 4.38 407.39 ± 0.89 755.59 ± 0.90 160.62 ± 1.25
pp2048 554.40 ± 0.51 398.95 ± 3.05 741.53 ± 1.33 150.45 ± 1.76
pp8192 409.13 ± 7.86 381.04 ± 1.87 650.61 ± 7.65 116.98 ± 0.47
tg128 (decode) 50.76 ± 0.21 31.93 ± 0.10 59.67 ± 0.46 20.57 ± 0.08
tg512 50.00 ± 0.11 31.94 ± 0.26 60.40 ± 0.31 19.82 ± 0.30
pp8192+tg128 356.51 ± 2.71 323.00 ± 2.03 544.15 ± 3.87 104.77 ± 1.50

See docs/05-benchmarking.md for raw output, sweep parameters, and a full "reading the numbers" breakdown.

Should work on any Apple Silicon Mac with ≥ 32 GB RAM. Bigger context windows or higher-bit quants need more.

Linux/CPU path (X270-class): Validated on a ThinkPad X270 (i7-7600U, 14 GiB, no GPU, Ubuntu 26.04) running Qwen3-4B-Instruct-2507 Q4_K_M (~2.5 GB) via apt-shipped llama.cpp. tool-call-test passes and real pi sessions reliably emit structured tool calls — 4B-Instruct-2507 is non-thinking by design, so there's no thinking-vs-speed trade-off. Decode is ~3.3 tok/s; first-turn prefill is the pain point, mitigated by --cache-reuse 256 (turn-2+ skips the system-prompt portion). Setup walkthrough in docs/12-linux-cpu.md; design notes — including why smaller Qwen3-1.7B and Qwen2.5-Coder-1.5B were tried and rejected — in docs/TODO_x270.md.

How the pieces fit

┌─────────┐    HTTP /v1/chat/completions   ┌──────────────┐    Metal    ┌──────────┐
│   pi    │ ──────────────────────────────▶│ llama-server │ ──────────▶ │   GPU    │
│ (agent) │                                │  (llama.cpp) │             │ (M-chip) │
└─────────┘                                └──────────────┘             └──────────┘
                                                  ▲
                                                  │ reads
                                                  ▼
                                         ~/models/...Q5_K_M.gguf

pi sees an OpenAI-compatible endpoint. llama-server does the actual inference on Metal. Your model file sits on disk and is memory-mapped at load time.

Workflow

The full lifecycle — system setup → model install → daily use → model switching — looks like this. Diagram authored by GLM-4.5-Air running locally in pi, using the project's mermaid skill (format.sh + validate.sh iterate loop) — see docs/codebase-workflow.mmd for the source.

flowchart TD
    A[Start: System Setup] --> B[Install Prerequisites]
    B --> C[Install llama.cpp via Homebrew]
    B --> D[Install pi via curl]
    B --> E[Install uv via curl]

    C --> F[Clone Repository]
    D --> F
    E --> F

    F --> G[Run uv sync]
    G --> H[Base Install: ./install/base.sh]
    H --> I[Copy helper scripts to ~/bin]
    H --> J[Copy pi provider config to ~/.pi/agent/]

    I --> K[Choose Model & Install]
    J --> K

    K --> L{Model Selection}
    L --> M[Qwen3-Coder-30B<br>./install/qwen3-coder-30b.sh]
    L --> N[Qwen3-Coder-Next-80B<br>./install/qwen3-coder-next-80b.sh]
    L --> O[GPT-OSS-20B<br>./install/gpt-oss-20b.sh]
    L --> P[GLM-4.5-Air<br>./install/glm-4.5-air.sh]

    M --> QWEN30B_DOWNLOAD["Download Model (~21 GB)"]
    N --> QWENNEXT_DOWNLOAD["Download Model (~38 GB)"]
    O --> GPTOSS_DOWNLOAD["Download Model (~12 GB)"]
    P --> GLMAIR_DOWNLOAD["Download Model (~55 GB)"]

    QWEN30B_DOWNLOAD --> U[Daily Use Workflow]
    QWENNEXT_DOWNLOAD --> U
    GPTOSS_DOWNLOAD --> U
    GLMAIR_DOWNLOAD --> U

    U --> V[Terminal 1: Start Server]
    U --> W[Terminal 2: Test & Run pi]

    V --> X{Choose Server Script}
    X --> Y[qwen-serve<br>Qwen3-Coder-30B-A3B]
    X --> Z[qwennext-serve<br>Qwen3-Coder-Next-80B-A3B]
    X --> AA[gptoss-serve<br>GPT-OSS-20B]
    X --> AB[glmair-serve<br>GLM-4.5-Air]

    Y --> AC[llama-server with Qwen flags]
    Z --> AD[llama-server with Next flags]
    AA --> AE[llama-server with GPT-OSS flags]
    AB --> AF[llama-server with GLM flags]

    AC --> AG[Server running on port 8080]
    AD --> AG
    AE --> AG
    AF --> AG

    W --> AH[Smoke Test: qwen-test]
    W --> AI[Tool Call Test: tool-call-test]
    W --> AJ[Run pi: pi --model <alias>]

    AH --> AK[Reply with hello]
    AI --> AL[Verify structured tool_calls]
    AJ --> AM[Interactive pi session]

    AK --> AN[✓ Server working]
    AL --> AN
    AN --> AM

    AM --> AO[Code with pi<br>Exit with /exit or Ctrl-D]

    AO --> AP{Switch Models?}
    AP --> |Yes| AQ[Terminal 1: serve-stop]
    AP -->|No| AR[Continue using current model]

    AQ --> AS[Free port 8080]
    AS --> AT[Choose different server script]
    AT --> X

    AR --> AU[Daily workflow complete]

    style A fill:#e1f5fe
    style U fill:#f3e5f5
    style X fill:#fff3e0
    style AP fill:#e8f5e8
Loading

Daily use

Two commands in two terminals:

# Terminal 1 — start the server (leave it running)
qwen-serve

# Terminal 2 — drop into pi against the local model
cd /path/to/your/project
pi --model qwen3-coder-30b-a3b

Exit pi with /exit or Ctrl-D. The server keeps running across pi sessions; stop it with Ctrl-C or serve-stop. Switch models with serve-stop then a different *-serve.

Docs

  • 01 — Quickstart — install llama.cpp + pi + uv, download the default model, first run.
  • 02 — Models — the four working candidates, tested-and-rejected, next trials, choosing a quant.
  • 03 — Serving — the *-serve scripts and helpers (qwen-test, tool-call-test, serve-stop, fetch-template).
  • 04 — Tool calling — chat-template fix, verification flow, pi-web-access extension, future MCP for git/GitHub.
  • 05 — Benchmarkingllama-bench sweeps, raw output for all four models, reading the numbers.
  • 06 — Troubleshooting — common failure modes (slow downloads, npm permissions, OOM at load, Metal wired-memory cap, template bugs).
  • 07 — Alternative engines — mistral.rs (primary alternate), candle-vllm, Crane, vllm-mlx.
  • 08 — Recording the demovhs script for the README GIF.
  • 09 — Skills — what a skill is, where pi looks for them, adding project-local and global skills, reusing Claude Code / Codex skills.
  • 10 — Extensions — TypeScript extensions, install patterns, what's wired into this repo (pi-web-access + the pi-hooks bundle), writing your own, skill vs. extension recap.
  • 11 — Prompt engineering — evidence-based prompt tips per local model, reasoning levels, prompting via skills and extensions, empirical findings from this sandbox.

Layout of this repo

pi_sandbox/
├── README.md           # this file — summary + docs index
├── LICENSE             # MIT
├── docs/               # detailed docs, one per stage
│   ├── 01-quickstart.md
│   ├── 02-models.md
│   ├── 03-serving.md
│   ├── 04-tool-calling.md
│   ├── 05-benchmarking.md
│   ├── 06-troubleshooting.md
│   ├── 07-engines.md
│   ├── 08-demo.md
│   ├── 09-skills.md
│   ├── 10-extensions.md
│   └── 11-prompt-engineering.md
├── install/            # one-shot install scripts (idempotent)
│   ├── base.sh                    # copies scripts + models.json into place
│   ├── qwen3-coder-30b.sh         # default coder model, ~21 GB
│   ├── qwen3-coder-next-80b.sh    # long-context coder, ~38 GB
│   ├── gpt-oss-20b.sh             # fastest, generalist, ~12 GB
│   ├── glm-4.5-air.sh             # largest, agent-tuned, ~55 GB
│   ├── all-models.sh              # base + all four models in sequence
│   └── vllm-mlx.sh                # alternative engine (Python+MLX, no Xcode)
├── scripts/
│   ├── qwen-serve      # start llama-server for Qwen3-Coder-30B-A3B
│   ├── qwennext-serve  # alternate: Qwen3-Coder-Next-80B-A3B
│   ├── gptoss-serve    # alternate: OpenAI gpt-oss-20b
│   ├── glmair-serve    # alternate: Z.ai GLM-4.5-Air
│   ├── serve-stop      # kill whatever llama-server is on port 8080
│   ├── qwen-test       # one-shot chat-completion smoke test
│   ├── tool-call-test  # model-agnostic check that pi-style tool_calls fire
│   └── fetch-template  # fetch Qwen's official chat template (fixes tool calls)
├── bench/
│   └── throughput.sh   # llama-bench wrapper for pp/tg tok/s
├── demo/
│   ├── pi-qwen.tape    # vhs script for the README demo
│   └── smoke.tape      # minimal vhs script to verify the toolchain
└── config/
    └── models.json     # pi provider config (copy to ~/.pi/agent/)

Credits

Further reading

Three upstream catalogs that source the skills and extensions referenced in docs/09-skills.md:

  • anthropics/skills — Anthropic's official skills (docx, pdf, mcp-builder, skill-creator, frontend-design, …).
  • badlogic/pi-skills — community pi skills (brave-search, browser-tools, Google APIs, transcribe, vscode, youtube-transcript).
  • qualisero/awesome-pi-agent — meta-catalog of pi extensions, skills, tools, and prompt templates across the ecosystem.

License

MIT — see LICENSE.

Acknowledgments

This README was co-written by Claude Code and the Qwen3-Coder-30B-A3B model with Pi. The workflow diagram above was authored by GLM-4.5-Air running locally in pi via the project's mermaid skill — first-pass attempts on smaller models (gpt-oss-20b) failed the iterate-on-validator-feedback loop; GLM-Air converged with one steering message. See docs/11-prompt-engineering.md for the model-vs-task fit table this empirically supports.

About

A sandbox for running pi locally against different models (Qwen3-Coder, gpt-oss, Devstral, ...) and inference engines (llama.cpp, mistral.rs) on Apple Silicon — no API keys, all on-device.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors