This guide covers how to run inference using the flashchat CLI wrapper.
./flashchatRunning flashchat with no arguments launches an interactive menu where you can:
- Start a new chat session
- Resume an existing session
- Start the API server
- Configure settings
- Manage model storage
- View status
Before running Flashchat, you need:
- An Apple Silicon Mac with at least 16GB of RAM
- Enough free internal SSD space for the selected model and generated expert data
- Xcode Command Line Tools for building the Metal inference binaries
On first run, flashchat will guide you through the rest:
- Create a Python virtual environment with NumPy for setup scripts
- Build the required binaries when needed (
infer,chat) - Download the model from HuggingFace (if not present)
- Create a default configuration file
- Prompt you through extracting weights and expert data
./flashchatLaunches an interactive menu with options for chat, server, configuration, and more.
./flashchat chat # Start new chat session
./flashchat chat --resume <id> # Resume existing sessionStarts the interactive chat TUI. Automatically starts the server if not running.
./flashchat opencode
./flashchat opencode --port 8080Starts the local Flashchat server if needed, checks for an OpenCode config at ~/.config/opencode/opencode.jsonc, and launches opencode from the repo root.
./flashchat serve # Start API server
./flashchat serve --stop # Stop running server
./flashchat serve --stop --external # Stop an external Flashchat infer server on the configured port
./flashchat serve --port 8080 # Start on specific portStarts the OpenAI-compatible HTTP server. Server runs persistently until stopped.
If the interactive menu shows Running (external), Flashchat sees a listener on the configured server port but does not have a pid file for it. The [O]n/[O]ff menu option can offer to stop it, defaulting to no. The CLI equivalent is ./flashchat serve --stop --external, which refuses to stop processes that do not look like Flashchat's infer server.
When infer runs in server mode, it also appends timestamped server activity to:
~/.config/flashchat/logs/server.log
This is useful when flashchat starts the server in the background and you want to review request timing or errors afterward. You can tail it directly:
tail -f ~/.config/flashchat/logs/server.logTo override the log path for a single run:
FLASHCHAT_SERVER_LOG=/tmp/flashchat-server.log ./flashchat serveTwo server-side logging features can be enabled independently through the flashchat configuration wizard:
SERVER_DEBUG=1- writes prompt/debug artifacts such as raw request bodies, assembled prompts, and final system prompts
SERVER_HTTP_LOG=1- appends raw API traffic to:
~/.config/flashchat/logs/http.log
This is useful for debugging frontend compatibility problems, SSE formatting, and unexpected request payloads without enabling the heavier prompt artifact dumps.
./flashchat prompt "Hello world"
./flashchat prompt "Explain quantum computing" --tokens 50Runs a single prompt and prints the response.
./flashchat benchmark # Show available benchmarks
./flashchat benchmark run # Single expert forward pass
./flashchat benchmark verify # Metal vs CPU verification
./flashchat benchmark bench # Single expert benchmark (10 iterations)
./flashchat benchmark moe # MoE forward (K experts, single layer)
./flashchat benchmark moebench # MoE benchmark (10 iterations)
./flashchat benchmark full # Full model forward (K=4)
./flashchat benchmark fullbench # Full benchmark (3 iterations)Runs performance benchmarks. Uses configuration for model paths.
./metal_infer/infer --render-request request.json --render-output debug/rendered-request
./metal_infer/infer --parse-tool-call tool_call.txt
make tool-template-smokeThe render path parses an OpenAI-compatible request and writes the exact native Qwen system prompt, conversation text, assembled prompt, and summary counts without loading model weights or starting the server. Use it to compare nanocode/opencode request logs with Flashchat's prompt rendering before debugging live model behavior.
./flashchat config # View configuration
./flashchat config --reset # Re-run setup wizard (keeps sessions)
./flashchat config --full-reset # Delete all data and start freshView and edit settings. If no config exists, defaults are used automatically. The reset option allows you to reconfigure while preserving chat sessions.
The configuration wizard selects from the models in assets/model_configs.json and shows the local setup state for each model, including downloaded HuggingFace files and generated files under <model>/flashchat/.
./flashchat modelsLists supported models and their local setup status. This command is read-only.
./flashchat manage
./flashchat manage --listManages local and offloaded model storage. The manage view shows each supported model's local/offloaded status, runtime readiness, original blob size, and total storage footprint.
Available storage actions:
- Remove original HuggingFace safetensors blobs after the generated runtime files are complete
- Delete a local model cache repo
- Offload a whole HuggingFace cache repo to
OFFLOAD_DIR - Fully reload an offloaded model back to the local HuggingFace cache
- Restore only the generated
<model>/flashchat/runtime files from offload storage
Destructive actions require typing the exact model ID. Offload storage uses one global directory configured as OFFLOAD_DIR in ~/.config/flashchat/config or overridden with FLASHCHAT_OFFLOAD_DIR.
./flashchat statusShows system status including model, paths, server status, and generation settings.
./flashchat sessions # List all sessions
./flashchat sessions --delete <id> # Delete a sessionManage chat sessions.
Configuration is loaded from (priority highest to lowest):
--config FILE(explicit override)~/.config/flashchat/config(user)- Environment variables (
FLASHCHAT_*) - Registry/default values
| Variable | Description | Default |
|---|---|---|
FLASHCHAT_MODEL |
Supported model ID | qwen3.6-35B-A3B |
FLASHCHAT_MODEL_CONFIG |
Model registry path, including the default model and active setup scripts | assets/model_configs.json |
FLASHCHAT_MODEL_PATH |
Override model path | Auto-detected |
FLASHCHAT_OFFLOAD_DIR |
Unified root for offloaded HuggingFace model cache repos | unset |
FLASHCHAT_SERVER_PORT |
Server port | 8000 |
FLASHCHAT_SERVER_HOST |
Server host | 127.0.0.1 |
FLASHCHAT_WEIGHTS_DIR |
Weights directory | <model>/flashchat |
FLASHCHAT_EXPERTS_DIR |
Experts directory | <model>/flashchat/packed_experts |
# ~/.config/flashchat/config
# Model Settings
MODEL="qwen3.6-35B-A3B"
# Storage Settings
OFFLOAD_DIR=""
# Generation Defaults
MAX_TOKENS="8192"
SAMPLING_PROFILE="instruct"
REASONING="0"
TEMPERATURE="0.7"
TOP_P="0.8"
TOP_K="20"
MIN_P="0.0"
PRESENCE_PENALTY="1.5"
REPETITION_PENALTY="1.0"
# Server Settings
SERVER_PORT="8000"
SERVER_HOST="127.0.0.1"
SERVER_LOG_PATH="$HOME/.config/flashchat/logs/server.log"
SYSTEM_PROMPT_CACHE="1"
SYSTEM_PROMPT_CACHE_MAX_ENTRIES="2"
# UI Settings
SHOW_THINKING="0"
COLOR_OUTPUT="1"SERVER_LOG_PATH may be a .log file or a directory. Extensionless paths entered in the configuration wizard are treated as directories and will receive server.log plus debug artifacts when debug logging is enabled.
SYSTEM_PROMPT_CACHE stores compressed, model-local snapshots under <model>/flashchat/system_prompt_cache/ so repeated server runs with the same harness/system prompt can skip the expensive system prompt prefill. The cache is bounded by SYSTEM_PROMPT_CACHE_MAX_ENTRIES.
SAMPLING_PROFILE selects a model-supported generation profile from assets/model_configs.json. Use custom to edit REASONING, TEMPERATURE, TOP_P, TOP_K, MIN_P, PRESENCE_PENALTY, and REPETITION_PENALTY directly.
Flashchat records a runtime signature for servers it starts. If model selection, model registry data, server-affecting settings, or infer source/binary state changes while the server is running, the next Flashchat-managed server use restarts the owned server before reuse.
When running the server (./flashchat serve):
| Endpoint | Method | Description |
|---|---|---|
/v1/chat/completions |
POST | Chat completions (SSE streaming) |
/v1/responses |
POST | Responses API compatibility endpoint |
/v1 |
GET | Lightweight service info / compatibility probe |
/v1/models |
GET | List available models |
/health |
GET | Health check |
The examples below assume the default SERVER_HOST="127.0.0.1" and SERVER_PORT="8000".
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.5-397b-a17b",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 100,
"temperature": 0.7,
"top_p": 0.8,
"top_k": 20,
"min_p": 0.0,
"presence_penalty": 1.5,
"repetition_penalty": 1.0
}'curl -X POST http://localhost:8000/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.5-397b-a17b",
"input": "Summarize why Flashchat works on Apple Silicon.",
"max_output_tokens": 256,
"temperature": 0.2
}'curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.5-397b-a17b",
"messages": [{"role": "user", "content": "Give a short answer only."}],
"reasoning": false,
"stream": false
}'From the project root:
make cli-smoke
make api-smoke
make testmake cli-smoke runs the Flashchat CLI smoke test.
make api-smoke checks:
GET /healthGET /v1GET /v1/modelsPOST /v1/chat/completionswith and without streamingPOST /v1/responseswith and without streaming- tool-call round trips for both endpoints
make test runs both smoke tests.
If nothing is already listening on the configured port, the script starts metal_infer/infer --serve automatically.
When enabled, it also appends lightweight timing rows to:
assets/api_perf_log.tsvEach row records the date, branch, commit, endpoint scenario, request mode, duration, and derived stream tok/s when available. This is meant for spotting regressions over time, not for scientific benchmarking. The log also records the hostname, hardware model, RAM size, and a compact CPU/GPU core summary so results from different Apple Silicon machines can be compared later.
For OpenCode, a working provider entry looks like:
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"flashchat": {
"name": "flashchat",
"npm": "@ai-sdk/openai-compatible",
"models": {
"mlx-community/Qwen3.5-397B-A17B-4bit": {
"name": "Qwen3.5-397B-A17B-4bit",
"tools": true,
"limit": {
"context": 220000,
"output": 16000
}
}
},
"options": {
"baseURL": "http://127.0.0.1:8000/v1"
}
}
}
}If you changed SERVER_HOST or SERVER_PORT, use those configured values in baseURL.
| File | Size | Description |
|---|---|---|
<model>/flashchat/model_weights.bin |
5.5GB | Non-expert weights (mmap'd) |
<model>/flashchat/model_weights.json |
371KB | Manifest for weight loading |
<model>/flashchat/vocab.bin |
7.8MB | Tokenizer vocabulary |
<model>/flashchat/expert_index.json |
- | Safetensors expert lookup index |
<model>/flashchat/packed_experts/ |
218GB | Expert weights |
~/.config/flashchat/config |
- | User configuration |
~/.config/flashchat/sessions/ |
- | Chat session history |
~/.config/flashchat/history |
- | Interactive prompt history |
~/.config/flashchat/system.md |
- | Optional custom system prompt |