dusterbloom · dusterbloom · May 4, 2026 · Apr 24, 2026 · Apr 24, 2026 · Apr 24, 2026
diff --git a/README.md b/README.md
@@ -61,6 +61,174 @@ higgs serve --model mlx-community/Qwen3.6-35B-A3B-4bit
 Send a request to the local endpoint:
 
 ```bash
+<<<<<<< HEAD
+=======
+higgs init        # create ~/.config/higgs/config.toml
+higgs serve       # start with config
+higgs start       # start as background daemon
+higgs attach      # attach TUI dashboard to running daemon
+higgs stop        # stop daemon
+```
+
+### Profiles
+
+Named profiles let you maintain multiple configurations and run multiple instances simultaneously:
+
+```bash
+higgs init --profile dev              # create config.dev.toml
+higgs init --profile prod             # create config.prod.toml
+higgs serve --profile dev             # foreground with dev config
+higgs start --profile dev             # daemon with dev config (separate PID/log)
+higgs start --profile prod            # daemon with prod config (different port)
+higgs attach --profile dev            # attach TUI to dev instance
+higgs stop --profile dev              # stop only the dev instance
+higgs doctor --profile prod           # validate prod config
+```
+
+Each profile gets isolated runtime files (`higgs.<profile>.pid`, `higgs.<profile>.log`, `metrics.<profile>.jsonl`). Profiles must use different ports (configured in each profile's config file). `--profile` and `--config` are mutually exclusive.
+
+## Features
+
+### Local inference
+- **OpenAI + Anthropic APIs** -- chat completions, text completions, embeddings, messages
+- **Structured output** -- `json_schema` response format (100% schema compliance)
+- **Reasoning models** -- `<think>` tag extraction to `reasoning_content`
+- **Continuous batching** -- 755 tok/s aggregate at 8 concurrent requests
+- **Radix tree prefix cache** -- shared prefix reuse across requests
+- **Vision** -- multimodal image+text (LLaVA-Qwen2)
+- **11 architectures** -- LLaMA, Mistral, Qwen2/3, Qwen3-MoE, Qwen3-Next, Gemma 2, Phi-3, Starcoder2, DeepSeek-V2, LLaVA-Qwen2
+
+### Gateway
+- **Remote providers** -- proxy requests to OpenAI, Anthropic, Ollama, or any OpenAI-compatible API
+- **Format translation** -- send OpenAI requests to Anthropic providers (and vice versa) with automatic conversion of request/response formats, including streaming
+- **Pattern routing** -- regex-based model name matching to route requests to the right provider
+- **Model rewriting** -- map model aliases to upstream model names
+- **Auto-router** -- classify requests using a local LLM to pick the best provider
+- **Metrics dashboard** -- TUI with live request rates, latency, token throughput, and error tracking
+- **Daemon mode** -- `higgs start`/`stop`/`attach` for background operation
+- **Config management** -- `higgs config get/set`, `higgs doctor` for validation
+
+## Configuration
+
+### Simple mode (CLI flags)
+
+| CLI Flag | Env Variable | Default | Description |
+|---|---|---|---|
+| `--model` | `HIGGS_MODELS` | *(required)* | Model path or HF ID (repeatable) |
+| `--host` | `HIGGS_HOST` | `0.0.0.0` | Bind address |
+| `--port` | `HIGGS_PORT` | `8000` | Bind port |
+| `--max-tokens` | `HIGGS_MAX_TOKENS` | `32768` | Max generation tokens |
+| `--api-key` | `HIGGS_API_KEY` | *(none)* | Bearer token for auth |
+| `--rate-limit` | `HIGGS_RATE_LIMIT` | `0` | Requests/min per client |
+| `--timeout` | `HIGGS_TIMEOUT` | `300` | Request timeout (seconds) |
+| `--batch` | -- | `false` | Enable continuous batching |
+
+### Gateway mode (config file)
+
+Run `higgs init` to create `~/.config/higgs/config.toml`:
+
+```toml
+[server]
+host = "0.0.0.0"
+port = 8000
+# max_tokens = 32768
+# timeout = 300.0
+# api_key = "sk-..."
+
+# --- Local models ---
+[[models]]
+path = "mlx-community/Llama-3.2-1B-Instruct-4bit"
+# name = "llama"     # optional friendly name (used as engine key and for auto_router lookup)
+# batch = false
+# draft_model = "mlx-community/Llama-3.2-1B-Instruct-4bit"  # speculative decoding
+# num_draft = 8      # draft tokens per speculative cycle (default: 8)
+
+# --- Remote providers ---
+[provider.anthropic]
+url = "https://api.anthropic.com"
+format = "anthropic"
+
+[provider.openai]
+url = "https://api.openai.com"
+format = "openai"
+
+[provider.ollama]
+url = "http://localhost:11434"
+strip_auth = true
+
+# --- Routes ---
+# First regex match wins. Requests matching a local model name are served locally.
+
+[[routes]]
+pattern = "claude-.*"
+provider = "anthropic"
+
+[[routes]]
+pattern = "gpt-.*"
+provider = "openai"
+
+# Model rewriting: requests for "my-alias" are sent to the provider as "actual-model-name"
+# [[routes]]
+# pattern = "my-alias"
+# provider = "openai"
+# model = "gpt-4o"
+
+# --- Default route ---
+[default]
+provider = "higgs"   # "higgs" = local models only; set to a provider name to proxy unmatched requests
+
+# --- Auto router (optional) ---
+# Classify requests with a local LLM to pick the best provider automatically.
+# The model field can reference a model by name or path.
+# [auto_router]
+# enabled = true
+# model = "llama"    # matches [[models]] name or path
+# timeout_ms = 2000
+
+# --- Metrics & dashboard ---
+[retention]
+enabled = true
+minutes = 60
+
+[logging.metrics]
+enabled = true
+# path = "~/.config/higgs/logs/metrics.jsonl"
+# max_size_mb = 50
+# max_files = 5
+```
+
+#### Provider options
+
+| Field | Type | Default | Description |
+|---|---|---|---|
+| `url` | string | *(required)* | Base URL of the upstream API |
+| `format` | `"openai"` or `"anthropic"` | `"openai"` | API format the provider speaks |
+| `api_key` | string | *(none)* | API key to inject into proxied requests |
+| `strip_auth` | bool | `false` | Remove the client's Authorization header before proxying |
+| `stub_count_tokens` | bool | `false` | Return a stub for `/v1/messages/count_tokens` |
+
+#### Route options
+
+| Field | Type | Description |
+|---|---|---|
+| `pattern` | regex | Match against the `model` field in requests |
+| `provider` | string | Provider name to forward to |
+| `model` | string | Rewrite the model field before forwarding |
+| `name` | string | Human label (used by auto-router) |
+| `description` | string | Route description (used by auto-router for classification) |
+
+## API
+
+**OpenAI**: `/v1/chat/completions`, `/v1/completions`, `/v1/embeddings`, `/v1/models`
+**Anthropic**: `/v1/messages`, `/v1/messages/count_tokens`
+**Metrics**: `/metrics` (JSON)
+**Health**: `/health`
+
+Format translation works transparently: send an OpenAI-format request to higgs and it will translate to Anthropic format if the matched route points to an Anthropic provider (and vice versa), including streaming responses.
+
+```bash
+# Local model
+>>>>>>> feef8e47 (feat(doctor): validate draft_model path and batch incompatibility)
 curl http://localhost:8000/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{

diff --git a/crates/higgs-engine/src/lib.rs b/crates/higgs-engine/src/lib.rs
@@ -13,6 +13,7 @@ pub mod reasoning_parser;
 pub mod scheduler;
 pub mod simple;
 pub mod spec_prefill;
+pub mod speculative;
 pub mod tool_parser;
 
 pub use tokenizers;