Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
168 changes: 168 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,174 @@ higgs serve --model mlx-community/Qwen3.6-35B-A3B-4bit
Send a request to the local endpoint:

```bash
<<<<<<< HEAD
=======
higgs init # create ~/.config/higgs/config.toml
higgs serve # start with config
higgs start # start as background daemon
higgs attach # attach TUI dashboard to running daemon
higgs stop # stop daemon
```

### Profiles

Named profiles let you maintain multiple configurations and run multiple instances simultaneously:

```bash
higgs init --profile dev # create config.dev.toml
higgs init --profile prod # create config.prod.toml
higgs serve --profile dev # foreground with dev config
higgs start --profile dev # daemon with dev config (separate PID/log)
higgs start --profile prod # daemon with prod config (different port)
higgs attach --profile dev # attach TUI to dev instance
higgs stop --profile dev # stop only the dev instance
higgs doctor --profile prod # validate prod config
```

Each profile gets isolated runtime files (`higgs.<profile>.pid`, `higgs.<profile>.log`, `metrics.<profile>.jsonl`). Profiles must use different ports (configured in each profile's config file). `--profile` and `--config` are mutually exclusive.

## Features

### Local inference
- **OpenAI + Anthropic APIs** -- chat completions, text completions, embeddings, messages
- **Structured output** -- `json_schema` response format (100% schema compliance)
- **Reasoning models** -- `<think>` tag extraction to `reasoning_content`
- **Continuous batching** -- 755 tok/s aggregate at 8 concurrent requests
- **Radix tree prefix cache** -- shared prefix reuse across requests
- **Vision** -- multimodal image+text (LLaVA-Qwen2)
- **11 architectures** -- LLaMA, Mistral, Qwen2/3, Qwen3-MoE, Qwen3-Next, Gemma 2, Phi-3, Starcoder2, DeepSeek-V2, LLaVA-Qwen2

### Gateway
- **Remote providers** -- proxy requests to OpenAI, Anthropic, Ollama, or any OpenAI-compatible API
- **Format translation** -- send OpenAI requests to Anthropic providers (and vice versa) with automatic conversion of request/response formats, including streaming
- **Pattern routing** -- regex-based model name matching to route requests to the right provider
- **Model rewriting** -- map model aliases to upstream model names
- **Auto-router** -- classify requests using a local LLM to pick the best provider
- **Metrics dashboard** -- TUI with live request rates, latency, token throughput, and error tracking
- **Daemon mode** -- `higgs start`/`stop`/`attach` for background operation
- **Config management** -- `higgs config get/set`, `higgs doctor` for validation

## Configuration

### Simple mode (CLI flags)

| CLI Flag | Env Variable | Default | Description |
|---|---|---|---|
| `--model` | `HIGGS_MODELS` | *(required)* | Model path or HF ID (repeatable) |
| `--host` | `HIGGS_HOST` | `0.0.0.0` | Bind address |
| `--port` | `HIGGS_PORT` | `8000` | Bind port |
| `--max-tokens` | `HIGGS_MAX_TOKENS` | `32768` | Max generation tokens |
| `--api-key` | `HIGGS_API_KEY` | *(none)* | Bearer token for auth |
| `--rate-limit` | `HIGGS_RATE_LIMIT` | `0` | Requests/min per client |
| `--timeout` | `HIGGS_TIMEOUT` | `300` | Request timeout (seconds) |
| `--batch` | -- | `false` | Enable continuous batching |

### Gateway mode (config file)

Run `higgs init` to create `~/.config/higgs/config.toml`:

```toml
[server]
host = "0.0.0.0"
port = 8000
# max_tokens = 32768
# timeout = 300.0
# api_key = "sk-..."

# --- Local models ---
[[models]]
path = "mlx-community/Llama-3.2-1B-Instruct-4bit"
# name = "llama" # optional friendly name (used as engine key and for auto_router lookup)
# batch = false
# draft_model = "mlx-community/Llama-3.2-1B-Instruct-4bit" # speculative decoding
# num_draft = 8 # draft tokens per speculative cycle (default: 8)

# --- Remote providers ---
[provider.anthropic]
url = "https://api.anthropic.com"
format = "anthropic"

[provider.openai]
url = "https://api.openai.com"
format = "openai"

[provider.ollama]
url = "http://localhost:11434"
strip_auth = true

# --- Routes ---
# First regex match wins. Requests matching a local model name are served locally.

[[routes]]
pattern = "claude-.*"
provider = "anthropic"

[[routes]]
pattern = "gpt-.*"
provider = "openai"

# Model rewriting: requests for "my-alias" are sent to the provider as "actual-model-name"
# [[routes]]
# pattern = "my-alias"
# provider = "openai"
# model = "gpt-4o"

# --- Default route ---
[default]
provider = "higgs" # "higgs" = local models only; set to a provider name to proxy unmatched requests

# --- Auto router (optional) ---
# Classify requests with a local LLM to pick the best provider automatically.
# The model field can reference a model by name or path.
# [auto_router]
# enabled = true
# model = "llama" # matches [[models]] name or path
# timeout_ms = 2000

# --- Metrics & dashboard ---
[retention]
enabled = true
minutes = 60

[logging.metrics]
enabled = true
# path = "~/.config/higgs/logs/metrics.jsonl"
# max_size_mb = 50
# max_files = 5
```

#### Provider options

| Field | Type | Default | Description |
|---|---|---|---|
| `url` | string | *(required)* | Base URL of the upstream API |
| `format` | `"openai"` or `"anthropic"` | `"openai"` | API format the provider speaks |
| `api_key` | string | *(none)* | API key to inject into proxied requests |
| `strip_auth` | bool | `false` | Remove the client's Authorization header before proxying |
| `stub_count_tokens` | bool | `false` | Return a stub for `/v1/messages/count_tokens` |

#### Route options

| Field | Type | Description |
|---|---|---|
| `pattern` | regex | Match against the `model` field in requests |
| `provider` | string | Provider name to forward to |
| `model` | string | Rewrite the model field before forwarding |
| `name` | string | Human label (used by auto-router) |
| `description` | string | Route description (used by auto-router for classification) |

## API

**OpenAI**: `/v1/chat/completions`, `/v1/completions`, `/v1/embeddings`, `/v1/models`
**Anthropic**: `/v1/messages`, `/v1/messages/count_tokens`
**Metrics**: `/metrics` (JSON)
**Health**: `/health`

Format translation works transparently: send an OpenAI-format request to higgs and it will translate to Anthropic format if the matched route points to an Anthropic provider (and vice versa), including streaming responses.

```bash
# Local model
>>>>>>> feef8e47 (feat(doctor): validate draft_model path and batch incompatibility)
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
Expand Down
1 change: 1 addition & 0 deletions crates/higgs-engine/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ pub mod reasoning_parser;
pub mod scheduler;
pub mod simple;
pub mod spec_prefill;
pub mod speculative;
pub mod tool_parser;

pub use tokenizers;
Loading