A production-ready LLM inference load balancer that fronts a heterogeneous backend cluster of Ollama and vLLM inference nodes, providing a unified OpenAI-compatible API surface with native Ollama compatibility.
For comprehensive documentation covering the full application, API reference, configuration, and more, see docs/index.md.
Interactive API docs are also available at /docs (Swagger UI) and /redoc (ReDoc) when the application is running.
- Unified API Gateway: OpenAI-compatible
/v1/*, Ollama/api/*, and Anthropic/anthropic/v1/*endpoints - API Dialect Translation: Automatic translation between Ollama and vLLM formats
- Fair-Share Scheduling: Weighted Deficit Round Robin with burst credits
- Multi-Modal Support: Text, embeddings, and vision-language models
- Structured Outputs: JSON schema validation across all backends
- Quota Management: Per-user token budgets with role-based weights
- Node/Backend Architecture: Separate physical GPU nodes from inference endpoints — one sidecar poll per node, GPU-to-backend assignment
- GPU Sidecar Agent: Lightweight per-node agent for real-time GPU metrics (utilization, memory, temperature, power)
- Real-Time Telemetry: GPU/memory/utilization monitoring per node and per backend
- Drain Mode: Gracefully take backends offline — stop new requests while in-flight requests finish, then auto-disable
- Health Alerts: Admin dashboard shows a prominent warning banner when any backend is unhealthy or any node is offline
- Tool Calling: Function calling support across all API surfaces with cross-engine translation
- Thinking/Reasoning Mode: Control reasoning depth on supported models (qwen3.5, qwen3, gpt-oss)
- Web Search: Standalone search API (
/v1/search) with pluggable providers (Brave Search, SearXNG) - Agentic AI Integrations: MCP servers and agent skills for Claude Code, ForgeCode, OpenCode, Codex, Cursor, and others (
agentic_ai/) - Azure AD SSO: Optional single sign-on with JIT user provisioning from Microsoft Entra ID
- Voice API: Public TTS and STT endpoints (
/v1/audio/speech,/v1/audio/transcriptions) with API key auth and quota tracking - Full Audit Logging: All prompts, responses, and artifacts stored for review
- Dual Dashboards: Public status + authenticated user/admin interfaces with dark mode
- Docker and Docker Compose
- Python 3.11+ (for local development)
# Copy environment template
cp .env.example .env
# Edit .env with your settings (database passwords, secret key, etc.)
nano .env
# Start all services
docker compose up --build
# In another terminal, run migrations
docker compose exec app alembic upgrade head
# Seed development data
docker compose exec app python scripts/seed_dev_data.pyThe gateway runs on http://localhost:8000 by default.
# Chat completion
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Authorization: Bearer your-api-key" \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": false
}'
# Embeddings
curl -X POST http://localhost:8000/v1/embeddings \
-H "Authorization: Bearer your-api-key" \
-H "Content-Type: application/json" \
-d '{
"model": "nomic-embed-text",
"input": "Hello world"
}'
# List models
curl http://localhost:8000/v1/models \
-H "Authorization: Bearer your-api-key"# Chat via Anthropic Messages API
curl -X POST http://localhost:8000/anthropic/v1/messages \
-H "Authorization: Bearer your-api-key" \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"max_tokens": 500,
"messages": [{"role": "user", "content": "Hello!"}]
}'Anthropic SDK clients (Python, TypeScript) can point directly at MindRouter:
import anthropic
client = anthropic.Anthropic(
base_url="http://localhost:8000/anthropic",
api_key="your-api-key",
)
message = client.messages.create(
model="llama3.2",
max_tokens=500,
messages=[{"role": "user", "content": "Hello!"}],
)# Chat via Ollama API
curl -X POST http://localhost:8000/api/chat \
-H "Authorization: Bearer your-api-key" \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": false
}'
# Generate via Ollama API
curl -X POST http://localhost:8000/api/generate \
-H "Authorization: Bearer your-api-key" \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2",
"prompt": "Why is the sky blue?"
}'- Public Dashboard:
http://localhost:8000/- Cluster status, request API key - User Dashboard:
http://localhost:8000/dashboard- Usage, keys, quota requests - Chat Interface:
http://localhost:8000/chat- Full-featured chat with file upload, web search, streaming - Admin Dashboard:
http://localhost:8000/admin- Full system control
After running the seed script:
| User | Password | Role |
|---|---|---|
| admin | admin123 | admin |
| faculty1 | faculty123 | faculty |
| staff1 | staff123 | staff |
| student1 | student123 | student |
┌─────────────────────────────────────────────────────────────────┐
│ MindRouter Gateway │
├─────────────────────────────────────────────────────────────────┤
│ ┌────────┐ ┌────────┐ ┌───────────┐ ┌───────┐ ┌──────────────┐ │
│ │ OpenAI │ │ Ollama │ │ Anthropic │ │ Admin │ │ Dashboard │ │
│ │ /v1/* │ │ /api/* │ │/anthropic/│ │ API │ │ (Bootstrap) │ │
│ └───┬────┘ └───┬────┘ └─────┬─────┘ └─-──┬──┘ └───-───┬──────┘ │
│ │ │ │ │ │ │
│ └──────────┴────────────┴────────────┴────────────┘ │
│ │ │
│ ┌───────────────────────────┴───────────────────────────────┐ │
│ │ Translation Layer │ │
│ │ OpenAI/Ollama/Anthropic ←→ Canonical ←→ Ollama/vLLM │ │
│ └───────────────────────────┬───────────────────────────────┘ │
│ │ │
│ ┌───────────────────────────┴───────────────────────────────┐ │
│ │ Fair-Share Scheduler (WDRR) │ │
│ │ • Per-user queues with deficit counters │ │
│ │ • Role-based weights (faculty > staff > student) │ │
│ │ • Burst credits for idle cluster utilization │ │
│ └───────────────────────────┬───────────────────────────────┘ │
│ │ │
│ ┌───────────────────────────┴───────────────────────────────┐ │
│ │ Backend Registry │ │
│ │ • Node + Backend separation (1 node → N backends) │ │
│ │ • Per-node sidecar polling (deduplicated) │ │
│ │ • GPU-to-backend assignment via gpu_indices │ │
│ │ • Health monitoring & model residency tracking │ │
│ └───────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│
┌──────────────────────┼──────────────────────┐
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Node 1 │ │ Node 2 │ │ Node 3 │
│ 4x A100 │ │ 2x L40S │ │ 2x RTX4090 │
│ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │
│ │ Sidecar │ │ │ │ Sidecar │ │ │ │ Sidecar │ │
│ │ :8007 │ │ │ │ :8007 │ │ │ │ :8007 │ │
│ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ │
│ ┌────┬────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ │
│ │vLLM│vLLM│ │ │ │ Ollama │ │ │ │ Ollama │ │
│ │0,1 │2,3 │ │ │ │ (all) │ │ │ │ (all) │ │
│ └────┴────┘ │ │ └─────────┘ │ │ └─────────┘ │
└─────────────┘ └─────────────┘ └─────────────┘
Key concept: A Node represents a physical GPU server running a sidecar agent. A Backend is an inference endpoint (Ollama or vLLM instance) running on a node. Multiple backends can share a node, each assigned specific GPUs via gpu_indices.
See .env.example for all available configuration options.
Key settings:
| Variable | Description | Default |
|---|---|---|
DATABASE_URL |
MariaDB connection string | Required |
SECRET_KEY |
Session/JWT signing key | Required |
REDIS_URL |
Redis for rate limiting (optional) | None |
ARTIFACT_STORAGE_PATH |
Path for uploaded files | /data/artifacts |
SCHEDULER_FAIRNESS_WINDOW |
Rolling window for usage tracking | 300 (5 min) |
GPU_AGENT_PORT |
Sidecar agent listen port (sidecar-side env var, not a MindRouter setting) | 8007 |
MindRouter separates nodes (physical GPU servers) from backends (inference endpoints). Register a node first, then attach backends to it.
# Step 1: Register a GPU node (physical server running the sidecar agent)
curl -X POST http://localhost:8000/api/admin/nodes/register \
-H "Authorization: Bearer admin-api-key" \
-H "Content-Type: application/json" \
-d '{
"name": "gpu-server-1",
"hostname": "gpu1.example.com",
"sidecar_url": "http://gpu1.example.com:8007"
}'
# Step 2: Register a backend on that node (all GPUs)
curl -X POST http://localhost:8000/api/admin/backends/register \
-H "Authorization: Bearer admin-api-key" \
-H "Content-Type: application/json" \
-d '{
"name": "ollama-gpu1",
"url": "http://gpu1.example.com:11434",
"engine": "ollama",
"max_concurrent": 4,
"node_id": 1
}'
# Or assign specific GPUs to a backend (multi-backend node)
curl -X POST http://localhost:8000/api/admin/backends/register \
-H "Authorization: Bearer admin-api-key" \
-H "Content-Type: application/json" \
-d '{
"name": "vllm-gpu1-01",
"url": "http://gpu1.example.com:8000",
"engine": "vllm",
"max_concurrent": 16,
"node_id": 1,
"gpu_indices": [0, 1]
}'Backends without a node_id still work as standalone endpoints (no GPU telemetry).
The max_concurrent value registered in MindRouter must match the concurrency limit configured on the inference engine itself:
| Engine | Engine Setting | MindRouter Setting |
|---|---|---|
| vLLM | --max-num-seqs N |
"max_concurrent": N |
| Ollama | OLLAMA_NUM_PARALLEL=N |
"max_concurrent": N |
Why this matters: MindRouter uses max_concurrent to decide how many requests to route to a backend. If MindRouter thinks a backend can handle 8 concurrent requests but vLLM is configured with --max-num-seqs 4, the extra requests queue silently inside vLLM. MindRouter can't see this hidden queue, so it keeps routing requests there instead of spreading load to other backends. The result is uneven load distribution and unpredictable latency — the fair-share scheduler is effectively bypassed for those excess requests.
MindRouter auto-discovers each model's context length and caps it at min(model_max_context, 32768) to prevent small models from consuming excessive VRAM. For every Ollama request, MindRouter injects num_ctx matching the configured context_length. By default, user-supplied num_ctx values are overridden to prevent GPU memory oversubscription — this can be toggled in Site Settings. Admins can also set context_length_override per model in the admin UI. Ollama 0.17+ automatically adjusts downward if the requested context doesn't fit in GPU memory.
# Create virtual environment
python -m venv venv
source venv/bin/activate
# Install dependencies
pip install -e ".[dev]"
# Start MariaDB (via docker)
docker compose up -d mariadb redis
# Run migrations
alembic upgrade head
# Seed data
python scripts/seed_dev_data.py
# Start development server
uvicorn backend.app.main:app --reload --host 0.0.0.0 --port 8000# Unit tests
pytest backend/app/tests/unit -v
# Integration tests (requires docker)
pytest backend/app/tests/integration -v
# End-to-end tests
pytest backend/app/tests/e2e -v
# All tests with coverage
pytest --cov=backend/app --cov-report=htmlmake dev # Start development server
make test # Run all tests
make test-unit # Run unit tests only
make lint # Run linters
make format # Format code
make migrate # Run database migrations
make seed # Seed development data
make docker-up # Start docker compose stack
make docker-down # Stop docker compose stack
make migrate-down # Rollback one Alembic migration
make demo # Run fairness demo script
make docker-shell # Open bash shell in the app container
make docker-seed # Seed development data via docker-compose
make docker-migrate # Run Alembic migrations via docker-compose| Method | Path | Description |
|---|---|---|
| POST | /v1/chat/completions |
OpenAI-compatible chat |
| POST | /v1/completions |
OpenAI-compatible completion |
| POST | /v1/embeddings |
OpenAI-compatible embeddings |
| POST | /v1/rerank |
Rerank documents against a query |
| POST | /v1/score |
Score similarity between text pairs |
| GET | /v1/models |
List available models |
| POST | /api/chat |
Ollama-compatible chat |
| POST | /api/generate |
Ollama-compatible generate |
| GET | /api/tags |
Ollama-compatible model list |
| POST | /anthropic/v1/messages |
Anthropic Messages API compatible |
| Method | Path | Description |
|---|---|---|
| POST | /v1/audio/speech |
OpenAI-compatible text-to-speech |
| POST | /v1/audio/transcriptions |
OpenAI-compatible speech-to-text |
| Method | Path | Description |
|---|---|---|
| GET | /healthz |
Liveness probe |
| GET | /readyz |
Readiness probe |
| GET | /metrics |
Prometheus metrics |
| GET | /status |
Cluster status summary (JSON) |
| GET | /api/cluster/total-tokens |
Total tokens served |
| GET | /api/cluster/trends |
Token and user trends |
| GET | /api/cluster/throughput |
Real-time token throughput |
| Method | Path | Description |
|---|---|---|
| POST | /api/admin/nodes/register |
Register a GPU node |
| GET | /api/admin/nodes |
List all nodes |
| DELETE | /api/admin/nodes/{id} |
Remove a node |
| POST | /api/admin/nodes/{id}/refresh |
Refresh node sidecar |
| POST | /api/admin/backends/register |
Register new backend |
| POST | /api/admin/backends/{id}/disable |
Disable backend |
| POST | /api/admin/backends/{id}/enable |
Enable backend |
| POST | /admin/backends/{id}/drain |
Drain backend (dashboard-only) |
| POST | /api/admin/backends/{id}/refresh |
Refresh capabilities |
| POST | /api/admin/backends/{id}/ollama/pull |
Pull a model to Ollama backend |
| POST | /api/admin/backends/{id}/ollama/delete |
Delete a model from Ollama backend |
| GET | /api/admin/telemetry/overview |
Cluster telemetry overview |
| GET | /api/admin/telemetry/nodes/{id}/history |
Node GPU history |
| GET | /api/admin/telemetry/export |
Export telemetry as JSON or CSV |
| GET | /api/admin/queue |
View scheduler queue |
| GET | /api/admin/audit/search |
Search audit logs |
| GET | /api/admin/conversations/export |
Export conversations |
MindRouter implements Weighted Deficit Round Robin (WDRR) for fair resource allocation:
- Share Weights: faculty=3, staff=2, student=1, admin=10
- Deficit Counters: Track service debt per user
- Burst Credits: Allow full cluster use when idle
- Backend Scoring: Model residency, GPU utilization, queue depth
See docs/scheduler.md for detailed algorithm specification.
Each GPU node runs a lightweight sidecar agent (sidecar/gpu_agent.py) that exposes per-GPU hardware metrics via HTTP. MindRouter's backend registry polls the sidecar once per node (not per backend) to collect telemetry.
- GPU utilization and memory utilization (%)
- Memory usage (used/free/total GB)
- Temperature (GPU and memory)
- Power draw and power limit (watts)
- Fan speed, SM/memory clocks
- Running processes per GPU
- Driver version and CUDA version
The sidecar must run on each physical GPU server. It requires NVIDIA drivers and the NVIDIA Container Toolkit. A SIDECAR_SECRET_KEY environment variable is required — the sidecar will refuse to start without it. Generate one with: python -c "import secrets; print(secrets.token_hex(32))"
The sidecar container requires GPU access via --gpus all. Install the NVIDIA Container Toolkit on each GPU node:
# Debian/Ubuntu
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
| sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
| sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
| sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
# RHEL/Rocky Linux
curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo \
| sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
sudo dnf install -y nvidia-container-toolkit
# Configure and restart Docker
sudo nvidia-ctk runtime configure --driver=docker
sudo systemctl restart docker
# Verify GPU access
docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smiDocker's default bridge network uses 172.17.0.0/16, which can collide with campus or institutional routing. Configure each GPU node's Docker daemon to use 10.x.x.x address space before deploying the sidecar:
# /etc/docker/daemon.json — create or merge on each GPU node
{
"runtimes": {
"nvidia": {
"args": [],
"path": "nvidia-container-runtime"
}
},
"bip": "10.77.0.1/16",
"default-address-pools": [
{ "base": "10.78.0.0/16", "size": 24 }
]
}After creating or updating daemon.json, restart Docker: sudo systemctl restart docker
Option A: Docker Compose (development)
# Set the key in .env or export it
export SIDECAR_SECRET_KEY=your-generated-key
# Start with the gpu profile (requires NVIDIA GPU on this machine)
docker compose --profile gpu up gpu-sidecarOption B: Build directly from GitHub (production — run on each GPU node)
Build and deploy a specific version directly from the repository without cloning.
First, create the sidecar configuration directory and env file (once per node):
sudo mkdir -p /etc/mindrouter
python3 -c "import secrets; print('SIDECAR_SECRET_KEY=' + secrets.token_hex(32))" | sudo tee /etc/mindrouter/sidecar.env
sudo chmod 600 /etc/mindrouter/sidecar.envThen build and run:
# Build a specific release tag
docker build -t mindrouter-sidecar:v2.0.0 \
-f Dockerfile.sidecar \
https://github.com/ui-insight/MindRouter.git#v2.0.0:sidecar
# Or build latest from master
docker build -t mindrouter-sidecar:latest \
-f Dockerfile.sidecar \
https://github.com/ui-insight/MindRouter.git:sidecar
# Run bound to localhost only (nginx will proxy external traffic)
docker run -d --name gpu-sidecar \
--gpus all \
-p 127.0.0.1:18007:8007 \
--env-file /etc/mindrouter/sidecar.env \
--restart unless-stopped \
mindrouter-sidecar:v2.0.0To upgrade an existing sidecar to a new version:
docker build -t mindrouter-sidecar:v2.0.0 \
-f Dockerfile.sidecar \
https://github.com/ui-insight/MindRouter.git#v2.0.0:sidecar
docker stop gpu-sidecar && docker rm gpu-sidecar
docker run -d --name gpu-sidecar \
--gpus all \
-p 127.0.0.1:18007:8007 \
--env-file /etc/mindrouter/sidecar.env \
--restart unless-stopped \
mindrouter-sidecar:v2.0.0Then configure nginx to proxy external port 8007 to the container:
# /etc/nginx/conf.d/sidecar-proxy.conf
server {
listen 8007;
listen [::]:8007;
server_name _;
location / {
proxy_pass http://127.0.0.1:18007;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Sidecar-Key $http_x_sidecar_key;
proxy_connect_timeout 5s;
proxy_read_timeout 10s;
}
}sudo systemctl enable --now nginxOption C: Direct Python (no Docker)
# On each GPU server:
pip install fastapi uvicorn nvidia-ml-py
cd sidecar/
SIDECAR_SECRET_KEY=your-generated-key GPU_AGENT_PORT=8007 python gpu_agent.pyAfter starting the sidecar on a GPU server, register the node in MindRouter. Include the same sidecar_key that was set as SIDECAR_SECRET_KEY on the sidecar:
curl -X POST http://mindrouter:8000/api/admin/nodes/register \
-H "Authorization: Bearer admin-api-key" \
-H "Content-Type: application/json" \
-d '{
"name": "gpu-server-1",
"hostname": "gpu1.example.com",
"sidecar_url": "http://gpu1.example.com:8007",
"sidecar_key": "your-generated-key"
}'Or use the admin dashboard at /admin/nodes to register nodes and view GPU telemetry.
A single GPU server can host multiple inference endpoints. Assign specific GPUs to each backend using gpu_indices:
Node: gpu-server-1 (4x A100-80GB, sidecar at :8007)
├── Backend: vllm-large (gpu_indices: [0, 1]) ← uses GPUs 0-1
├── Backend: vllm-small (gpu_indices: [2]) ← uses GPU 2
└── Backend: ollama-misc (gpu_indices: [3]) ← uses GPU 3
Each backend's telemetry (utilization, memory) is aggregated only from its assigned GPUs. Omit gpu_indices to use all GPUs on the node.
- API keys are stored hashed (Argon2)
- Role-based access control (RBAC)
- Rate limiting per key (RPM) and per user (concurrency)
- All admin actions are audited
- Request/response content logged for compliance review
- GPU sidecar endpoints are authenticated via shared secret key (
X-Sidecar-Keyheader, constant-time comparison) - Sidecar containers bind to localhost only; nginx reverse proxy handles external access on port 8007
- Docker daemon on GPU nodes uses 10.x.x.x address space to avoid routing collisions with campus/institutional networks
Apache License 2.0 - See LICENSE for details.
- Fork the repository
- Create a feature branch
- Make your changes with tests
- Submit a pull request
Please follow conventional commit messages.
This project was developed with support from the National Science Foundation (NSF Award #2427549).