diff --git a/README.md b/README.md index c922ef1..d739b01 100644 --- a/README.md +++ b/README.md @@ -46,7 +46,6 @@ - [CLI](#cli) - [SaaS control plane (optional)](#saas-control-plane-optional) - [Project layout](#project-layout) -- [Contributing](#contributing) - [License](#license) --- @@ -84,16 +83,22 @@ If you have an existing gateway, the question is whether Fairvisor adds anything **If nginx `limit_req` is enough for you**, use it. It has zero overhead and is the right tool for simple per-IP global throttling. Fairvisor becomes relevant when you need per-tenant awareness, JWT-claim-based bucketing, or cost/token tracking that `limit_req` has no model for. -**If you are already running Kong**, the built-in rate limiting plugin stores counters in Redis or Postgres — every decision is a network call. Fairvisor can run alongside Kong as an `auth_request` decision service with no external state. +**If you are already running Kong**, the built-in rate limiting plugin stores counters in Redis or Postgres — every decision is a network call. Fairvisor can run alongside Kong as an `auth_request` decision service with no external state. See [Kong / Traefik integration →](https://docs.fairvisor.com/docs/gateway/) -**If you are running Envoy**, the [global rate limit service](https://github.com/envoyproxy/ratelimit) requires deploying a separate Redis-backed service with its own config language. Fairvisor is one container, one JSON file, and integrates via `ext_authz` in the same position. +**If you are running Envoy**, the [global rate limit service](https://github.com/envoyproxy/ratelimit) requires deploying a separate Redis-backed service with its own config language. Fairvisor is one container, one JSON file, and integrates via `ext_authz` in the same position. See [Envoy ext_authz integration →](https://docs.fairvisor.com/docs/gateway/envoy/) **If you are on Cloudflare or Akamai**, per-JWT-claim limits, LLM token budgets, and cost caps are not in the platform's model. If your limits are tenant-aware or cost-aware, you need something that runs in your own stack. -Fairvisor integrates *alongside* Kong, nginx, and Envoy — it is not a replacement. See [docs/gateway-integration.md](docs/gateway-integration.md) for integration patterns. +Fairvisor integrates *alongside* Kong, nginx, and Envoy — it is not a replacement. See [nginx auth_request →](https://docs.fairvisor.com/docs/gateway/nginx/) · [Envoy ext_authz →](https://docs.fairvisor.com/docs/gateway/envoy/) · [Kong / Traefik →](https://docs.fairvisor.com/docs/gateway/) for integration patterns. ## Quick start +> **Runnable quickstart:** `examples/quickstart/` — `docker compose up -d` and run your first enforce/reject test in under a minute. See [`examples/quickstart/README.md`](examples/quickstart/README.md). +> +> **Recipes:** `examples/recipes/` — deployable team budgets, runaway agent guard, and circuit-breaker examples. +> +> **Sample artifacts:** `fixtures/` — canonical request/response fixtures for enforce, reject (TPM, TPD, prompt-too-large), and provider-native error bodies (OpenAI, Anthropic, Gemini). + ### 1. Create a policy ```bash @@ -156,11 +161,15 @@ curl -s -w "\nHTTP %{http_code}\n" \ ## LLM token budget in 30 seconds +The fastest path is **wrapper mode**: Fairvisor sits in front of the LLM API, enforces budgets, and strips the upstream key from the client. No gateway changes needed — just point your client at Fairvisor instead of OpenAI. + +**1. Policy** — one rule, per-org TPM + daily cap: + ```json { "id": "llm-budget", "spec": { - "selector": { "pathPrefix": "/v1/chat" }, + "selector": { "pathPrefix": "/" }, "mode": "enforce", "rules": [ { @@ -178,9 +187,29 @@ curl -s -w "\nHTTP %{http_code}\n" \ } ``` -Each organization (from the JWT `org_id` claim) gets its own independent 60k TPM / 1.2M TPD budget. Requests over the limit return a `429` with an OpenAI-compatible error body — no client changes needed. +**2. Call the API** — token format `Bearer :`: + +```bash +curl https://your-fairvisor-host/openai/v1/chat/completions -H "Authorization: Bearer eyJhbGc...:sk-proj-..." -H "Content-Type: application/json" -d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"Hello"}]}' +``` + +Fairvisor parses the JWT claims (no signature validation — the JWT is trusted as-is), extracts `org_id`, charges tokens against the budget, strips the `Authorization` header, and forwards with the upstream key. The upstream never sees the client JWT. + +When the budget is exhausted: + +```http +HTTP/1.1 429 Too Many Requests +X-Fairvisor-Reason: tpm_exceeded +Retry-After: 12 +RateLimit-Limit: 60000 +RateLimit-Remaining: 0 +``` + +Each organization gets its own independent 60k TPM / 1.2M TPD budget. Works with OpenAI, Anthropic, Azure OpenAI, Mistral, and any OpenAI-compatible endpoint. + +The selector matches the incoming wrapper path. Use `pathPrefix: "/"` to cover all providers, or `pathPrefix: "/openai"` to limit to one provider only. -Works with OpenAI, Anthropic, Azure OpenAI, Mistral, and any OpenAI-compatible endpoint. +> **Decision service / reverse proxy mode:** if you already have a gateway, use `selector: { "pathPrefix": "/v1/chat" }` and call `POST /v1/decision` from your existing `auth_request` or `ext_authz` hook instead. ## How a request flows @@ -188,7 +217,9 @@ Works with OpenAI, Anthropic, Azure OpenAI, Mistral, and any OpenAI-compatible e **Reverse proxy mode** — Fairvisor sits inline. Traffic arrives at Fairvisor directly, gets evaluated, and is proxied to the upstream if allowed. No separate gateway needed. -Both modes use the same policy bundle and return the same rejection headers. +**Wrapper mode** — Fairvisor acts as a transparent LLM proxy. Clients send requests to Fairvisor's OpenAI-compatible endpoint (`/openai/v1/chat/completions`, `/anthropic/v1/messages`, `/gemini/v1/generateContent`). Fairvisor enforces token budgets and cost limits, strips the client auth header, injects the upstream API key, and forwards the request. No changes needed in the client — swap the base URL and you're done. + +All three modes use the same policy bundle and return the same rejection headers. When a request is rejected: @@ -206,40 +237,72 @@ Headers follow [RFC 9333 RateLimit Fields](https://www.rfc-editor.org/rfc/rfc933 ### Architecture -**Decision service mode** (sidecar — your gateway calls `/v1/decision`, handles forwarding itself): - -``` - Client ──► Your gateway (nginx / Envoy / Kong) - │ - │ POST /v1/decision - │ (auth_request / ext_authz) - ▼ - ┌─────────────────────┐ - │ Fairvisor Edge │ - │ decision_service │ - │ │ - │ rule_engine │ - │ ngx.shared.dict │ ◄── no Redis, no network - └──────────┬──────────┘ - │ - 204 allow │ 429 reject - ▼ - gateway proxies or returns rejection +**Decision service mode** — sidecar: your gateway calls `/v1/decision`, handles forwarding itself. + +```mermaid +sequenceDiagram + participant C as Client + participant G as Your Gateway
(nginx / Envoy / Kong) + participant F as Fairvisor Edge
decision_service + participant U as Upstream service + + C->>G: Request + G->>F: POST /v1/decision
(auth_request / ext_authz) + alt allow + F-->>G: 204 No Content + G->>U: Forward request + U-->>G: Response + G-->>C: Response + else reject + F-->>G: 429 + RateLimit headers + G-->>C: 429 Too Many Requests + end ``` -**Reverse proxy mode** (inline — Fairvisor handles proxying): +No Redis, no external state — all counters live in `ngx.shared.dict`. + +**Reverse proxy mode** — inline: Fairvisor handles both enforcement and proxying. + +```mermaid +sequenceDiagram + participant C as Client + participant F as Fairvisor Edge
reverse_proxy + participant U as Upstream service + C->>F: Request + alt allow + F->>U: Forward request + U-->>F: Response + F-->>C: Response + else reject + F-->>C: 429 + RFC 9333 headers + end ``` - Client ──► Fairvisor Edge (reverse_proxy) - │ - │ access.lua → rule_engine - │ ngx.shared.dict - │ - allow ──► upstream service - reject ──► 429 + RFC 9333 headers + +**Wrapper mode** — transparent LLM proxy: swap the base URL, no other client changes needed. + +```mermaid +sequenceDiagram + participant C as Client + participant F as Fairvisor Edge
wrapper + participant U as Upstream LLM
(OpenAI / Anthropic / Gemini) + + C->>F: POST /openai/v1/chat/completions
Authorization: Bearer CLIENT_JWT:UPSTREAM_KEY + F->>F: 1. Parse JWT claims (org_id, user_id) + F->>F: 2. Enforce TPM / TPD / cost budget + alt budget ok + F->>U: POST /v1/chat/completions
Authorization: Bearer UPSTREAM_KEY + U-->>F: 200 OK + token usage + F->>F: 3. Count tokens · refund unused reservation + F-->>C: 200 OK (Authorization stripped from reply) + else budget exceeded + F-->>C: 429 X-Fairvisor-Reason: tpm_exceeded + end ``` -Both modes use the same policy bundle and produce the same rejection headers. +Supported upstream paths: `/openai/*`, `/anthropic/*`, `/gemini/*`, `/grok/*`. + +All three modes use the same policy bundle and produce the same rejection headers. ## Enforcement capabilities @@ -274,17 +337,6 @@ Policies are versioned JSON — commit them to Git, review changes in PRs, roll ## Performance -### Benchmark methodology (March 2026) - -- **Hosts:** 2 × AWS `c7i.xlarge` (4 vCPU, 8 GiB each), cluster placement group, eu-central-1 -- **OS:** Ubuntu 24.04 LTS -- **Runtime:** OpenResty 1.29.2.1, Fairvisor latest `main` (no Docker) -- **Load tool:** `k6` v0.54.0, `constant-arrival-rate`, 10,000 RPS for 60s, 10s warmup -- **Benchmark script:** `run-all.sh` from `fairvisor/benchmark` -- **Topology:** two-host — Fairvisor and k6 on separate machines (VPC private network) -- **Decision endpoint contract:** `POST /v1/decision` with `X-Original-Method` and `X-Original-URI` -- **Note:** reverse proxy numbers include policy evaluation and upstream proxy hop to backend nginx. - ### Latest measured latency @ 10,000 RPS | Percentile | Decision service | Reverse proxy | Raw nginx (baseline) | @@ -305,7 +357,8 @@ Policies are versioned JSON — commit them to Git, review changes in PRs, roll **No external datastore.** All enforcement state lives in in-process shared memory (`ngx.shared.dict`). No Redis, no Postgres, no network round-trips in the decision path. -> Reproduce: `git clone https://github.com/fairvisor/benchmark && cd benchmark && bash run-all.sh` +Reproduce: see [fairvisor/benchmark](https://github.com/fairvisor/benchmark) — the canonical benchmark source of truth for Fairvisor Edge performance numbers. + ## Deployment @@ -349,25 +402,16 @@ If the SaaS is unreachable, the edge keeps enforcing with the last-known policy ## Project layout ``` -src/fairvisor/ runtime modules (OpenResty/LuaJIT) -cli/ command-line tooling -spec/ unit and integration tests (busted) -tests/e2e/ Docker-based E2E tests (pytest) -examples/ sample policy bundles -helm/ Helm chart -docker/ Docker artifacts -docs/ reference documentation -``` - -## Contributing - -See [CONTRIBUTING.md](CONTRIBUTING.md). Bug reports, issues, and pull requests welcome. - -Run the test suite: - -```bash -busted spec # unit + integration -pytest tests/e2e -v # E2E (requires Docker) +src/fairvisor/ runtime modules (OpenResty/LuaJIT) +cli/ command-line tooling +spec/ unit and integration tests (busted) +tests/e2e/ Docker-based E2E tests (pytest) +examples/quickstart/ runnable quickstart (docker compose up -d) +examples/recipes/ deployable policy recipes (team budgets, agent guard, failover) +fixtures/ canonical request/response sample artifacts +helm/ Helm chart +docker/ Docker artifacts +docs/ reference documentation ``` ## License @@ -377,3 +421,4 @@ pytest tests/e2e -v # E2E (requires Docker) --- **Docs:** [docs.fairvisor.com](https://docs.fairvisor.com/docs/) · **Website:** [fairvisor.com](https://fairvisor.com) · **Quickstart:** [5 minutes to enforcement](https://docs.fairvisor.com/docs/quickstart/) + diff --git a/examples/quickstart/README.md b/examples/quickstart/README.md new file mode 100644 index 0000000..3acb943 --- /dev/null +++ b/examples/quickstart/README.md @@ -0,0 +1,108 @@ +# Fairvisor Edge — Quickstart + +Go from `git clone` to working policy enforcement in one step. + +## Prerequisites + +- Docker with Compose V2 (`docker compose version`) +- Port 8080 free on localhost + +## Start + +```bash +docker compose up -d +``` + +Wait for the edge service to report healthy: + +```bash +docker compose ps +# edge should show "healthy" +``` + +## Verify enforcement + +This quickstart runs in `FAIRVISOR_MODE=reverse_proxy`. Requests to `/v1/*` +are enforced by the TPM policy and forwarded to a local mock LLM backend. +No real API keys are required. + +**Allowed request** — should return `200`: + +```bash +curl -s -X POST http://localhost:8080/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d @../../fixtures/normal_request.json +``` + +Expected response body shape matches `../../fixtures/allow_response.json`. + +**Over-limit request** — should return `429`: + +```bash +curl -s -X POST http://localhost:8080/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d @../../fixtures/over_limit_request.json +``` + +Expected response body shape: `../../fixtures/reject_tpm_exceeded.json`. +The response will also include: +- `X-Fairvisor-Reason: tpm_exceeded` +- `Retry-After: 60` +- `RateLimit-Limit: 100` (matches the quickstart policy `tokens_per_minute`) +- `RateLimit-Remaining: 0` + +## How the policy works + +The quickstart policy (`policy.json`) enforces a TPM limit keyed on `ip:address`: + +- `tokens_per_minute: 100` — allows roughly 2 small requests per minute +- `tokens_per_day: 1000` — daily cap +- `default_max_completion: 50` — pessimistic reservation per request when `max_tokens` is not set + +Sending `over_limit_request.json` (which sets `max_tokens: 200000`) immediately +exceeds the 100-token per-minute budget and triggers a `429`. + +## Wrapper mode (real provider routing) + +Wrapper mode routes requests to real upstream providers using provider-prefixed paths +and a composite Bearer token. It requires real provider API keys and cannot be +demonstrated with this mock stack. + +**Path and auth format:** + +``` +POST /openai/v1/chat/completions +Authorization: Bearer CLIENT_JWT:UPSTREAM_KEY +``` + +Where: +- `CLIENT_JWT` — signed JWT identifying the calling client/tenant (used for policy enforcement) +- `UPSTREAM_KEY` — real upstream API key forwarded to the provider (e.g. `sk-...` for OpenAI) + +Fairvisor strips the composite header, injects the correct provider auth before forwarding, +and **never returns upstream auth headers to the caller** +(see `../../fixtures/allow_response.json`). + +**Provider-prefixed paths:** + +| Path prefix | Upstream | Auth header injected | +|---|---|---| +| `/openai/v1/...` | `https://api.openai.com/v1/...` | `Authorization: Bearer UPSTREAM_KEY` | +| `/anthropic/v1/...` | `https://api.anthropic.com/v1/...` | `x-api-key: UPSTREAM_KEY` | +| `/gemini/v1beta/...` | `https://generativelanguage.googleapis.com/v1beta/...` | `x-goog-api-key: UPSTREAM_KEY` | + +To run in wrapper mode, change the compose env to `FAIRVISOR_MODE: wrapper` and +supply real credentials in the `Authorization` header. + +## Teardown + +```bash +docker compose down +``` + +## Next steps + +- See `../recipes/` for team budgets, runaway agent guard, and provider failover scenarios +- See `../../fixtures/` for all sample request/response artifacts +- See [fairvisor/benchmark](https://github.com/fairvisor/benchmark) for performance benchmarks +- See [docs/install/](../../docs/install/) for Kubernetes, VM, and SaaS deployment options diff --git a/examples/quickstart/docker-compose.yml b/examples/quickstart/docker-compose.yml new file mode 100644 index 0000000..870812d --- /dev/null +++ b/examples/quickstart/docker-compose.yml @@ -0,0 +1,58 @@ +# Fairvisor Edge — Quickstart stack (standalone + reverse proxy mode) +# +# Usage: +# docker compose up -d +# curl -s http://localhost:8080/readyz # health check +# curl -s -X POST http://localhost:8080/v1/chat/completions \ +# -H "Content-Type: application/json" \ +# -d @../../fixtures/normal_request.json # expect 200 +# curl -s -X POST http://localhost:8080/v1/chat/completions \ +# -H "Content-Type: application/json" \ +# -d @../../fixtures/over_limit_request.json # expect 429 +# +# This stack runs in FAIRVISOR_MODE=reverse_proxy — requests to /v1/* are +# enforced by policy then forwarded to the local mock LLM backend. +# No real API keys required. +# +# Wrapper mode (routing by provider prefix, real upstream keys) is documented +# in README.md under "Wrapper mode". It requires real provider credentials and +# cannot be demonstrated with this mock stack. +# +# This file is also the base for the e2e-smoke CI check. +# CI extends it via tests/e2e/docker-compose.test.yml; do not diverge the +# service name, port, or volume contract without updating CI as well. + +services: + edge: + image: ghcr.io/fairvisor/fairvisor-edge:latest + ports: + - "8080:8080" + environment: + FAIRVISOR_CONFIG_FILE: /etc/fairvisor/policy.json + FAIRVISOR_MODE: reverse_proxy + FAIRVISOR_BACKEND_URL: http://mock_llm:80 + FAIRVISOR_SHARED_DICT_SIZE: 32m + FAIRVISOR_LOG_LEVEL: info + FAIRVISOR_WORKER_PROCESSES: "1" + volumes: + - ./policy.json:/etc/fairvisor/policy.json:ro + depends_on: + mock_llm: + condition: service_healthy + healthcheck: + test: ["CMD", "curl", "-sf", "http://127.0.0.1:8080/readyz"] + interval: 2s + timeout: 2s + retries: 15 + start_period: 5s + + mock_llm: + image: nginx:1.27-alpine + volumes: + - ./mock-llm.conf:/etc/nginx/nginx.conf:ro + healthcheck: + test: ["CMD", "wget", "-q", "-O", "-", "http://127.0.0.1:80/"] + interval: 2s + timeout: 2s + retries: 10 + start_period: 5s diff --git a/examples/quickstart/mock-llm.conf b/examples/quickstart/mock-llm.conf new file mode 100644 index 0000000..26603ab --- /dev/null +++ b/examples/quickstart/mock-llm.conf @@ -0,0 +1,10 @@ +events {} +http { + server { + listen 80; + location / { + default_type application/json; + return 200 '{"id":"chatcmpl-qs","object":"chat.completion","choices":[{"index":0,"message":{"role":"assistant","content":"Hello from the mock backend!"},"finish_reason":"stop"}],"usage":{"prompt_tokens":10,"completion_tokens":8,"total_tokens":18}}'; + } + } +} diff --git a/examples/quickstart/policy.json b/examples/quickstart/policy.json new file mode 100644 index 0000000..fb9b375 --- /dev/null +++ b/examples/quickstart/policy.json @@ -0,0 +1,31 @@ +{ + "bundle_version": 1, + "issued_at": "2026-01-01T00:00:00Z", + "expires_at": "2030-01-01T00:00:00Z", + "policies": [ + { + "id": "quickstart-tpm-policy", + "spec": { + "selector": { + "pathPrefix": "/v1/", + "methods": ["POST"] + }, + "mode": "enforce", + "rules": [ + { + "name": "tpm-limit", + "limit_keys": ["ip:address"], + "algorithm": "token_bucket_llm", + "algorithm_config": { + "tokens_per_minute": 100, + "tokens_per_day": 1000, + "burst_tokens": 100, + "default_max_completion": 50 + } + } + ] + } + } + ], + "kill_switches": [] +} diff --git a/examples/recipes/circuit-breaker/README.md b/examples/recipes/circuit-breaker/README.md new file mode 100644 index 0000000..ad1227e --- /dev/null +++ b/examples/recipes/circuit-breaker/README.md @@ -0,0 +1,43 @@ +# Recipe: Circuit Breaker — Cost Spike Auto-Shutdown + +Automatically block all LLM traffic when the aggregate token spend rate +exceeds a budget threshold, then self-reset after a cooldown period. + +## How it works + +- Normal traffic: per-org TPM limit enforced (`100 000 tokens/min`) +- Spike detection: if the rolling spend rate hits `500 000 tokens/min` + the circuit breaker opens and **all requests return `429`** with + `X-Fairvisor-Reason: circuit_breaker_open` +- Auto-reset: after 10 minutes without breaker-triggering load, the + circuit resets automatically — no manual intervention needed +- `alert: true` logs the trip event to the Fairvisor audit log + +## Deploy + +```bash +cp policy.json /etc/fairvisor/policy.json +``` + +## Expected behaviour + +```bash +# Normal request — passes +curl -s -o /dev/null -w "%{http_code}" \ + -H "Authorization: Bearer :" \ + http://localhost:8080/v1/chat/completions \ + -d '{"model":"gpt-4o","messages":[{"role":"user","content":"hi"}]}' +# → 200 + +# After spend spike trips the breaker: +# → 429 X-Fairvisor-Reason: circuit_breaker_open +# Retry-After: 600 +``` + +## Tuning + +| Field | Description | +|---|---| +| `spend_rate_threshold_per_minute` | Tokens/min rolling spend that opens the breaker | +| `auto_reset_after_minutes` | Cooldown before automatic reset (0 = manual only) | +| `tokens_per_minute` | Per-org steady-state limit (independent of breaker) | diff --git a/examples/recipes/circuit-breaker/policy.json b/examples/recipes/circuit-breaker/policy.json new file mode 100644 index 0000000..7d58c8d --- /dev/null +++ b/examples/recipes/circuit-breaker/policy.json @@ -0,0 +1,37 @@ +{ + "bundle_version": 1, + "issued_at": "2026-01-01T00:00:00Z", + "expires_at": "2030-01-01T00:00:00Z", + "policies": [ + { + "id": "cost-spike-guard", + "spec": { + "selector": { + "pathPrefix": "/v1/", + "methods": ["POST"] + }, + "mode": "enforce", + "rules": [ + { + "name": "per-org-tpm", + "limit_keys": ["jwt:org_id"], + "algorithm": "token_bucket_llm", + "algorithm_config": { + "tokens_per_minute": 100000, + "burst_tokens": 100000, + "default_max_completion": 2048 + } + } + ], + "circuit_breaker": { + "enabled": true, + "spend_rate_threshold_per_minute": 500000, + "action": "reject", + "alert": true, + "auto_reset_after_minutes": 10 + } + } + } + ], + "kill_switches": [] +} diff --git a/examples/recipes/runaway-agent-guard/README.md b/examples/recipes/runaway-agent-guard/README.md new file mode 100644 index 0000000..7b34491 --- /dev/null +++ b/examples/recipes/runaway-agent-guard/README.md @@ -0,0 +1,50 @@ +# Recipe: Runaway Agent Guard + +Stop runaway agentic workflows before they exhaust your token budget or +billing limit. + +## Problem + +Autonomous agents (LangChain, AutoGPT, custom loops) can enter retry storms +or infinite planning loops. Without enforcement, a single runaway agent +can consume thousands of dollars of API budget in minutes. + +## How it works + +Two rules cooperate: + +1. **Loop detector** — counts requests per `agent_id` in a sliding window. + If the agent fires more than 30 requests in 60 seconds, it trips a + 120-second cooldown. This catches tight retry loops. + +2. **TPM guard** — caps tokens per minute per agent. A burst-heavy agent + that passes the loop check still cannot drain the token pool. + +## Deploy + +```bash +cp policy.json /etc/fairvisor/policy.json +``` + +## JWT shape expected + +```json +{ + "sub": "user-456", + "agent_id": "autoagent-prod-7", + "exp": 9999999999 +} +``` + +## Kill switch for incidents + +If an agent causes an incident, flip a kill switch without restarting edge: + +```bash +# Via CLI +fairvisor kill-switch enable agent-id=autoagent-prod-7 + +# Or update the policy bundle with a kill_switch entry and hot-reload +``` + +See `docs/cookbook/kill-switch-incident-response.md` for the full incident playbook. diff --git a/examples/recipes/runaway-agent-guard/policy.json b/examples/recipes/runaway-agent-guard/policy.json new file mode 100644 index 0000000..38de248 --- /dev/null +++ b/examples/recipes/runaway-agent-guard/policy.json @@ -0,0 +1,40 @@ +{ + "bundle_version": 1, + "issued_at": "2026-01-01T00:00:00Z", + "expires_at": "2030-01-01T00:00:00Z", + "policies": [ + { + "id": "runaway-agent-guard", + "spec": { + "selector": { + "pathPrefix": "/", + "methods": ["POST"] + }, + "mode": "enforce", + "rules": [ + { + "name": "loop-detection", + "limit_keys": ["jwt:agent_id"], + "algorithm": "loop_detector", + "algorithm_config": { + "window_seconds": 60, + "max_requests": 30, + "cooldown_seconds": 120 + } + }, + { + "name": "agent-tpm-guard", + "limit_keys": ["jwt:agent_id"], + "algorithm": "token_bucket_llm", + "algorithm_config": { + "tokens_per_minute": 50000, + "burst_tokens": 50000, + "default_max_completion": 512 + } + } + ] + } + } + ], + "kill_switches": [] +} diff --git a/examples/recipes/team-budgets/README.md b/examples/recipes/team-budgets/README.md new file mode 100644 index 0000000..54c1551 --- /dev/null +++ b/examples/recipes/team-budgets/README.md @@ -0,0 +1,45 @@ +# Recipe: Team Budgets + +Enforce per-team token and cost limits using JWT claims. + +## How it works + +Each request carries a JWT with a `team_id` claim. Fairvisor uses this as +the bucket key for two independent rules: + +1. **TPM/TPD limit** — token-rate enforcement per minute and per day +2. **Monthly cost budget** — cumulative cost cap with staged warn/throttle/reject + +## Deploy + +```bash +# Copy policy to your edge config path +cp policy.json /etc/fairvisor/policy.json + +# Or use with docker compose (standalone mode): +FAIRVISOR_CONFIG_FILE=./policy.json FAIRVISOR_MODE=wrapper docker compose up -d +``` + +## JWT shape expected + +```json +{ + "sub": "user-123", + "team_id": "engineering", + "plan": "pro", + "exp": 9999999999 +} +``` + +## Staged actions at cost budget thresholds + +| Threshold | Action | +|---|---| +| 80% | Warn (allow, log, emit business event) | +| 95% | Throttle (allow with 500 ms delay) | +| 100% | Reject (429, `budget_exceeded`) | + +## Related fixtures + +- `../../../fixtures/reject_tpd_exceeded.json` — TPD reject body +- `../../../fixtures/reject_tpm_exceeded.json` — TPM reject body diff --git a/examples/recipes/team-budgets/policy.json b/examples/recipes/team-budgets/policy.json new file mode 100644 index 0000000..87d7c63 --- /dev/null +++ b/examples/recipes/team-budgets/policy.json @@ -0,0 +1,47 @@ +{ + "bundle_version": 1, + "issued_at": "2026-01-01T00:00:00Z", + "expires_at": "2030-01-01T00:00:00Z", + "policies": [ + { + "id": "team-token-budget", + "spec": { + "selector": { + "pathPrefix": "/openai/", + "methods": ["POST"] + }, + "mode": "enforce", + "rules": [ + { + "name": "per-team-tpm", + "limit_keys": ["jwt:team_id"], + "algorithm": "token_bucket_llm", + "algorithm_config": { + "tokens_per_minute": 120000, + "tokens_per_day": 2000000, + "burst_tokens": 120000, + "default_max_completion": 1024 + } + }, + { + "name": "per-team-cost-budget", + "limit_keys": ["jwt:team_id"], + "algorithm": "cost_based", + "algorithm_config": { + "budget": 50000, + "period": "30d", + "cost_key": "fixed", + "fixed_cost": 1, + "staged_actions": [ + { "threshold_percent": 80, "action": "warn" }, + { "threshold_percent": 95, "action": "throttle", "delay_ms": 500 }, + { "threshold_percent": 100, "action": "reject" } + ] + } + } + ] + } + } + ], + "kill_switches": [] +} diff --git a/fixtures/allow_response.json b/fixtures/allow_response.json new file mode 100644 index 0000000..7cc0312 --- /dev/null +++ b/fixtures/allow_response.json @@ -0,0 +1,28 @@ +{ + "_comment": "Sample 200 response for an allowed request in wrapper mode. Note: no Authorization, x-api-key, or x-goog-api-key headers — upstream auth is stripped on the response side.", + "_status": 200, + "_headers": { + "Content-Type": "application/json", + "X-Fairvisor-Reason": null, + "Authorization": null, + "x-api-key": null, + "x-goog-api-key": null + }, + "id": "chatcmpl-example", + "object": "chat.completion", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "Hello! How can I help you today?" + }, + "finish_reason": "stop" + } + ], + "usage": { + "prompt_tokens": 10, + "completion_tokens": 9, + "total_tokens": 19 + } +} diff --git a/fixtures/anthropic_normal_request.json b/fixtures/anthropic_normal_request.json new file mode 100644 index 0000000..bcffdbf --- /dev/null +++ b/fixtures/anthropic_normal_request.json @@ -0,0 +1,10 @@ +{ + "model": "claude-3-5-haiku-20241022", + "max_tokens": 20, + "messages": [ + { + "role": "user", + "content": "Say hello in one sentence." + } + ] +} diff --git a/fixtures/normal_request.json b/fixtures/normal_request.json new file mode 100644 index 0000000..049a4e4 --- /dev/null +++ b/fixtures/normal_request.json @@ -0,0 +1,10 @@ +{ + "model": "gpt-4o-mini", + "messages": [ + { + "role": "user", + "content": "Say hello in one sentence." + } + ], + "max_tokens": 20 +} diff --git a/fixtures/over_limit_request.json b/fixtures/over_limit_request.json new file mode 100644 index 0000000..b3b554f --- /dev/null +++ b/fixtures/over_limit_request.json @@ -0,0 +1,10 @@ +{ + "model": "gpt-4o", + "messages": [ + { + "role": "user", + "content": "Say hello in one sentence." + } + ], + "max_tokens": 200000 +} diff --git a/fixtures/reject_anthropic.json b/fixtures/reject_anthropic.json new file mode 100644 index 0000000..bdf468f --- /dev/null +++ b/fixtures/reject_anthropic.json @@ -0,0 +1,13 @@ +{ + "_comment": "Anthropic-native 429 reject body. Used for /anthropic/* paths.", + "_headers": { + "X-Fairvisor-Reason": "tpm_exceeded", + "Retry-After": "60", + "Content-Type": "application/json" + }, + "type": "error", + "error": { + "type": "rate_limit_error", + "message": "Token budget exceeded for this tenant." + } +} diff --git a/fixtures/reject_gemini.json b/fixtures/reject_gemini.json new file mode 100644 index 0000000..f0df901 --- /dev/null +++ b/fixtures/reject_gemini.json @@ -0,0 +1,13 @@ +{ + "_comment": "Gemini-native 429 reject body. Used for /gemini/* paths.", + "_headers": { + "X-Fairvisor-Reason": "tpm_exceeded", + "Retry-After": "60", + "Content-Type": "application/json" + }, + "error": { + "code": 429, + "message": "Token budget exceeded for this tenant.", + "status": "RESOURCE_EXHAUSTED" + } +} diff --git a/fixtures/reject_openai.json b/fixtures/reject_openai.json new file mode 100644 index 0000000..eabd023 --- /dev/null +++ b/fixtures/reject_openai.json @@ -0,0 +1,14 @@ +{ + "_comment": "OpenAI-native 429 reject body. Used for /openai/* paths and OpenAI-compatible providers.", + "_headers": { + "X-Fairvisor-Reason": "tpm_exceeded", + "Retry-After": "60", + "Content-Type": "application/json" + }, + "error": { + "type": "rate_limit_error", + "code": "tpm_exceeded", + "message": "Token budget exceeded for this tenant.", + "param": null + } +} diff --git a/fixtures/reject_prompt_too_large.json b/fixtures/reject_prompt_too_large.json new file mode 100644 index 0000000..9c4cf8c --- /dev/null +++ b/fixtures/reject_prompt_too_large.json @@ -0,0 +1,13 @@ +{ + "_comment": "429 body returned when the request exceeds max_prompt_tokens.", + "_headers": { + "X-Fairvisor-Reason": "prompt_too_large", + "Content-Type": "application/json" + }, + "error": { + "type": "rate_limit_error", + "code": "prompt_too_large", + "message": "Request prompt exceeds the maximum allowed token count for this policy.", + "param": null + } +} diff --git a/fixtures/reject_tpd_exceeded.json b/fixtures/reject_tpd_exceeded.json new file mode 100644 index 0000000..83cb2ea --- /dev/null +++ b/fixtures/reject_tpd_exceeded.json @@ -0,0 +1,16 @@ +{ + "_comment": "Illustrative 429 body returned when the per-day token budget is exhausted. RateLimit-Limit reflects the policy's tokens_per_day value.", + "_headers": { + "X-Fairvisor-Reason": "tpd_exceeded", + "Retry-After": "", + "RateLimit-Limit": "", + "RateLimit-Remaining": "0", + "Content-Type": "application/json" + }, + "error": { + "type": "rate_limit_error", + "code": "tpd_exceeded", + "message": "Token budget exceeded for this tenant.", + "param": null + } +} diff --git a/fixtures/reject_tpm_exceeded.json b/fixtures/reject_tpm_exceeded.json new file mode 100644 index 0000000..0805778 --- /dev/null +++ b/fixtures/reject_tpm_exceeded.json @@ -0,0 +1,17 @@ +{ + "_comment": "Illustrative 429 body returned when the per-minute token budget is exhausted. RateLimit-Limit reflects the policy's tokens_per_minute value.", + "_headers": { + "X-Fairvisor-Reason": "tpm_exceeded", + "Retry-After": "60", + "RateLimit-Limit": "", + "RateLimit-Remaining": "0", + "RateLimit-Reset": "", + "Content-Type": "application/json" + }, + "error": { + "type": "rate_limit_error", + "code": "tpm_exceeded", + "message": "Token budget exceeded for this tenant.", + "param": null + } +}