-
-
Notifications
You must be signed in to change notification settings - Fork 0
Configuration
Nenya reads its configuration from a JSON file or directory (default: /etc/nenya/). See Architecture for the request flow, Providers for provider-specific settings, and Secrets for credential configuration.
- Environment variables
- Top-level sections
- Configuration file structure
- Multi-file configuration
- Key configuration blocks
- Billing configuration
- Hot reload
- Configuration validation
- Migration guide
- Processing pipeline order
| Variable | Default | Effect |
|---|---|---|
PORT |
8080 |
Listening port (overrides server.listen_addr). Validated via net.LookupPort. |
HOST |
— | Optional bind address (e.g. 127.0.0.1). Only used when combined with PORT. |
NENYA_CONFIG_DIR |
/etc/nenya/ |
Config root directory |
NENYA_CONFIG_FILE |
— | Single JSON config file (takes precedence over NENYA_CONFIG_DIR) |
NENYA_SECRETS_DIR |
— | Secrets directory for containers (see Secrets) |
After flags are parsed, NENYA_CONFIG_DIR and NENYA_CONFIG_FILE override -config-dir and -config if set. If both env vars are set, NENYA_CONFIG_FILE still wins at load time (single-file mode).
| Section | JSON key | Description |
|---|---|---|
| Server | server |
Listen address, body limits, token estimation |
| Context | context |
Truncation, TF-IDF relevance scoring, context management |
| Governance | governance |
Rate limiting, retries, routing policy |
| Bouncer | bouncer |
PII redaction, entropy detection, engine interception |
| Prefix Cache | prefix_cache |
System prompt and tool caching |
| Compaction | compaction |
JSON minification, whitespace collapse, tool pruning |
| Window | window |
Sliding context window with summarization |
| Response Cache | response_cache |
Response caching with LRU eviction |
| Agents | agents |
Model lists, strategies, circuit breakers, MCP config |
| Discovery | discovery |
Dynamic model discovery, auto-agents |
| Providers | providers |
Upstream API endpoints |
All /v1/* and /proxy/* routes require Authorization: Bearer <client_token> from secrets.
When a directory is specified, all *.json files (excluding secrets.json) are loaded in alphabetical order and deep-merged. Map fields (agents, providers, mcp_servers) merge per-key; struct fields use last-file-wins. Defaults are applied once after the merge.
config.json vs config.d/: Under the config root directory, if config.d/ exists and contains at least one *.json file, those files are merged and config.json in the parent directory is not read. If config.d/ exists but has no JSON files, the loader falls back to config.json at the parent level.
When a file is specified, only that file is loaded (single-file mode).
When -config points to a directory (the default), Nenya loads all *.json files in sorted order and deep-merges them:
/etc/nenya/
├── config.d/
│ ├── 01-server.json # server, governance, bouncer, compaction
│ ├── 02-providers.json # provider URL or auth overrides
│ ├── 03-agents.json # agent definitions
│ └── 04-mcp.json # MCP server definitions
└── secrets.json # EXCLUDED (loaded via systemd credential)
Merge rules:
| Field Type | Behavior |
|---|---|
agents (map) |
Per-key merge — later files add or override individual agents |
providers (map) |
Per-key merge — later files add or override individual providers |
mcp_servers (map) |
Per-key merge |
server, governance, bouncer, etc. (struct) |
Last file wins — if multiple files set the same field, the last one in alphabetical order takes precedence |
This lets you split configuration however makes sense for your deployment:
{
"server": {
"listen_addr": ":8080"
},
"bouncer": {
"enabled": true,
"engine": {
"provider": "ollama",
"model": "qwen2.5-coder:7b"
}
}
}{
"agents": {
"plan": {
"strategy": "fallback",
"models": ["deepseek-reasoner"]
},
"build": {
"strategy": "fallback",
"models": ["gemini-3-flash"]
}
}
}{
"server": {
"listen_addr": ":8080",
"max_body_bytes": 10485760,
"log_level": "info",
"secure_memory_required": true,
"user_agent": "nenya/1.0"
}
}| Field | Default | Description |
|---|---|---|
listen_addr |
":8080" |
Bind address and port |
max_body_bytes |
10485760 (10 MB) |
Maximum incoming request body size |
log_level |
"info" |
Log level: "debug", "info", "warn", or "error". The -verbose flag overrides this to "debug". |
secure_memory_required |
true |
Require mlock-backed secure memory for tokens. When true, gateway fails to start if mlock is unavailable. Set to false to allow heap fallback (e.g., macOS development). |
user_agent |
"nenya/1.0" |
User-Agent header sent to upstream providers |
Unified configuration for context management and truncation.
The interceptor implements a 3-tier pipeline for the last user message content, with limits derived from the target model's max_context (characters, not tokens). If the model has no max_context, fallback defaults of 4000/24000 are used.
-
Tier 1 (pass-through): content below
soft_limitrunes -
Tier 2 (engine summarization): content between
soft_limitandhard_limitrunes -
Tier 3 (truncation + engine): content above
hard_limitrunes. Truncation uses the strategy selected bytruncation_strategy:-
"middle-out"(default): positional — keeps first/last percentages, discards middle - When
tfidf_query_sourceis set: TF-IDF scoring — splits content into blocks (paragraphs + code fences), scores each block's relevance to the user's prior messages or the start of the current message, and greedily keeps the most relevant blocks within budget. First/last blocks are pinned as a safety net. If TF-IDF reduces the payload belowsoft_limit, the engine call is skipped entirely (zero network overhead).
-
| Field | Default | Description |
|---|---|---|
truncation_strategy |
"middle-out" |
Truncation method. "middle-out" (positional) or any value — TF-IDF is activated by setting tfidf_query_source instead. |
truncation_keep_first_pct |
15.0 |
Percentage of blocks to pin from the start when truncating (safety net for both middle-out and TF-IDF) |
truncation_keep_last_pct |
25.0 |
Percentage of blocks to pin from the end when truncating (safety net for both middle-out and TF-IDF) |
tfidf_query_source |
"" (disabled) |
Enable TF-IDF relevance-scored truncation for Tier 3. "" = disabled (use middle-out). "prior_messages" = use previous user messages as query terms. "self" = use first 500 runes of the massive message as query terms. When enabled, if TF-IDF reduces the payload below soft_limit, the engine call is skipped entirely. |
auto_context_skip |
false |
Automatically skip models that do not meet context requirements for the current request. When enabled, models with max_context smaller than the request's input token count are excluded from routing, preventing errors and improving latency. |
auto_reorder_by_latency |
false |
Dynamically sort targets based on historical response times. When enabled, targets are reordered by median latency (fastest first) with ±5% jitter to prevent thundering herd. |
hard_limit_tokens |
0 (auto) |
Hard token limit — if payload exceeds this after all pipeline steps, trim by dropping oldest non-system messages and apply middle-out truncation. 0 (default) uses soft_limit × 2 (backward-compatible). Non-zero values set an absolute token budget. |
Rate limiting, routing weights, and circuit breaker configuration.
| Field | Default | Description |
|---|---|---|
ratelimit_max_tpm |
250000 |
Max tokens per minute per upstream host (0 = disabled) |
ratelimit_max_rpm |
15 |
Max requests per minute per upstream host (0 = disabled) |
routing_strategy |
"" (latency) |
Routing strategy when auto_reorder_by_latency is enabled. "" or "latency" = latency-only sorting. "balanced" = weighted scoring using latency, cost, capability matching, and per-model score bonus. |
routing_latency_weight |
1.0 |
Weight for latency normalization in balanced scoring (0.0-10.0). Higher = prioritize faster models. |
routing_cost_weight |
0.0 |
Weight for cost normalization in balanced scoring (0.0-10.0). Higher = prioritize cheaper models. |
max_cost_per_request |
0 (disabled) |
Maximum allowed cost in USD per request. 0 = no limit. Logged but not yet enforced. |
max_retry_attempts |
3 |
Max retry attempts |
half_open_max_requests |
3 |
Max requests in half-open state during circuit recovery |
retryable_status_codes |
[429, 500, 502, 503, 504] |
HTTP status codes that trigger fallback to the next model in an agent chain. Warning: setting this field REPLACES the built-in defaults entirely. You must include all codes you want retryable (including the standard ones). Per-provider override available via providers.<name>.retryable_status_codes (provider-level replaces global for that provider). |
empty_stream_as_error |
true |
Treat upstream responses with 200 OK and zero-byte body as errors. When enabled, an SSE error payload is emitted to the client (code: empty_response), which OpenCode recognizes as a retryable error, allowing fallback to the next target. The metric nenya_empty_stream_total is incremented. Set to false to preserve backward compatibility (empty streams treated as successful responses, resulting in empty assistant messages). |
auto_retry_on_context_limit |
false |
Automatically retry the request with reduced max_tokens when the upstream provider returns a context limit exceeded error. When enabled, the gateway halves the max_tokens value and retries up to max_retry_attempts times before giving up. |
cost_mode |
"balanced" |
Cost optimization strategy for balanced routing: "economy" (cheapest first), "balanced" (default tradeoff), or "quality" (quality/scoring priority). Controls cost weight scaling. |
billing_economy_scale |
1.5 |
Multiplier for cost weight in "economy" mode |
billing_quality_scale |
0.0 |
Multiplier for cost weight in "quality" mode |
Tier-0 regex-based secret redaction runs on every request, before any other pipeline step. Includes configurable engine for privacy filtering and optional Shannon entropy detection for unknown high-entropy tokens.
| Field | Default | Description |
|---|---|---|
enabled |
true |
Enable/disable the filter. Defaults to true if redact_patterns are provided but field omitted. |
redact_patterns |
[]string (9 built-in) | Custom regex patterns. Replaces built-in patterns if set. Built-in patterns match: AWS keys, GitHub tokens, Google OAuth, sk- API keys, PEM private keys, AWS credential file lines, password/key assignments, Docker tokens, SendGrid keys. |
redaction_label |
"[REDACTED]" |
Replacement string for matched secrets |
redact_output |
false |
Enable stream output filtering (secret redaction and execution policy blocking on responses) |
redact_output_window |
4096 |
Sliding window size (in chars) for cross-chunk pattern matching in output streams |
fail_open |
true |
When the engine (Ollama/cloud) is unreachable, skip summarization and forward the original payload. If false, hard-limit payloads are truncated even when the engine fails. |
entropy_enabled |
false |
Enable Shannon entropy-based secret detection. Catches high-entropy tokens that don't match regex patterns (JWTs, opaque API keys, base64 credentials). |
entropy_threshold |
4.5 |
Shannon entropy threshold in bits/character. Tokens above this value are redacted. English text: ~3.5, hex secrets: ~4.0, base64 tokens: ~5.5, random API keys: ~4.5-5.5. |
entropy_min_token |
20 |
Minimum token length (in characters) to evaluate for entropy. Shorter tokens are skipped to reduce false positives. |
engine |
string or object | (see below) |
The bouncer implements a 3-tier pipeline:
- Tier 1 (pass-through): content below calculated soft limit
- Tier 2 (engine summarization): content between soft and hard limit
- Tier 3 (truncation + engine): content above hard limit. Uses middle-out or TF-IDF strategy.
Engine supports two forms: agent reference ("engine": "summarizer") or inline object ({"provider": "...", "model": "..."}).
Both bouncer.engine and window.engine support two forms:
References a named agent by name. The agent's model list becomes the engine's fallback chain. The agent's system_prompt / system_prompt_file are used as defaults (overridable by inline fields on the EngineRef).
{
"bouncer": {
"engine": "summarizer"
}
}Full engine configuration with explicit provider, model, system prompt, and optional inline fallback chain.
{
"bouncer": {
"engine": {
"provider": "ollama",
"model": "qwen2.5-coder:7b",
"system_prompt": "Summarize the following text...",
"system_prompt_file": "/path/to/system.txt",
"models": ["qwen2.5-coder:7b", "phi-3.5-mini-instruct"]
}
}
}Structured logging: Engine calls log the caller (bouncer or window), agent name (or inline), provider, model, and attempt/total for observability.
Optimizations to improve upstream provider prefix cache hit rates by stabilizing the prompt structure.
| Field | Default | Description |
|---|---|---|
enabled |
true (auto) |
Master toggle. Auto-enabled when any sub-field is explicitly set to true. |
pin_system_first |
true |
Reorder all system role messages to the top of the messages array |
stable_tools |
true |
Sort tools[] array by function.name for deterministic ordering |
skip_redaction_on_system |
true |
Skip Tier-0 regex redaction on system messages to preserve prefix byte-identity |
Text compaction applied to all message content (both string and multi-part content arrays).
| Field | Default | Description |
|---|---|---|
enabled |
true (auto) |
Master toggle. Auto-enabled when any sub-field is explicitly set to true. |
normalize_line_endings |
true |
Convert CRLF to LF |
trim_trailing_whitespace |
true |
Remove trailing spaces/tabs from each line |
collapse_blank_lines |
true |
Collapse runs of 3+ blank lines to max 2 |
compaction_preset |
"" |
Compaction preset: "aggressive" (all features), "balanced" (whitespace + JSON minify), or "minimal" (disabled). Individual fields override preset values. |
json_minify |
true |
Minify the final JSON body with json.Compact
|
prune_stale_tools |
false |
Compact old assistant+tool response pairs into summary placeholders |
tool_protection_window |
4 |
Number of most recent messages to protect from tool call pruning |
prune_thoughts |
false |
Strip reasoning blocks from assistant messages to save context tokens |
Compaction runs after redaction, before engine interception. JSON minify runs at the very end of the pipeline.
When prune_stale_tools is enabled, the gateway scans the messages array backwards (from oldest to newest) for completed tool execution pairs: an assistant message containing tool_calls, immediately followed by one or more tool messages with the results. When such a pair is found outside the protection window, both the assistant message and its tool responses are replaced with a single summary message:
[System] Tool 'tool_name' was executed previously. Result compacted to save context window.
The tool name is extracted from the first tool call's function.name field. If unavailable, the tool_call_id is used as a fallback.
Protection window: The last tool_protection_window messages (default: 4) are never modified, preserving the LLM's immediate reasoning context including the most recent tool calls.
Safety: Orphaned tool calls (assistant with tool_calls but missing corresponding tool response, e.g., due to stream interruption) are left untouched. The pruning is skipped entirely for IDE clients.
When prune_thoughts is enabled, the gateway strips reasoning blocks from all assistant messages in the conversation history. This targets <think.../think> tags used by reasoning models (DeepSeek, OpenRouter, Groq, Gemini):
Text tag pruning: Inside the content string, the gateway looks for the <think opening tag and </think> closing tag. When found:
- Both tags and everything between them are removed.
- The removed block is replaced with
[Reasoning pruned by gateway]. - If the opening tag exists but the closing tag is missing (stream interruption), everything from
<thinkto the end of the string is replaced. - Multiple reasoning blocks in a single message are all pruned.
The structured reasoning_content field is not stripped by thought pruning. It is preserved in the shared pipeline and stripped per-target during request sanitization — only for providers that do not support reasoning.
Uses strings.Index (not regex) for zero-allocation scanning of large payloads.
Sliding window conversation compaction for long conversations. When the estimated token count exceeds max_context * trigger_ratio, older messages are summarized (or truncated) and replaced with a single system summary message.
| Field | Default | Description |
|---|---|---|
enabled |
false |
Master toggle (off by default) |
mode |
"summarize" |
"summarize" (engine), "truncate" (hard cut), or "tfidf" (relevance-scored, zero network calls) |
active_messages |
6 |
Number of recent messages to preserve unchanged |
trigger_ratio |
0.8 |
Trigger when tokens exceed max_context * ratio (0.0-1.0) |
summary_max_runes |
4000 |
Maximum length of the generated summary |
max_context |
128000 |
Context window size. Overridden by agent model max_context when routing through agents. |
engine |
string or object | Agent name reference or inline engine configuration for window summarization |
In-memory LRU cache for deterministic response caching. Responses are cached by SHA-256 fingerprint of the request payload. On cache hit, the stored SSE stream is replayed to the client with X-Nenya-Cache-Status: HIT header.
| Field | Default | Description |
|---|---|---|
enabled |
false |
Master toggle (off by default) |
max_entries |
512 |
Maximum number of cached responses (LRU eviction) |
max_entry_bytes |
1048576 (1 MB) |
Maximum size per cached response |
ttl_seconds |
3600 (1 hour) |
Time-to-live for cached entries |
evict_every_seconds |
300 (5 minutes) |
Background eviction sweep interval |
force_refresh_header |
"x-nenya-cache-force-refresh" |
HTTP header name that bypasses cache when present |
Cache key: Deterministic SHA-256 computed from model, messages, temperature, top_p, max_tokens, tools, tool_choice, response_format, stop, stream.
Bypass: Send any non-empty value for the configured force_refresh_header to force a cache miss.
{
"debug": {
"pprof_enabled": false
}
}| Field | Default | Description |
|---|---|---|
pprof_enabled |
false |
Enable Go pprof endpoints at /debug/pprof/. Requires auth. |
{
"agents": {
"default": {
"strategy": "fallback",
"models": ["gemini-3-flash", "deepseek-chat"]
},
"build": {
"strategy": "fallback",
"cooldown_seconds": 60,
"failure_threshold": 5,
"models": [
"gemini-3-flash",
{ "provider": "ollama", "model": "qwen2.5-coder:7b" }
]
}
}
}Model entries support flexible selectors: plain strings (registry lookup), objects with provider+model, or regex patterns (provider_rgx/model_rgx) for dynamic catalog expansion. See Model Selector Syntax for the full syntax reference.
Models listed in the built-in Model Registry can be specified as plain strings. Provider and max_context are resolved automatically:
{
"agents": {
"build": {
"strategy": "fallback",
"models": ["gemini-3-flash", "deepseek-reasoner"]
}
}
}For custom or local models (not in the registry), or to override registry defaults, use full objects:
{
"agents": {
"build": {
"strategy": "fallback",
"models": [
"gemini-3-flash",
{
"provider": "ollama",
"model": "qwen2.5-coder:7b",
"max_context": 32000,
"url": "http://localhost:11434/v1/chat/completions"
},
{
"provider": "zen",
"model": "claude-opus-4-7",
"format": "anthropic",
"max_context": 200000
}
]
}
}
}Both styles can be mixed in the same models array.
Model entries support flexible selectors that expand at runtime against the discovery catalog:
{
"agents": {
"all-deepseek": { "models": [{ "provider": "deepseek" }] },
"all-claude-opus": { "models": [{ "model": "claude-opus" }] },
"zen-reasoning": { "models": [{ "provider_rgx": "zen", "model_rgx": ".*-reasoner" }] }
}
}Selector precedence (highest to lowest): exact provider+model (1), exact provider+model_rgx (2), provider_rgx+exact model (3), exact provider (4), exact model (5), provider_rgx+model_rgx (6), exact provider_rgx (7), exact model_rgx (8). First match wins.
{
"providers": {
"openai": {
"url": "https://api.openai.com/v1/chat/completions",
"auth_style": "bearer",
"timeout_seconds": 30,
"ratelimit_max_rpm": 500,
"ratelimit_max_tpm": 2000000,
"retryable_status_codes": [429, 500, 502, 503, 504],
"auto_retry_on_context_limit": false
}
}
}API keys are loaded via provider_keys (keyed by provider name). See Secrets for details.
| Field | Default | Description |
|---|---|---|
ratelimit_max_rpm |
— | Per-provider override for max requests per minute |
ratelimit_max_tpm |
— | Per-provider override for max tokens per minute |
max_retry_attempts |
— | Per-provider override for max retry attempts (takes precedence over global governance.max_retry_attempts) |
retryable_status_codes |
— | Provider-level override for retryable statuses (replaces global) |
format_urls |
— | Maps wire format to endpoint URL (e.g., {"anthropic": "..."}) |
accounts |
— | Multi-account credential pool with LRU selection |
billing |
— | Billing model, quota tracking, free model detection |
thinking |
— | Per-provider thinking/reasoning mode configuration |
| Style | Header(s) | Used By |
|---|---|---|
bearer |
Authorization: Bearer <key> |
OpenAI, DeepSeek, Groq, Together, SambaNova, Cerebras, GitHub, z.ai, z.ai Coding Plan, Mistral, xAI, Perplexity, Cohere, DeepInfra, Moonshot, Qwen, MiniMax |
bearer+x-goog |
Both Authorization: Bearer + x-goog-api-key
|
Gemini |
anthropic |
x-api-key: <key> + anthropic-version: 2023-06-01
|
Anthropic |
azure |
api-key: <key> |
Azure OpenAI |
For high-volume providers with multiple API keys:
{
"providers": {
"openai": {
"accounts": [
{ "id": "account-1", "type": "apikey", "credential": "sk-proj-xxxxx" },
{ "id": "account-2", "type": "apikey", "credential": "sk-proj-yyyyy" }
]
}
}
}AccountPool: LRU selection with 6 error classes, exponential backoff (±5% jitter), model-level locks. State persisted in <provider>.accounts.json.
{
"providers": {
"zai": {
"thinking": {
"enabled": true,
"clear_thinking": false
}
}
}
}| Field | Default | Description |
|---|---|---|
enabled |
true |
Enable thinking mode for reasoning-capable models |
clear_thinking |
false |
Strip reasoning_content from responses to save output tokens |
Note: Per-model thinking metadata (min, max, zero_allowed, dynamic_allowed, levels) is defined in the internal ModelRegistry and is not user-configurable. Model entries can override provider defaults via thinking field.
Configuration for local Ollama model lifecycle management:
{
"local_engine": {
"base_url": "http://127.0.0.1:11434",
"timeout_seconds": 120,
"max_sessions": 3,
"auto_load": false,
"startup_models": ["qwen2.5-coder:7b"]
}
}| Field | Default | Description |
|---|---|---|
base_url |
http://127.0.0.1:11434 |
Ollama API endpoint |
timeout_seconds |
120 |
Per-operation timeout |
max_sessions |
3 |
Maximum loaded models with LRU eviction |
auto_load |
false |
Automatically load models when referenced |
startup_models |
[] |
Models to preload on gateway startup |
Per-provider billing and quota tracking configuration for cost-aware routing.
{
"providers": {
"openrouter": {
"billing": {
"model": "mixed",
"period_hours": 730,
"included_usd": 10.0,
"quota_source": "headers",
"quota_extraction": {
"mode": "headers",
"remaining_header": "X-RateLimit-Remaining",
"limit_header": "X-RateLimit-Limit",
"reset_header": "X-RateLimit-Reset"
},
"free_models": ["gpt-4o-mini-free"]
}
},
"zai": {
"billing": {
"model": "credit",
"quota_source": "api",
"quota_url": "https://api.zai.com/v1/billing/quota",
"quota_interval": "1h",
"quota_timeout_seconds": 10,
"quota_extraction": {
"mode": "simple_json",
"balance_path": "credits_remaining",
"reset_field": "credits_reset_at",
"reset_unit": "unix_seconds"
}
}
}
}
}| Field | Type | Description |
|---|---|---|
model |
string |
Billing model: subscription, credit, free, mixed
|
period_hours |
int |
Period length in hours (for period reset automation) |
included_usd |
float64 |
Included credit amount for computing utilization ratio |
balance_usd |
float64 |
Static balance (only used if quota_source: none) |
quota_source |
string |
Quota source: none, api, headers
|
quota_url |
string |
URL to fetch quota (for api source) |
quota_interval |
string |
Poll interval (e.g., 1h, 30m) |
quota_timeout_seconds |
int |
Timeout for quota fetch (default 10s) |
quota_extraction |
object |
Extraction config (see below) |
free_only |
bool |
Strip paid models from target list (only for model: free) |
free_models |
[]string |
Explicit list of free model IDs for scoring bonus |
simple_json — Extract balance from JSON response:
{
"mode": "simple_json",
"balance_path": "data.credits_remaining",
"reset_field": "data.credits_reset_at",
"reset_unit": "unix_seconds"
}-
balance_path— JSON pointer to balance field -
reset_field— JSON pointer to reset timestamp -
reset_unit—unix_secondsorrfc3339
max_from_array — Extract max value from array:
{
"mode": "max_from_array",
"array_path": "data.accounts",
"value_field": "credits_remaining",
"value_divide_by": 100,
"reset_field": "reset_at",
"level_field": "tier"
}-
value_divide_by— Divide extracted value by this (e.g., cents to dollars)
headers — Extract from response headers:
{
"mode": "headers",
"remaining_header": "X-Remaining-Credits",
"limit_header": "X-Max-Credits",
"reset_header": "X-Reset-Time"
}{
"agents": {
"my-agent": {
"models": ["gemini-3-flash"],
"budget_limit_usd": 50.0
}
}
}The budget_limit_usd field enforces per-agent spend limits independent of provider-level exhaustion.
Nenya dynamically fetches model catalogs from upstream providers at startup and on SIGHUP reload. This enables automatic discovery of custom models (e.g., Ollama) and reduces the need for manual registry updates.
-
Startup/Reload — For each configured provider with an API key, fetch
/v1/modelsin parallel (10s timeout per provider) - Provider-specific parsing — Each provider has a dedicated parser for its response format
- Three-tier merge — Discovered models are merged with static registry (config overrides take precedence)
- Catalog update — The merged catalog is used for all subsequent model resolution
| Priority | Source | Description |
|---|---|---|
| 1 | Config overrides | Agent model entries with explicit provider, max_context, max_output, or format fields |
| 2 | Discovered models | Models fetched from provider /v1/models endpoints at startup/reload |
| 3 | Static registry | Built-in ModelRegistry fallback for known models |
This allows:
- Custom local models (Ollama) to be discovered automatically
- Provider-specific overrides without code changes
- Graceful fallback when discovery fails (static registry still works)
The discovery package enforces strict security boundaries:
- Response body limits — 10 MB max per provider response (DoS protection)
-
JSON decode limits — 10 MB max with
DisallowUnknownFields(malformed JSON rejection) -
Content-type validation — Only
application/jsonresponses are parsed - Model ID sanitization — Max 256 chars, printable characters only (XSS prevention)
- Per-provider timeouts — 10s context timeout per fetch (no hanging)
- Panic recovery — Goroutines have defer/recover to prevent crashes
-
Auth header injection — Gemini uses
x-goog-api-keyheader (not query params) - Shared HTTP client — Reused with proper TLS timeouts (no resource leaks)
If discovery fails for any provider:
- The provider is skipped with a warning log
- Static registry models for that provider still work
- Other providers' discovered models are still used
When discovery.auto_agents is enabled, Nenya automatically generates agent definitions from discovered models:
| Agent | Filter | Strategy |
|---|---|---|
auto_fast |
≤32k context, ≤4k output | round-robin |
auto_reasoning |
reasoning + ≥128k context | fallback |
auto_vision |
vision capability | round-robin |
auto_tools |
tool_calls capability | round-robin |
auto_large |
≥200k context | fallback |
auto_balanced |
32k–128k context | round-robin |
auto_coding |
tool_calls + coding prefix | fallback |
User-defined agents take precedence over auto-generated ones.
Enable server.log_level: "debug" to see discovery details:
DEBUG discovery catalog providers=[anthropic:3 gemini:5 openrouter:42]
systemctl reload nenyaReloads config and re-discovers model catalogs. Preserves UsageTracker, Metrics, and caches. On validation failure, continues with old config.
Validate config without starting the gateway:
nenya -validate -config /etc/nenya/config.d/The validator checks:
- Required fields:
agentsmust have at least one entry - Model references: agent model entries resolve to valid providers
- Provider configs: auth_style is recognized (bearer, bearer+x-goog, anthropic, azure)
- Bouncer engine: agent reference resolves to a valid agent name
- Secrets file:
secrets.jsonexits and has expected keys - Mutually exclusive options: config.d/ and config.json not both set
Configuration format changed from TOML to JSON with semantic grouping. Old interceptor, ollama, ratelimit, and filter sections are now unified under governance and bouncer with engine abstraction.
The old security_filter top-level section has been renamed to bouncer:
-
security_filter.patterns→bouncer.redact_patterns -
security_filter.replacement→bouncer.redaction_label -
security_filter.fail_open→bouncer.fail_open
The engine configuration was moved from a separate ollama section into the bouncer:
{
"bouncer": {
"engine": {
"provider": "ollama",
"model": "qwen2.5-coder:7b"
}
}
}| Step | Action | Condition |
|---|---|---|
| 1 | Response cache lookup | if enabled |
| 2 | MCP auto-search | if agent has mcp.auto_search
|
| 3 | MCP tool injection | if agent has MCP servers |
| 4 | Prefix cache optimizations | pin system messages, sort tools |
| 5 | Agent system prompt injection | if no existing system message |
| 6 | Tier-0 regex redaction | secret patterns via bouncer
|
| 6b | Shannon entropy redaction | if entropy_enabled
|
| 7 | Text compaction | normalize, trim, collapse blanks |
| 8 | Stale tool call pruning | if prune_stale_tools enabled |
| 9 | Thought pruning | if prune_thoughts enabled |
| 10 | Window compaction | if enabled and threshold exceeded |
| 11 | Interceptor chain execution | Priority-ordered Redact/Entropy/TFIDF/Bouncer interceptors |
| 11b | Engine interception | 3-tier summarization with TF-IDF fallback |
| 12 | Format-aware body conversion | if model has format: "anthropic"
|
| 13 | JSON minification | final body compaction |
| 14 | Response cache store | if enabled |
| 15 | MCP auto-save | if agent has mcp.auto_save (async) |
- Providers — Full provider reference table and special behaviors
- Routing — Balanced scoring algorithm
- Model Discovery — Dynamic model catalog fetching
- Secrets — Secrets format
- Architecture — Request lifecycle and pipeline order
The following table lists all models in the built-in registry with context windows, output limits, and pricing:
| Model | Provider | Context | Max Output | Input ($/1M) | Output ($/1M) |
|---|---|---|---|---|---|
glm-4.6v-flash |
zai | 200,000 | 128,000 | $0.10/M | $0.10/M |
glm-4.6v-flashx |
zai | 200,000 | 128,000 | $0.10/M | $0.10/M |
glm-4-32b-0414-128k |
zai | 128,000 | 16,000 | $0.50/M | $2.00/M |
nemotron-3-super |
nvidia_free | 4,000 | 1,024 | $0.10/M | $0.10/M |
qwen-3.6-plus |
qwen_free | 8,000 | 8,192 | $0.10/M | $0.10/M |
minimax-m2.5 |
minimax_free | 8,000 | 4,096 | $0.10/M | $0.10/M |
llama-3.3-70b-versatile |
groq | 131,072 | 8,192 | $0.59/M | $0.79/M |
mixtral-8x7b-32768 |
groq | 32,768 | 8,192 | $0.27/M | $0.27/M |
llama-3.1-405b-instruct |
sambanova | 128,000 | 4,096 | $0.10/M | $0.10/M |
llama-3.3-70b |
cerebras | 8,192 | 8,192 | $0.10/M | $0.10/M |
gpt-4o |
github | 8,000 | 4,096 | $2.50/M | $10.00/M |
phi-3.5-mini-instruct |
github | 128,000 | 4,096 | $0.10/M | $0.10/M |
qwen2.5-72b-turbo |
together | 32,768 | 4,096 | $0.90/M | $0.90/M |
claude-opus-4-5 |
anthropic | 200,000 | 64,000 | $5.00/M | $25.00/M |
claude-opus-4-0 |
anthropic | 200,000 | 32,000 | $15.00/M | $75.00/M |
claude-sonnet-4-5 |
anthropic | 200,000 | 64,000 | $3.00/M | $15.00/M |
claude-sonnet-4-0 |
anthropic | 200,000 | 64,000 | $3.00/M | $15.00/M |
claude-haiku-4-5 |
anthropic | 200,000 | 64,000 | $1.00/M | $5.00/M |
claude-3-7-sonnet-20250219 |
anthropic | 128,000 | 8,192 | $3.00/M | $15.00/M |
claude-3-5-sonnet-20241022 |
anthropic | 200,000 | 64,000 | $3.00/M | $15.00/M |
claude-3-5-haiku-latest |
anthropic | 200,000 | 8,192 | $0.25/M | $1.25/M |
mistral-large-latest |
mistral | 256,000 | 262,144 | $4.00/M | $12.00/M |
mistral-small-latest |
mistral | 256,000 | 256,000 | $0.20/M | $0.60/M |
mistral-medium-latest |
mistral | 256,000 | 16,384 | $2.70/M | $8.10/M |
codestral-latest |
mistral | 128,000 | 4,096 | $0.30/M | $0.30/M |
devstral-medium-latest |
mistral | 256,000 | 262,144 | $0.20/M | $0.60/M |
magistral-medium-latest |
mistral | 128,000 | 16,384 | $2.50/M | $7.50/M |
pixtral-large-latest |
mistral | 128,000 | 128,000 | $0.20/M | $0.60/M |
grok-4 |
xai | 256,000 | 64,000 | $5.00/M | $15.00/M |
grok-4-fast |
xai | 2,000,000 | 32,000 | $0.50/M | $5.00/M |
grok-3 |
xai | 131,072 | 8,192 | $3.00/M | $12.00/M |
grok-3-fast |
xai | 131,072 | 8,192 | $0.50/M | $5.00/M |
grok-3-mini |
xai | 131,072 | 8,192 | $0.50/M | $5.00/M |
sonar-pro |
perplexity | 200,000 | 8,192 | $3.00/M | $15.00/M |
sonar-reasoning-pro |
perplexity | 128,000 | 4,096 | $2.00/M | $8.00/M |
sonar-deep-research |
perplexity | 128,000 | 32,768 | $2.00/M | $8.00/M |
sonar |
perplexity | 128,000 | 4,096 | $1.00/M | $1.00/M |
qwen3.5-plus |
zen | 131,072 | 8,192 | — | — |
minimax-m2.7 |
zen | 200,000 | 8,192 | — | — |
minimax-m2.5-free |
zen | 200,000 | 8,192 | — | — |
kimi-k2.6 |
zen | 262,144 | 65,536 | — | — |
kimi-k2.5 |
zen | 131,072 | 32,768 | — | — |
big-pickle |
zen | 200,000 | 8,192 | — | — |
ling-2.6-flash-free |
zen | 200,000 | 8,192 | — | — |
hy3-preview-free |
zen | 131,072 | 8,192 | — | — |
nemotron-3-super-free |
zen | 4,000 | 1,024 | — | — |
gpt-5-nano |
zen | 200,000 | 8,192 | — | — |
claude-opus-4-1-20250805 |
anthropic | 200,000 | 32,000 | $15.00/M | $75.00/M |
gemini-3.5-flash |
gemini | 1,048,576 | 65,536 | $0.075/M | $0.30/M |
gemini-2.5-pro |
gemini | 1,048,576 | 65,536 | $1.25/M | $10.00/M |
gemini-3.1-pro-preview |
gemini | 1,048,576 | 65,536 | $2.00/M | $15.00/M |
gpt-5.2 |
openai | 400,000 | 128,000 | — | — |
gpt-5.3-codex |
openai | 400,000 | 128,000 | — | — |
gpt-5.3-codex-spark |
openai | 128,000 | 128,000 | — | — |
gpt-5.4 |
openai | 1,050,000 | 128,000 | — | — |
gpt-5.4-mini |
openai | 400,000 | 128,000 | — | — |
gpt-5.5 |
openai | 272,000 | 128,000 | — | — |
codex-auto-review |
openai | 272,000 | 128,000 | — | — |
kimi-k2-thinking |
zen | 131,072 | 32,768 | — | — |
kimi-k2 |
moonshot | 131,072 | 32,768 | — | — |
claude-opus-4-7 |
anthropic | 1,000,000 | 128,000 | $5.00/M | $25.00/M |
claude-opus-4-6 |
anthropic | 1,000,000 | 128,000 | $5.00/M | $25.00/M |
claude-sonnet-4-6 |
anthropic | 200,000 | 64,000 | $3.00/M | $15.00/M |
Getting Started
- Home — Project overview
- Quick Start — Install and run in 5 minutes
- Client Setup — OpenCode, Cursor, and other clients
- Deployment — Bare metal, container, Kubernetes
Core Concepts
- Configuration — Config reference and examples
- Providers — 24 providers, capabilities, special behaviors
- Routing — Latency-aware routing and fallback chains
- Architecture — Package overview and request lifecycle
- MCP Integration — MCP server integration
Reference
- Passthrough Proxy — Raw provider endpoint proxying
- Secrets — Systemd credentials and container secrets
- Model Discovery — Dynamic model catalog fetching
- API Endpoints — Endpoint reference
- Adapters — Provider adapter system
- Billing — Billing-aware routing and quota tracking
- Caching — Exact-match and semantic caching
- Provider Capabilities — Service kinds matrix
- Unknown MaxContext — Unknown context window behavior
Operations
- Demo — Test all pipeline tiers
- Troubleshooting — Common issues and solutions
- FAQ — Frequently asked questions
- Security — Security policy and vulnerability reporting
Project
- Roadmap — Planned features
- Disclaimer — Legal disclaimer