Configuration

Configuration Reference

Nenya reads its configuration from a JSON file or directory (default: /etc/nenya/). See Architecture for the request flow, Providers for provider-specific settings, and Secrets for credential configuration.

Environment variables
Top-level sections
Configuration file structure
Multi-file configuration
Key configuration blocks
- Server
- Context
- Governance
- Bouncer
- Prefix Cache
- Compaction
- Window
- Response Cache
- Agents
- Discovery
- Providers
- local_engine
Billing configuration
Hot reload
Configuration validation
Migration guide
Processing pipeline order

Environment Variables

Variable	Default	Effect
`PORT`	`8080`	Listening port (overrides `server.listen_addr`). Validated via `net.LookupPort`.
`HOST`	—	Optional bind address (e.g. `127.0.0.1`). Only used when combined with `PORT`.
`NENYA_CONFIG_DIR`	`/etc/nenya/`	Config root directory
`NENYA_CONFIG_FILE`	—	Single JSON config file (takes precedence over `NENYA_CONFIG_DIR`)
`NENYA_SECRETS_DIR`	—	Secrets directory for containers (see Secrets)

After flags are parsed, NENYA_CONFIG_DIR and NENYA_CONFIG_FILE override -config-dir and -config if set. If both env vars are set, NENYA_CONFIG_FILE still wins at load time (single-file mode).

Top-Level Sections

Section	JSON key	Description
Server	`server`	Listen address, body limits, token estimation
Context	`context`	Truncation, TF-IDF relevance scoring, context management
Governance	`governance`	Rate limiting, retries, routing policy
Bouncer	`bouncer`	PII redaction, entropy detection, engine interception
Prefix Cache	`prefix_cache`	System prompt and tool caching
Compaction	`compaction`	JSON minification, whitespace collapse, tool pruning
Window	`window`	Sliding context window with summarization
Response Cache	`response_cache`	Response caching with LRU eviction
Agents	`agents`	Model lists, strategies, circuit breakers, MCP config
Discovery	`discovery`	Dynamic model discovery, auto-agents
Providers	`providers`	Upstream API endpoints

Configuration File Structure

All /v1/* and /proxy/* routes require Authorization: Bearer <client_token> from secrets.

When a directory is specified, all *.json files (excluding secrets.json) are loaded in alphabetical order and deep-merged. Map fields (agents, providers, mcp_servers) merge per-key; struct fields use last-file-wins. Defaults are applied once after the merge.

config.json vs config.d/: Under the config root directory, if config.d/ exists and contains at least one *.json file, those files are merged and config.json in the parent directory is not read. If config.d/ exists but has no JSON files, the loader falls back to config.json at the parent level.

When a file is specified, only that file is loaded (single-file mode).

Multi-File Configuration (Directory Mode)

When -config points to a directory (the default), Nenya loads all *.json files in sorted order and deep-merges them:

/etc/nenya/
├── config.d/
│   ├── 01-server.json       # server, governance, bouncer, compaction
│   ├── 02-providers.json    # provider URL or auth overrides
│   ├── 03-agents.json       # agent definitions
│   └── 04-mcp.json          # MCP server definitions
└── secrets.json             # EXCLUDED (loaded via systemd credential)

Merge rules:

Field Type	Behavior
`agents` (map)	Per-key merge — later files add or override individual agents
`providers` (map)	Per-key merge — later files add or override individual providers
`mcp_servers` (map)	Per-key merge
`server`, `governance`, `bouncer`, etc. (struct)	Last file wins — if multiple files set the same field, the last one in alphabetical order takes precedence

This lets you split configuration however makes sense for your deployment:

Example: `01-server.json`

{
  "server": {
    "listen_addr": ":8080"
  },
  "bouncer": {
    "enabled": true,
    "engine": {
      "provider": "ollama",
      "model": "qwen2.5-coder:7b"
    }
  }
}

Example: `03-agents.json`

{
  "agents": {
    "plan": {
      "strategy": "fallback",
      "models": ["deepseek-reasoner"]
    },
    "build": {
      "strategy": "fallback",
      "models": ["gemini-3-flash"]
    }
  }
}

Key Configuration Blocks

Server

{
  "server": {
    "listen_addr": ":8080",
    "max_body_bytes": 10485760,
    "log_level": "info",
    "secure_memory_required": true,
    "user_agent": "nenya/1.0"
  }
}

Field	Default	Description
`listen_addr`	`":8080"`	Bind address and port
`max_body_bytes`	`10485760` (10 MB)	Maximum incoming request body size
`log_level`	`"info"`	Log level: `"debug"`, `"info"`, `"warn"`, or `"error"`. The `-verbose` flag overrides this to `"debug"`.
`secure_memory_required`	`true`	Require mlock-backed secure memory for tokens. When `true`, gateway fails to start if `mlock` is unavailable. Set to `false` to allow heap fallback (e.g., macOS development).
`user_agent`	`"nenya/1.0"`	User-Agent header sent to upstream providers

Context

Unified configuration for context management and truncation.

The interceptor implements a 3-tier pipeline for the last user message content, with limits derived from the target model's max_context (characters, not tokens). If the model has no max_context, fallback defaults of 4000/24000 are used.

Tier 1 (pass-through): content below soft_limit runes
Tier 2 (engine summarization): content between soft_limit and hard_limit runes
Tier 3 (truncation + engine): content above hard_limit runes. Truncation uses the strategy selected by truncation_strategy:
- "middle-out" (default): positional — keeps first/last percentages, discards middle
- When tfidf_query_source is set: TF-IDF scoring — splits content into blocks (paragraphs + code fences), scores each block's relevance to the user's prior messages or the start of the current message, and greedily keeps the most relevant blocks within budget. First/last blocks are pinned as a safety net. If TF-IDF reduces the payload below soft_limit, the engine call is skipped entirely (zero network overhead).

Field	Default	Description
`truncation_strategy`	`"middle-out"`	Truncation method. `"middle-out"` (positional) or any value — TF-IDF is activated by setting `tfidf_query_source` instead.
`truncation_keep_first_pct`	`15.0`	Percentage of blocks to pin from the start when truncating (safety net for both middle-out and TF-IDF)
`truncation_keep_last_pct`	`25.0`	Percentage of blocks to pin from the end when truncating (safety net for both middle-out and TF-IDF)
`tfidf_query_source`	`""` (disabled)	Enable TF-IDF relevance-scored truncation for Tier 3. `""` = disabled (use middle-out). `"prior_messages"` = use previous user messages as query terms. `"self"` = use first 500 runes of the massive message as query terms. When enabled, if TF-IDF reduces the payload below `soft_limit`, the engine call is skipped entirely.
`auto_context_skip`	`false`	Automatically skip models that do not meet context requirements for the current request. When enabled, models with `max_context` smaller than the request's input token count are excluded from routing, preventing errors and improving latency.
`auto_reorder_by_latency`	`false`	Dynamically sort targets based on historical response times. When enabled, targets are reordered by median latency (fastest first) with ±5% jitter to prevent thundering herd.
`hard_limit_tokens`	`0` (auto)	Hard token limit — if payload exceeds this after all pipeline steps, trim by dropping oldest non-system messages and apply middle-out truncation. `0` (default) uses `soft_limit × 2` (backward-compatible). Non-zero values set an absolute token budget.

Governance

Rate limiting, routing weights, and circuit breaker configuration.

Field	Default	Description
`ratelimit_max_tpm`	`250000`	Max tokens per minute per upstream host (0 = disabled)
`ratelimit_max_rpm`	`15`	Max requests per minute per upstream host (0 = disabled)
`routing_strategy`	`""` (latency)	Routing strategy when `auto_reorder_by_latency` is enabled. `""` or `"latency"` = latency-only sorting. `"balanced"` = weighted scoring using latency, cost, capability matching, and per-model score bonus.
`routing_latency_weight`	`1.0`	Weight for latency normalization in balanced scoring (0.0-10.0). Higher = prioritize faster models.
`routing_cost_weight`	`0.0`	Weight for cost normalization in balanced scoring (0.0-10.0). Higher = prioritize cheaper models.
`max_cost_per_request`	`0` (disabled)	Maximum allowed cost in USD per request. 0 = no limit. Logged but not yet enforced.
`max_retry_attempts`	`3`	Max retry attempts
`half_open_max_requests`	`3`	Max requests in half-open state during circuit recovery
`retryable_status_codes`	`[429, 500, 502, 503, 504]`	HTTP status codes that trigger fallback to the next model in an agent chain. Warning: setting this field REPLACES the built-in defaults entirely. You must include all codes you want retryable (including the standard ones). Per-provider override available via `providers.<name>.retryable_status_codes` (provider-level replaces global for that provider).
`empty_stream_as_error`	`true`	Treat upstream responses with `200 OK` and zero-byte body as errors. When enabled, an SSE error payload is emitted to the client (code: `empty_response`), which OpenCode recognizes as a retryable error, allowing fallback to the next target. The metric `nenya_empty_stream_total` is incremented. Set to `false` to preserve backward compatibility (empty streams treated as successful responses, resulting in empty assistant messages).
`auto_retry_on_context_limit`	`false`	Automatically retry the request with reduced max_tokens when the upstream provider returns a context limit exceeded error. When enabled, the gateway halves the max_tokens value and retries up to `max_retry_attempts` times before giving up.
`cost_mode`	`"balanced"`	Cost optimization strategy for balanced routing: `"economy"` (cheapest first), `"balanced"` (default tradeoff), or `"quality"` (quality/scoring priority). Controls cost weight scaling.
`billing_economy_scale`	`1.5`	Multiplier for cost weight in `"economy"` mode
`billing_quality_scale`	`0.0`	Multiplier for cost weight in `"quality"` mode

Bouncer

Tier-0 regex-based secret redaction runs on every request, before any other pipeline step. Includes configurable engine for privacy filtering and optional Shannon entropy detection for unknown high-entropy tokens.

Field	Default	Description
`enabled`	`true`	Enable/disable the filter. Defaults to `true` if `redact_patterns` are provided but field omitted.
`redact_patterns`	[]string (9 built-in)	Custom regex patterns. Replaces built-in patterns if set. Built-in patterns match: AWS keys, GitHub tokens, Google OAuth, sk- API keys, PEM private keys, AWS credential file lines, password/key assignments, Docker tokens, SendGrid keys.
`redaction_label`	`"[REDACTED]"`	Replacement string for matched secrets
`redact_output`	`false`	Enable stream output filtering (secret redaction and execution policy blocking on responses)
`redact_output_window`	`4096`	Sliding window size (in chars) for cross-chunk pattern matching in output streams
`fail_open`	`true`	When the engine (Ollama/cloud) is unreachable, skip summarization and forward the original payload. If `false`, hard-limit payloads are truncated even when the engine fails.
`entropy_enabled`	`false`	Enable Shannon entropy-based secret detection. Catches high-entropy tokens that don't match regex patterns (JWTs, opaque API keys, base64 credentials).
`entropy_threshold`	`4.5`	Shannon entropy threshold in bits/character. Tokens above this value are redacted. English text: ~3.5, hex secrets: ~4.0, base64 tokens: ~5.5, random API keys: ~4.5-5.5.
`entropy_min_token`	`20`	Minimum token length (in characters) to evaluate for entropy. Shorter tokens are skipped to reduce false positives.
`engine`	string or object	(see below)

The bouncer implements a 3-tier pipeline:

Tier 1 (pass-through): content below calculated soft limit
Tier 2 (engine summarization): content between soft and hard limit
Tier 3 (truncation + engine): content above hard limit. Uses middle-out or TF-IDF strategy.

Engine supports two forms: agent reference ("engine": "summarizer") or inline object ({"provider": "...", "model": "..."}).

Engine Configuration

Both bouncer.engine and window.engine support two forms:

Form 1: Agent Reference (string)

References a named agent by name. The agent's model list becomes the engine's fallback chain. The agent's system_prompt / system_prompt_file are used as defaults (overridable by inline fields on the EngineRef).

{
  "bouncer": {
    "engine": "summarizer"
  }
}

Form 2: Inline Object

Full engine configuration with explicit provider, model, system prompt, and optional inline fallback chain.

{
  "bouncer": {
    "engine": {
      "provider": "ollama",
      "model": "qwen2.5-coder:7b",
      "system_prompt": "Summarize the following text...",
      "system_prompt_file": "/path/to/system.txt",
      "models": ["qwen2.5-coder:7b", "phi-3.5-mini-instruct"]
    }
  }
}

Structured logging: Engine calls log the caller (bouncer or window), agent name (or inline), provider, model, and attempt/total for observability.

Prefix Cache

Optimizations to improve upstream provider prefix cache hit rates by stabilizing the prompt structure.

Field	Default	Description
`enabled`	`true` (auto)	Master toggle. Auto-enabled when any sub-field is explicitly set to `true`.
`pin_system_first`	`true`	Reorder all `system` role messages to the top of the messages array
`stable_tools`	`true`	Sort `tools[]` array by `function.name` for deterministic ordering
`skip_redaction_on_system`	`true`	Skip Tier-0 regex redaction on system messages to preserve prefix byte-identity

Compaction

Text compaction applied to all message content (both string and multi-part content arrays).

Field	Default	Description
`enabled`	`true` (auto)	Master toggle. Auto-enabled when any sub-field is explicitly set to `true`.
`normalize_line_endings`	`true`	Convert CRLF to LF
`trim_trailing_whitespace`	`true`	Remove trailing spaces/tabs from each line
`collapse_blank_lines`	`true`	Collapse runs of 3+ blank lines to max 2
`compaction_preset`	`""`	Compaction preset: `"aggressive"` (all features), `"balanced"` (whitespace + JSON minify), or `"minimal"` (disabled). Individual fields override preset values.
`json_minify`	`true`	Minify the final JSON body with `json.Compact`
`prune_stale_tools`	`false`	Compact old assistant+tool response pairs into summary placeholders
`tool_protection_window`	`4`	Number of most recent messages to protect from tool call pruning
`prune_thoughts`	`false`	Strip reasoning blocks from assistant messages to save context tokens

Compaction runs after redaction, before engine interception. JSON minify runs at the very end of the pipeline.

Stale Tool Call Pruning

When prune_stale_tools is enabled, the gateway scans the messages array backwards (from oldest to newest) for completed tool execution pairs: an assistant message containing tool_calls, immediately followed by one or more tool messages with the results. When such a pair is found outside the protection window, both the assistant message and its tool responses are replaced with a single summary message:

[System] Tool 'tool_name' was executed previously. Result compacted to save context window.

The tool name is extracted from the first tool call's function.name field. If unavailable, the tool_call_id is used as a fallback.

Protection window: The last tool_protection_window messages (default: 4) are never modified, preserving the LLM's immediate reasoning context including the most recent tool calls.

Safety: Orphaned tool calls (assistant with tool_calls but missing corresponding tool response, e.g., due to stream interruption) are left untouched. The pruning is skipped entirely for IDE clients.

Thought Pruning

When prune_thoughts is enabled, the gateway strips reasoning blocks from all assistant messages in the conversation history. This targets <think.../think> tags used by reasoning models (DeepSeek, OpenRouter, Groq, Gemini):

Text tag pruning: Inside the content string, the gateway looks for the <think opening tag and </think> closing tag. When found:

Both tags and everything between them are removed.
The removed block is replaced with [Reasoning pruned by gateway].
If the opening tag exists but the closing tag is missing (stream interruption), everything from <think to the end of the string is replaced.
Multiple reasoning blocks in a single message are all pruned.

The structured reasoning_content field is not stripped by thought pruning. It is preserved in the shared pipeline and stripped per-target during request sanitization — only for providers that do not support reasoning.

Uses strings.Index (not regex) for zero-allocation scanning of large payloads.

Window

Sliding window conversation compaction for long conversations. When the estimated token count exceeds max_context * trigger_ratio, older messages are summarized (or truncated) and replaced with a single system summary message.

Field	Default	Description
`enabled`	`false`	Master toggle (off by default)
`mode`	`"summarize"`	`"summarize"` (engine), `"truncate"` (hard cut), or `"tfidf"` (relevance-scored, zero network calls)
`active_messages`	`6`	Number of recent messages to preserve unchanged
`trigger_ratio`	`0.8`	Trigger when tokens exceed `max_context * ratio` (0.0-1.0)
`summary_max_runes`	`4000`	Maximum length of the generated summary
`max_context`	`128000`	Context window size. Overridden by agent model `max_context` when routing through agents.
`engine`	string or object	Agent name reference or inline engine configuration for window summarization

Response Cache

In-memory LRU cache for deterministic response caching. Responses are cached by SHA-256 fingerprint of the request payload. On cache hit, the stored SSE stream is replayed to the client with X-Nenya-Cache-Status: HIT header.

Field	Default	Description
`enabled`	`false`	Master toggle (off by default)
`max_entries`	`512`	Maximum number of cached responses (LRU eviction)
`max_entry_bytes`	`1048576` (1 MB)	Maximum size per cached response
`ttl_seconds`	`3600` (1 hour)	Time-to-live for cached entries
`evict_every_seconds`	`300` (5 minutes)	Background eviction sweep interval
`force_refresh_header`	`"x-nenya-cache-force-refresh"`	HTTP header name that bypasses cache when present

Cache key: Deterministic SHA-256 computed from model, messages, temperature, top_p, max_tokens, tools, tool_choice, response_format, stop, stream.

Bypass: Send any non-empty value for the configured force_refresh_header to force a cache miss.

Debug

{
  "debug": {
    "pprof_enabled": false
  }
}

Field	Default	Description
`pprof_enabled`	`false`	Enable Go pprof endpoints at `/debug/pprof/`. Requires auth.

Agents

{
  "agents": {
    "default": {
      "strategy": "fallback",
      "models": ["gemini-3-flash", "deepseek-chat"]
    },
    "build": {
      "strategy": "fallback",
      "cooldown_seconds": 60,
      "failure_threshold": 5,
      "models": [
        "gemini-3-flash",
        { "provider": "ollama", "model": "qwen2.5-coder:7b" }
      ]
    }
  }
}

Model entries support flexible selectors: plain strings (registry lookup), objects with provider+model, or regex patterns (provider_rgx/model_rgx) for dynamic catalog expansion. See Model Selector Syntax for the full syntax reference.

Model Shorthand

Models listed in the built-in Model Registry can be specified as plain strings. Provider and max_context are resolved automatically:

{
  "agents": {
    "build": {
      "strategy": "fallback",
      "models": ["gemini-3-flash", "deepseek-reasoner"]
    }
  }
}

Model Object Notation

For custom or local models (not in the registry), or to override registry defaults, use full objects:

{
  "agents": {
    "build": {
      "strategy": "fallback",
      "models": [
        "gemini-3-flash",
        {
          "provider": "ollama",
          "model": "qwen2.5-coder:7b",
          "max_context": 32000,
          "url": "http://localhost:11434/v1/chat/completions"
        },
        {
          "provider": "zen",
          "model": "claude-opus-4-7",
          "format": "anthropic",
          "max_context": 200000
        }
      ]
    }
  }
}

Both styles can be mixed in the same models array.

Model Selector Syntax

Model entries support flexible selectors that expand at runtime against the discovery catalog:

{
  "agents": {
    "all-deepseek": { "models": [{ "provider": "deepseek" }] },
    "all-claude-opus": { "models": [{ "model": "claude-opus" }] },
    "zen-reasoning": { "models": [{ "provider_rgx": "zen", "model_rgx": ".*-reasoner" }] }
  }
}

Selector precedence (highest to lowest): exact provider+model (1), exact provider+model_rgx (2), provider_rgx+exact model (3), exact provider (4), exact model (5), provider_rgx+model_rgx (6), exact provider_rgx (7), exact model_rgx (8). First match wins.

Providers

{
  "providers": {
    "openai": {
      "url": "https://api.openai.com/v1/chat/completions",
      "auth_style": "bearer",
      "timeout_seconds": 30,
      "ratelimit_max_rpm": 500,
      "ratelimit_max_tpm": 2000000,
      "retryable_status_codes": [429, 500, 502, 503, 504],
      "auto_retry_on_context_limit": false
    }
  }
}

API keys are loaded via provider_keys (keyed by provider name). See Secrets for details.

Field	Default	Description
`ratelimit_max_rpm`	—	Per-provider override for max requests per minute
`ratelimit_max_tpm`	—	Per-provider override for max tokens per minute
`max_retry_attempts`	—	Per-provider override for max retry attempts (takes precedence over global `governance.max_retry_attempts`)
`retryable_status_codes`	—	Provider-level override for retryable statuses (replaces global)
`format_urls`	—	Maps wire format to endpoint URL (e.g., `{"anthropic": "..."}`)
`accounts`	—	Multi-account credential pool with LRU selection
`billing`	—	Billing model, quota tracking, free model detection
`thinking`	—	Per-provider thinking/reasoning mode configuration

Provider Auth Styles

Style	Header(s)	Used By
`bearer`	`Authorization: Bearer <key>`	OpenAI, DeepSeek, Groq, Together, SambaNova, Cerebras, GitHub, z.ai, z.ai Coding Plan, Mistral, xAI, Perplexity, Cohere, DeepInfra, Moonshot, Qwen, MiniMax
`bearer+x-goog`	Both `Authorization: Bearer` + `x-goog-api-key`	Gemini
`anthropic`	`x-api-key: <key>` + `anthropic-version: 2023-06-01`	Anthropic
`azure`	`api-key: <key>`	Azure OpenAI

Multi-Account Per-Provider Keys

For high-volume providers with multiple API keys:

{
  "providers": {
    "openai": {
      "accounts": [
        { "id": "account-1", "type": "apikey", "credential": "sk-proj-xxxxx" },
        { "id": "account-2", "type": "apikey", "credential": "sk-proj-yyyyy" }
      ]
    }
  }
}

AccountPool: LRU selection with 6 error classes, exponential backoff (±5% jitter), model-level locks. State persisted in <provider>.accounts.json.

Thinking Configuration

{
  "providers": {
    "zai": {
      "thinking": {
        "enabled": true,
        "clear_thinking": false
      }
    }
  }
}

Field	Default	Description
`enabled`	`true`	Enable thinking mode for reasoning-capable models
`clear_thinking`	`false`	Strip `reasoning_content` from responses to save output tokens

Note: Per-model thinking metadata (min, max, zero_allowed, dynamic_allowed, levels) is defined in the internal ModelRegistry and is not user-configurable. Model entries can override provider defaults via thinking field.

local_engine

Configuration for local Ollama model lifecycle management:

{
  "local_engine": {
    "base_url": "http://127.0.0.1:11434",
    "timeout_seconds": 120,
    "max_sessions": 3,
    "auto_load": false,
    "startup_models": ["qwen2.5-coder:7b"]
  }
}

Field	Default	Description
`base_url`	`http://127.0.0.1:11434`	Ollama API endpoint
`timeout_seconds`	`120`	Per-operation timeout
`max_sessions`	`3`	Maximum loaded models with LRU eviction
`auto_load`	`false`	Automatically load models when referenced
`startup_models`	`[]`	Models to preload on gateway startup

Billing Configuration

Per-provider billing and quota tracking configuration for cost-aware routing.

Provider Billing Config

{
  "providers": {
    "openrouter": {
      "billing": {
        "model": "mixed",
        "period_hours": 730,
        "included_usd": 10.0,
        "quota_source": "headers",
        "quota_extraction": {
          "mode": "headers",
          "remaining_header": "X-RateLimit-Remaining",
          "limit_header": "X-RateLimit-Limit",
          "reset_header": "X-RateLimit-Reset"
        },
        "free_models": ["gpt-4o-mini-free"]
      }
    },
    "zai": {
      "billing": {
        "model": "credit",
        "quota_source": "api",
        "quota_url": "https://api.zai.com/v1/billing/quota",
        "quota_interval": "1h",
        "quota_timeout_seconds": 10,
        "quota_extraction": {
          "mode": "simple_json",
          "balance_path": "credits_remaining",
          "reset_field": "credits_reset_at",
          "reset_unit": "unix_seconds"
        }
      }
    }
  }
}

Billing Fields

Field	Type	Description
`model`	`string`	Billing model: `subscription`, `credit`, `free`, `mixed`
`period_hours`	`int`	Period length in hours (for period reset automation)
`included_usd`	`float64`	Included credit amount for computing utilization ratio
`balance_usd`	`float64`	Static balance (only used if `quota_source: none`)
`quota_source`	`string`	Quota source: `none`, `api`, `headers`
`quota_url`	`string`	URL to fetch quota (for `api` source)
`quota_interval`	`string`	Poll interval (e.g., `1h`, `30m`)
`quota_timeout_seconds`	`int`	Timeout for quota fetch (default 10s)
`quota_extraction`	`object`	Extraction config (see below)
`free_only`	`bool`	Strip paid models from target list (only for `model: free`)
`free_models`	`[]string`	Explicit list of free model IDs for scoring bonus

Quota Extraction Modes

simple_json — Extract balance from JSON response:

{
  "mode": "simple_json",
  "balance_path": "data.credits_remaining",
  "reset_field": "data.credits_reset_at",
  "reset_unit": "unix_seconds"
}

balance_path — JSON pointer to balance field
reset_field — JSON pointer to reset timestamp
reset_unit — unix_seconds or rfc3339

max_from_array — Extract max value from array:

{
  "mode": "max_from_array",
  "array_path": "data.accounts",
  "value_field": "credits_remaining",
  "value_divide_by": 100,
  "reset_field": "reset_at",
  "level_field": "tier"
}

value_divide_by — Divide extracted value by this (e.g., cents to dollars)

headers — Extract from response headers:

{
  "mode": "headers",
  "remaining_header": "X-Remaining-Credits",
  "limit_header": "X-Max-Credits",
  "reset_header": "X-Reset-Time"
}

Agent Budget Config

{
  "agents": {
    "my-agent": {
      "models": ["gemini-3-flash"],
      "budget_limit_usd": 50.0
    }
  }
}

The budget_limit_usd field enforces per-agent spend limits independent of provider-level exhaustion.

Model Discovery

Nenya dynamically fetches model catalogs from upstream providers at startup and on SIGHUP reload. This enables automatic discovery of custom models (e.g., Ollama) and reduces the need for manual registry updates.

Discovery Process

Startup/Reload — For each configured provider with an API key, fetch /v1/models in parallel (10s timeout per provider)
Provider-specific parsing — Each provider has a dedicated parser for its response format
Three-tier merge — Discovered models are merged with static registry (config overrides take precedence)
Catalog update — The merged catalog is used for all subsequent model resolution

Three-Tier Model Resolution

Priority	Source	Description
1	Config overrides	Agent model entries with explicit `provider`, `max_context`, `max_output`, or `format` fields
2	Discovered models	Models fetched from provider `/v1/models` endpoints at startup/reload
3	Static registry	Built-in ModelRegistry fallback for known models

This allows:

Custom local models (Ollama) to be discovered automatically
Provider-specific overrides without code changes
Graceful fallback when discovery fails (static registry still works)

Security Hardening

The discovery package enforces strict security boundaries:

Response body limits — 10 MB max per provider response (DoS protection)
JSON decode limits — 10 MB max with DisallowUnknownFields (malformed JSON rejection)
Content-type validation — Only application/json responses are parsed
Model ID sanitization — Max 256 chars, printable characters only (XSS prevention)
Per-provider timeouts — 10s context timeout per fetch (no hanging)
Panic recovery — Goroutines have defer/recover to prevent crashes
Auth header injection — Gemini uses x-goog-api-key header (not query params)
Shared HTTP client — Reused with proper TLS timeouts (no resource leaks)

Graceful Degradation

If discovery fails for any provider:

The provider is skipped with a warning log
Static registry models for that provider still work
Other providers' discovered models are still used

Auto-Agents

When discovery.auto_agents is enabled, Nenya automatically generates agent definitions from discovered models:

Agent	Filter	Strategy
`auto_fast`	≤32k context, ≤4k output	round-robin
`auto_reasoning`	reasoning + ≥128k context	fallback
`auto_vision`	vision capability	round-robin
`auto_tools`	tool_calls capability	round-robin
`auto_large`	≥200k context	fallback
`auto_balanced`	32k–128k context	round-robin
`auto_coding`	tool_calls + coding prefix	fallback

User-defined agents take precedence over auto-generated ones.

Debug Logging

Enable server.log_level: "debug" to see discovery details:

DEBUG discovery catalog providers=[anthropic:3 gemini:5 openrouter:42]

Hot Reload

systemctl reload nenya

Reloads config and re-discovers model catalogs. Preserves UsageTracker, Metrics, and caches. On validation failure, continues with old config.

Configuration Validation

Validate config without starting the gateway:

nenya -validate -config /etc/nenya/config.d/

The validator checks:

Required fields: agents must have at least one entry
Model references: agent model entries resolve to valid providers
Provider configs: auth_style is recognized (bearer, bearer+x-goog, anthropic, azure)
Bouncer engine: agent reference resolves to a valid agent name
Secrets file: secrets.json exits and has expected keys
Mutually exclusive options: config.d/ and config.json not both set

Migration Guide

From TOML to JSON

Configuration format changed from TOML to JSON with semantic grouping. Old interceptor, ollama, ratelimit, and filter sections are now unified under governance and bouncer with engine abstraction.

From `security_filter` to `bouncer`

The old security_filter top-level section has been renamed to bouncer:

security_filter.patterns → bouncer.redact_patterns
security_filter.replacement → bouncer.redaction_label
security_filter.fail_open → bouncer.fail_open

From `ollama.engine` to `bouncer.engine`

The engine configuration was moved from a separate ollama section into the bouncer:

{
  "bouncer": {
    "engine": {
      "provider": "ollama",
      "model": "qwen2.5-coder:7b"
    }
  }
}

Processing Pipeline Order

Step	Action	Condition
1	Response cache lookup	if enabled
2	MCP auto-search	if agent has `mcp.auto_search`
3	MCP tool injection	if agent has MCP servers
4	Prefix cache optimizations	pin system messages, sort tools
5	Agent system prompt injection	if no existing system message
6	Tier-0 regex redaction	secret patterns via `bouncer`
6b	Shannon entropy redaction	if `entropy_enabled`
7	Text compaction	normalize, trim, collapse blanks
8	Stale tool call pruning	if `prune_stale_tools` enabled
9	Thought pruning	if `prune_thoughts` enabled
10	Window compaction	if enabled and threshold exceeded
11	Interceptor chain execution	Priority-ordered Redact/Entropy/TFIDF/Bouncer interceptors
11b	Engine interception	3-tier summarization with TF-IDF fallback
12	Format-aware body conversion	if model has `format: "anthropic"`
13	JSON minification	final body compaction
14	Response cache store	if enabled
15	MCP auto-save	if agent has `mcp.auto_save` (async)

Model Reference Table

The following table lists all models in the built-in registry with context windows, output limits, and pricing:

Model	Provider	Context	Max Output	Input ($/1M)	Output ($/1M)
`glm-4.6v-flash`	zai	200,000	128,000	$0.10/M	$0.10/M
`glm-4.6v-flashx`	zai	200,000	128,000	$0.10/M	$0.10/M
`glm-4-32b-0414-128k`	zai	128,000	16,000	$0.50/M	$2.00/M
`nemotron-3-super`	nvidia_free	4,000	1,024	$0.10/M	$0.10/M
`qwen-3.6-plus`	qwen_free	8,000	8,192	$0.10/M	$0.10/M
`minimax-m2.5`	minimax_free	8,000	4,096	$0.10/M	$0.10/M
`llama-3.3-70b-versatile`	groq	131,072	8,192	$0.59/M	$0.79/M
`mixtral-8x7b-32768`	groq	32,768	8,192	$0.27/M	$0.27/M
`llama-3.1-405b-instruct`	sambanova	128,000	4,096	$0.10/M	$0.10/M
`llama-3.3-70b`	cerebras	8,192	8,192	$0.10/M	$0.10/M
`gpt-4o`	github	8,000	4,096	$2.50/M	$10.00/M
`phi-3.5-mini-instruct`	github	128,000	4,096	$0.10/M	$0.10/M
`qwen2.5-72b-turbo`	together	32,768	4,096	$0.90/M	$0.90/M
`claude-opus-4-5`	anthropic	200,000	64,000	$5.00/M	$25.00/M
`claude-opus-4-0`	anthropic	200,000	32,000	$15.00/M	$75.00/M
`claude-sonnet-4-5`	anthropic	200,000	64,000	$3.00/M	$15.00/M
`claude-sonnet-4-0`	anthropic	200,000	64,000	$3.00/M	$15.00/M
`claude-haiku-4-5`	anthropic	200,000	64,000	$1.00/M	$5.00/M
`claude-3-7-sonnet-20250219`	anthropic	128,000	8,192	$3.00/M	$15.00/M
`claude-3-5-sonnet-20241022`	anthropic	200,000	64,000	$3.00/M	$15.00/M
`claude-3-5-haiku-latest`	anthropic	200,000	8,192	$0.25/M	$1.25/M
`mistral-large-latest`	mistral	256,000	262,144	$4.00/M	$12.00/M
`mistral-small-latest`	mistral	256,000	256,000	$0.20/M	$0.60/M
`mistral-medium-latest`	mistral	256,000	16,384	$2.70/M	$8.10/M
`codestral-latest`	mistral	128,000	4,096	$0.30/M	$0.30/M
`devstral-medium-latest`	mistral	256,000	262,144	$0.20/M	$0.60/M
`magistral-medium-latest`	mistral	128,000	16,384	$2.50/M	$7.50/M
`pixtral-large-latest`	mistral	128,000	128,000	$0.20/M	$0.60/M
`grok-4`	xai	256,000	64,000	$5.00/M	$15.00/M
`grok-4-fast`	xai	2,000,000	32,000	$0.50/M	$5.00/M
`grok-3`	xai	131,072	8,192	$3.00/M	$12.00/M
`grok-3-fast`	xai	131,072	8,192	$0.50/M	$5.00/M
`grok-3-mini`	xai	131,072	8,192	$0.50/M	$5.00/M
`sonar-pro`	perplexity	200,000	8,192	$3.00/M	$15.00/M
`sonar-reasoning-pro`	perplexity	128,000	4,096	$2.00/M	$8.00/M
`sonar-deep-research`	perplexity	128,000	32,768	$2.00/M	$8.00/M
`sonar`	perplexity	128,000	4,096	$1.00/M	$1.00/M
`qwen3.5-plus`	zen	131,072	8,192	—	—
`minimax-m2.7`	zen	200,000	8,192	—	—
`minimax-m2.5-free`	zen	200,000	8,192	—	—
`kimi-k2.6`	zen	262,144	65,536	—	—
`kimi-k2.5`	zen	131,072	32,768	—	—
`big-pickle`	zen	200,000	8,192	—	—
`ling-2.6-flash-free`	zen	200,000	8,192	—	—
`hy3-preview-free`	zen	131,072	8,192	—	—
`nemotron-3-super-free`	zen	4,000	1,024	—	—
`gpt-5-nano`	zen	200,000	8,192	—	—
`claude-opus-4-1-20250805`	anthropic	200,000	32,000	$15.00/M	$75.00/M
`gemini-3.5-flash`	gemini	1,048,576	65,536	$0.075/M	$0.30/M
`gemini-2.5-pro`	gemini	1,048,576	65,536	$1.25/M	$10.00/M
`gemini-3.1-pro-preview`	gemini	1,048,576	65,536	$2.00/M	$15.00/M
`gpt-5.2`	openai	400,000	128,000	—	—
`gpt-5.3-codex`	openai	400,000	128,000	—	—
`gpt-5.3-codex-spark`	openai	128,000	128,000	—	—
`gpt-5.4`	openai	1,050,000	128,000	—	—
`gpt-5.4-mini`	openai	400,000	128,000	—	—
`gpt-5.5`	openai	272,000	128,000	—	—
`codex-auto-review`	openai	272,000	128,000	—	—
`kimi-k2-thinking`	zen	131,072	32,768	—	—
`kimi-k2`	moonshot	131,072	32,768	—	—
`claude-opus-4-7`	anthropic	1,000,000	128,000	$5.00/M	$25.00/M
`claude-opus-4-6`	anthropic	1,000,000	128,000	$5.00/M	$25.00/M
`claude-sonnet-4-6`	anthropic	200,000	64,000	$3.00/M	$15.00/M

Nenya on GitHub | Report an Issue | Apache 2.0 License

Getting Started

Home — Project overview
Quick Start — Install and run in 5 minutes
Client Setup — OpenCode, Cursor, and other clients
Deployment — Bare metal, container, Kubernetes

Core Concepts

Configuration — Config reference and examples
Providers — 24 providers, capabilities, special behaviors
Routing — Latency-aware routing and fallback chains
Architecture — Package overview and request lifecycle
MCP Integration — MCP server integration

Reference

Passthrough Proxy — Raw provider endpoint proxying
Secrets — Systemd credentials and container secrets
Model Discovery — Dynamic model catalog fetching
API Endpoints — Endpoint reference
Adapters — Provider adapter system
Billing — Billing-aware routing and quota tracking
Caching — Exact-match and semantic caching
Provider Capabilities — Service kinds matrix
Unknown MaxContext — Unknown context window behavior

Operations

Demo — Test all pipeline tiers
Troubleshooting — Common issues and solutions
FAQ — Frequently asked questions
Security — Security policy and vulnerability reporting

Project

Roadmap — Planned features
Disclaimer — Legal disclaimer

Uh oh!

Configuration

Configuration Reference

Table of Contents

Environment Variables

Top-Level Sections

Configuration File Structure

Multi-File Configuration (Directory Mode)

Example: 01-server.json

Example: 03-agents.json

Key Configuration Blocks

Server

Context

Governance

Bouncer

Engine Configuration

Form 1: Agent Reference (string)

Form 2: Inline Object

Prefix Cache

Compaction

Stale Tool Call Pruning

Thought Pruning

Window

Response Cache

Debug

Agents

Model Shorthand

Model Object Notation

Model Selector Syntax

Providers

Provider Auth Styles

Multi-Account Per-Provider Keys

Thinking Configuration

local_engine

Billing Configuration

Provider Billing Config

Billing Fields

Quota Extraction Modes

Agent Budget Config

Model Discovery

Discovery Process

Three-Tier Model Resolution

Security Hardening

Graceful Degradation

Auto-Agents

Debug Logging

Hot Reload

Configuration Validation

Migration Guide

From TOML to JSON

From security_filter to bouncer

From ollama.engine to bouncer.engine

Processing Pipeline Order

See Also

Model Reference Table

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

Example: `01-server.json`

Example: `03-agents.json`

From `security_filter` to `bouncer`

From `ollama.engine` to `bouncer.engine`