Skip to content

Configuration

Rafael Gumieri edited this page Jun 15, 2026 · 9 revisions

Configuration Reference

Nenya reads its configuration from a JSON file or directory (default: /etc/nenya/). See Architecture for the request flow, Providers for provider-specific settings, and Secrets for credential configuration.

Table of Contents

Environment Variables

Variable Default Effect
PORT 8080 Listening port (overrides server.listen_addr). Validated via net.LookupPort.
HOST Optional bind address (e.g. 127.0.0.1). Only used when combined with PORT.
NENYA_CONFIG_DIR /etc/nenya/ Config root directory
NENYA_CONFIG_FILE Single JSON config file (takes precedence over NENYA_CONFIG_DIR)
NENYA_SECRETS_DIR Secrets directory for containers (see Secrets)

After flags are parsed, NENYA_CONFIG_DIR and NENYA_CONFIG_FILE override -config-dir and -config if set. If both env vars are set, NENYA_CONFIG_FILE still wins at load time (single-file mode).

Top-Level Sections

Section JSON key Description
Server server Listen address, body limits, token estimation
Context context Truncation, TF-IDF relevance scoring, context management
Governance governance Rate limiting, retries, routing policy
Bouncer bouncer PII redaction, entropy detection, engine interception
Prefix Cache prefix_cache System prompt and tool caching
Compaction compaction JSON minification, whitespace collapse, tool pruning
Window window Sliding context window with summarization
Response Cache response_cache Response caching with LRU eviction
Agents agents Model lists, strategies, circuit breakers, MCP config
Discovery discovery Dynamic model discovery, auto-agents
Providers providers Upstream API endpoints

Configuration File Structure

All /v1/* and /proxy/* routes require Authorization: Bearer <client_token> from secrets.

When a directory is specified, all *.json files (excluding secrets.json) are loaded in alphabetical order and deep-merged. Map fields (agents, providers, mcp_servers) merge per-key; struct fields use last-file-wins. Defaults are applied once after the merge.

config.json vs config.d/: Under the config root directory, if config.d/ exists and contains at least one *.json file, those files are merged and config.json in the parent directory is not read. If config.d/ exists but has no JSON files, the loader falls back to config.json at the parent level.

When a file is specified, only that file is loaded (single-file mode).

Multi-File Configuration (Directory Mode)

When -config points to a directory (the default), Nenya loads all *.json files in sorted order and deep-merges them:

/etc/nenya/
├── config.d/
│   ├── 01-server.json       # server, governance, bouncer, compaction
│   ├── 02-providers.json    # provider URL or auth overrides
│   ├── 03-agents.json       # agent definitions
│   └── 04-mcp.json          # MCP server definitions
└── secrets.json             # EXCLUDED (loaded via systemd credential)

Merge rules:

Field Type Behavior
agents (map) Per-key merge — later files add or override individual agents
providers (map) Per-key merge — later files add or override individual providers
mcp_servers (map) Per-key merge
server, governance, bouncer, etc. (struct) Last file wins — if multiple files set the same field, the last one in alphabetical order takes precedence

This lets you split configuration however makes sense for your deployment:

Example: 01-server.json

{
  "server": {
    "listen_addr": ":8080"
  },
  "bouncer": {
    "enabled": true,
    "engine": {
      "provider": "ollama",
      "model": "qwen2.5-coder:7b"
    }
  }
}

Example: 03-agents.json

{
  "agents": {
    "plan": {
      "strategy": "fallback",
      "models": ["deepseek-reasoner"]
    },
    "build": {
      "strategy": "fallback",
      "models": ["gemini-3-flash"]
    }
  }
}

Key Configuration Blocks

Server

{
  "server": {
    "listen_addr": ":8080",
    "max_body_bytes": 10485760,
    "log_level": "info",
    "secure_memory_required": true,
    "user_agent": "nenya/1.0"
  }
}
Field Default Description
listen_addr ":8080" Bind address and port
max_body_bytes 10485760 (10 MB) Maximum incoming request body size
log_level "info" Log level: "debug", "info", "warn", or "error". The -verbose flag overrides this to "debug".
secure_memory_required true Require mlock-backed secure memory for tokens. When true, gateway fails to start if mlock is unavailable. Set to false to allow heap fallback (e.g., macOS development).
user_agent "nenya/1.0" User-Agent header sent to upstream providers

Context

Unified configuration for context management and truncation.

The interceptor implements a 3-tier pipeline for the last user message content, with limits derived from the target model's max_context (characters, not tokens). If the model has no max_context, fallback defaults of 4000/24000 are used.

  • Tier 1 (pass-through): content below soft_limit runes
  • Tier 2 (engine summarization): content between soft_limit and hard_limit runes
  • Tier 3 (truncation + engine): content above hard_limit runes. Truncation uses the strategy selected by truncation_strategy:
    • "middle-out" (default): positional — keeps first/last percentages, discards middle
    • When tfidf_query_source is set: TF-IDF scoring — splits content into blocks (paragraphs + code fences), scores each block's relevance to the user's prior messages or the start of the current message, and greedily keeps the most relevant blocks within budget. First/last blocks are pinned as a safety net. If TF-IDF reduces the payload below soft_limit, the engine call is skipped entirely (zero network overhead).
Field Default Description
truncation_strategy "middle-out" Truncation method. "middle-out" (positional) or any value — TF-IDF is activated by setting tfidf_query_source instead.
truncation_keep_first_pct 15.0 Percentage of blocks to pin from the start when truncating (safety net for both middle-out and TF-IDF)
truncation_keep_last_pct 25.0 Percentage of blocks to pin from the end when truncating (safety net for both middle-out and TF-IDF)
tfidf_query_source "" (disabled) Enable TF-IDF relevance-scored truncation for Tier 3. "" = disabled (use middle-out). "prior_messages" = use previous user messages as query terms. "self" = use first 500 runes of the massive message as query terms. When enabled, if TF-IDF reduces the payload below soft_limit, the engine call is skipped entirely.
auto_context_skip false Automatically skip models that do not meet context requirements for the current request. When enabled, models with max_context smaller than the request's input token count are excluded from routing, preventing errors and improving latency.
auto_reorder_by_latency false Dynamically sort targets based on historical response times. When enabled, targets are reordered by median latency (fastest first) with ±5% jitter to prevent thundering herd.
hard_limit_tokens 0 (auto) Hard token limit — if payload exceeds this after all pipeline steps, trim by dropping oldest non-system messages and apply middle-out truncation. 0 (default) uses soft_limit × 2 (backward-compatible). Non-zero values set an absolute token budget.

Governance

Rate limiting, routing weights, and circuit breaker configuration.

Field Default Description
ratelimit_max_tpm 250000 Max tokens per minute per upstream host (0 = disabled)
ratelimit_max_rpm 15 Max requests per minute per upstream host (0 = disabled)
routing_strategy "" (latency) Routing strategy when auto_reorder_by_latency is enabled. "" or "latency" = latency-only sorting. "balanced" = weighted scoring using latency, cost, capability matching, and per-model score bonus.
routing_latency_weight 1.0 Weight for latency normalization in balanced scoring (0.0-10.0). Higher = prioritize faster models.
routing_cost_weight 0.0 Weight for cost normalization in balanced scoring (0.0-10.0). Higher = prioritize cheaper models.
max_cost_per_request 0 (disabled) Maximum allowed cost in USD per request. 0 = no limit. Logged but not yet enforced.
max_retry_attempts 3 Max retry attempts
half_open_max_requests 3 Max requests in half-open state during circuit recovery
retryable_status_codes [429, 500, 502, 503, 504] HTTP status codes that trigger fallback to the next model in an agent chain. Warning: setting this field REPLACES the built-in defaults entirely. You must include all codes you want retryable (including the standard ones). Per-provider override available via providers.<name>.retryable_status_codes (provider-level replaces global for that provider).
empty_stream_as_error true Treat upstream responses with 200 OK and zero-byte body as errors. When enabled, an SSE error payload is emitted to the client (code: empty_response), which OpenCode recognizes as a retryable error, allowing fallback to the next target. The metric nenya_empty_stream_total is incremented. Set to false to preserve backward compatibility (empty streams treated as successful responses, resulting in empty assistant messages).
auto_retry_on_context_limit false Automatically retry the request with reduced max_tokens when the upstream provider returns a context limit exceeded error. When enabled, the gateway halves the max_tokens value and retries up to max_retry_attempts times before giving up.
cost_mode "balanced" Cost optimization strategy for balanced routing: "economy" (cheapest first), "balanced" (default tradeoff), or "quality" (quality/scoring priority). Controls cost weight scaling.
billing_economy_scale 1.5 Multiplier for cost weight in "economy" mode
billing_quality_scale 0.0 Multiplier for cost weight in "quality" mode

Bouncer

Tier-0 regex-based secret redaction runs on every request, before any other pipeline step. Includes configurable engine for privacy filtering and optional Shannon entropy detection for unknown high-entropy tokens.

Field Default Description
enabled true Enable/disable the filter. Defaults to true if redact_patterns are provided but field omitted.
redact_patterns []string (9 built-in) Custom regex patterns. Replaces built-in patterns if set. Built-in patterns match: AWS keys, GitHub tokens, Google OAuth, sk- API keys, PEM private keys, AWS credential file lines, password/key assignments, Docker tokens, SendGrid keys.
redaction_label "[REDACTED]" Replacement string for matched secrets
redact_output false Enable stream output filtering (secret redaction and execution policy blocking on responses)
redact_output_window 4096 Sliding window size (in chars) for cross-chunk pattern matching in output streams
fail_open true When the engine (Ollama/cloud) is unreachable, skip summarization and forward the original payload. If false, hard-limit payloads are truncated even when the engine fails.
entropy_enabled false Enable Shannon entropy-based secret detection. Catches high-entropy tokens that don't match regex patterns (JWTs, opaque API keys, base64 credentials).
entropy_threshold 4.5 Shannon entropy threshold in bits/character. Tokens above this value are redacted. English text: ~3.5, hex secrets: ~4.0, base64 tokens: ~5.5, random API keys: ~4.5-5.5.
entropy_min_token 20 Minimum token length (in characters) to evaluate for entropy. Shorter tokens are skipped to reduce false positives.
engine string or object (see below)

The bouncer implements a 3-tier pipeline:

  • Tier 1 (pass-through): content below calculated soft limit
  • Tier 2 (engine summarization): content between soft and hard limit
  • Tier 3 (truncation + engine): content above hard limit. Uses middle-out or TF-IDF strategy.

Engine supports two forms: agent reference ("engine": "summarizer") or inline object ({"provider": "...", "model": "..."}).

Engine Configuration

Both bouncer.engine and window.engine support two forms:

Form 1: Agent Reference (string)

References a named agent by name. The agent's model list becomes the engine's fallback chain. The agent's system_prompt / system_prompt_file are used as defaults (overridable by inline fields on the EngineRef).

{
  "bouncer": {
    "engine": "summarizer"
  }
}
Form 2: Inline Object

Full engine configuration with explicit provider, model, system prompt, and optional inline fallback chain.

{
  "bouncer": {
    "engine": {
      "provider": "ollama",
      "model": "qwen2.5-coder:7b",
      "system_prompt": "Summarize the following text...",
      "system_prompt_file": "/path/to/system.txt",
      "models": ["qwen2.5-coder:7b", "phi-3.5-mini-instruct"]
    }
  }
}

Structured logging: Engine calls log the caller (bouncer or window), agent name (or inline), provider, model, and attempt/total for observability.

Prefix Cache

Optimizations to improve upstream provider prefix cache hit rates by stabilizing the prompt structure.

Field Default Description
enabled true (auto) Master toggle. Auto-enabled when any sub-field is explicitly set to true.
pin_system_first true Reorder all system role messages to the top of the messages array
stable_tools true Sort tools[] array by function.name for deterministic ordering
skip_redaction_on_system true Skip Tier-0 regex redaction on system messages to preserve prefix byte-identity

Compaction

Text compaction applied to all message content (both string and multi-part content arrays).

Field Default Description
enabled true (auto) Master toggle. Auto-enabled when any sub-field is explicitly set to true.
normalize_line_endings true Convert CRLF to LF
trim_trailing_whitespace true Remove trailing spaces/tabs from each line
collapse_blank_lines true Collapse runs of 3+ blank lines to max 2
compaction_preset "" Compaction preset: "aggressive" (all features), "balanced" (whitespace + JSON minify), or "minimal" (disabled). Individual fields override preset values.
json_minify true Minify the final JSON body with json.Compact
prune_stale_tools false Compact old assistant+tool response pairs into summary placeholders
tool_protection_window 4 Number of most recent messages to protect from tool call pruning
prune_thoughts false Strip reasoning blocks from assistant messages to save context tokens

Compaction runs after redaction, before engine interception. JSON minify runs at the very end of the pipeline.

Stale Tool Call Pruning

When prune_stale_tools is enabled, the gateway scans the messages array backwards (from oldest to newest) for completed tool execution pairs: an assistant message containing tool_calls, immediately followed by one or more tool messages with the results. When such a pair is found outside the protection window, both the assistant message and its tool responses are replaced with a single summary message:

[System] Tool 'tool_name' was executed previously. Result compacted to save context window.

The tool name is extracted from the first tool call's function.name field. If unavailable, the tool_call_id is used as a fallback.

Protection window: The last tool_protection_window messages (default: 4) are never modified, preserving the LLM's immediate reasoning context including the most recent tool calls.

Safety: Orphaned tool calls (assistant with tool_calls but missing corresponding tool response, e.g., due to stream interruption) are left untouched. The pruning is skipped entirely for IDE clients.

Thought Pruning

When prune_thoughts is enabled, the gateway strips reasoning blocks from all assistant messages in the conversation history. This targets <think.../think> tags used by reasoning models (DeepSeek, OpenRouter, Groq, Gemini):

Text tag pruning: Inside the content string, the gateway looks for the <think opening tag and </think> closing tag. When found:

  • Both tags and everything between them are removed.
  • The removed block is replaced with [Reasoning pruned by gateway].
  • If the opening tag exists but the closing tag is missing (stream interruption), everything from <think to the end of the string is replaced.
  • Multiple reasoning blocks in a single message are all pruned.

The structured reasoning_content field is not stripped by thought pruning. It is preserved in the shared pipeline and stripped per-target during request sanitization — only for providers that do not support reasoning.

Uses strings.Index (not regex) for zero-allocation scanning of large payloads.

Window

Sliding window conversation compaction for long conversations. When the estimated token count exceeds max_context * trigger_ratio, older messages are summarized (or truncated) and replaced with a single system summary message.

Field Default Description
enabled false Master toggle (off by default)
mode "summarize" "summarize" (engine), "truncate" (hard cut), or "tfidf" (relevance-scored, zero network calls)
active_messages 6 Number of recent messages to preserve unchanged
trigger_ratio 0.8 Trigger when tokens exceed max_context * ratio (0.0-1.0)
summary_max_runes 4000 Maximum length of the generated summary
max_context 128000 Context window size. Overridden by agent model max_context when routing through agents.
engine string or object Agent name reference or inline engine configuration for window summarization

Response Cache

In-memory LRU cache for deterministic response caching. Responses are cached by SHA-256 fingerprint of the request payload. On cache hit, the stored SSE stream is replayed to the client with X-Nenya-Cache-Status: HIT header.

Field Default Description
enabled false Master toggle (off by default)
max_entries 512 Maximum number of cached responses (LRU eviction)
max_entry_bytes 1048576 (1 MB) Maximum size per cached response
ttl_seconds 3600 (1 hour) Time-to-live for cached entries
evict_every_seconds 300 (5 minutes) Background eviction sweep interval
force_refresh_header "x-nenya-cache-force-refresh" HTTP header name that bypasses cache when present

Cache key: Deterministic SHA-256 computed from model, messages, temperature, top_p, max_tokens, tools, tool_choice, response_format, stop, stream.

Bypass: Send any non-empty value for the configured force_refresh_header to force a cache miss.

Debug

{
  "debug": {
    "pprof_enabled": false
  }
}
Field Default Description
pprof_enabled false Enable Go pprof endpoints at /debug/pprof/. Requires auth.

Agents

{
  "agents": {
    "default": {
      "strategy": "fallback",
      "models": ["gemini-3-flash", "deepseek-chat"]
    },
    "build": {
      "strategy": "fallback",
      "cooldown_seconds": 60,
      "failure_threshold": 5,
      "models": [
        "gemini-3-flash",
        { "provider": "ollama", "model": "qwen2.5-coder:7b" }
      ]
    }
  }
}

Model entries support flexible selectors: plain strings (registry lookup), objects with provider+model, or regex patterns (provider_rgx/model_rgx) for dynamic catalog expansion. See Model Selector Syntax for the full syntax reference.

Model Shorthand

Models listed in the built-in Model Registry can be specified as plain strings. Provider and max_context are resolved automatically:

{
  "agents": {
    "build": {
      "strategy": "fallback",
      "models": ["gemini-3-flash", "deepseek-reasoner"]
    }
  }
}

Model Object Notation

For custom or local models (not in the registry), or to override registry defaults, use full objects:

{
  "agents": {
    "build": {
      "strategy": "fallback",
      "models": [
        "gemini-3-flash",
        {
          "provider": "ollama",
          "model": "qwen2.5-coder:7b",
          "max_context": 32000,
          "url": "http://localhost:11434/v1/chat/completions"
        },
        {
          "provider": "zen",
          "model": "claude-opus-4-7",
          "format": "anthropic",
          "max_context": 200000
        }
      ]
    }
  }
}

Both styles can be mixed in the same models array.

Model Selector Syntax

Model entries support flexible selectors that expand at runtime against the discovery catalog:

{
  "agents": {
    "all-deepseek": { "models": [{ "provider": "deepseek" }] },
    "all-claude-opus": { "models": [{ "model": "claude-opus" }] },
    "zen-reasoning": { "models": [{ "provider_rgx": "zen", "model_rgx": ".*-reasoner" }] }
  }
}

Selector precedence (highest to lowest): exact provider+model (1), exact provider+model_rgx (2), provider_rgx+exact model (3), exact provider (4), exact model (5), provider_rgx+model_rgx (6), exact provider_rgx (7), exact model_rgx (8). First match wins.

Providers

{
  "providers": {
    "openai": {
      "url": "https://api.openai.com/v1/chat/completions",
      "auth_style": "bearer",
      "timeout_seconds": 30,
      "ratelimit_max_rpm": 500,
      "ratelimit_max_tpm": 2000000,
      "retryable_status_codes": [429, 500, 502, 503, 504],
      "auto_retry_on_context_limit": false
    }
  }
}

API keys are loaded via provider_keys (keyed by provider name). See Secrets for details.

Field Default Description
ratelimit_max_rpm Per-provider override for max requests per minute
ratelimit_max_tpm Per-provider override for max tokens per minute
max_retry_attempts Per-provider override for max retry attempts (takes precedence over global governance.max_retry_attempts)
retryable_status_codes Provider-level override for retryable statuses (replaces global)
format_urls Maps wire format to endpoint URL (e.g., {"anthropic": "..."})
accounts Multi-account credential pool with LRU selection
billing Billing model, quota tracking, free model detection
thinking Per-provider thinking/reasoning mode configuration

Provider Auth Styles

Style Header(s) Used By
bearer Authorization: Bearer <key> OpenAI, DeepSeek, Groq, Together, SambaNova, Cerebras, GitHub, z.ai, z.ai Coding Plan, Mistral, xAI, Perplexity, Cohere, DeepInfra, Moonshot, Qwen, MiniMax
bearer+x-goog Both Authorization: Bearer + x-goog-api-key Gemini
anthropic x-api-key: <key> + anthropic-version: 2023-06-01 Anthropic
azure api-key: <key> Azure OpenAI

Multi-Account Per-Provider Keys

For high-volume providers with multiple API keys:

{
  "providers": {
    "openai": {
      "accounts": [
        { "id": "account-1", "type": "apikey", "credential": "sk-proj-xxxxx" },
        { "id": "account-2", "type": "apikey", "credential": "sk-proj-yyyyy" }
      ]
    }
  }
}

AccountPool: LRU selection with 6 error classes, exponential backoff (±5% jitter), model-level locks. State persisted in <provider>.accounts.json.

Thinking Configuration

{
  "providers": {
    "zai": {
      "thinking": {
        "enabled": true,
        "clear_thinking": false
      }
    }
  }
}
Field Default Description
enabled true Enable thinking mode for reasoning-capable models
clear_thinking false Strip reasoning_content from responses to save output tokens

Note: Per-model thinking metadata (min, max, zero_allowed, dynamic_allowed, levels) is defined in the internal ModelRegistry and is not user-configurable. Model entries can override provider defaults via thinking field.

local_engine

Configuration for local Ollama model lifecycle management:

{
  "local_engine": {
    "base_url": "http://127.0.0.1:11434",
    "timeout_seconds": 120,
    "max_sessions": 3,
    "auto_load": false,
    "startup_models": ["qwen2.5-coder:7b"]
  }
}
Field Default Description
base_url http://127.0.0.1:11434 Ollama API endpoint
timeout_seconds 120 Per-operation timeout
max_sessions 3 Maximum loaded models with LRU eviction
auto_load false Automatically load models when referenced
startup_models [] Models to preload on gateway startup

Billing Configuration

Per-provider billing and quota tracking configuration for cost-aware routing.

Provider Billing Config

{
  "providers": {
    "openrouter": {
      "billing": {
        "model": "mixed",
        "period_hours": 730,
        "included_usd": 10.0,
        "quota_source": "headers",
        "quota_extraction": {
          "mode": "headers",
          "remaining_header": "X-RateLimit-Remaining",
          "limit_header": "X-RateLimit-Limit",
          "reset_header": "X-RateLimit-Reset"
        },
        "free_models": ["gpt-4o-mini-free"]
      }
    },
    "zai": {
      "billing": {
        "model": "credit",
        "quota_source": "api",
        "quota_url": "https://api.zai.com/v1/billing/quota",
        "quota_interval": "1h",
        "quota_timeout_seconds": 10,
        "quota_extraction": {
          "mode": "simple_json",
          "balance_path": "credits_remaining",
          "reset_field": "credits_reset_at",
          "reset_unit": "unix_seconds"
        }
      }
    }
  }
}

Billing Fields

Field Type Description
model string Billing model: subscription, credit, free, mixed
period_hours int Period length in hours (for period reset automation)
included_usd float64 Included credit amount for computing utilization ratio
balance_usd float64 Static balance (only used if quota_source: none)
quota_source string Quota source: none, api, headers
quota_url string URL to fetch quota (for api source)
quota_interval string Poll interval (e.g., 1h, 30m)
quota_timeout_seconds int Timeout for quota fetch (default 10s)
quota_extraction object Extraction config (see below)
free_only bool Strip paid models from target list (only for model: free)
free_models []string Explicit list of free model IDs for scoring bonus

Quota Extraction Modes

simple_json — Extract balance from JSON response:

{
  "mode": "simple_json",
  "balance_path": "data.credits_remaining",
  "reset_field": "data.credits_reset_at",
  "reset_unit": "unix_seconds"
}
  • balance_path — JSON pointer to balance field
  • reset_field — JSON pointer to reset timestamp
  • reset_unitunix_seconds or rfc3339

max_from_array — Extract max value from array:

{
  "mode": "max_from_array",
  "array_path": "data.accounts",
  "value_field": "credits_remaining",
  "value_divide_by": 100,
  "reset_field": "reset_at",
  "level_field": "tier"
}
  • value_divide_by — Divide extracted value by this (e.g., cents to dollars)

headers — Extract from response headers:

{
  "mode": "headers",
  "remaining_header": "X-Remaining-Credits",
  "limit_header": "X-Max-Credits",
  "reset_header": "X-Reset-Time"
}

Agent Budget Config

{
  "agents": {
    "my-agent": {
      "models": ["gemini-3-flash"],
      "budget_limit_usd": 50.0
    }
  }
}

The budget_limit_usd field enforces per-agent spend limits independent of provider-level exhaustion.

Model Discovery

Nenya dynamically fetches model catalogs from upstream providers at startup and on SIGHUP reload. This enables automatic discovery of custom models (e.g., Ollama) and reduces the need for manual registry updates.

Discovery Process

  1. Startup/Reload — For each configured provider with an API key, fetch /v1/models in parallel (10s timeout per provider)
  2. Provider-specific parsing — Each provider has a dedicated parser for its response format
  3. Three-tier merge — Discovered models are merged with static registry (config overrides take precedence)
  4. Catalog update — The merged catalog is used for all subsequent model resolution

Three-Tier Model Resolution

Priority Source Description
1 Config overrides Agent model entries with explicit provider, max_context, max_output, or format fields
2 Discovered models Models fetched from provider /v1/models endpoints at startup/reload
3 Static registry Built-in ModelRegistry fallback for known models

This allows:

  • Custom local models (Ollama) to be discovered automatically
  • Provider-specific overrides without code changes
  • Graceful fallback when discovery fails (static registry still works)

Security Hardening

The discovery package enforces strict security boundaries:

  • Response body limits — 10 MB max per provider response (DoS protection)
  • JSON decode limits — 10 MB max with DisallowUnknownFields (malformed JSON rejection)
  • Content-type validation — Only application/json responses are parsed
  • Model ID sanitization — Max 256 chars, printable characters only (XSS prevention)
  • Per-provider timeouts — 10s context timeout per fetch (no hanging)
  • Panic recovery — Goroutines have defer/recover to prevent crashes
  • Auth header injection — Gemini uses x-goog-api-key header (not query params)
  • Shared HTTP client — Reused with proper TLS timeouts (no resource leaks)

Graceful Degradation

If discovery fails for any provider:

  • The provider is skipped with a warning log
  • Static registry models for that provider still work
  • Other providers' discovered models are still used

Auto-Agents

When discovery.auto_agents is enabled, Nenya automatically generates agent definitions from discovered models:

Agent Filter Strategy
auto_fast ≤32k context, ≤4k output round-robin
auto_reasoning reasoning + ≥128k context fallback
auto_vision vision capability round-robin
auto_tools tool_calls capability round-robin
auto_large ≥200k context fallback
auto_balanced 32k–128k context round-robin
auto_coding tool_calls + coding prefix fallback

User-defined agents take precedence over auto-generated ones.

Debug Logging

Enable server.log_level: "debug" to see discovery details:

DEBUG discovery catalog providers=[anthropic:3 gemini:5 openrouter:42]

Hot Reload

systemctl reload nenya

Reloads config and re-discovers model catalogs. Preserves UsageTracker, Metrics, and caches. On validation failure, continues with old config.

Configuration Validation

Validate config without starting the gateway:

nenya -validate -config /etc/nenya/config.d/

The validator checks:

  • Required fields: agents must have at least one entry
  • Model references: agent model entries resolve to valid providers
  • Provider configs: auth_style is recognized (bearer, bearer+x-goog, anthropic, azure)
  • Bouncer engine: agent reference resolves to a valid agent name
  • Secrets file: secrets.json exits and has expected keys
  • Mutually exclusive options: config.d/ and config.json not both set

Migration Guide

From TOML to JSON

Configuration format changed from TOML to JSON with semantic grouping. Old interceptor, ollama, ratelimit, and filter sections are now unified under governance and bouncer with engine abstraction.

From security_filter to bouncer

The old security_filter top-level section has been renamed to bouncer:

  • security_filter.patternsbouncer.redact_patterns
  • security_filter.replacementbouncer.redaction_label
  • security_filter.fail_openbouncer.fail_open

From ollama.engine to bouncer.engine

The engine configuration was moved from a separate ollama section into the bouncer:

{
  "bouncer": {
    "engine": {
      "provider": "ollama",
      "model": "qwen2.5-coder:7b"
    }
  }
}

Processing Pipeline Order

Step Action Condition
1 Response cache lookup if enabled
2 MCP auto-search if agent has mcp.auto_search
3 MCP tool injection if agent has MCP servers
4 Prefix cache optimizations pin system messages, sort tools
5 Agent system prompt injection if no existing system message
6 Tier-0 regex redaction secret patterns via bouncer
6b Shannon entropy redaction if entropy_enabled
7 Text compaction normalize, trim, collapse blanks
8 Stale tool call pruning if prune_stale_tools enabled
9 Thought pruning if prune_thoughts enabled
10 Window compaction if enabled and threshold exceeded
11 Interceptor chain execution Priority-ordered Redact/Entropy/TFIDF/Bouncer interceptors
11b Engine interception 3-tier summarization with TF-IDF fallback
12 Format-aware body conversion if model has format: "anthropic"
13 JSON minification final body compaction
14 Response cache store if enabled
15 MCP auto-save if agent has mcp.auto_save (async)

See Also

Model Reference Table

The following table lists all models in the built-in registry with context windows, output limits, and pricing:

Model Provider Context Max Output Input ($/1M) Output ($/1M)
glm-4.6v-flash zai 200,000 128,000 $0.10/M $0.10/M
glm-4.6v-flashx zai 200,000 128,000 $0.10/M $0.10/M
glm-4-32b-0414-128k zai 128,000 16,000 $0.50/M $2.00/M
nemotron-3-super nvidia_free 4,000 1,024 $0.10/M $0.10/M
qwen-3.6-plus qwen_free 8,000 8,192 $0.10/M $0.10/M
minimax-m2.5 minimax_free 8,000 4,096 $0.10/M $0.10/M
llama-3.3-70b-versatile groq 131,072 8,192 $0.59/M $0.79/M
mixtral-8x7b-32768 groq 32,768 8,192 $0.27/M $0.27/M
llama-3.1-405b-instruct sambanova 128,000 4,096 $0.10/M $0.10/M
llama-3.3-70b cerebras 8,192 8,192 $0.10/M $0.10/M
gpt-4o github 8,000 4,096 $2.50/M $10.00/M
phi-3.5-mini-instruct github 128,000 4,096 $0.10/M $0.10/M
qwen2.5-72b-turbo together 32,768 4,096 $0.90/M $0.90/M
claude-opus-4-5 anthropic 200,000 64,000 $5.00/M $25.00/M
claude-opus-4-0 anthropic 200,000 32,000 $15.00/M $75.00/M
claude-sonnet-4-5 anthropic 200,000 64,000 $3.00/M $15.00/M
claude-sonnet-4-0 anthropic 200,000 64,000 $3.00/M $15.00/M
claude-haiku-4-5 anthropic 200,000 64,000 $1.00/M $5.00/M
claude-3-7-sonnet-20250219 anthropic 128,000 8,192 $3.00/M $15.00/M
claude-3-5-sonnet-20241022 anthropic 200,000 64,000 $3.00/M $15.00/M
claude-3-5-haiku-latest anthropic 200,000 8,192 $0.25/M $1.25/M
mistral-large-latest mistral 256,000 262,144 $4.00/M $12.00/M
mistral-small-latest mistral 256,000 256,000 $0.20/M $0.60/M
mistral-medium-latest mistral 256,000 16,384 $2.70/M $8.10/M
codestral-latest mistral 128,000 4,096 $0.30/M $0.30/M
devstral-medium-latest mistral 256,000 262,144 $0.20/M $0.60/M
magistral-medium-latest mistral 128,000 16,384 $2.50/M $7.50/M
pixtral-large-latest mistral 128,000 128,000 $0.20/M $0.60/M
grok-4 xai 256,000 64,000 $5.00/M $15.00/M
grok-4-fast xai 2,000,000 32,000 $0.50/M $5.00/M
grok-3 xai 131,072 8,192 $3.00/M $12.00/M
grok-3-fast xai 131,072 8,192 $0.50/M $5.00/M
grok-3-mini xai 131,072 8,192 $0.50/M $5.00/M
sonar-pro perplexity 200,000 8,192 $3.00/M $15.00/M
sonar-reasoning-pro perplexity 128,000 4,096 $2.00/M $8.00/M
sonar-deep-research perplexity 128,000 32,768 $2.00/M $8.00/M
sonar perplexity 128,000 4,096 $1.00/M $1.00/M
qwen3.5-plus zen 131,072 8,192
minimax-m2.7 zen 200,000 8,192
minimax-m2.5-free zen 200,000 8,192
kimi-k2.6 zen 262,144 65,536
kimi-k2.5 zen 131,072 32,768
big-pickle zen 200,000 8,192
ling-2.6-flash-free zen 200,000 8,192
hy3-preview-free zen 131,072 8,192
nemotron-3-super-free zen 4,000 1,024
gpt-5-nano zen 200,000 8,192
claude-opus-4-1-20250805 anthropic 200,000 32,000 $15.00/M $75.00/M
gemini-3.5-flash gemini 1,048,576 65,536 $0.075/M $0.30/M
gemini-2.5-pro gemini 1,048,576 65,536 $1.25/M $10.00/M
gemini-3.1-pro-preview gemini 1,048,576 65,536 $2.00/M $15.00/M
gpt-5.2 openai 400,000 128,000
gpt-5.3-codex openai 400,000 128,000
gpt-5.3-codex-spark openai 128,000 128,000
gpt-5.4 openai 1,050,000 128,000
gpt-5.4-mini openai 400,000 128,000
gpt-5.5 openai 272,000 128,000
codex-auto-review openai 272,000 128,000
kimi-k2-thinking zen 131,072 32,768
kimi-k2 moonshot 131,072 32,768
claude-opus-4-7 anthropic 1,000,000 128,000 $5.00/M $25.00/M
claude-opus-4-6 anthropic 1,000,000 128,000 $5.00/M $25.00/M
claude-sonnet-4-6 anthropic 200,000 64,000 $3.00/M $15.00/M

Getting Started

Core Concepts

Reference

Operations

  • Demo — Test all pipeline tiers
  • Troubleshooting — Common issues and solutions
  • FAQ — Frequently asked questions
  • Security — Security policy and vulnerability reporting

Project

Clone this wiki locally