Adaptive inference control plane for AI apps and agents.
Agents no longer make one model call. They run loops, resend huge context, retry tools, call overpowered models, and burn compute invisibly. Every serious agent runtime needs a compute control plane:
- cache before recompute
- route before escalate
- verify before continue
- budget before inference
- evidence after execution
Start the local gateway:
TOKENOPS_PORT=8787 pnpm --filter @nerve/server startRun against a real Groq LLM:
GROQ_API_KEY=... \
TOKENOPS_PROVIDER=groq \
GROQ_MODEL=llama-3.3-70b-versatile \
TOKENOPS_PORT=8787 \
pnpm --filter @nerve/server startPoint OpenAI-compatible clients at:
OPENAI_BASE_URL=http://localhost:8787/v1Smoke test:
curl -sS -X POST http://127.0.0.1:8787/v1/chat/completions \
-H 'content-type: application/json' \
-d '{"model":"llama-3.3-70b-versatile","messages":[{"role":"user","content":"docs quickstart"}]}'
curl -sS -X POST http://127.0.0.1:8787/v1/responses \
-H 'content-type: application/json' \
-d '{"model":"llama-3.3-70b-versatile","input":"docs quickstart"}'
curl -sS -X POST http://127.0.0.1:8787/v1/embeddings \
-H 'content-type: application/json' \
-d '{"model":"text-embedding-3-small","input":["TokenOps cache policy","Adaptive inference control plane"],"dimensions":32}'Implemented endpoints:
POST /planPOST /v1/tokenops/planPOST /v1/chat/completionsPOST /v1/responsesPOST /v1/embeddingsGET /v1/modelsGET /healthGET /readyGET /statsGET /tracesGET /traces/:idPOST /replayGET /benchmark/resultsGET /cache/statsPOST /cache/clearGET /budget/statusPOST /policy/simulateGET /rate-limit/statusGET /runtime/statsGET /analyzeGET /routing/policyGET /routing/sloGET /routing/slo-impactGET /routing/model-impactGET /providers/health
POST /replay runs local benchmark datasets through the baseline-vs-optimized TokenOps runner and stores results in SQLite.
POST /v1/tokenops/plan runs the request normalizer, profiler, policy firewall,
cache checks, router, and AIS compute planner without calling a provider or
writing a trace. Use it as a preflight control-plane decision before burning
inference compute.
App / Agent / Coding Tool
-> OpenAI-Compatible Gateway
-> Request Normalizer
-> Workload Profiler
-> Policy + Budget Firewall
-> Cache / Reuse Layer
-> AIS Compute Planner
-> Model / Provider Router
-> Verifier / Eval Gate
-> Trace + Cost Ledger
-> Replay Benchmark
TokenOps packages live under packages/core, packages/gateway, packages/cache, packages/profiler, packages/router, packages/policy, packages/ais, packages/providers, packages/ledger, packages/verifier, and packages/benchmark.
The repo still ships the legacy nerve CLI. TokenOps commands are available through the same entrypoint today:
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts replay --all
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts replay benchmark/datasets/docs-qa.jsonl --persist
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts export ./tokenops-snapshot.json
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts import ./tokenops-snapshot.json
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts reconcile ./provider-usage.jsonl
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts traces export --format otel --out ./tokenops-spans.jsonl
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts prune --keep-traces 1000 --keep-benchmarks 100 --keep-idempotency 1000
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts load --requests 40 --concurrency 10 --duplicate-ratio 0.5
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts load-shedding
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts batch --requests 32 --batch-size 8
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts failover --requests 12
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts throughput mock --requests 24 --concurrency 6
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts throughput groq --requests 4 --concurrency 2
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts proof --include-groq
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts verify readiness --json
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts doctor
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts demo
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts demo --json
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts serve
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts gateway smoke --url http://127.0.0.1:8787
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts compare groq
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts compare openai
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts verify eval
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts verify eval --dataset benchmark/verifier/basic.jsonl
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts verify routing
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts cache eval
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts cache eval --sweep --thresholds 0.2,0.3,0.5
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts routing policy
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts routing slo
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts routing slo-benchmark
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts providers health
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts providers health --slo-max-p95-ms 1000
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts providers attempts
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts providers usage --out provider-usage.jsonl
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts policy simulate --daily-budget-usd 2 --max-request-cost-usd 0.01
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts smoke ollamaOnce linked/installed, apps/cli/bin/tokenops exposes:
tokenops servetokenops replay <dataset>tokenops replay --alltokenops replay <dataset> --persisttokenops export <file>tokenops import <file>tokenops reconcile <provider-usage.jsonl>tokenops traces export --format otel [--out file]tokenops prune --keep-traces <n> --keep-benchmarks <n> --keep-idempotency <n>tokenops loadtokenops load-sheddingtokenops batchtokenops failovertokenops throughput [mock|ollama|groq]tokenops prooftokenops doctortokenops verify readiness [--json]tokenops gateway smoke [--url http://127.0.0.1:8787] [--admission] [--require-provider groq]tokenops statstokenops trace <id>tokenops cache statstokenops cache cleartokenops cache eval [dataset] [--sweep --thresholds 0.2,0.3,0.5]tokenops analyzetokenops analyze --trace <id>tokenops budget statustokenops policy simulate --daily-budget-usd <usd> --max-request-cost-usd <usd>tokenops routing policytokenops routing slotokenops routing slo-benchmarktokenops routing slo-impact --candidates groq,mock --max-p95-ms 1000tokenops routing model-impact [--target-model gpt-5-mini]tokenops routing arbitrage --provider <name> --candidates groq,openai,mocktokenops providers healthtokenops providers attemptstokenops providers usage [--trace <id>] [--out provider-usage.jsonl]tokenops verify evaltokenops verify eval --dataset <jsonl>tokenops verify routing [dataset]tokenops compare groqtokenops compare openaitokenops smoke ollamatokenops demo [--json]
Dataset: long-prefix.jsonl
Baseline: requests=2 model_calls=2 estimated_cost=$0.019
Optimized: model_calls=2 estimated_cost=$0.00409 cost_reduction=78.5%
Cache: exact=0.0% semantic=0.0% tool=0.0% context=50.0%
Routing: downgraded=100.0% verifier_escalation=0.0%
Tokens: input_saved=0 output_saved=994 prefix_eligible=708
Latency: p50=1ms p95=1ms
Wrong-cache incidents: 0
Run:
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts replay --allRuntime load benchmark:
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts load \
--requests 40 \
--concurrency 10 \
--duplicate-ratio 0.5 \
--provider-latency-ms 25This stresses the local inference runtime and reports provider calls avoided by in-flight request coalescing, p50/p95 latency, scheduler queue stats, and circuit state.
Micro-batching benchmark:
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts batch \
--requests 32 \
--batch-size 8 \
--batch-window-ms 5 \
--per-batch-overhead-ms 20 \
--per-item-latency-ms 2This simulates provider-side batching economics: fixed per-call overhead plus per-item work. It reports batch count, largest batch, average batch size, baseline wall time, batched wall time, and estimated latency reduction.
Provider failover benchmark:
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts failover \
--requests 12 \
--primary-failures-before-success 12 \
--circuit-failure-threshold 2This proves the local inference runtime opens the primary provider circuit after repeated failures, bypasses the broken provider, and recovers every request through a fallback provider. The report includes primary calls, fallback calls, circuit-rejected requests, failed responses, and runtime circuit state.
Provider throughput benchmark:
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts throughput mock \
--requests 24 \
--concurrency 6 \
--provider-latency-ms 5
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts throughput ollama \
--requests 8 \
--concurrency 2 \
--model llama3.2
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts throughput groq \
--requests 4 \
--concurrency 2 \
--model llama-3.3-70b-versatileThis measures provider calls, total input/output tokens, output tokens/sec, total tokens/sec, p50/p95 latency, errors, and runtime scheduler/circuit stats. The Ollama and Groq paths are availability-aware: if ollama serve or GROQ_API_KEY is unavailable, the command returns a skipped report instead of failing the benchmark run.
Product-readiness proof report:
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts proof --include-groqThis writes:
docs/experiments/tokenops-product-readiness-report.jsondocs/experiments/tokenops-product-readiness.md
The report combines replay savings, runtime coalescing, micro-batching, mock throughput, optional live Groq throughput, pass/fail gates, and known gaps.
Product demo summary:
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts demo
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts demo --jsonThe TokenOps demo summarizes replay savings, exact/semantic cache safety, provider SLO routing, provider arbitrage, verifier escalation, policy blocks, AIS foreground/background planning, runtime priority scheduling, load shedding, and could-have-been-cheaper analyzer output.
HTTP gateway smoke, against a running gateway. This checks /ready, /v1/models, OpenAI-compatible chat shape, exact cache reuse, and runtime stats:
TOKENOPS_PORT=8787 pnpm --filter @nerve/server start
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts gateway smoke --url http://127.0.0.1:8787For HTTP-level admission-control proof, start with a constrained local runtime
and include --admission:
TOKENOPS_PROVIDER=mock \
TOKENOPS_MOCK_DELAY_MS=40 \
TOKENOPS_MAX_CONCURRENT_INFERENCE=1 \
TOKENOPS_MAX_INFERENCE_QUEUE=1 \
pnpm --filter @nerve/server start
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts gateway smoke \
--url http://127.0.0.1:8787 \
--admissionFor live-provider proof, start the gateway with Groq and require the first uncached call to use Groq instead of silently passing through mock or fallback:
GROQ_API_KEY=... TOKENOPS_PROVIDER=groq pnpm --filter @nerve/server start
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts gateway smoke \
--url http://127.0.0.1:8787 \
--require-provider groqSafety/eval gates:
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts cache eval
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts cache eval --sweep --thresholds 0.2,0.3,0.5
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts verify routing
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts verify eval --dataset benchmark/verifier/basic.jsonlcache eval runs adversarial semantic-cache cases and reports false unsafe hits and safe misses. The sweep mode compares thresholds and recommends the safest threshold that preserves safe reuse on the local corpus. verify routing runs cheap-then-verify regression cases and reports expected escalations, false escalations, and missed escalations.
Provider SLO routing benchmark:
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts routing slo-benchmark
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts routing slo-impact --candidates groq,mock --max-p95-ms 1000The benchmark generates synthetic provider traces where Groq violates the local SLO window and mock remains eligible, then proves the router reroutes from groq to mock. The impact command analyzes local traces and reports unhealthy providers, eligible fallbacks, impacted requests, and impacted optimized cost.
Local setup doctor:
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts doctorThis reports git/remote status, .env ignore safety, provider configuration presence, proof readiness, and whether the gateway port is already occupied. It never prints provider secret values.
Before pushing or tagging, run the checklist in docs/release-checklist.md.
LiteLLM abstracts providers.
TokenOps controls compute spend. It decides whether the call should happen at all, whether cache/reuse is safe, whether a cheaper model is enough, whether a stronger verifier is required, and what evidence should be written afterward.
Langfuse and LangSmith observe traces.
TokenOps acts before compute is spent. It can serve cache, downgrade, block, verify, and then log the decision for replay.
Production-like:
- OpenAI-compatible chat completions endpoint with streaming chunk support
- minimal OpenAI-compatible non-streaming Responses endpoint
- deterministic normalizer and stable request hash
- SQLite-backed exact cache with user/agent isolation
- SQLite-backed semantic cache with safety classifier
- prefix cache simulator
- SQLite-backed tool-result and context-block cache primitives
- workload profiler
- model router
- budget firewall
- AIS compute plan
- request trace and cost ledger
- SQLite-backed provider attempt ledger for fallback/retry debugging
- SQLite-backed idempotency records for retry-safe chat completions, including SSE replay
- snapshot import/export for TokenOps traces and benchmark results
- provider usage JSONL reconciliation against the trace ledger
- OpenTelemetry-style JSONL trace export
- retention pruning for local TokenOps traces, benchmark results, and idempotency records
- benchmark datasets and replay runner
- mock provider
- Groq provider over OpenAI-compatible HTTP
- OpenAI provider over OpenAI-compatible HTTP
- deterministic local
/v1/embeddingscompatibility for RAG and semantic-cache experiments - Ollama local provider over
/api/chat - provider fallback chains such as
TOKENOPS_PROVIDER=groq,ollama,mock - trace-derived adaptive routing policy endpoint and optional runtime application
- verifier pass-rate gated adaptive routing
- provider health scoring from traces
- trace-derived provider arbitrage across configured provider candidates
- verifier eval harness with confusion metrics
- cheap-then-verify routing benchmark with false/missed escalation counts
- semantic-cache safety eval with adversarial cache-reuse fixtures
- semantic-cache threshold sweep for local safety/recall tradeoff checks
- env-configurable max request cost, daily budget, daily user/agent quota, per-minute rate limit, and agent loop limiter
- budget policy shadow mode for would-block rollout analysis
- provider failure traces with redaction for common API key and authorization formats
- inference runtime with concurrency admission, priority-aware bounded queueing, provider circuit breaker, provider timeout aborts, and in-flight request coalescing
- local background task queue for AIS verification, context, and trace-maintenance work
- rolling SLO routing policy from local traces, with error-rate, p95 latency, and average cost thresholds
- optional provider arbitrage with
TOKENOPS_PROVIDER_ARBITRAGE=1andTOKENOPS_PROVIDER_CANDIDATES=groq,openai,mock - SLO rerouting benchmark that proves unhealthy providers are avoided when an eligible fallback exists
- first-class
providerLatencyMsin TokenOps request traces for SLO routing and provider health - tests and typecheck
Prototype/mock:
- Anthropic/Gemini/vLLM adapters are scaffolded
- semantic similarity is lexical
- embeddings are deterministic local hashed vectors, not provider-grade embedding models
- request traces and benchmark results are SQLite-backed
- pricing is configurable estimate data, not billing truth
- verifier gate supports heuristic and model-backed judging, but production quality still depends on eval coverage
- benchmark datasets are small deterministic fixtures
- rate limiting is local in-memory per gateway process
- circuit breaker and coalescing are local per gateway process; distributed coordination is future work
- idempotency is implemented for non-streaming chat completions and generated SSE stream replay; token-level upstream streaming remains future work
Roadmap:
- token-level streaming support with real provider deltas
- fuller Anthropic/Gemini/vLLM adapters
- provider-backed or production embedding models for semantic cache
- redaction policy hooks
- larger eval-backed cache safety corpus
- direct Langfuse export adapter
- signed receipts via FIDES-style evidence chain
The original nerve compile/learn/replay loop remains below.
The inference compiler agents call before they call a model.
Agents should not hardcode model = "claude-opus-4-7". They should HTTP POST /v1/compile-task with {task, budget, context} and get back a ComputePlan — which model, what context, which verifier, when to fall back, what budget to respect, and which prior lessons apply — that gets better every week because traces and corrections feed back into the compiler.
Like DSPy, but online, agent-facing, and provider-agnostic.
Every other LLM tool is human-facing:
| Tool | Category | What it does |
|---|---|---|
| OpenRouter, Helicone, LiteLLM, Portkey, Vercel AI Gateway | Gateways | Route a call between providers. |
| Langfuse, LangSmith, Braintrust, Arize Phoenix | Observability / Eval | Trace + experiment workbench for humans. |
| DSPy | Compiler (offline) | Compile prompts via metric optimization — Python lib for researchers. |
| LangGraph, OpenAI Agents SDK, Mastra | Frameworks | Compose agents into graphs. |
| Mem0, vector DBs | Memory | Store and retrieve memories. |
| E2B, Daytona, Modal | Sandboxes | Execute code safely. |
Nothing sits before the model call, at runtime, exposing planning and learning as agent-facing verbs. That is the seam.
nerve sits there. It does not terminate provider keys, does not mark up tokens, does not ship a dashboard. It tells the agent what to call — the agent (or LiteLLM/OpenRouter/Vercel) does the calling.
task ─► /compile-task ─► ComputePlan (model + context + teachings + verifier + budget + fallback)
exec ─► /record-trace ─► Trace (events, cost, latency, outcome)
trace ─► mine ─► FailureCluster
cluster─► /learn ─► TeachingObject + PatchCandidate
cluster─► /generate-evals ─► EvalCase
patches × evals ─► /replay ─► before/after delta ─► human approve ─► live policy
Every approved patch is reflected in the next /compile-task response. The compiler gets smarter with usage.
For TokenOps, the fastest proof is:
pnpm install
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts demo --json
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts replay --all
TOKENOPS_CLI=1 npx tsx apps/cli/src/index.ts proofStart the OpenAI-compatible gateway when you want to connect a client:
TOKENOPS_PORT=8787 pnpm --filter @nerve/server startThen use:
OPENAI_BASE_URL=http://127.0.0.1:8787/v1The legacy Nerve learning-loop demo is still available:
pnpm install
pnpm demoThe demo:
- seeds
~/.nerve/nerve.dband starts the API on:7777, - imports 50 OpenAI-style JSONL traces from
examples/openai-traces/, - mines 3 failure clusters (
schema_hallucination,missing_join,wrong_agg), - runs
learn→ 7 teachings, 3 patch candidates, - runs
generate-evals→ 9 eval cases, - runs
replay→ baselinepass_rate=0.67→ candidatepass_rate=1.0, +0.333 with zero regressions, - approves the patches, POSTs a fresh task to
/v1/compile-task, and returns a ComputePlan that visibly contains the 3 approved teachings.
Total wall-clock on a clean Mac: ~8 seconds.
baseline pass_rate=0.667 cost=$0.0019 p50=356ms
candidate pat_…ATFC3T8S8R pass_rate=1.0 (+0.333) cost=$0.0019 (Δ$0.00) regressions=0
candidate pat_…6S0HVPNR4G pass_rate=1.0 (+0.333) cost=$0.0019 (Δ$0.00) regressions=0
candidate pat_…PTD62KBV6T pass_rate=1.0 (+0.333) cost=$0.0019 (Δ$0.00) regressions=0
nerve init # create ~/.nerve, print token
nerve serve # start API on :7777
nerve import <file|glob> # ingest JSONL traces
nerve mine # cluster failures
nerve clusters # list clusters
nerve learn [--cluster <id>] # generate teachings + patch candidates
nerve evals gen [--cluster <id>] # generate eval cases
nerve replay --patch <id|all> [--sample N]
nerve patches list [--status proposed|live]
nerve patches approve <id> [<id>...]
nerve compile <file.json> # compile a TaskEnvelope JSON file
nerve db # print db path + countsAll requests/responses are JSON. Bodies are typed against the IR in packages/ir/. Every response carries X-Receipt-Id; the receipt is retrievable at GET /v1/receipts/:id.
Turn an agent's intent into a budget-aware compute plan.
curl -X POST http://127.0.0.1:7777/v1/compile-task \
-H 'content-type: application/json' \
-d '{
"task": {
"schema_version": "0.1",
"task_id": "tsk_demo_1",
"agent_id": "sql_agent",
"intent": "Generate a SQL query that joins users and orders…",
"modality": "code",
"risk_class": "medium",
"inputs": { "schema": { "users": ["id","name"], "orders": ["id","user_id"] } },
"budget_hint": { "max_usd": 0.02, "max_latency_ms": 5000 },
"context_refs": ["schema.users","schema.orders"],
"created_at": "2026-05-24T00:00:00Z"
}
}'Returns {plan, context_pack, teaching_program, receipt_id}.
Persist execution evidence. Triggers async mining.
Cluster → regression EvalCases.
Cluster → TeachingObjects + PatchCandidates (status proposed).
Run the verifier suite from a plan (or ad-hoc) against an output.
Patches × evals → before/after delta. Output: {baseline, candidates: [{patch_id, pass_rate, cost_usd, delta, regressions}], replay_id}.
GET /v1/clusters— list failure clustersGET /v1/patches?status=proposed|approved|rejected|livePOST /v1/patches/:id/approveGET /v1/receipts/:id
TokenOps gateway client:
import { TokenOpsClient } from "@nerve/sdk-ts";
const tokenops = new TokenOpsClient({ base_url: "http://127.0.0.1:8787" });
const completion = await tokenops.chatCompletions(
{
model: "gpt-5-mini",
messages: [{ role: "user", content: "Explain the cache policy." }],
},
{ idempotencyKey: "request-123" },
);
console.log(completion.choices[0].message.content);
console.log(await tokenops.stats());
console.log(await tokenops.cacheStats());TokenOpsClient accepts either a gateway root URL such as http://localhost:8787
or an OpenAI-style base URL such as http://localhost:8787/v1. It also exposes
chatCompletionsStream, budgetStatus, traces, trace, and policySimulate
helpers for local control-plane workflows.
Legacy Nerve compiler client:
import { NerveClient } from "@nerve/sdk-ts";
const nerve = new NerveClient({ base_url: "http://127.0.0.1:7777" });
const { plan, teaching_program } = await nerve.compileTask(envelope);
// ...agent runs plan against any provider/gateway, captures a trace...
await nerve.recordTrace(trace);A complete 30-line agent: examples/sdk-demo/agent.ts.
┌──────────────────────────────────────────────────────────────────────┐
│ agent (your code, your framework) │
└────────────────┬──────────────────────────────────┬──────────────────┘
│ /compile-task │ /record-trace
▼ ▼
┌────────────────────────────────────────────────────────────────────────┐
│ apps/server (Fastify, single process) │
│ ├─ planner classify→context→teachings→model→verifiers→budget→… │
│ ├─ miner cluster failures by signature │
│ ├─ learner cluster → TeachingObject + PatchCandidate │
│ ├─ verifiers schema / regex / exec / llm_judge │
│ ├─ replay patches × evals → delta │
│ └─ store better-sqlite3 (WAL) — one file at ~/.nerve/nerve.db │
└────────────────────────────────────────────────────────────────────────┘
Every component is a workspace package under packages/. The server is a thin Fastify wrapper that wires them up. The CLI is a parallel surface over the same internal functions.
nerve/
├── apps/
│ ├── server/ — Fastify HTTP service, six verbs
│ └── cli/ — `nerve` command
├── packages/
│ ├── ir/ — Zod IR (single source of truth)
│ ├── store/ — better-sqlite3 DAOs + migrations
│ ├── planner/ — compile-task: heuristic classifier, model table, teaching selector
│ ├── miner/ — failure clustering by deterministic signature
│ ├── learner/ — teachings + patch candidates per cluster
│ ├── replay/ — simulated runner with honest pass/fail grading
│ ├── verifiers/ — schema/regex/exec/llm_judge + grader API
│ ├── importers/ — native JSONL + OpenAI-style chat log → Trace
│ └── sdk-ts/ — typed client
├── examples/
│ ├── openai-traces/ — 50 seeded JSONL traces across 3 failure clusters
│ └── sdk-demo/ — agent.ts (30 lines) + task.json
├── research/ — design docs (thesis-grounding)
├── scripts/ — demo.sh, gen-fixtures.ts
└── THESIS.md — product thesis, ICP, defensibility, YC framing
This is the smallest thing that proves the loop. It deliberately does not yet do:
- No real LLM calls inside the planner. Heuristic classifier; the LLM seam is documented and ready in
packages/planner/src/. Drop in a model call when you want LLM-quality classification — the IR contract is unchanged. - No vector retrieval.
ContextPackhonorscontext_refsthe agent passes in; full RAG (sqlite-vss / FAISS) is v0.2. - No gateway proxy. nerve tells the agent what to call; it does not intercept provider traffic. Use OpenRouter / LiteLLM / Vercel AI Gateway downstream.
- No multi-tenant SaaS / billing / RBAC. One SQLite file, one process, one team. Self-host only.
- No dashboard / UI. Read API + JSON only. Pipe to Langfuse/Grafana if you want graphs.
- No auto-promotion of patches. Every patch requires
patches approve. - Replay uses a simulated model, deterministically tied to cluster signatures, so the demo runs without API keys. The grader interface (
packages/verifiers/) is the same one a real-model replay would use; swapreplay/src/index.ts:simulate()for a real call when you bring keys. - TokenOps gateway SSE is simulated from complete responses. Real provider token-level delta proxying is still future work.
- No fine-tuning. nerve learns by changing policy and context, not weights.
What is honest:
- 19/19 tests pass (
pnpm test). - The demo runs from a clean install in ~8s, deterministically.
- The replay numbers reflect real Zod-validated grader output, not hardcoded values.
- Every API call writes a Receipt with
inputs_hash,outputs_hash, cost, latency.
| Tool | Relationship |
|---|---|
| OpenRouter / LiteLLM / Vercel AI Gateway | Downstream of nerve. nerve emits model.primary; you route to it. We will publish a nerve-routed adapter for LiteLLM. |
| Langfuse / LangSmith | Upstream of nerve. We import their trace formats and re-emit the same span schema. Bring your existing logs. |
| Braintrust | Complementary. nerve generates EvalCases; pipe to Braintrust as a JSON dataset if you want their experiment UI. |
| DSPy | Spiritual ancestor. DSPy compiles offline in Python; nerve compiles online over HTTP, agent-facing, language-agnostic. We will wrap DSPy's MIPROv2 inside /learn in v0.2. |
| OpenAI Agents SDK / LangGraph / Mastra | Frameworks call us. Add a compileTask() step before each model call. |
| E2B / Daytona / Modal | Substrate, not competition. We will run verifier execs in E2B for v0.2. |
The product was designed with re-use in mind:
| Repo | What it gives nerve |
|---|---|
| capsule (TS) | CapsuleReceipt → our Receipt shape; CapabilityMap/SupportLevel → model capability declaration; store-sqlite pattern. |
| fides (TS) | EvidenceChain (hash-chained + Merkle + Ed25519) → signed receipts (v0.2); PolicyBundle.evaluatePolicy() → patch gating; DelegationToken → task-budget mandate. |
| switchboard (Rust) | sb-events + sb-replay + sb-memory shapes ported to TS for the trace store + deterministic replay + scoped lesson store. |
| OAPS | JSON schemas (intent.json, task.json, evidence-event.json) align with TaskEnvelope / Trace wire format. |
| OSP | cost-summary + usage-report + service-manifest → budget + provider declaration shapes. |
| agentbox / agit / sardis | Concepts only (three-bucket policy classifier, content-addressed state DAG, mandate→policy→execution→signed receipt). |
See THESIS.md and research/local-repo-map.md for the full archaeology.
v0.1 (alpha, 2026-05). Working loop, working tests, working demo. Not yet production-ready.
Roadmap: real-LLM planner classifier · vector retrieval · DSPy wrap in /learn · LiteLLM/OpenRouter adapter · Python SDK · auto-promotion gate with stricter eval thresholds · OTel/Langfuse trace export.
License: MIT.