████████╗ ██████╗ ██╗ ██╗███████╗███╗ ██╗███████╗███████╗███╗ ██╗███████╗███████╗
╚══██╔══╝██╔═══██╗██║ ██╔╝██╔════╝████╗ ██║██╔════╝██╔════╝████╗ ██║██╔════╝██╔════╝
██║ ██║ ██║█████╔╝ █████╗ ██╔██╗ ██║███████╗█████╗ ██╔██╗ ██║███████╗█████╗
██║ ██║ ██║██╔═██╗ ██╔══╝ ██║╚██╗██║╚════██║██╔══╝ ██║╚██╗██║╚════██║██╔══╝
██║ ╚██████╔╝██║ ██╗███████╗██║ ╚████║███████║███████╗██║ ╚████║███████║███████╗
╚═╝ ╚═════╝ ╚═╝ ╚═╝╚══════╝╚═╝ ╚═══╝╚══════╝╚══════╝╚═╝ ╚═══╝╚══════╝╚══════╝
Production-grade LLMOps observability — proxy ingestion, deterministic rules, AI-powered analysis, and real-time cost intelligence for engineering teams at scale.
Live Demo · Upgrade Roadmap · CLI Reference · Smart Router · Architecture
TokenSense is an open-source LLMOps intelligence platform built for engineering teams who need to understand, attribute, and optimize their AI infrastructure costs. It combines three systems that most observability tools keep separate:
| System | What it does |
|---|---|
| 🔍 Deterministic Rules Engine | Fires zero-false-positive alerts for model misuse, prompt bloat, cache waste, traffic spikes, and cloud over-provisioning |
| 🤖 AI Analyst | Consumes rule findings and returns structured root-cause analysis, action plans, and monthly savings estimates via Gemini |
| 🔀 Smart Router SDK | A drop-in LLM gateway that classifies task complexity and routes to the cheapest capable model automatically |
Current Status — Phase 1 Complete. The monorepo is structured, the API security vulnerability is resolved (Gemini calls moved server-side), real proxy ingestion is wired up, and the shared
@tokensense/typesand@tokensense/rules-enginepackages are extracted. The dashboard degrades gracefully to mock data when the server is offline.
- Architecture
- Monorepo Structure
- Quick Start
- Running the Stack
- Environment Variables
- The Rules Engine
- Smart Router SDK
- CLI Tool
- Ask TokenSense AI
- Upgrade Roadmap
- Contributing
Raw Telemetry (LLM + Cloud)
│
▼
┌──────────────────────────────┐ ┌─────────────────────┐
│ Proxy Gateway │ │ SDK Wrap / OTEL │
│ /v1/proxy/openai │ │ @tokensense/sdk │
│ /v1/proxy/anthropic │ └──────────┬──────────┘
│ /v1/proxy/google │ │
└──────────────┬───────────────┘ │
└──────────┬──────────────────────┘
▼
┌─────────────────────────┐
│ Express API Server │ :3001
│ apps/server │
└──────────┬──────────────┘
│ async / non-blocking
▼
┌─────────────────────────┐
│ Rules Engine │ @tokensense/rules-engine
│ AI Analyst Service │ server-side Gemini
└──────────┬──────────────┘
│
▼
┌─────────────────────────┐
│ React Dashboard │ :5173 / :3000
│ apps/web │
└─────────────────────────┘
Key architectural principles:
- Proxy-first ingestion — teams change one
baseURL, get full observability with zero instrumentation code - Non-blocking write path — the proxy returns to the caller immediately; logging happens asynchronously via a queue
- Server-side AI — the Gemini API key never touches the browser bundle
- Graceful degradation — the dashboard falls back to mock data if the server is unreachable, so demos always work
- Shared packages —
@tokensense/typesand@tokensense/rules-enginerun identically in browser and Node.js
tokensense/
├── apps/
│ ├── web/ ← React 19 + Vite 8 dashboard
│ │ ├── src/
│ │ │ ├── components/ ← All UI components
│ │ │ ├── router/ ← Smart Router SDK (15 modules)
│ │ │ ├── App.tsx
│ │ │ ├── mockData.ts ← Offline fallback, always available
│ │ │ ├── recommendations.ts ← Calls server; falls back gracefully
│ │ │ └── cli.ts ← CLI report generator
│ │ └── package.json (@tokensense/web)
│ │
│ └── server/ ← Express API server
│ ├── src/
│ │ ├── proxy/ ← LLM proxy gateway (OpenAI / Anthropic / Google)
│ │ ├── ingestion/ ← OTEL receiver + direct ingest endpoint
│ │ ├── queues/ ← Telemetry + webhook job queues
│ │ ├── routes/ ← REST API routes
│ │ ├── services/ ← AI Analyst, analytics, webhook dispatcher
│ │ └── middleware/ ← Auth, RBAC, audit
│ └── package.json (@tokensense/server)
│
├── packages/
│ ├── types/ ← @tokensense/types (shared interfaces)
│ │ └── src/index.ts
│ └── rules-engine/ ← @tokensense/rules-engine (runs everywhere)
│ └── src/
│ ├── rules.ts
│ ├── aggregator.ts
│ └── config/ruleThresholds.ts
│
├── turbo.json
├── pnpm-workspace.yaml
└── package.json (monorepo root)
| Tool | Version | Install |
|---|---|---|
| Node.js | ≥ 20.19.0 | nodejs.org |
| pnpm | ≥ 9.0 | npm i -g pnpm |
| Git | any | git-scm.com |
Optional: A Google AI Studio API key unlocks the AI Analyst. Without it, the platform runs in deterministic-only mode — all rules still fire and structured recommendations are still generated.
git clone https://github.com/your-org/tokensense.git
cd tokensense
# Install all workspace dependencies in one shot
pnpm install# Copy the example env — required for both apps
cp apps/web/.env.example apps/web/.env
cp apps/server/.env.example apps/server/.envThen open apps/web/.env:
# Server URL (where the Express API runs)
VITE_API_URL=http://localhost:3001And apps/server/.env:
# Required
GEMINI_API_KEY=your_google_ai_studio_key_here
JWT_SECRET=replace_with_64_char_random_hex
# Phase 2+ (leave blank for now — server starts fine without them)
DATABASE_URL=
REDIS_URL=
CLICKHOUSE_URL=# Run everything in parallel (dashboard + server)
pnpm dev
# Or run individually:
pnpm --filter @tokensense/web dev # Dashboard → http://localhost:5173
pnpm --filter @tokensense/server dev # API server → http://localhost:3001That's it. Open http://localhost:5173.
# Full stack — Turborepo fans out tasks to both apps simultaneously
pnpm devTurborepo handles dependency ordering automatically. The server starts first; the dashboard connects on boot.
The dashboard works without the server. When http://localhost:3001 is unreachable, recommendations.ts falls back to the deterministic offline generator. All rule evaluations, cost charts, and team metrics still render from mockData.ts.
pnpm --filter @tokensense/web dev
# → Full dashboard at http://localhost:5173 — no server neededpnpm --filter @tokensense/server dev
# → Express at http://localhost:3001pnpm build
# Builds all packages in correct dependency order:
# packages/types → packages/rules-engine → apps/web + apps/serverpnpm lint| Variable | Required | Description |
|---|---|---|
VITE_API_URL |
Yes | URL of the Express API server. Default: http://localhost:3001 |
⚠️ Security note: AI keys are stored locally in the browser using the BYOK config modal.VITE_prefixed secrets are never used for API keys.
| Variable | Required | Description |
|---|---|---|
GEMINI_API_KEY |
Recommended | Google AI Studio key. Without it, AI Analyst returns deterministic fallbacks only |
JWT_SECRET |
Yes (Phase 2+) | 64-byte hex secret for session tokens |
ENCRYPTION_KEY |
Yes (Phase 2+) | 32-byte hex key for API key encryption at rest |
DATABASE_URL |
Phase 2+ | PostgreSQL connection string |
REDIS_URL |
Phase 1+ | Redis connection string (queue + cache) |
CLICKHOUSE_URL |
Phase 4+ | ClickHouse HTTP endpoint |
ENABLE_PROXY_GATEWAY |
Optional | true to activate the LLM proxy endpoints |
ENABLE_WEBHOOKS |
Optional | true to activate webhook delivery |
ENABLE_CLICKHOUSE |
Optional | true to swap mock data for ClickHouse queries |
The Rules Engine is the core of TokenSense. It is deterministic, zero-latency, and zero-false-positive — if a rule fires, the issue is real. It runs in both the browser (via @tokensense/rules-engine package) and on the server.
| Rule | Trigger | Signal |
|---|---|---|
MODEL_MISUSE |
Premium model (gpt-4o, claude-3-opus) producing avg < 100 output tokens |
Task is too simple for the model tier |
LONG_PROMPTS |
Average input token count > 2,000 | Bloated context, RAG opportunity |
CACHE_WASTE |
Same prompt hash seen ≥ 50 times | Repeated identical API calls, zero caching |
SPIKE |
Daily call count > 3× rolling average | Runaway retry loop or misconfigured batch job |
RAM_OVER_PROVISION |
Cloud function RAM utilization < 40% | Over-allocated Lambda / Worker |
MODEL_DOWNGRADE |
Opus/Pro model on sub-100-token tasks | Premium model doing economy work |
import { runAllRules } from '@tokensense/rules-engine';
const insights = runAllRules(
llmLogs, // Log[]
teamMetrics, // TeamMetrics[]
cloudMetrics, // CloudResourceMetrics[]
{
projectId: 'proj_abc', // optional: tenant scope
ruleOverrides: { SPIKE_MULTIPLIER: 5 } // optional: project-specific thresholds
}
);
// insights: Insight[] — each has { rule, team, severity, evidence, suggestedFix }The same call works identically in the browser and on the Node.js server.
A production-quality LLM routing SDK embedded in apps/web/src/router/ (15 modules). It classifies task type and complexity, routes to the optimal model, caches identical prompts, and tracks every dollar.
import { TokenSenseRouter } from '@tokensense/web/router';
const router = new TokenSenseRouter({
providers: {
openai: { apiKey: process.env.OPENAI_API_KEY! },
anthropic: { apiKey: process.env.ANTHROPIC_API_KEY! },
google: { apiKey: process.env.GOOGLE_API_KEY! },
},
defaults: { preferCost: true },
cache: { ttlMs: 300_000 }, // 5-minute semantic cache
});
// One call — automatic task classification + optimal routing
const response = await router.route({
prompt: "Classify this support ticket as: billing, technical, or general",
// ↑ Router detects: task=classification, complexity=simple
// Routes to: gemini-2.0-flash (96% cheaper than gpt-4o)
});
console.log(router.getMetrics());
// → { totalCost: 0.0000042, cacheHitRate: 0.15, avgLatencyMs: 248 }| Model | Provider | Tier | Input $/1M | Output $/1M |
|---|---|---|---|---|
gpt-4o |
OpenAI | Premium | $2.50 | $10.00 |
gpt-4o-mini |
OpenAI | Economy | $0.15 | $0.60 |
gpt-3.5-turbo |
OpenAI | Economy | $0.50 | $1.50 |
claude-sonnet-4 |
Anthropic | Premium | $3.00 | $15.00 |
claude-3.5-haiku |
Anthropic | Economy | $0.80 | $4.00 |
gemini-2.5-pro |
Premium | $1.25 | $10.00 | |
gemini-2.0-flash |
Economy | $0.10 | $0.40 |
| Module | Purpose |
|---|---|
TokenSenseRouter.ts |
Main entry point — orchestrates the full request lifecycle |
routingEngine.ts |
Model selection logic: tier, strength, latency, budget scoring |
modelRegistry.ts |
Model catalog with pricing and capability metadata |
taskClassifier.ts |
NLP-based prompt → task type inference (code, reasoning, chat, etc.) |
complexityEstimator.ts |
Prompt → complexity scoring (simple / moderate / complex) |
cache.ts |
LRU + TTL semantic response cache |
costTracker.ts |
Per-request cost ledger + daily budget tracking |
latencyTracker.ts |
p95 / p99 latency tracking per model |
rateLimiter.ts |
Token bucket rate limiter per user |
promptOptimizer.ts |
Removes filler phrases, normalizes whitespace, enforces output format |
parameterTuner.ts |
Task-aware temperature / max_tokens tuning |
responseValidator.ts |
Output quality checks per task type |
providers.ts |
OpenAI / Anthropic / Google provider clients |
promptTemplates.ts |
System prompt library per task type |
logger.ts |
Structured request logging with level filtering |
A headless terminal report — perfect for CI/CD pipelines, Slack digests, or engineering standups.
# From the web app package
pnpm --filter @tokensense/web cli
# Or directly
cd apps/web && npx tsx src/cli.tsExample output:
╔══════════════════════════════════════════════════════════════╗
║ TOKENSENSE — AI COST OPTIMIZATION REPORT ║
║ Report Date: 2025-10-30 ║
╚══════════════════════════════════════════════════════════════╝
EXECUTIVE SUMMARY
────────────────────────────────────────────────────────────────
● Analyzed 79,290 LLM calls and 180 cloud resource entries
● Total LLM Spend: $487.32
● Total Cloud Spend: $241.15
● 6 issues detected
● Potential savings: $18,200.00/month ($218,400.00/year)
ISSUES DETECTED
────────────────────────────────────────────────────────────────
[HIGH] MODEL_MISUSE Payments 12,390 calls @ 42 avg tokens
[HIGH] MODEL_DOWNGRADE Analytics 150 calls/day, claude-3-opus
[MEDIUM] LONG_PROMPTS Marketing 3,200 avg input tokens
[MEDIUM] CACHE_WASTE Engineering 200 repeat hashes/day
[LOW] SPIKE Data 328 calls (3.1x avg) on Oct 19
[LOW] RAM_OVER_PROVISION Data 512MB alloc → 80MB used
The floating chat interface in the bottom-right corner gives you a context-aware AI analyst. It has your full 30-day telemetry baked into the system prompt, so you can ask it anything about your cost picture:
- "Which team should I fix first for maximum ROI?"
- "Write me a Redis caching wrapper for the Engineering cache waste"
- "What's my projected yearly spend if nothing changes?"
- "Explain the Payments MODEL_MISUSE issue for my CTO"
The chat widget uses the Bring Your Own Key (BYOK) architecture. You configure your own OpenAI or compatible API key directly in the UI, and the browser makes direct calls to the provider. Keys never touch the server.
Teams can send real LLM telemetry to TokenSense by changing one line:
// Before
const client = new OpenAI({ baseURL: "https://api.openai.com/v1" });
// After — full observability, zero other changes
const client = new OpenAI({
baseURL: "http://localhost:3001/v1/proxy/openai",
defaultHeaders: { "TS-API-Key": "ts_live_xxxx" }
});Supported proxy targets:
| Endpoint | Forwards to |
|---|---|
POST /v1/proxy/openai/* |
api.openai.com |
POST /v1/proxy/anthropic/* |
api.anthropic.com |
POST /v1/proxy/google/* |
generativelanguage.googleapis.com |
Every proxied request is logged non-blocking to a queue, then batch-flushed to the analytics store. The proxy adds < 50ms overhead and returns to the caller before any logging occurs.
For teams who don't want to change baseURL:
import { TokenSense } from '@tokensense/sdk';
const ts = new TokenSense({ apiKey: 'ts_live_xxxx' });
const response = await ts.wrap(
() => openai.chat.completions.create({ model: 'gpt-4o', messages }),
{ feature: 'invoice-classifier', userId: 'user_123' }
);
// Captures request + response + latency, posts to /api/ingest asyncTokenSense is on a phased path from demo dashboard to enterprise LLMOps platform. See project_upgrade.md for the full technical spec.
| Phase | Status | Milestone |
|---|---|---|
| Phase 0 | ✅ Done | Monorepo (Turborepo + pnpm), security fix, shared packages |
| Phase 1 | ✅ Done | Real data ingestion — proxy gateway, SDK wrap, OTEL receiver |
| Phase 2 | 🔲 Next | Multi-tenancy + RBAC + authentication (PostgreSQL + JWT) |
| Phase 3 | 🔲 Planned | Public REST API + TypeScript SDK + Python SDK + OpenAPI spec |
| Phase 4 | 🔲 Planned | ClickHouse analytics layer — retire mock aggregator |
| Phase 5 | 🔲 Planned | Webhooks + Slack / email / PagerDuty alert delivery |
| Phase 6 | 🔲 Planned | Configurable rules + LLM-as-a-judge evaluators |
| Phase 7 | 🔲 Planned | Executive dashboards + custom widgets + data export |
| Phase 8 | 🔲 Planned | Docker Compose + Kubernetes Helm chart (self-hosting) |
| Phase 9 | 🔲 Planned | SSO (SAML/OIDC) + SCIM + audit logs |
Architectural inspiration: Helicone (proxy-first, ClickHouse at scale) and Langfuse (SDK-first, five-level RBAC, Postgres → ClickHouse migration path).
| Layer | Technology | Notes |
|---|---|---|
| Frontend | React 19, TypeScript 5.9, Vite 8 | Production-quality SPA |
| Charting | Recharts | Area charts, bar charts, pie charts |
| Icons | Lucide React | |
| AI | Google Gemini (server-side) | Falls back to deterministic generator |
| API Server | Express + TypeScript | apps/server |
| Monorepo | Turborepo + pnpm workspaces | Parallel builds, shared packages |
| Shared Types | @tokensense/types |
Single source of truth for all interfaces |
| Rules Engine | @tokensense/rules-engine |
Runs in browser and Node.js |
| CLI Runtime | tsx + Node.js | Zero-config TypeScript execution |
| Styling | CSS Modules + custom design system | Space Grotesk + Space Mono |
# Fork, then clone your fork
git clone https://github.com/your-username/tokensense.git
cd tokensense
pnpm install
# Create a feature branch
git checkout -b feat/your-feature
# Run the full stack
pnpm dev
# Run linting before committing
pnpm lint
# Open a PRGood places to contribute:
- New detection rules in
packages/rules-engine/src/rules.ts - Additional LLM provider clients in
apps/web/src/router/providers.ts - Improved complexity estimation heuristics in
complexityEstimator.ts - Real data adapters (OpenTelemetry, Datadog, CloudWatch) in
apps/server/src/ingestion/
Distributed under the MIT License. See LICENSE for details.
Built for engineering teams who care about what they ship — and what it costs.
TokenSense — Stop guessing. Start owning.