KVGate

Multimodal LLM inference gateway with KV-cache-aware routing and offload.

KVGate is an open-source, OpenAI-compatible gateway that sits in front of a fleet of vLLM replicas. Unlike provider-aggregation proxies, it works at the KV-cache layer: it routes each request to the replica that already holds its prefix (including the image bytes) warm, and offloads overflow KV cache to CPU RAM or Redis via LMCache. It also handles the production basics: response caching, rate limiting, per-tenant budgets, failover, and Prometheus/Grafana observability.

It runs out of the box with zero API keys thanks to a built-in deterministic mock provider, so you can clone, start, stream, and load-test in under a minute.

Results (measured on GPU)

Benchmarked on Llava-OneVision-7B (a vision-language model) on rented A40 GPUs. All numbers are measured; full methodology and caveats are in docs/BENCHMARK_REPORT.md.

Capability	Result	Setup
Prefix-aware routing	1.84x lower tail latency (TTFT p95 2783 to 1516 ms), +14% throughput, 98.6% routing-affinity, load stays balanced	2 GPUs, one replica each, 12 images, KV capped
LMCache CPU KV offload	up to 2.0x lower TTFT and +65% throughput under memory pressure	single GPU, working set overflows GPU KV
Gateway overhead	about 1 ms p50 and 1.7 ms p99 added latency per request, negligible next to multimodal TTFT of 250 to 2700 ms	zero-latency mock backend, single worker

Why it works. Round-robin scatters each image across replicas, so each one keeps re-prefilling roughly 6.5k vision tokens; prefix-aware routing keeps each image's KV resident on a single replica, with a load guard so popular prompts do not snowball onto one replica. LMCache spills overflow KV to CPU RAM and recovers it faster than recomputing it; the gain appears only under memory pressure (when the working set fits in GPU, the engine's own cache suffices).

Interactive results dashboard

A Next.js dashboard renders these experiments as before/after charts, with a detailed docs tab and a live-operations view.

cd dashboard && npm install && npm run dev   # http://localhost:3000

Why KVGate

Serving open-source LLMs in production means re-solving the same problems every time: which replica gets each request, how to avoid recomputing KV cache another replica already holds, how to stop one tenant starving the rest, and how to see latency, cost, and throughput. KVGate packages those into one drop-in gateway.

Capability	What it does
OpenAI-compatible API	`POST /v1/chat/completions` (streaming and blocking) and `/v1/models`. Existing OpenAI SDKs work unchanged by pointing `base_url` at the gateway.
KV / prefix-aware routing	`prefix_kv_aware` routes requests sharing a prompt prefix, and the same image, to the replica that already has that prefix's KV cache warm. Multimodal-aware and needs no cooperation from the engine.
Smart routing strategies	Route a logical model across deployments by `latency` (EWMA), `least_busy`, `cost`, `weighted`, or `round_robin`.
Failover and circuit breaking	Retryable upstream errors fail over to the next-best deployment; repeatedly-failing deployments are ejected and given a cooldown.
Exact and semantic caching	Identical requests hit an exact cache; reworded prompts hit a semantic cache (cosine similarity). The Redis backend enables cross-replica reuse.
Rate limiting	Token-bucket limits per API key or tenant, in-memory or Redis.
Per-tenant budgets	Spend caps per tenant or API key; once a tenant's estimated spend exceeds its cap in the window, requests are rejected with HTTP 402 until reset.
Observability	Prometheus metrics at `/metrics`, a ready-made Grafana dashboard, and `/admin/stats` for live routing state.
Runs with no keys	A deterministic mock provider and a dependency-free hashing embedder mean everything works offline for demos, CI, and load tests.

How KVGate compares

Most "LLM gateways" are provider-aggregation proxies: one API in front of OpenAI, Anthropic, and Gemini, with response caching and cost routing. KVGate operates one layer deeper, at the KV-cache layer of a self-hosted vLLM fleet.

	Provider-aggregation gateways (LiteLLM, InferXgate, ...)	vLLM scheduling sidecars	KVGate
Primary job	Fan out to many hosted providers	Admission control and scheduling	Maximize KV-cache reuse across a replica fleet
Caching	Response cache (exact / semantic)	none	Response cache plus KV-cache offload (GPU to CPU to Redis)
Routing	By cost or availability	none	Prefix / KV-aware (including per-image hash) to the warm replica
Multimodal	Text-focused	varies	Vision-language first-class (image-hash routing)
Evidence	Feature lists	research	Measured GPU benchmarks (this repo)

KVGate also includes the table-stakes gateway features, but its differentiator is the KV-cache work, which is where the measured wins come from. It is complementary to a provider proxy, not a replacement for one.

Architecture

flowchart LR
    C["OpenAI SDK / curl"] --> A
    subgraph GW["KVGate gateway"]
      direction LR
      A["Auth<br/>(tenant)"] --> RL["Rate limit<br/>+ budget"] --> CA["Cache<br/>(exact + semantic)"] --> RT["Router<br/>(prefix-aware / latency / cost)"] --> FO["Failover +<br/>circuit breaker"]
    end
    FO --> MK["mock provider"]
    FO --> VL["vLLM / TGI fleet<br/>(self-hosted GPU)"]
    FO --> OA["OpenAI API"]
    FO --> AN["Anthropic API"]
    GW -. "/metrics, /admin/stats" .-> PM["Prometheus + Grafana"]

How KV / prefix-aware routing works

When many LLM replicas sit behind a load balancer, a normal balancer is KV-blind: it may send a request to a replica that never saw its prefix, wasting the KV cache another replica already holds. prefix_kv_aware fixes this:

Fingerprint the prompt into a chain of block hashes. Each image is hashed into the key, so same-text / different-image diverges and the same image collapses.
Score each replica by how long a leading prefix it has served recently (a per-replica affinity table with TTL and LRU), balanced against load.
Route to the best replica, with a load guard so a shared system prompt cannot snowball all traffic onto one replica.

It infers affinity from traffic the gateway already sees, needing no cooperation from the inference engine, which keeps it lightweight and vendor-neutral. Full explanation in docs/HOW_ROUTING_WORKS.md.

Quickstart (no API keys needed)

git clone https://github.com/rishika7006/kvgate.git
cd kvgate
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

kvgate run            # starts on http://localhost:8080 with the built-in mock providers

In another terminal:

# Blocking request
curl -s localhost:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"demo","messages":[{"role":"user","content":"Hello"}]}' | jq

# Streaming (Server-Sent Events)
curl -N localhost:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"demo","stream":true,"messages":[{"role":"user","content":"Stream this"}]}'

Use it from the OpenAI Python SDK unchanged:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
resp = client.chat.completions.create(
    model="demo",
    messages=[{"role": "user", "content": "What is KVGate?"}],
)
print(resp.choices[0].message.content)
print(resp.kvgate)  # gateway metadata: cache status, provider, latency, cost

Send the same request twice and the second response comes back from cache (kvgate.cache == "exact"); send a reworded version to hit the semantic cache.

Run the full stack (gateway + Redis + Prometheus + Grafana)

cp .env.example .env        # optional: add real API keys
docker compose up --build

Service	URL
KVGate API	http://localhost:8080 (`/docs` for Swagger)
Prometheus	http://localhost:9090
Grafana	http://localhost:3000 (admin / admin), KVGate dashboard pre-loaded

Connect real backends

Copy and edit the config, then point --config at it:

cp config/config.example.yaml config/config.yaml
export OPENAI_API_KEY=sk-...        # secrets via ${ENV} expansion, never in the file
kvgate run -c config/config.yaml

A model is a logical name clients request; it fans out to one or more deployments (provider plus upstream model). For example, serve a logical gpt-4o mostly from your own vLLM box and overflow to hosted OpenAI:

routing:
  strategy: cost            # prefer the cheapest healthy deployment
models:
  - name: gpt-4o
    deployments:
      - { provider: vllm-local, model: meta-llama/Llama-3.1-8B-Instruct, cost_per_1k_input: 0.0,   cost_per_1k_output: 0.0 }
      - { provider: openai,     model: gpt-4o,                           cost_per_1k_input: 0.005, cost_per_1k_output: 0.015 }

Supported provider types: mock, openai, openai_compatible (vLLM, TGI, Ollama, LocalAI), anthropic.

API reference

Endpoint	Description
`POST /v1/chat/completions`	Chat completions, `stream: true` or `false`. OpenAI-compatible.
`GET /v1/models`	List logical models the gateway serves.
`GET /healthz`, `GET /readyz`	Liveness and readiness probes.
`GET /metrics`	Prometheus exposition format.
`GET /admin/stats`	Live routing state: per-deployment latency, in-flight, circuit status, cost.
`GET /docs`	Swagger UI.

Configuration is documented inline in config/config.example.yaml. Validate any config with kvgate validate -c config/config.yaml.

Development

pip install -e ".[dev]"
pytest                 # run the test suite
ruff check .           # lint
ruff format .          # format
mypy src               # type-check

See CONTRIBUTING.md.

Roadmap

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.github		.github
config		config
dashboard		dashboard
deploy		deploy
docs		docs
loadtest		loadtest
results		results
scripts		scripts
src/kvgate		src/kvgate
tests		tests
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KVGate

Results (measured on GPU)

Interactive results dashboard

Why KVGate

How KVGate compares

Architecture

How KV / prefix-aware routing works

Quickstart (no API keys needed)

Run the full stack (gateway + Redis + Prometheus + Grafana)

Connect real backends

API reference

Development

Roadmap

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

KVGate

Results (measured on GPU)

Interactive results dashboard

Why KVGate

How KVGate compares

Architecture

How KV / prefix-aware routing works

Quickstart (no API keys needed)

Run the full stack (gateway + Redis + Prometheus + Grafana)

Connect real backends

API reference

Development

Roadmap

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages