The traffic layer for AI-native services
Single binary. ACL config. Hot reload. Built for LLM streaming, scale-to-zero, and safe model rollouts.
Why A3S Gateway • Quick Start • Features • Configuration • Architecture • Deployment • API Reference • Stability
AI services break the assumptions baked into Web-era gateways:
| Assumption | Web services | AI services |
|---|---|---|
| Response size | Small, bounded | Unbounded (streaming tokens) |
| Latency | Milliseconds | Seconds (model inference) |
| Idle cost | Cheap | Expensive (GPU memory) |
| Deployment risk | Low | High (model quality regression) |
| Protocol | HTTP request/response | SSE, WebSocket, gRPC |
nginx, Caddy, and Traefik were built for the left column. A3S Gateway is built for the right:
- SSE/Streaming — chunked transfer without response buffering; first token reaches the client as soon as the model emits it
- Scale-to-zero with request buffering — when a model is cold, incoming requests are held in memory and replayed the moment the replica is ready, not dropped or returned 503
- Revision traffic splitting — send 5% of live traffic to a new model version; automatically roll back if error rate or p99 latency crosses a threshold
- Traffic mirroring — shadow-test a new model with real requests before it handles a single production response
- WebSocket multiplexing — named pub/sub channels over a single connection for real-time AI interactions
Everything else (routing, TLS, rate limiting, circuit breaker, Prometheus) is table-stakes infrastructure packaged so you don't need a second tool.
1069 tests | 80 source files | ~37,000 lines of Rust | Single statically-linked binary | MSRV 1.88
# Install
brew install a3s-lab/tap/a3s-gateway
# Or via cargo
cargo install a3s-gateway
# Start
a3s-gateway --config gateway.acl# gateway.acl - proxy all traffic to an LLM service
entrypoints "web" {
address = "0.0.0.0:8080"
}
routers "llm" {
rule = "PathPrefix(`/v1`)"
service = "llm-backend"
middlewares = ["rate-limit", "auth-jwt"]
}
services "llm-backend" {
load_balancer {
strategy = "least-connections"
request_timeout = "60s"
servers = [
{ url = "http://127.0.0.1:8001" },
{ url = "http://127.0.0.1:8002" }
]
health_check { path = "/health" }
}
# Scale to zero — buffer requests during cold start
scaling {
min_replicas = 0
max_replicas = 4
container_concurrency = 10
buffer_enabled = true
executor = "box"
}
}
middlewares "rate-limit" { type = "rate-limit"; rate = 60; burst = 10 }
middlewares "auth-jwt" { type = "jwt"; value = env("JWT_SECRET") }
use a3s_gateway::{Gateway, config::GatewayConfig};
use std::sync::Arc;
#[tokio::main]
async fn main() -> a3s_gateway::Result<()> {
let config = GatewayConfig::from_file("gateway.acl").await?;
let gateway = Arc::new(Gateway::new(config)?);
gateway.start().await?;
gateway.wait_for_shutdown().await;
Ok(())
}| Feature | How it works |
|---|---|
| SSE / Streaming | Chunked transfer relay — zero response buffering, first token delivered immediately |
| Scale-to-zero | Knative formula: desired = ⌈(in_flight + queue_depth) / (concurrency × utilization)⌉ |
| Cold-start buffering | Requests queue in memory during scale-up; replayed once backend is ready |
| Revision traffic splitting | Route N% to v1, M% to v2 with per-revision health tracking |
| Gradual rollout | Step-by-step traffic shift with automatic rollback on error rate or latency breach |
| Traffic mirroring | Fire-and-forget copy of live traffic to a shadow backend (no client impact) |
- Dynamic routing: Traefik-style rule engine —
Host(),PathPrefix(),Path(),Headers(),Method(),&& - Load balancing: Round-robin, weighted, least-connections, random
- Health checks: Active HTTP probes + passive error-count eviction
- Sticky sessions: Cookie-based backend affinity with TTL and LRU eviction
- Failover: Automatic switch to secondary pool when primary has no healthy backends
- Forwarded headers: Upstreams receive normalized
X-Forwarded-For,X-Forwarded-Host,X-Forwarded-Proto, andX-Forwarded-Port - Upstream timeouts: Per-service plain HTTP
request_timeoutreturns504 Gateway Timeouton slow backends - TLS termination: rustls (pure Rust, no OpenSSL) + ACME/Let's Encrypt (HTTP-01, DNS-01/Cloudflare, DNS-01/Route53)
- Hot reload: File-watch config reload (inotify/kqueue) — unchanged HTTP/TCP entrypoints are not rebound
| Protocol | Capability |
|---|---|
| HTTP/1.1 & HTTP/2 | Full reverse proxy, hop-by-hop header filtering, normalized X-Forwarded-* metadata |
| WebSocket | Upgrade detection, bidirectional relay, named-channel multiplexing |
| SSE / Streaming | Chunked transfer, zero buffering — optimized for LLM token streams |
| gRPC | HTTP/2 h2c forwarding with header translation |
| TCP | Raw byte relay, SNI-based routing (HostSNI()), IP filtering |
| UDP | Session-based datagram relay |
| Middleware | Config Keys | Purpose |
|---|---|---|
jwt |
value |
JWT validation (HS256) |
api-key |
header, keys |
API key enforcement |
basic-auth |
username, password |
HTTP Basic Auth |
forward-auth |
forward_auth_url |
Delegate auth to external IdP |
rate-limit |
rate, burst |
Token bucket (in-process) |
rate-limit-redis |
rate, burst, redis_url |
Distributed rate limiting |
cors |
allowed_origins, allowed_methods |
CORS headers |
headers |
request_headers, response_headers |
Header manipulation |
strip-prefix |
prefixes |
Path prefix removal |
body-limit |
max_body_bytes |
Request body cap (413 on exceed) |
retry |
max_retries, retry_interval_ms |
Retry on upstream failure |
circuit-breaker |
failure_threshold, cooldown_secs |
Closed/Open/HalfOpen state machine |
ip-allow |
allowed_ips |
CIDR/IP allowlist |
compress |
— | brotli/gzip/deflate (br preferred) |
tcp-filter |
— | Connection limit + IP allowlist for TCP |
All observability features are individually configurable — disable any of them to reduce per-request overhead in high-throughput scenarios.
observability {
metrics_enabled = true # Prometheus metrics (default: true)
access_log_enabled = true # Structured JSON access log (default: true)
tracing_enabled = false # W3C Trace Context propagation (default: true)
}
- Prometheus metrics: Per-router/service/backend request counts, latency histograms, error rates, autoscaler state
- Structured access log: JSON entries — timestamp, client IP, method, path, status, duration, backend, router
- Distributed tracing: W3C Trace Context and B3/Zipkin propagation; inject spans into upstream requests
- Management: CLI-first operations plus an optional authenticated/mTLS Dashboard API on a dedicated listener
- File provider: ACL with directory watching and hot reload
- DNS provider: Hostname resolution with TTL-based caching
- Docker provider: Auto-discover services from container labels (
a3s.router.rule,a3s.service.port) - Health-based discovery: Auto-register backends via
/.well-known/a3s-service.json - Kubernetes Ingress (
kubefeature): Watchnetworking.k8s.io/v1/Ingressresources - Kubernetes IngressRoute CRD (
kubefeature): Traefik-style advanced routing
Built for throughput. The proxy hot path uses direct hyper HTTP/1.1 connection pooling with streaming request body passthrough — no intermediate buffering for plain HTTP traffic.
| Metric | Value | Conditions |
|---|---|---|
| Throughput | 67,000 req/s | 200 concurrent connections, Apple Silicon, loopback |
| Latency overhead | 69 µs (p50) | Single connection, measures pure gateway overhead |
| Tail latency | 5.6 ms (p99) | 200 concurrent connections |
| Routing | 90 ns | Match against 100-route table |
| Middleware pipeline | 130 ns/middleware | Pre-compiled at startup, no per-request allocation |
| Config reload | 3 ms | 300-service configuration, hot reload |
Benchmarked with oha (Rust HTTP load generator) against a hyper backend on loopback. Criterion micro-benchmarks included in benches/.
All configuration uses ACL format (.acl files). Changes are picked up automatically when file watching is enabled — no restart required.
# Full example — LLM API gateway with safe rollout
entrypoints "web" { address = "0.0.0.0:80" }
entrypoints "websecure" {
address = "0.0.0.0:443"
tls { cert_file = "/etc/certs/cert.pem"; key_file = "/etc/certs/key.pem" }
}
routers "llm-api" {
rule = "Host(`api.example.com`) && PathPrefix(`/v1`)"
service = "llm-service"
entrypoints = ["websecure"]
middlewares = ["auth-jwt", "rate-limit", "circuit-breaker"]
}
services "llm-service" {
load_balancer {
strategy = "least-connections"
request_timeout = "60s"
servers = [
{ url = "http://127.0.0.1:8001" },
{ url = "http://127.0.0.1:8002" }
]
health_check { path = "/health"; interval = "10s" }
}
# Mirror 5% of traffic to a new model for shadow testing
mirror { service = "llm-canary"; percentage = 5 }
# Scale to zero when idle — buffer requests during cold start
scaling {
min_replicas = 0
max_replicas = 8
container_concurrency = 10
target_utilization = 0.7
buffer_enabled = true
executor = "box"
}
# Shift traffic from v1 to v2 over 10 steps, 1 minute apart
# Auto-rollback if error rate > 5% or p99 > 5s
rollout {
from = "v1"
to = "v2"
step_percent = 10
step_interval_secs = 60
error_rate_threshold = 0.05
latency_threshold_ms = 5000
}
}
middlewares "auth-jwt" { type = "jwt"; value = env("JWT_SECRET") }
middlewares "rate-limit" { type = "rate-limit"; rate = 100; burst = 20 }
middlewares "circuit-breaker" {
type = "circuit-breaker"
failure_threshold = 5
cooldown_secs = 30
success_threshold = 2
}
providers {
file { watch = true; directory = "/etc/gateway/conf.d/" }
}
Backends expose /.well-known/a3s-service.json (RFC 8615) for automatic registration:
{
"name": "llm-service",
"version": "2.1.0",
"routes": [
{ "rule": "PathPrefix(`/v1`)", "middlewares": ["rate-limit"] }
],
"health_path": "/health",
"weight": 1
} ┌──────────────────────────────────────────────┐
│ A3S Gateway │
│ │
Client ──────────┤ Entrypoint (HTTP/HTTPS/TCP/UDP) │
(HTTP/WS/SSE/ │ │ │
gRPC/TCP/UDP) │ ▼ │
│ TLS Termination (rustls + ACME) │
│ │ │
│ ▼ │
│ Router ──── Rule Matching │
│ │ (Host, Path, Headers, SNI) │
│ ▼ │
│ Middleware Pipeline │
│ ┌──────┬────────┬──────────┬───────────┐ │
│ │Auth │ Rate │ Circuit │ Compress │ │
│ │JWT │ Limit │ Breaker │ CORS │ │
│ └──────┴────────┴──────────┴───────────┘ │
│ │ │
│ ▼ │
│ Service (LB + Health + Failover + Mirror) │
│ │ │
│ ▼ │
│ Scaling (Knative autoscaler + buffer) │
│ │ │
└──────┼──────────────────────────────────────┘
│
┌────────────┼────────────┐
▼ ▼ ▼
┌────────┐ ┌────────┐ ┌──────────┐
│HTTP │ │gRPC │ │TCP/UDP │
│Backend │ │Backend │ │Backend │
└────────┘ └────────┘ └──────────┘
| Component | Responsibility |
|---|---|
Gateway |
Lifecycle orchestrator — owns all subsystems |
Entrypoint |
Network listener (HTTP, HTTPS, TCP, UDP) |
Router |
Rule-based request matching; O(n) scan with priority ordering |
Middleware |
Composable request/response pipeline; pre-compiled per-router at startup |
Service |
Backend pool — load balancing, health, mirroring, failover |
Scaling |
Knative autoscaler — concurrency tracking, request buffer, revision router |
Provider |
Dynamic config sources — file, DNS, discovery, Kubernetes |
Proxy |
Protocol forwarder — HTTP, WebSocket, SSE, gRPC, TCP, UDP |
Created → Starting → Running ⇄ Reloading → Stopping → Stopped
Hot reload (Reloading) replaces router table and service registry atomically under a shared runtime snapshot. HTTP/TCP entrypoints keep their sockets when listener configuration is unchanged. If HTTP/TCP listeners are added or moved, only changed entrypoints are reconciled; unchanged listeners remain active if the new bind fails. UDP entrypoints still restart explicitly.
brew install a3s-lab/tap/a3s-gatewayhelm install gateway deploy/helm/a3s-gateway \
--set-file config=my-gateway.acl \
--set service.type=LoadBalancerdocker run -v $(pwd)/gateway.acl:/etc/a3s-gateway/gateway.acl \
-p 8080:8080 ghcr.io/a3s-lab/gateway:latestcargo install a3s-gateway| Method | Description |
|---|---|
Gateway::new(config) |
Create from GatewayConfig |
start() |
Bind listeners and begin proxying |
shutdown() |
Graceful drain and stop |
reload(new_config) |
Atomic hot reload without downtime |
health() |
Current health snapshot |
metrics() |
Prometheus metrics collector |
state() |
GatewayState enum |
Management is CLI-first by default. The optional Dashboard API runs on a
dedicated listener, so /api/gateway/* on traffic entrypoints remains normal
user traffic and can be routed by your own routers. Management requests require
both a matching allowed_ips entry and a bearer token when auth_token_env is
set. Remote management listeners can also require HTTPS and client certificates.
a3s-gateway validate --config gateway.acl
a3s-gateway config --config gateway.acl summary
a3s-gateway config --config gateway.acl entrypoints
a3s-gateway config --config gateway.acl routes
a3s-gateway config --config gateway.acl services
a3s-gateway config --config gateway.acl middlewares
a3s-gateway config --config gateway.acl providers
a3s-gateway config --config gateway.acl json
a3s-gateway management events --url http://127.0.0.1:9090/api/gateway
a3s-gateway management validate --url http://127.0.0.1:9090/api/gateway --file gateway.acl
a3s-gateway management reload --url http://127.0.0.1:9090/api/gateway --file gateway.aclUse --ca-cert, --client-cert, and --client-key with
management events when the management listener requires mTLS.
management {
enabled = true
address = "127.0.0.1:9090"
path_prefix = "/api/gateway"
auth_token_env = "A3S_GATEWAY_ADMIN_TOKEN"
allowed_ips = ["127.0.0.1", "::1"]
tls {
cert_file = "/etc/a3s/admin/server.crt"
key_file = "/etc/a3s/admin/server.key"
client_ca_file = "/etc/a3s/admin/client-ca.crt"
require_client_cert = true
min_version = "1.3"
}
}
| Endpoint | Response |
|---|---|
GET /api/gateway/health |
Gateway health (JSON) |
GET /api/gateway/metrics |
Prometheus text format |
GET /api/gateway/config |
Active configuration (JSON) |
GET /api/gateway/routes |
Configured routes |
GET /api/gateway/services |
Services with backend health |
GET /api/gateway/backends |
All backends with connection counts |
GET /api/gateway/events |
Recent management security audit events |
GET /api/gateway/version |
Binary version |
POST /api/gateway/config/validate |
Validate an ACL payload without applying it |
POST /api/gateway/config/reload |
Transactionally reload from an ACL payload |
cargo build -p a3s-gateway
cargo test -p a3s-gateway --all-features # 1069 tests
cargo clippy -p a3s-gateway --all-features -- -D warnings
cargo bench --no-run --all-features # compile benchmarksOr use the justfile:
just ci # fmt + lint + test (full gate)
just bench # run criterion benchmarks
just release-check # full pre-release validationsrc/
├── lib.rs, main.rs # Public API + CLI
├── gateway.rs # Lifecycle orchestrator
├── dashboard.rs # Optional dedicated management API
├── entrypoint.rs # HTTP/HTTPS/TCP/UDP listeners + hot path
├── error.rs # GatewayError, Result
│
├── config/ # ACL configuration model
│ └── acl.rs, entrypoint.rs, router.rs, service.rs, scaling.rs, middleware.rs
│
├── router/ # Rule matching engine
│ ├── rule.rs # Host/Path/Header/Method/SNI matchers
│ └── tcp.rs # TCP SNI router
│
├── middleware/ # 15 built-in middleware types
│
├── service/ # Backend pool management
│ └── load_balancer.rs, health_check.rs, passive_health.rs,
│ sticky.rs, mirror.rs, failover.rs
│
├── proxy/ # Protocol forwarders
│ └── http_proxy.rs, websocket.rs, streaming.rs, grpc.rs,
│ tcp.rs, udp.rs, tls.rs, acme*.rs, ws_mux.rs
│
├── observability/
│ └── metrics.rs, access_log.rs, tracing.rs
│
├── scaling/ # Knative-style autoscaler
│ └── executor.rs, autoscaler.rs, buffer.rs,
│ concurrency.rs, revision.rs, rollout.rs
│
└── provider/ # Config providers
└── file_watcher.rs, dns.rs, docker.rs, discovery.rs,
kubernetes.rs, kubernetes_crd.rs (feature: kube)
benches/ # Criterion benchmarks
├── routing.rs # RouterTable::match_request
├── middleware_pipeline.rs # Pipeline::process_request
└── acl_parse.rs # GatewayConfig::from_acl
deploy/
└── helm/a3s-gateway/ # Helm chart
A3S Gateway follows Semantic Versioning. Starting with v1.0.0:
- Stable: Public Rust API (
Gateway,GatewayConfig,GatewayState,HealthStatus,GatewayError), ACL configuration format, Management API endpoints, CLI interface. - Unstable (may change in minor releases):
#[doc(hidden)]modules (router,middleware), internal provider implementations, benchmark infrastructure. - MSRV: Rust 1.88. May advance in minor releases with at least 3 stable-version lag.
See CHANGELOG.md for release history and RELEASING.md for the release process.
Discord — questions, discussions, updates.
MIT