Tenant-Aware Quotas, Rate Limits, and Usage Metering (+ Billing Exports)
Summary
Introduce first-class multi-tenant controls across the platform: per-tenant quotas (budgets, API calls, events), adaptive rate limiting, metering pipelines, and exportable billing reports. This enables fair usage, protects shared infra, and unlocks paid tiers.
Goals
- Per-tenant quotas (e.g., max budgets/expenses/day, storage, Kafka events/min).
- Smarter rate limits: token-bucket with burst, sliding-window for abuse spikes.
- Usage metering: durable event stream → aggregated daily/hourly metrics.
- Admin UX & APIs to set plans, limits, and view overages.
- Exports: CSV/Parquet to S3/GCS (or local) for finance tools (Stripe/BillingOps).
- Observability: Prometheus metrics + Grafana dashboards for capacity planning.
Non-Goals
- Payments/checkout (can integrate later).
- Price modeling—only usage collection & enforcement here.
Design Overview
1) Tenant & Plan Model
tenants collection/table: { id, name, plan_id, status, createdAt, … }
plans: { id, name, limits: { api_rpm, events_per_day, budgets, storage_mb, … } }
overrides: optional per-tenant limit overrides.
2) Gateway Enforcement
3) Usage Metering Pipeline
-
Emit normalized usage events at boundaries (HTTP, Kafka/RabbitMQ consume, DB writes):
{ "ts": "...", "tenantId": "...", "metric": "api_call", "value": 1, "attrs": { "route": "/api/expenses" } }
-
Ingest → buffer (Kafka topic usage_events) → aggregator job writes to Postgres usage_hourly/usage_daily.
-
Late arrivals handled via upserts (hour bucket).
4) Admin & Tenant APIs
GET /api/admin/tenants/:id/usage?from&to&granularity=day|hour
POST /api/admin/tenants/:id/limits (override)
GET /api/tenants/me/usage/summary (self-serve visibility)
- 429 + structured error on limit breach with
Retry-After.
5) Billing Exports
6) Observability
Acceptance Criteria
Tasks
Risks & Mitigations
- Hot-key contention in Redis → shard keys with route hash; use sliding window when needed.
- Clock skew for hourly/daily buckets → bucket by server time and tolerate late upserts.
- False positives on shared tenants → per-route + per-metric tuning in plans.
Tenant-Aware Quotas, Rate Limits, and Usage Metering (+ Billing Exports)
Summary
Introduce first-class multi-tenant controls across the platform: per-tenant quotas (budgets, API calls, events), adaptive rate limiting, metering pipelines, and exportable billing reports. This enables fair usage, protects shared infra, and unlocks paid tiers.
Goals
Non-Goals
Design Overview
1) Tenant & Plan Model
tenantscollection/table:{ id, name, plan_id, status, createdAt, … }plans:{ id, name, limits: { api_rpm, events_per_day, budgets, storage_mb, … } }overrides: optional per-tenant limit overrides.2) Gateway Enforcement
New rate-limit middleware (Express):
rate:<tenantId>:<route>)sw:<tenantId>)Quota checks on mutating endpoints (budgets/expenses/transactions).
q:<tenantId>:<metric>:YYYYMMDD).3) Usage Metering Pipeline
Emit normalized usage events at boundaries (HTTP, Kafka/RabbitMQ consume, DB writes):
{ "ts": "...", "tenantId": "...", "metric": "api_call", "value": 1, "attrs": { "route": "/api/expenses" } }Ingest → buffer (Kafka topic
usage_events) → aggregator job writes to Postgresusage_hourly/usage_daily.Late arrivals handled via upserts (hour bucket).
4) Admin & Tenant APIs
GET /api/admin/tenants/:id/usage?from&to&granularity=day|hourPOST /api/admin/tenants/:id/limits(override)GET /api/tenants/me/usage/summary(self-serve visibility)Retry-After.5) Billing Exports
Nightly job: materialize billable usage by tenant/metric →
s3://billing-exports/YYYY/MM/tenantId.csvCLI:
budget-manager usage export --period 2025-09 --dest s3://... --format parquet6) Observability
Prometheus counters/gauges:
tenant_api_requests_total{tenant,route}tenant_rate_limited_total{tenant,route}tenant_quota_consumed{tenant,metric}usage_aggregator_lag_secondsGrafana dashboards + alerts on over-limit rates & aggregator lag.
Acceptance Criteria
Retry-After.Tasks
Data models:
tenants,plans,usage_hourly,usage_daily.Redis keys & Lua (optional) for atomic rate-limit + quota counters.
Express middleware for rate limiting + quota checks (config via plan/override).
Usage event producer wrappers (HTTP, Kafka/RabbitMQ consumers, services layer).
Aggregator worker (Node) with idempotent upserts and retry w/ jitter.
Billing export job (CSV/Parquet) + S3/GCS client + retention policy.
Admin & Tenant REST endpoints + OpenAPI/Swagger docs.
Grafana dashboards + Prometheus alerts.
CLI commands:
budget-manager usage show --tenant <id> --since 7dbudget-manager usage export --period <YYYY-MM>budget-manager limits set --tenant <id> --plan proTests: unit (limits/quota), integration (burst traffic), e2e (export contents).
Risks & Mitigations