Skip to content

Feature: Tenant-Aware Quotas, Rate Limits, and Usage Metering (+ Billing Exports) #11

@hoangsonww

Description

@hoangsonww

Tenant-Aware Quotas, Rate Limits, and Usage Metering (+ Billing Exports)

Summary
Introduce first-class multi-tenant controls across the platform: per-tenant quotas (budgets, API calls, events), adaptive rate limiting, metering pipelines, and exportable billing reports. This enables fair usage, protects shared infra, and unlocks paid tiers.


Goals

  • Per-tenant quotas (e.g., max budgets/expenses/day, storage, Kafka events/min).
  • Smarter rate limits: token-bucket with burst, sliding-window for abuse spikes.
  • Usage metering: durable event stream → aggregated daily/hourly metrics.
  • Admin UX & APIs to set plans, limits, and view overages.
  • Exports: CSV/Parquet to S3/GCS (or local) for finance tools (Stripe/BillingOps).
  • Observability: Prometheus metrics + Grafana dashboards for capacity planning.

Non-Goals

  • Payments/checkout (can integrate later).
  • Price modeling—only usage collection & enforcement here.

Design Overview

1) Tenant & Plan Model

  • tenants collection/table: { id, name, plan_id, status, createdAt, … }
  • plans: { id, name, limits: { api_rpm, events_per_day, budgets, storage_mb, … } }
  • overrides: optional per-tenant limit overrides.

2) Gateway Enforcement

  • New rate-limit middleware (Express):

    • Redis token-bucket for RPM/RPS (key: rate:<tenantId>:<route>)
    • Sliding window for abuse spikes (key: sw:<tenantId>)
  • Quota checks on mutating endpoints (budgets/expenses/transactions).

    • Redis counters with daily TTL (q:<tenantId>:<metric>:YYYYMMDD).

3) Usage Metering Pipeline

  • Emit normalized usage events at boundaries (HTTP, Kafka/RabbitMQ consume, DB writes):

    { "ts": "...", "tenantId": "...", "metric": "api_call", "value": 1, "attrs": { "route": "/api/expenses" } }
  • Ingest → buffer (Kafka topic usage_events) → aggregator job writes to Postgres usage_hourly/usage_daily.

  • Late arrivals handled via upserts (hour bucket).

4) Admin & Tenant APIs

  • GET /api/admin/tenants/:id/usage?from&to&granularity=day|hour
  • POST /api/admin/tenants/:id/limits (override)
  • GET /api/tenants/me/usage/summary (self-serve visibility)
  • 429 + structured error on limit breach with Retry-After.

5) Billing Exports

  • Nightly job: materialize billable usage by tenant/metric →

    • write CSV/Parquet to s3://billing-exports/YYYY/MM/tenantId.csv
    • optional webhook to billing system.
  • CLI:

    • budget-manager usage export --period 2025-09 --dest s3://... --format parquet

6) Observability

  • Prometheus counters/gauges:

    • tenant_api_requests_total{tenant,route}
    • tenant_rate_limited_total{tenant,route}
    • tenant_quota_consumed{tenant,metric}
    • usage_aggregator_lag_seconds
  • Grafana dashboards + alerts on over-limit rates & aggregator lag.


Acceptance Criteria

  • Per-tenant rate limiting enforced on all public routes; returns 429 with Retry-After.
  • Quota enforcement for budgets/expenses/transactions; returns 403 with actionable message when exceeded.
  • Usage events produced for HTTP calls, async consumes, and key DB mutations.
  • Hourly/Daily aggregates persisted; backfills supported for late events.
  • Admin & Tenant APIs deliver correct usage data (unit/integration tests included).
  • Exports (CSV & Parquet) generated on schedule and via CLI; sample file validated.
  • Dashboards show per-tenant usage, rate-limited counts, and export job health.

Tasks

  • Data models: tenants, plans, usage_hourly, usage_daily.

  • Redis keys & Lua (optional) for atomic rate-limit + quota counters.

  • Express middleware for rate limiting + quota checks (config via plan/override).

  • Usage event producer wrappers (HTTP, Kafka/RabbitMQ consumers, services layer).

  • Aggregator worker (Node) with idempotent upserts and retry w/ jitter.

  • Billing export job (CSV/Parquet) + S3/GCS client + retention policy.

  • Admin & Tenant REST endpoints + OpenAPI/Swagger docs.

  • Grafana dashboards + Prometheus alerts.

  • CLI commands:

    • budget-manager usage show --tenant <id> --since 7d
    • budget-manager usage export --period <YYYY-MM>
    • budget-manager limits set --tenant <id> --plan pro
  • Tests: unit (limits/quota), integration (burst traffic), e2e (export contents).


Risks & Mitigations

  • Hot-key contention in Redis → shard keys with route hash; use sliding window when needed.
  • Clock skew for hourly/daily buckets → bucket by server time and tolerate late upserts.
  • False positives on shared tenants → per-route + per-metric tuning in plans.

Metadata

Metadata

Assignees

Labels

documentationImprovements or additions to documentationduplicateThis issue or pull request already existsenhancementNew feature or requestgood first issueGood for newcomershelp wantedExtra attention is neededquestionFurther information is requested

Projects

Status

Backlog

Relationships

None yet

Development

No branches or pull requests

Issue actions