Skip to content

Epic: Idempotent & Reliable Operations (Idempotency Keys + Outbox + Retries + DLQs) #10

@hoangsonww

Description

@hoangsonww

Summary

Introduce end-to-end idempotency and delivery guarantees across REST/GraphQL/gRPC + async pipelines (RabbitMQ/Kafka) to prevent duplicated charges/orders/expenses, message loss, or out-of-order effects. Implement the Transactional Outbox + Inbox patterns, idempotency keys on write APIs, retries with backoff, and dead-letter queues (DLQs)—with full OpenTelemetry traces and metrics.

Why

  • Current microservices (budgets, expenses, orders, notifications, tasks) span MongoDB, PostgreSQL, Redis, RabbitMQ, Kafka. Network glitches and retries can produce duplicates or lost messages.
  • Payment-like flows (orders/transactions) need “exactly-once effects” even when at-least-once delivery is used.
  • This aligns with existing architecture (queues, caches, multiple DBs) and hardens production behavior without changing business features.

Scope

  1. HTTP/gRPC Idempotency Keys
  • Write endpoints (POST /api/orders, POST /api/expenses, POST /api/transactions) accept header Idempotency-Key (UUIDv4).
  • Server stores request hash + normalized response for the key; repeated calls return the same response (HTTP 200) without re-executing side effects.
  • Storage: Redis primary (fast), fallback Mongo collection idempotency_keys with TTL.
  1. Transactional Outbox (producer side)
  • For services that publish events (orders/expenses/transactions), write domain changes + event record to the same DB transaction (Mongo: two-phase with session; Postgres: regular tx).
  • A background outbox dispatcher reads pending events, publishes to RabbitMQ/Kafka, marks them delivered with a monotonic sequence and publishes trace context.
  1. Inbox/Consumer Idempotency (consumer side)
  • Consumers record processed message IDs (Kafka offset + message key / RabbitMQ messageId) into an inbox table (Redis set + DB table) to ensure exactly-once processing semantics.
  1. Retry & Backoff
  • Standardize retry policy: exponential backoff with jitter (e.g., base 250ms, max 30s, cap 7 tries), then route to DLQ (dead-letter exchange / Kafka DLQ topic) with error reason.
  1. DLQ Handling
  • Unified DLQ topics/queues per service with small CLI to replay single messages or batches after fixes:
    budget-manager dlq:peek --service expenses
    budget-manager dlq:replay --service expenses --since 15m
  1. Ordering & Dedup
  • Prefer message keys (e.g., orderId) for Kafka topics to keep partition ordering.
  • For RabbitMQ, group by routing keys; consumers perform per-key serialization using a Redis lock (short TTL).
  1. Observability
  • OpenTelemetry traces (HTTP/gRPC → outbox publish → consumer handle) with W3C trace headers.
  • Prometheus metrics: outbox_pending, outbox_published_total, consumer_processed_total, consumer_dedup_hits_total, dlq_messages_total, idempotency_cache_hits_total.
  • Logs include idempotencyKey, eventId, sagaId (when applicable).

Acceptance Criteria

  • ✅ Replaying the same POST /api/orders with identical payload + Idempotency-Key returns the same response; no duplicate order/transaction rows; queue publishes only once.
  • ✅ Service restarts do not lose pending events (outbox drains on boot) and do not re-apply already processed messages (inbox dedup).
  • ✅ DLQs capture permanently failing messages; CLI can replay single or batched messages.
  • ✅ Grafana dashboard shows the above metrics; traces show end-to-end spans across producers/consumers.
  • ✅ Load test proves zero duplicates across 10k idempotent writes with induced failures (network cuts, consumer crashes).

Design Notes

  • Where to put outbox/inbox:

    • Mongo services: collections outbox_events, inbox_messages.
    • Postgres services (transactions): tables outbox_events, inbox_messages.
  • Schemas (sketch):

    // outbox_events
    {
      "_id": "...", "aggType": "order", "aggId": "ord_123",
      "eventType": "ORDER_CREATED", "payload": {...},
      "createdAt": "...", "publishedAt": null,
      "traceparent": "00-...-...", "tracestate": ""
    }
    
    // inbox_messages
    {
      "_id": "kafka:topic@partition@offset" | "amqp:exchange@routingKey@msgId",
      "firstProcessedAt": "...", "lastProcessedAt": "..."
    }
    
    // idempotency_keys
    {
      "_id": "uuid-key",
      "requestHash": "sha256(...normalized body + path + auth...)",
      "response": {...}, "statusCode": 200,
      "createdAt": "...", "ttlAt": "..."
    }
  • CLI additions: extend cli.js with dlq:peek, dlq:replay, outbox:stats.

Tasks

  • HTTP layer: Add Idempotency-Key middleware (REST & GraphQL mutations); Redis first, Mongo fallback.

  • Normalize requests (stable JSON stringify, redact secrets) to generate requestHash.

  • Outbox write path: Wrap write endpoints to append outbox event in same tx/session.

  • Outbox dispatcher: new worker (pm2/k8s CronJob) with backpressure & batch publishes.

  • Consumers: add inbox check + record; ensure handler is side-effect safe on re-delivery.

  • Retry policy & DLQ for both RabbitMQ (DLX) and Kafka (DLQ topic).

  • Tracing: add OpenTelemetry SDK, instrument producers/consumers, propagate headers.

  • Metrics: expose Prometheus counters/gauges; dashboards JSON in /docs/observability/.

  • CLI: dlq:peek, dlq:replay, outbox:stats.

  • Tests:

    • Unit: idempotency cache hits/misses; outbox format; inbox dedup.
    • Integration: induced failures (kill consumer, drop network) → no dup effects, DLQ populated.
    • Load: 10k requests with 1% random failures → 0 duplicates.
  • Docs: /docs/reliability.md covering patterns, env vars, runbooks.

Risks & Mitigations

  • Cross-DB transactions (Mongo + Postgres): Keep outbox next to the writing service DB; avoid cross-DB tx.
  • Cardinality of idempotency keys: TTL + size caps, evict LRU in Redis.
  • Hot partitions: key by aggregate id; add per-key worker pools to avoid head-of-line blocking.

Assignee: @hoangsonww
Milestone: v1.2.0
Related: CI/CD, Kafka/RabbitMQ configs, Prometheus/Grafana setup


Metadata

Metadata

Assignees

Labels

bugSomething isn't workingdocumentationImprovements or additions to documentationenhancementNew feature or requestgood first issueGood for newcomershelp wantedExtra attention is neededquestionFurther information is requested

Projects

Status

Backlog

Relationships

None yet

Development

No branches or pull requests

Issue actions