Skip to content

[Feature] Generic outbound webhook delivery infrastructure #218

@Polliog

Description

@Polliog

Feature Description

Build a reusable outbound webhook delivery system that handles HMAC payload signing, retry with exponential backoff, dead letter queue, SSRF protection, and delivery logging. Existing alert notification code migrates onto it. The same primitive becomes available for future features that need to deliver events to external systems.

Problem/Use Case

Outbound HTTP delivery exists today inside the alert notification path, but it's specific to that one feature. Several upcoming features will need the same plumbing: digest reports (#155), workflow integrations, custom event subscriptions. Building each one independently would duplicate retry logic, signing, error handling, and SSRF defense — and the result would be inconsistent quality (the first one done well, the rest "good enough").

There's also a security dimension: outbound HTTP from a server-side application is an SSRF vector if not handled carefully. Centralizing it in one well-tested module is much safer than scattering it across features.

Proposed Solution

A webhookDispatcher module:

webhookDispatcher.enqueue({
  url: string,
  payload: unknown,
  organizationId: string,
  eventType: string,
  signingSecret?: string,
  headers?: Record<string, string>,
  metadata?: Record<string, unknown>,
})

What it does:

  1. SSRF protection — full DNS resolution of the target host, reject if it resolves to a private/loopback/link-local address. Re-resolve at request time to defend against DNS rebinding.
  2. HMAC signing — when a signing secret is provided, sign the body with HMAC-SHA256 and add X-Logtide-Signature and X-Logtide-Timestamp headers. Standard, documented, easy for receivers to verify.
  3. Retry with backoff — exponential backoff (e.g. 1s, 5s, 25s, 2m, 10m), max attempts configurable, only retry on transient failures (5xx, network errors, timeouts).
  4. Dead letter queue — after final failure, the delivery lands in a DLQ table with full request/response info for inspection and manual replay.
  5. Delivery log — every attempt (success or failure) is recorded with timestamp, status code, duration, response excerpt. Visible in a dashboard view.
  6. Per-organization concurrency limit — prevent one tenant's slow webhook receiver from saturating the worker pool.
    Backed by BullMQ with deterministic job IDs to make idempotency easy on the consumer side. Each event has a stable id; receivers can deduplicate.

Alternatives Considered

  • Continue scattering webhook delivery per feature. Worse SSRF posture, inconsistent retry behavior, no central observability. Rejected.
  • Use an external service (e.g. Hookdeck, Svix). Adds a third-party dependency that breaks self-hosted/air-gapped deployments and contradicts the privacy-first philosophy of the project.
  • Build minimal version now, add observability/DLQ later. Tempting, but the migration cost of moving alert notifications onto the new dispatcher is paid once. Better to land it complete.

Implementation Details (Optional)

  • SSRF guard: resolve DNS, check that no resolved A/AAAA record is in private space (RFC 1918, ULA, link-local, loopback, multicast, reserved). For the actual HTTP request, pin to the resolved IP to prevent TOCTOU rebinding. Allow an opt-in allowPrivateNetworks flag for trusted on-prem deployments.
  • HMAC signing: X-Logtide-Signature: t=<unix>,v1=<hex>, where the signed string is <unix>.<body>. Document the verification snippet for receivers.
  • Job IDs: webhook:${organizationId}:${eventType}:${eventId} — deterministic, deduplicates retries triggered by upstream errors.
  • DLQ: a separate table webhook_deliveries_failed with the full job, last response, last error. A dashboard view lists DLQ entries per org with a "retry" button (which re-enqueues with a fresh job).
  • Delivery log: capacity-bounded — keep last N attempts per webhook (configurable, default 1000). Deeper history would need a separate storage decision.
  • Existing alert notification code becomes the first consumer. No behavioral regression — same delivery semantics, just centralized.
  • Coordinates with the lifecycle hooks issue: beforeWebhookDispatch is the right place for downstream platforms to inspect or reject deliveries.

Priority

  • Critical - Blocking my usage of Logtide
  • High - Would significantly improve my workflow
  • Medium - Nice to have
  • Low - Minor enhancement

Target Users

  • Operators integrating Logtide alerts with external systems (PagerDuty, Slack, custom internal tools)
  • Teams building automation around Logtide events (CI triggers, ticket creation, custom workflows)
  • Future features requiring reliable event delivery (digest reports, custom event subscriptions)

Contribution

  • I would like to work on implementing this feature

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions