Skip to content

feat: resilient background job retry & monitoring (#130)#651

Open
DrGalio wants to merge 1 commit intorohitdash08:mainfrom
DrGalio:feat/resilient-background-job-retry
Open

feat: resilient background job retry & monitoring (#130)#651
DrGalio wants to merge 1 commit intorohitdash08:mainfrom
DrGalio:feat/resilient-background-job-retry

Conversation

@DrGalio
Copy link

@DrGalio DrGalio commented Mar 26, 2026

Summary

Implements resilient background job retry & monitoringcloses #130.

Problem

The current reminder system had a critical reliability issue: run_due marked sent=True regardless of whether send_reminder actually succeeded. Failed deliveries were silently lost with no retry mechanism.

Solution

A production-ready background job runner with:

  • Retry with exponential backoff — configurable RetryPolicy (default: 3 retries, 5s base delay, 2x multiplier, 300s cap)
  • Dead-letter tracking — jobs that exhaust retries move to DEAD status for manual inspection
  • Job lifecycle modelBackgroundJob tracks attempts, errors, metadata through PENDING → RUNNING → COMPLETED / RETRYING / DEAD
  • Monitoring endpoints:
    • GET /reminders/jobs/stats — aggregated counts by status
    • GET /reminders/jobs/dead — dead-lettered jobs for inspection
    • GET /reminders/jobs/retrying — jobs waiting for next retry
    • POST /reminders/jobs/<id>/retry — manually re-queue dead jobs

Changes

File Description
services/job_runner.py (new) Core job runner with retry logic
models.py Added BackgroundJob model and JobStatus enum
routes/reminders.py Integrated job runner, fixed sent-boolean bug, added monitoring routes
db/schema.sql Added background_jobs table with indexes
tests/test_job_runner.py (new) 11 tests covering retry policy, execution, monitoring

Tests

All 11 new tests pass:

  • Retry policy: exponential backoff, max delay cap
  • Enqueue: creates PENDING job, custom max_retries
  • Execution: success, retry on failure, dead after max retries, succeed after retries, no handler
  • Monitoring: stats, manual retry of dead jobs

Acceptance Criteria (from issue)

  • ✅ Production ready implementation
  • ✅ Includes tests
  • ✅ Documentation updated (schema.sql)

Add a production-ready background job runner with retry logic, exponential backoff, dead-letter tracking, and monitoring endpoints.

Changes:

- New JobRunner service with configurable RetryPolicy and exponential backoff

- BackgroundJob model tracking full job lifecycle (PENDING/RUNNING/COMPLETED/RETRYING/DEAD)

- Fixed bug: reminder delivery now only marks sent=True on actual success

- Failed deliveries auto-retry with exponential backoff (default 3 retries)

- New monitoring endpoints: /jobs/stats, /jobs/dead, /jobs/retrying, /jobs/<id>/retry

- Manual retry support for dead-lettered jobs

- 11 new tests covering retry policy, execution, monitoring

Closes rohitdash08#130
@DrGalio DrGalio requested a review from rohitdash08 as a code owner March 26, 2026 10:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Resilient background job retry & monitoring

1 participant