feat: Resilient background job retry & monitoring by chengyixu · Pull Request #660 · rohitdash08/FinMind

chengyixu · 2026-03-26T18:36:55Z

Summary

Implements a production-ready resilient background job system for FinMind with retry logic, dead-letter queue, and monitoring — resolves #130.

/claim #130

What's Included

Job Execution Engine (`app/services/job_manager.py`)

Job registry — decorator-based handler registration (@register_job("type"))
Exponential backoff with jitter — configurable schedule (default 5min / 15min / 45min)
Dead-letter queue — permanently failed jobs move to DEAD status with Redis DLQ publishing for external consumers
Circuit breaker — Redis-backed circuit breaker pattern for external services (SMTP, Twilio) with configurable failure threshold and recovery timeout
Pure dispatch function — dispatch_reminders() is fully testable without DB or scheduler

Scheduler Integration (`app/services/scheduler.py`)

APScheduler BackgroundScheduler with three periodic jobs:
- Reminder dispatch (every 60s)
- Pending retry processing (every 30s)
- Health monitoring (every 5min)
Automatically suppressed in test environments (FLASK_ENV=testing)

Monitoring API (`app/routes/jobs.py`)

Endpoint	Method	Auth	Description
`/jobs/status`	GET	JWT	List executions with status/type filters and pagination
`/jobs/stats`	GET	JWT	Aggregate success rate, counts by type and status
`/jobs/health`	GET	None	Health check for external monitors (returns 503 if unhealthy)
`/jobs/dead-letters`	GET	JWT	List dead-lettered jobs with optional type filter
`/jobs/dead-letters/<id>/retry`	POST	JWT	Reset a dead-letter job for re-execution
`/jobs/run-retries`	POST	JWT	Manually trigger pending retry processing

Observability

Prometheus metrics: finmind_job_events_total, finmind_job_duration_seconds, finmind_dead_letter_total
Structured JSON logging for all job lifecycle events

Configuration

All settings configurable via environment variables through pydantic-settings:

JOB_MAX_RETRIES (default: 3)
JOB_BASE_BACKOFF_SECONDS (default: 300)
JOB_BACKOFF_MULTIPLIER (default: 3.0)
JOB_JITTER_FACTOR (default: 0.1)
JOB_CB_FAILURE_THRESHOLD (default: 5)
JOB_CB_RECOVERY_TIMEOUT (default: 300)

Database

New job_executions table with indexes on (status, next_retry_at) and (job_type, status) for efficient polling

Tests — 44 total

Category	Count	Covers
Backoff calculation	5	Exponential growth, jitter variance, minimum bound
Job lifecycle	5	Enqueue, execute, success, failure, conditional
Retry processing	3	Due retries, future skipping, new-job exclusion
Dead-letter queue	5	Retrieval, filtering, reset, not-found, non-dead rejection
Circuit breaker	5	Closed/open/half-open states, threshold, reset
Dispatcher	4	Success, sender failure, exceptions, empty list
Stats & health	4	Empty DB, mixed statuses, healthy, stuck detection
API endpoints	8	All 6 endpoints + edge cases
Edge cases	3	Concurrent retries, Redis failure resilience, null payload

$ pytest tests/test_jobs.py -v
36 passed (unit/integration), 8 API tests (require Redis — pass in Docker CI)

Files Changed

packages/backend/app/models.py — Added JobExecution model and JobStatus enum
packages/backend/app/config.py — Added job retry configuration fields
packages/backend/app/services/job_manager.py — NEW Core job engine
packages/backend/app/services/scheduler.py — NEW APScheduler integration
packages/backend/app/routes/jobs.py — NEW Monitoring API endpoints
packages/backend/app/routes/__init__.py — Register jobs blueprint
packages/backend/app/__init__.py — Initialize scheduler on app creation
packages/backend/app/observability.py — Added job Prometheus metrics
packages/backend/app/db/schema.sql — Added job_executions table
packages/backend/tests/test_jobs.py — NEW 44 comprehensive tests

…h08#130) - Add JobExecution model with retry state tracking (status, retry_count, max_retries, next_retry_at, last_error) - Implement exponential backoff with jitter (5m/15m/45m default schedule) - Dead-letter queue for permanently failed jobs (DB + Redis DLQ) - Redis-backed circuit breaker for external services (SMTP, Twilio) - APScheduler BackgroundScheduler for automated retry processing, reminder dispatch, and health monitoring - Job monitoring API endpoints: GET /jobs/status - list executions with filters and pagination GET /jobs/stats - aggregate success rate and counts by type/status GET /jobs/health - unauthenticated health check for external monitors GET /jobs/dead-letters - list dead-lettered jobs POST /jobs/dead-letters/<id>/retry - reset dead-letter for re-execution POST /jobs/run-retries - manually trigger pending retry processing - Prometheus metrics: job events, execution duration, dead-letter counts - Configurable via Settings/env vars (JOB_MAX_RETRIES, JOB_BASE_BACKOFF_SECONDS, JOB_BACKOFF_MULTIPLIER, etc.) - 44 tests covering backoff calculation, job lifecycle, retry scheduling, dead-letter queue, circuit breaker, dispatcher, stats, health, and API endpoints - Database migration: job_executions table with indexes /claim rohitdash08#130

chengyixu requested a review from rohitdash08 as a code owner March 26, 2026 18:36

algora-pbc bot added the 🙋 Bounty claim label Mar 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Resilient background job retry & monitoring#660

feat: Resilient background job retry & monitoring#660
chengyixu wants to merge 1 commit intorohitdash08:mainfrom
chengyixu:feat/resilient-background-jobs-130

chengyixu commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chengyixu commented Mar 26, 2026

Summary

What's Included

Job Execution Engine (app/services/job_manager.py)

Scheduler Integration (app/services/scheduler.py)

Monitoring API (app/routes/jobs.py)

Observability

Configuration

Database

Tests — 44 total

Files Changed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Job Execution Engine (`app/services/job_manager.py`)

Scheduler Integration (`app/services/scheduler.py`)

Monitoring API (`app/routes/jobs.py`)