feat: Resilient background job retry & monitoring#660
Open
chengyixu wants to merge 1 commit intorohitdash08:mainfrom
Open
feat: Resilient background job retry & monitoring#660chengyixu wants to merge 1 commit intorohitdash08:mainfrom
chengyixu wants to merge 1 commit intorohitdash08:mainfrom
Conversation
…h08#130) - Add JobExecution model with retry state tracking (status, retry_count, max_retries, next_retry_at, last_error) - Implement exponential backoff with jitter (5m/15m/45m default schedule) - Dead-letter queue for permanently failed jobs (DB + Redis DLQ) - Redis-backed circuit breaker for external services (SMTP, Twilio) - APScheduler BackgroundScheduler for automated retry processing, reminder dispatch, and health monitoring - Job monitoring API endpoints: GET /jobs/status - list executions with filters and pagination GET /jobs/stats - aggregate success rate and counts by type/status GET /jobs/health - unauthenticated health check for external monitors GET /jobs/dead-letters - list dead-lettered jobs POST /jobs/dead-letters/<id>/retry - reset dead-letter for re-execution POST /jobs/run-retries - manually trigger pending retry processing - Prometheus metrics: job events, execution duration, dead-letter counts - Configurable via Settings/env vars (JOB_MAX_RETRIES, JOB_BASE_BACKOFF_SECONDS, JOB_BACKOFF_MULTIPLIER, etc.) - 44 tests covering backoff calculation, job lifecycle, retry scheduling, dead-letter queue, circuit breaker, dispatcher, stats, health, and API endpoints - Database migration: job_executions table with indexes /claim rohitdash08#130
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements a production-ready resilient background job system for FinMind with retry logic, dead-letter queue, and monitoring — resolves #130.
/claim #130
What's Included
Job Execution Engine (
app/services/job_manager.py)@register_job("type"))dispatch_reminders()is fully testable without DB or schedulerScheduler Integration (
app/services/scheduler.py)FLASK_ENV=testing)Monitoring API (
app/routes/jobs.py)/jobs/status/jobs/stats/jobs/health/jobs/dead-letters/jobs/dead-letters/<id>/retry/jobs/run-retriesObservability
finmind_job_events_total,finmind_job_duration_seconds,finmind_dead_letter_totalConfiguration
All settings configurable via environment variables through pydantic-settings:
JOB_MAX_RETRIES(default: 3)JOB_BASE_BACKOFF_SECONDS(default: 300)JOB_BACKOFF_MULTIPLIER(default: 3.0)JOB_JITTER_FACTOR(default: 0.1)JOB_CB_FAILURE_THRESHOLD(default: 5)JOB_CB_RECOVERY_TIMEOUT(default: 300)Database
job_executionstable with indexes on(status, next_retry_at)and(job_type, status)for efficient pollingTests — 44 total
Files Changed
packages/backend/app/models.py— AddedJobExecutionmodel andJobStatusenumpackages/backend/app/config.py— Added job retry configuration fieldspackages/backend/app/services/job_manager.py— NEW Core job enginepackages/backend/app/services/scheduler.py— NEW APScheduler integrationpackages/backend/app/routes/jobs.py— NEW Monitoring API endpointspackages/backend/app/routes/__init__.py— Register jobs blueprintpackages/backend/app/__init__.py— Initialize scheduler on app creationpackages/backend/app/observability.py— Added job Prometheus metricspackages/backend/app/db/schema.sql— Added job_executions tablepackages/backend/tests/test_jobs.py— NEW 44 comprehensive tests