Skip to content

feat: resilient background job retry & monitoring#657

Open
chengyixu wants to merge 1 commit intorohitdash08:mainfrom
chengyixu:feat/resilient-job-retry-monitoring
Open

feat: resilient background job retry & monitoring#657
chengyixu wants to merge 1 commit intorohitdash08:mainfrom
chengyixu:feat/resilient-job-retry-monitoring

Conversation

@chengyixu
Copy link

Summary

Implements a production-ready background job execution framework for FinMind that adds resilience to reminder dispatch and other background tasks.

/claim #130

Features

Job Execution Engine:

  • JobExecution model tracking every background job with full lifecycle metadata (status, retries, errors, timestamps)
  • Configurable maximum retries per job (default: 3)
  • JSON-encoded payload and result storage for audit trail

Exponential Backoff with Jitter:

  • Retry 1: ~5 minutes (300s)
  • Retry 2: ~15 minutes (900s)
  • Retry 3: ~45 minutes (2700s)
  • ±10% jitter prevents thundering-herd on mass retries

Circuit Breaker Pattern:

  • Separate breakers for email and WhatsApp external services
  • CLOSED → OPEN → HALF_OPEN state transitions
  • Configurable failure threshold and reset timeout
  • Prevents cascading failures when external services are down

Dead-Letter Handling:

  • Jobs that exhaust all retries are marked DEAD with full error history
  • Can be manually retried via API

REST API Endpoints:

Endpoint Method Description
/jobs/status GET Paginated job list with status/type filters
/jobs/stats GET Aggregate stats: success rate, avg duration, counts by status/type
/jobs/health GET Health check with circuit breaker states (no auth required)
/jobs/ GET Single job detail
/jobs/retry/ POST Manual retry of failed/dead jobs
/jobs/process-retries POST Process all pending retries (for cron/scheduler)

Integration:

  • dispatch_reminder_with_retry() creates tracked jobs for reminder sends
  • Pluggable handler registry for extending to new job types
  • Wired into existing reminder dispatch pipeline

Test Coverage

30 tests covering:

  • Exponential backoff calculation (2 tests)
  • Circuit breaker state machine (5 tests)
  • Job creation, execution, failure (4 tests)
  • Retry scheduling and processing (2 tests)
  • Manual retry logic (2 tests)
  • Statistics computation (1 test)
  • Reminder integration (1 test)
  • All 6 API endpoints (11 tests)
  • Model serialization (2 tests)

All 30 tests pass.

Files Changed

File Change
packages/backend/app/models.py Added JobExecution model and JobStatus enum
packages/backend/app/services/job_scheduler.py New: job engine, backoff, circuit breaker, handler registry
packages/backend/app/routes/jobs.py New: monitoring and management API endpoints
packages/backend/app/routes/init.py Register jobs blueprint at /jobs
packages/backend/app/db/schema.sql Added job_executions table with indexes
packages/backend/tests/test_jobs.py 30 comprehensive tests

Database Migration

CREATE TABLE IF NOT EXISTS job_executions (
  id SERIAL PRIMARY KEY,
  job_type VARCHAR(100) NOT NULL,
  status VARCHAR(20) NOT NULL DEFAULT 'PENDING',
  payload TEXT,
  result TEXT,
  retry_count INT NOT NULL DEFAULT 0,
  max_retries INT NOT NULL DEFAULT 3,
  last_error TEXT,
  next_retry_at TIMESTAMP,
  started_at TIMESTAMP,
  completed_at TIMESTAMP,
  created_at TIMESTAMP NOT NULL DEFAULT NOW(),
  user_id INT REFERENCES users(id) ON DELETE SET NULL
);

No Breaking Changes

  • All existing tests continue to pass
  • No modifications to existing reminder, bill, or expense logic
  • Backward compatible with existing database schema

Implements a production-ready background job execution framework with:

- JobExecution model for tracking job state, retries, and errors
- Exponential backoff with jitter (5min/15min/45min retry intervals)
- Circuit breaker pattern for external services (email, WhatsApp)
- Dead-letter handling for jobs that exhaust all retries
- Manual retry capability for failed/dead jobs
- REST API endpoints for job monitoring:
  - GET /jobs/status — paginated job list with filters
  - GET /jobs/stats — aggregate statistics and success rates
  - GET /jobs/health — circuit breaker and system health
  - POST /jobs/retry/<id> — manual retry of failed jobs
  - POST /jobs/process-retries — process pending retries
- Integration with existing reminder dispatch pipeline
- 30 comprehensive tests covering all functionality

/claim rohitdash08#130

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant