Skip to content

Future Worker Improvements (Before Prod) #4

@DG-20

Description

@DG-20
  1. Add Retry Logic

    • Implement retry count per job (e.g., max_retries = 3–5).
    • Track retry attempts in Redis alongside the job.
    • Apply exponential backoff between retries.
    • Move job to a dead-letter queue (DLQ) after max retries.
  2. Dead-Letter Queue (DLQ)

    • Create a Redis-backed DLQ list (e.g., "jobs:dead").
    • Store failed job metadata + error reason.
    • Add an API endpoint or admin page to inspect DLQ items.
    • Allow manual requeue of DLQ jobs.
  3. Improved Error Handling

    • Standardized exception types for unzip, file IO, KM upload, etc.
    • Structured error logging with correlation IDs.
    • More granular job status values (e.g., "unzip_failed", "upload_failed").
  4. Worker Scaling

    • Enable multiple worker replicas consuming from the same Redis queue.
    • Ensure job processing is fully idempotent.
    • Add a Redis-based distributed lock if needed for shared-state operations.
  5. Progress Reporting

    • Report intermediate steps to Redis:
      pending → unzipping → uploading → indexing → completed
    • Include % progress estimate for multi-file uploads.
    • Provide progress messages the frontend can display.
  6. File Validation & Security

    • Validate supported file types before unzipping.
    • Reject archives containing > X files or unsupported formats.
    • Sanitize or reject dangerous paths (zip-slip prevention).
    • Enforce file size limits on extracted content.
  7. Performance Optimizations

    • Stream KM uploads instead of loading entire files into memory.
    • Optional background cleanup of extracted temp files.
    • Pre-check available disk space.
    • Consider batching KM ingestion if many small files are uploaded.
  8. Observability & Monitoring

    • Add structured logs with jobId + userId correlation.
    • Expose Prometheus metrics (jobs processed, failures, retry counts, queue lag).
    • Dashboards for worker health.
  9. Worker Shutdown Safety

    • Graceful shutdown signals (SIGTERM/SIGINT).
    • Finish current job before terminating.
    • Requeue unprocessed/partial jobs safely.
  10. Config Improvements

  • Externalize worker config (timeouts, KM URL, max zip size, retry logic).
  • Use strong typing + validation for configuration.
  1. Multi-File Upload Enhancements
  • Allow very large multi-file zips.
  • Allow user-configurable tags per file.
  • Deduplicate files before uploading to KM.
  1. Admin / Diagnostics Tools
  • Add “requeue all failed jobs” endpoint.
  • KM upload diagnostics (timings, latency).
  • Worker self-test endpoint.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions