Evora is a robust distributed job queue system built on top of Postgres with exactly-once idempotency guarantees and scalable worker management.
graph LR
subgraph Clients
App([Producer App])
Dash([Admin Dashboard])
end
subgraph Evora Server
API[HTTP API]
Dispatcher[Worker Dispatcher]
Lifecycle[Lifecycle Manager]
Sweeper[Visibility Sweeper]
Projector[Job Projector]
end
subgraph Storage
PG[(PostgreSQL<br>jobs & events)]
Mongo[(MongoDB<br>telemetry)]
end
subgraph External Workers
W1[Worker 1]
W2[Worker 2]
end
%% Producer Flow
App -->|POST /jobs| API
API -->|Insert PENDING| PG
%% Worker Flow
W1 -->|GET /jobs/poll| API
W2 -->|POST /jobs/:id/complete| API
API --> Dispatcher
API --> Lifecycle
Dispatcher -->|FOR UPDATE SKIP LOCKED| PG
Lifecycle -->|Update Status| PG
%% Background Tasks
Sweeper -.->|Find Expired Locks| PG
%% Telemetry
Lifecycle -.->|Publish Event| Projector
Projector -->|Upsert Stats| Mongo
Dash -->|GET /queues/stats| API
API -->|Read| Mongo
This diagram shows how a job transitions through its states within Evora.
stateDiagram-v2
[*] --> PENDING : Submitted by Client
PENDING --> RUNNING : Claimed by Worker
RUNNING --> COMPLETED : Worker finishes successfully
RUNNING --> PENDING : Worker fails (Attempt < Max)
RUNNING --> PENDING : Lock expires (Visibility Sweeper)
RUNNING --> DLQ : Worker fails (Attempt >= Max)
RUNNING --> DLQ : Lock expires (Attempt >= Max)
DLQ --> PENDING : Manual Retry from Dashboard
COMPLETED --> [*]
DLQ --> [*]
This sequence diagram illustrates the lifecycle of a single job, including idempotency checks and worker processing.
sequenceDiagram
participant Client
participant API as Evora API
participant DB as Postgres DB
participant Worker
Client->>API: POST /jobs (idempotency_key)
API->>DB: SELECT * WHERE idempotency_key
alt Job Exists
DB-->>API: Return existing Job
API-->>Client: 200 OK (Already Exists)
else Job Does Not Exist
API->>DB: INSERT INTO jobs (status=PENDING)
API-->>Client: 201 Created
end
loop Every 10ms
Worker->>DB: UPDATE jobs SET status=RUNNING... FOR UPDATE SKIP LOCKED
DB-->>Worker: Returns 1 Claimed Job
end
Worker->>Worker: Process Task (Long running)
loop Every 20 seconds
Worker->>API: POST /jobs/:id/heartbeat
API->>DB: UPDATE locked_until = NOW() + 30s
end
Worker->>API: POST /jobs/:id/complete
API->>DB: UPDATE jobs SET status=COMPLETED
Our core mechanic for safely claiming jobs across multiple concurrent workers without external queues (like RabbitMQ or Redis) relies on Postgres's native FOR UPDATE SKIP LOCKED.
UPDATE jobs
SET status = 'RUNNING',
worker_id = ?,
locked_until = NOW() + (? || ' seconds')::INTERVAL,
attempt_count = attempt_count + 1
WHERE id = (
SELECT id FROM jobs
WHERE status = 'PENDING'
AND queue = ?
AND scheduled_at <= NOW()
ORDER BY priority ASC, scheduled_at ASC
FOR UPDATE SKIP LOCKED
LIMIT 1
)
RETURNING *Why this guarantees exactly-once processing:
- Idempotent Submission: When a job is submitted, we first check the
idempotency_key. If it exists, we return the existing job. This prevents the same operation from being queued multiple times. - Atomic Claiming:
FOR UPDATE SKIP LOCKEDallows multiple workers to poll the table simultaneously. Instead of blocking each other waiting for a row lock, they instantly skip over rows already locked by other workers and grab the next available pending job. - Visibility Timeouts: Rather than assuming a worker completed the job, we give them a "lease" (
locked_until). If the worker crashes before completion, the lease expires. TheVisibilityTimeoutSweeperwill then requeue the job automatically.
- A worker polls and claims a job. The job status becomes
RUNNINGand alocked_untiltimestamp is set (e.g., 30 seconds into the future). - The worker processes the job. If processing is slow, the worker periodically sends a heartbeat to extend the
locked_untiltimestamp. - If the worker crashes or hangs, it stops sending heartbeats.
- The
VisibilityTimeoutSweeperruns periodically (e.g., every 10 seconds), finding jobs wherestatus = 'RUNNING'andlocked_until < NOW(). - The sweeper resets the job to
PENDING(requeuing it) and increments the attempt count. If the maximum attempt count is reached, it moves the job to the Dead Letter Queue (DLQ).
- Postgres as a Queue: While dedicated message brokers are standard, using Postgres simplifies our operational overhead. With
SKIP LOCKED, Postgres is highly capable of acting as a high-throughput queue while giving us transactional guarantees across business data and queue state. - Pull vs Push: Workers actively poll the queue instead of the system pushing to them. This inherently provides backpressure—workers only take what they can handle, preventing overwhelming down-stream systems during traffic spikes.
- Eventual Consistency Telemetry: We separate operational queuing (Postgres) from telemetry (MongoDB). Domain events like
JobCompletedEventare published and projected into Mongo. This offloads expensive aggregation queries from our core Postgres database, ensuring queue latency stays minimal.
Start the supporting infrastructure:
docker-compose up -dRun the application:
mvn clean compile exec:java -Dexec.mainClass="com.evora.EvoraApplication"