Skip to content

Commit 43d4705

Browse files
committed
Revising some more of my docs.
1 parent ea7e049 commit 43d4705

5 files changed

Lines changed: 273 additions & 265 deletions

additional-docs/sql-process-retry-failure-deep-dive.md

Lines changed: 100 additions & 73 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
## Introduction
44

5-
This document provides a focused deep dive into how **Processing, Retry & Failure Orchestration** is implemented in my SpringQueuePro system.
5+
This document provides a focused deep dive into how **Processing, Retry & Failure Orchestration** is implemented in the SpringQueuePro system.
66

77
While the **Task Domain & Persistence** deep dive explains how PostgreSQL enforces correctness and atomic state transitions, and the **Distributed Coordination & Redis Integration** deep dive explains how Redis provides safe mutual exclusion under concurrency, this document focuses on **what happens after a task has been claimed**:
88

@@ -28,7 +28,7 @@ This document explains and highlights:
2828
- How manual requeueing is supported safely
2929
- Steps taken to mirror production-grade retry semantics
3030

31-
**NOTE**: This document intentionally does **not** re-explain database claim semantics (`QUEUED → IN_PROGRESS`) or Redis lock mechanics. Those topics are covered in the *Task Domain & Persistence* and *Distributed Coordination & Redis Integration* deep dives. The focus here is strictly on **execution flow after claim, failure classification, and retry orchestration**.
31+
**NOTE**: This document intentionally does **not** re-explain database claim semantics (`QUEUED → INPROGRESS`) or Redis lock mechanics. Those topics are covered in the *Task Domain & Persistence* and *Distributed Coordination & Redis Integration* deep dives. The focus here is strictly on **execution flow after claim, failure classification, and retry orchestration**.
3232

3333
---
3434

@@ -43,6 +43,8 @@ This document explains and highlights:
4343
- [Observability of Processing & Retry Behavior](#observability-of-processing--retry-behavior)
4444
- [Steps Taken to Mimic Production Quality](#steps-taken-to-mimic-production-quality)
4545

46+
---
47+
4648
## Why Retry Orchestration Is Critical in SpringQueuePro
4749

4850
SpringQueuePro is designed as a concurrent, multi-worker task system where **tasks may legitimately fail** due to:
@@ -51,7 +53,7 @@ SpringQueuePro is designed as a concurrent, multi-worker task system where **tas
5153
- deterministic failures (simulated via “fail absolute” handlers)
5254
- runtime exceptions during execution
5355
- timeouts or resource contention
54-
- infrastructure instability (Redis/network/db issues)
56+
- infrastructure instability (Redis/network/database issues)
5557

5658
In a system like this, retry behavior cannot be an afterthought. If retries are poorly designed:
5759

@@ -63,6 +65,8 @@ In a system like this, retry behavior cannot be an afterthought. If retries are
6365

6466
For these reasons, SpringQueuePro treats retries as a **first-class orchestration concern** and places retry logic in a single authoritative location: `ProcessingService`.
6567

68+
---
69+
6670
## Handlers vs Orchestration: Clear Responsibility Boundaries
6771

6872
SpringQueuePro intentionally distinguishes between:
@@ -77,118 +81,139 @@ public interface TaskHandler {
7781
}
7882
```
7983

80-
Handlers are intentionally “dumb”:
81-
- they perform task-specific work (*or at least simulate supposed task-distinct work*)
84+
Handlers are intentionally simple:
85+
86+
- they perform task-specific work (or simulate task-distinct work)
8287
- they may sleep to simulate real processing latency
8388
- they throw exceptions to signal failure
8489

8590
Handlers do **not**:
91+
8692
- mutate database state
8793
- schedule retries
8894
- re-enqueue tasks
8995
- acquire locks
9096
- decide eligibility for retries
9197

92-
This separation mirrors real production systems where business logic is isolated from runtime policy.
98+
This separation mirrors real production systems where business logic is isolated from execution policy.
99+
100+
---
93101

94102
### Orchestration Layer (Execution Policy)
95103

96104
The orchestration layer (`ProcessingService`) is responsible for:
105+
97106
- persisting outcomes (`COMPLETED` / `FAILED`)
98107
- determining retry eligibility
99108
- computing backoff
100109
- scheduling retries safely
101110
- maintaining observability and metrics
102111

103112
This keeps all retry semantics:
113+
104114
- centralized
105115
- testable
106116
- predictable
107117
- auditable
108118

119+
---
120+
109121
## ProcessingService as the Policy Engine
110122

111123
`ProcessingService` is the runtime component that converts **a claimed task** into a **deterministic lifecycle outcome**.
112124

113-
Even though `claimAndProcess` contains claim and locking logic, its most important role is what happens *after execution begins*:
114-
- wrap handler execution in measurement
115-
- treat exceptions as control flow signals
116-
- persist failure explicitly
117-
- enforce retry policy centrally
125+
Although `claimAndProcess` contains validation, claim, and locking logic, its most important responsibility begins **after execution starts**:
126+
127+
- wrapping handler execution in measurement
128+
- treating exceptions as control-flow signals
129+
- persisting failure deterministically
130+
- enforcing retry policy centrally
131+
132+
A key design decision is that handler execution occurs inside a tightly controlled boundary:
118133

119-
A key design decision is that task handlers are executed inside a controlled boundary:
120-
- execution happens inside a try/catch
121-
- the orchestration layer is the only place where failure transitions and retries are triggered
122-
- all state mutations occur via persistence-driven updates and repository methods
134+
- execution is wrapped in a try/catch
135+
- orchestration logic exclusively owns failure handling and retries
136+
- all lifecycle mutations occur via repository-backed persistence
137+
138+
This structure closely resembles execution loops found in:
123139

124-
This resembles the execution loop found in:
125140
- job queues
126-
- message consumer frameworks
141+
- message consumers
127142
- worker schedulers
128-
- cloud-native retry systems (SQS/Kafka + backoff policies)
143+
- cloud-native retry systems (e.g., SQS/Kafka-style consumers)
144+
145+
---
129146

130147
## Failure Handling & Deterministic Persistence
131148

132149
When handler execution throws an exception (including `TaskProcessingException`), `ProcessingService` handles it deterministically:
133150

134151
1. The task is marked `FAILED` in the domain model.
135-
2. The entity is updated from the domain model (`taskMapper.updateEntity(...)`).
152+
2. The persistence entity is updated via `TaskMapper`.
136153
3. The failed task is persisted immediately (`taskRepository.save(...)`).
137-
4. The cache is synchronized (`cache.put(...)`).
138-
5. Failure counters and events are recorded.
154+
4. The Redis cache is synchronized (`cache.put(...)`).
155+
5. Failure counters and runtime events are recorded.
156+
157+
This ensures that failures are:
139158

140-
This means failures are:
141159
- explicit
142160
- durable
143161
- observable
144-
- never silent
162+
- never silent
145163

146-
Importantly, failure persistence happens **before retry scheduling**. This ensures that:
147-
- retries are never scheduled without first recording the failure state
148-
- dashboards and inspection APIs always reflect real system behavior
164+
Crucially, failure persistence occurs **before any retry scheduling**. This guarantees that:
165+
166+
- retries are never scheduled without recorded failure
167+
- dashboards and APIs reflect real system state
149168
- the system remains debuggable during heavy retry activity
150169

151170
---
152171

153172
## Automatic Retries with Exponential Backoff
154173

155-
After persisting a failure, SpringQueuePro evaluates retry eligibility:
174+
After persisting a failure, `ProcessingService` evaluates retry eligibility:
156175

157176
```java
158177
if (claimed.getAttempts() < claimed.getMaxRetries()) { ... }
159178
```
160179

161-
If eligible:
180+
If the task is eligible for retry:
162181

163-
1. The retry delay is computed via `computeBackoffMs(attempts)
182+
1. A delay is computed using `computeBackoffMs(attempts)`
164183
2. The task is transitioned from `FAILED → QUEUED`
165-
3. Retry is scheduled using the internal scheduler
184+
3. Retry execution is scheduled via the scheduler
166185
4. Retry metrics and events are recorded
167186

187+
---
188+
168189
### Why FAILED → QUEUED Is Explicit
169190

170-
This is not cosmetic. It is a **correctness requirement**.
191+
This transition is not cosmetic—it is required for correctness.
171192

172-
The claim step only succeeds for `QUEUED` tasks. If a failed task remained `FAILED`, it would never be eligible for re-claim.
193+
The claim mechanism only operates on `QUEUED` tasks. A task that remains `FAILED` cannot be reclaimed or retried.
173194

174-
SpringQueuePro therefore treats “re-queued retry” as a **real state transition**, not as an invisible runtime trick.
195+
SpringQueuePro therefore treats retries as **real lifecycle transitions**, not invisible runtime shortcuts.
196+
197+
---
175198

176199
### Exponential Backoff Policy
177200

178-
The backoff function:
201+
The backoff computation:
179202

180203
```java
181204
return (long) (1000 * Math.pow(2, Math.max(0, attempts - 1)));
182205
```
183206

184-
This mirrors real-world retry semantics because it:
207+
This mirrors production retry semantics because it:
185208

186209
- reduces pressure on downstream systems
187-
- prevents lockstep retry waves
188-
- avoids hammering infrastructure during failure spikes
189-
- stabilizes recovery under load
210+
- avoids synchronized retry waves
211+
- prevents infrastructure hammering
212+
- stabilizes recovery under failure spikes
213+
214+
The objective is not merely to retry, but to retry **safely and predictably**.
190215

191-
The goal is not just to “retry”—it is to retry **safely and predictably**.
216+
---
192217

193218
### Why Scheduling Lives in ProcessingService
194219

@@ -198,87 +223,89 @@ Retry scheduling is performed via:
198223
scheduler.schedule(() -> queueService.enqueueById(taskId), delayMs, TimeUnit.MILLISECONDS);
199224
```
200225

201-
This is intentionally routed through:
226+
This design is intentional:
202227

203-
- the scheduler for delay control
204-
- the queue service for controlled submission
205-
- the same orchestration entry point used for normal execution
228+
- delays are controlled by the scheduler
229+
- execution re-enters via `QueueService`
230+
- retries follow the exact same execution path as initial runs
206231

207-
This avoids creating hidden “alternate execution paths” for retries.
232+
This avoids hidden or bypassed execution paths.
208233

209234
---
210235

211236
## Manual Requeueing (Operator-Controlled Retry)
212237

213-
SpringQueuePro supports explicit operator-style retry through:
238+
SpringQueuePro supports explicit, operator-style retry through:
214239

215240
```java
216241
public boolean manuallyRequeue(String taskId)
217242
```
218243

219-
Manual requeueing is intentionally strict:
244+
Manual requeueing is deliberately strict:
220245

221-
- Only `FAILED` tasks may be manually requeued
222-
- The state transition is enforced in the database (`FAILED → QUEUED`)
223-
- Attempts are reset to 0 (to re-enable retry semantics cleanly)
224-
- Cache is synchronized
225-
- A `TaskCreatedEvent` is published
246+
- only `FAILED` tasks are eligible
247+
- state transition (`FAILED → QUEUED`) is enforced in the database
248+
- attempts are reset to zero
249+
- Redis cache is synchronized
250+
- a `TaskCreatedEvent` is published
226251

227-
Publishing the event (rather than directly calling `enqueueById`) is deliberate: it ensures runtime submission happens only after the transaction commits, preventing “phantom execution of tasks whose state hasn’t actually been persisted yet.
252+
Publishing the eventrather than calling `enqueueById` directly—is deliberate. It ensures execution occurs only **after the transaction commits**, preventing execution of tasks whose state is not yet durable.
228253

229-
This mirrors real production systems where operational commands (manual retry) are handled through the same lifecycle mechanisms as normal execution.
254+
This mirrors production systems where operational actions reuse the same lifecycle pathways as normal execution.
230255

231256
---
232257

233258
## Observability of Processing & Retry Behavior
234259

235-
SpringQueuePro makes processing and retry behavior explicitly visible through:
260+
SpringQueuePro exposes processing and retry behavior explicitly via:
236261

237262
### 1. Metrics
238263

239264
- submitted, claimed, completed, failed, retried counters
240265
- execution timing via `processingTimer.recordCallable(...)`
241266

242-
This enables:
267+
These enable:
243268

244269
- throughput analysis
245-
- failure rate analysis
246-
- latency distribution inspection
270+
- failure rate inspection
271+
- latency distribution tracking
247272
- retry volume monitoring
248273

274+
---
275+
249276
### 2. Runtime Event Log
250277

251-
A bounded event log captures key execution transitions:
278+
A bounded in-memory event log captures execution transitions:
252279

253280
- CLAIM_START / CLAIM_SUCCESS
254281
- PROCESSING
255282
- FAILED / RETRY_SCHEDULED
256283
- COMPLETED
257284
- LOCK_RELEASE
258285

259-
This log exists to support:
286+
This supports:
260287

261288
- dashboard visibility
262-
- debugging without logs
263-
- replaying recent runtime history during demos or incidents
289+
- debugging without backend logs
290+
- replaying recent runtime behavior during demos or incidents
264291

265292
---
266293

267294
## Steps Taken to Mimic Production Quality
268295

269296
### 1. Centralized Retry Policy
270297

271-
Retries are enforced in one place, not scattered across handlers. This prevents:
298+
Retries are enforced in one location, not scattered across handlers. This prevents:
272299

273-
- inconsistent retry semantics
300+
- inconsistent semantics
274301
- handler complexity
275-
- hidden execution paths
302+
- hidden execution behavior
276303

277304
---
278305

279306
### 2. Failure Is Always Persisted First
280307

281-
Failures are recorded durably before retries are scheduled. This ensures:
308+
Failures are durably recorded before retries are scheduled, ensuring:
282309

283310
- correctness under crash scenarios
284311
- consistent observability
@@ -288,24 +315,24 @@ Failures are recorded durably before retries are scheduled. This ensures:
288315

289316
### 3. Backoff Is Applied Deliberately
290317

291-
Exponential backoff prevents retry storms and stabilizes the system under repeated failure. This mirrors real cloud-native patterns.
318+
Exponential backoff stabilizes the system and prevents retry storms, mirroring cloud-native patterns.
292319

293320
---
294321

295322
### 4. Handlers Are Pure Units of Work
296323

297-
Handlers do not coordinate retries or persistence. They either:
324+
Handlers:
298325

299-
- complete successfully
326+
- execute work
300327
- throw exceptions
301328

302-
This is a clean and scalable separation of concerns.
329+
They do not coordinate retries or persistence.
303330

304331
---
305332

306333
### 5. Retry Scheduling Uses Controlled Runtime Paths
307334

308-
Retries re-enter the system through the same submission pipeline (`enqueueById`), ensuring:
335+
Retries re-enter through the same submission pipeline (`enqueueById`), ensuring:
309336

310337
- consistent execution boundaries
311338
- uniform observability
@@ -317,14 +344,14 @@ Retries re-enter the system through the same submission pipeline (`enqueueById`)
317344

318345
SpringQueuePro treats task processing as an orchestration problem, not a handler problem.
319346

320-
Handlers do work. Failures are persisted. Retries are policy-driven. Backoff is deliberate. Execution re-enters the pipeline through controlled scheduling.
347+
Handlers execute work. Failures are persisted. Retries are policy-driven. Backoff is deliberate. Execution re-enters through controlled scheduling.
321348

322-
This design yields task execution behavior that is:
349+
This yields processing behavior that is:
323350

324351
- deterministic
325352
- scalable
326353
- observable
327-
- production-adjacent in its operational semantics
354+
- production-adjacent in operational semantics
328355

329356
In short:
330357

0 commit comments

Comments
 (0)