Problem
The current install_panic_hook in TaskManager treats all panics equally and any thread panic triggers a full shutdown of the binary. This is correct for critical tasks like the server loop, but overly aggressive for non-critical background tasks (e.g., persistence flush, size calculation) where a panic could be logged and retried on the next interval.
Proposed Solution
Replace the global panic hook approach with per-task panic handling using std::panic::catch_unwind (via AssertUnwindSafe) inside spawn_task_loop and spawn_blocking.
pub enum TaskCriticality {
/// Panic triggers shutdown of all tasks (e.g., server loop)
Critical,
/// Panic is logged, task continues on next loop iteration (e.g., persistence, size calc)
Recoverable,
}
- Critical tasks: on panic, cancel the CancellationToken to trigger graceful shutdown
- Recoverable tasks: on panic, log the error and continue the loop (return TaskState::Continue)
This requires adding a criticality() method to the Task and BlockingTask traits (defaulting to Critical for safety).
Context
This came out of the papaya panic issue (#318). We added a global panic hook (#323) to prevent the binary from staying in limbo after a worker thread panic. That fix is correct as a baseline, but we should be more granular about which panics are actually fatal.
Problem
The current
install_panic_hookin TaskManager treats all panics equally and any thread panic triggers a full shutdown of the binary. This is correct for critical tasks like the server loop, but overly aggressive for non-critical background tasks (e.g., persistence flush, size calculation) where a panic could be logged and retried on the next interval.Proposed Solution
Replace the global panic hook approach with per-task panic handling using
std::panic::catch_unwind (via AssertUnwindSafe)insidespawn_task_loopandspawn_blocking.This requires adding a criticality() method to the Task and BlockingTask traits (defaulting to Critical for safety).
Context
This came out of the papaya panic issue (#318). We added a global panic hook (#323) to prevent the binary from staying in limbo after a worker thread panic. That fix is correct as a baseline, but we should be more granular about which panics are actually fatal.