fix(passive-health): break the 503 deadlock + release 1.0.6 by ZhiXiao-Lin · Pull Request #4 · A3S-Lab/Gateway

ZhiXiao-Lin · 2026-06-01T02:04:19Z

Problem

A single transient burst of SendRequest/5xx errors could take a whole service down indefinitely. Once a backend exceeded the error threshold it was marked unhealthy and dropped from rotation, but recovery only happened inside record_success — and an unhealthy backend receives no traffic, so no success ever arrived. The backend stayed 503 until the gateway was manually restarted.

This is the root cause of the recent edge outage where a3s-web/api returned 503 while the pods themselves were healthy.

Fix

A background recovery ticker drives a half-open probe: after recovery_time elapses the backend is re-enabled so it receives traffic again. If it is still broken the next errors re-mark it; otherwise record_success keeps it healthy.

BackendErrors now keeps an Arc<Backend> so the ticker can flip the health flag without live traffic.
recover_expired() re-enables any backend past its recovery window.
spawn_recovery() starts a Weak-ref ticker (exits when the checker is dropped on config reload — no task accumulation) and no-ops when there is no Tokio runtime (unit-test builders).

Verification

cargo fmt --check clean, cargo clippy no warnings.
19 passive_health + 13 builders unit tests pass, incl. new test_recover_expired_reenables_after_recovery_time.

Release 1.0.6 (Cargo.toml + CHANGELOG).

Once a backend exceeded the error threshold it was marked unhealthy and dropped from rotation, but recovery only happened inside record_success. An unhealthy backend receives no traffic, so no success ever arrived and the service stayed 503 until the gateway was restarted — a single transient burst of SendRequest/5xx errors could take a whole service down indefinitely. A background recovery ticker now drives a half-open probe: after recovery_time elapses the backend is re-enabled so it receives traffic again; if still broken the next errors re-mark it, otherwise it stays healthy. BackendErrors keeps an Arc<Backend> so the ticker can flip the flag without live traffic. The ticker holds a Weak ref and exits when its checker is dropped (config reload), avoiding task accumulation. Release 1.0.6.

ZhiXiao-Lin merged commit 710d743 into main Jun 1, 2026
5 of 6 checks passed

ZhiXiao-Lin deleted the fix/passive-health-deadlock branch June 1, 2026 02:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(passive-health): break the 503 deadlock + release 1.0.6#4

fix(passive-health): break the 503 deadlock + release 1.0.6#4
ZhiXiao-Lin merged 1 commit into
mainfrom
fix/passive-health-deadlock

ZhiXiao-Lin commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ZhiXiao-Lin commented Jun 1, 2026

Problem

Fix

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant