Skip to content

fix(passive-health): break the 503 deadlock + release 1.0.6#4

Merged
ZhiXiao-Lin merged 1 commit into
mainfrom
fix/passive-health-deadlock
Jun 1, 2026
Merged

fix(passive-health): break the 503 deadlock + release 1.0.6#4
ZhiXiao-Lin merged 1 commit into
mainfrom
fix/passive-health-deadlock

Conversation

@ZhiXiao-Lin
Copy link
Copy Markdown
Contributor

Problem

A single transient burst of SendRequest/5xx errors could take a whole service down indefinitely. Once a backend exceeded the error threshold it was marked unhealthy and dropped from rotation, but recovery only happened inside record_success — and an unhealthy backend receives no traffic, so no success ever arrived. The backend stayed 503 until the gateway was manually restarted.

This is the root cause of the recent edge outage where a3s-web/api returned 503 while the pods themselves were healthy.

Fix

A background recovery ticker drives a half-open probe: after recovery_time elapses the backend is re-enabled so it receives traffic again. If it is still broken the next errors re-mark it; otherwise record_success keeps it healthy.

  • BackendErrors now keeps an Arc<Backend> so the ticker can flip the health flag without live traffic.
  • recover_expired() re-enables any backend past its recovery window.
  • spawn_recovery() starts a Weak-ref ticker (exits when the checker is dropped on config reload — no task accumulation) and no-ops when there is no Tokio runtime (unit-test builders).

Verification

  • cargo fmt --check clean, cargo clippy no warnings.
  • 19 passive_health + 13 builders unit tests pass, incl. new test_recover_expired_reenables_after_recovery_time.

Release 1.0.6 (Cargo.toml + CHANGELOG).

Once a backend exceeded the error threshold it was marked unhealthy and
dropped from rotation, but recovery only happened inside record_success.
An unhealthy backend receives no traffic, so no success ever arrived and
the service stayed 503 until the gateway was restarted — a single
transient burst of SendRequest/5xx errors could take a whole service
down indefinitely.

A background recovery ticker now drives a half-open probe: after
recovery_time elapses the backend is re-enabled so it receives traffic
again; if still broken the next errors re-mark it, otherwise it stays
healthy. BackendErrors keeps an Arc<Backend> so the ticker can flip the
flag without live traffic. The ticker holds a Weak ref and exits when
its checker is dropped (config reload), avoiding task accumulation.

Release 1.0.6.
@ZhiXiao-Lin ZhiXiao-Lin merged commit 710d743 into main Jun 1, 2026
5 of 6 checks passed
@ZhiXiao-Lin ZhiXiao-Lin deleted the fix/passive-health-deadlock branch June 1, 2026 02:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant