fix(passive-health): break the 503 deadlock + release 1.0.6#4
Merged
Conversation
Once a backend exceeded the error threshold it was marked unhealthy and dropped from rotation, but recovery only happened inside record_success. An unhealthy backend receives no traffic, so no success ever arrived and the service stayed 503 until the gateway was restarted — a single transient burst of SendRequest/5xx errors could take a whole service down indefinitely. A background recovery ticker now drives a half-open probe: after recovery_time elapses the backend is re-enabled so it receives traffic again; if still broken the next errors re-mark it, otherwise it stays healthy. BackendErrors keeps an Arc<Backend> so the ticker can flip the flag without live traffic. The ticker holds a Weak ref and exits when its checker is dropped (config reload), avoiding task accumulation. Release 1.0.6.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
A single transient burst of
SendRequest/5xx errors could take a whole service down indefinitely. Once a backend exceeded the error threshold it was marked unhealthy and dropped from rotation, but recovery only happened insiderecord_success— and an unhealthy backend receives no traffic, so no success ever arrived. The backend stayed503until the gateway was manually restarted.This is the root cause of the recent edge outage where a3s-web/api returned 503 while the pods themselves were healthy.
Fix
A background recovery ticker drives a half-open probe: after
recovery_timeelapses the backend is re-enabled so it receives traffic again. If it is still broken the next errors re-mark it; otherwiserecord_successkeeps it healthy.BackendErrorsnow keeps anArc<Backend>so the ticker can flip the health flag without live traffic.recover_expired()re-enables any backend past its recovery window.spawn_recovery()starts aWeak-ref ticker (exits when the checker is dropped on config reload — no task accumulation) and no-ops when there is no Tokio runtime (unit-test builders).Verification
cargo fmt --checkclean,cargo clippyno warnings.test_recover_expired_reenables_after_recovery_time.Release 1.0.6 (Cargo.toml + CHANGELOG).