Speed up split-vote elections with the new FAILOVER_AUTH_NACK message#3833
Speed up split-vote elections with the new FAILOVER_AUTH_NACK message#3833enjoy-binbin wants to merge 12 commits into
Conversation
In the current failover protocol a replica sends AUTH_REQUEST exactly once per epoch and each voter casts at most one vote per epoch. Despite the various delay heuristics in clusterHandleReplicaFailover that try to stagger replicas, concurrent elections can still collide on the same epoch. When the votes split and nobody reaches the quorum, the losing replica has no way to learn this in time and must first wait for the election to be declared expired after auth_timeout (2*cluster-node-timeout) and then wait another auth_retry_time (2*auth_timeout) before it is even allowed to start the next election with a higher epoch. Introduce a new message type, FAILOVER_AUTH_NACK, that voters reply with from every refusal branch. The replica counts incoming NACKs; since a voter never re-answers within the same epoch, once size - nack_count drops below the quorum the election is declared unwinnable and a new one is started with a higher epoch right away, shrinking recovery from the auth_timeout + auth_retry_time window to a few cron ticks. Wire compatibility is preserved by gating NACK emission on a new CLUSTER_NODE_FAILOVER_AUTH_NACK_SUPPORTED capability flag advertised in PING/PONG flags via clusterUpdateMyselfFlags. Peers that do not advertise the capability never see the new message type and fall back to the legacy auth_timeout path. Adding a new DEBUG CLUSTER-FAILOVER-DELAY <ms> hook overrides the delay computed in clusterHandleReplicaFailover for testing. Signed-off-by: Binbin <binloveplay1314@qq.com>
|
This is an example without the auth_time reset (just print the logs but not reset failover_auth_time): one replica The other replica: |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds explicit FAILOVER_AUTH_NACK messaging and handling: wire contracts and capability flag, send/receive helpers, denial-path NACKs, NACK counting with election fast-fail, debug-configurable failover delay, and expanded parametrized tests. ChangesCluster Failover Authorization NACK Support
Sequence Diagram(s)sequenceDiagram
participant ComponentA
participant ComponentB
ComponentA->>ComponentB: observable interaction
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (3)
src/cluster_legacy.h (1)
147-150: ⚡ Quick winPin the new wire payload size with a static assert.
This struct is now part of the cluster wire contract. A compile-time size check here would catch accidental padding or field changes before they silently alter the on-the-wire format.
♻️ Proposed fix
typedef struct { uint8_t reason; char notused1[24]; } clusterMsgDataFailoverNack; +static_assert(sizeof(clusterMsgDataFailoverNack) == 25, "unexpected FAILOVER_AUTH_NACK payload size");🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cluster_legacy.h` around lines 147 - 150, The new cluster wire struct clusterMsgDataFailoverNack must have its on-the-wire size pinned; add a compile-time assertion (e.g., static_assert or _Static_assert depending on project C standard) that sizeof(clusterMsgDataFailoverNack) == 25 (1 byte for reason + 24 bytes for notused1) with a clear message indicating this pins the wire format, so any accidental padding/field changes will fail to compile; place the assertion next to the struct definition and use the same symbol name in the assertion.src/cluster_legacy.c (1)
70-71: ⚡ Quick winMake the new NACK helpers file-local.
clusterSendFailoverNack()andclusterProcessFailoverAuthNack()are only used inside this translation unit, so keeping them non-staticneedlessly widens the symbol surface.Proposed change
-void clusterProcessFailoverAuthNack(clusterNode *sender, clusterMsg *request); -void clusterSendFailoverNack(clusterNode *node, uint8_t reason); +static void clusterProcessFailoverAuthNack(clusterNode *sender, clusterMsg *request); +static void clusterSendFailoverNack(clusterNode *node, uint8_t reason); ... -void clusterSendFailoverNack(clusterNode *node, uint8_t reason) { +static void clusterSendFailoverNack(clusterNode *node, uint8_t reason) { ... -void clusterProcessFailoverAuthNack(clusterNode *sender, clusterMsg *request) { +static void clusterProcessFailoverAuthNack(clusterNode *sender, clusterMsg *request) {As per coding guidelines "Use static keyword for file-local functions in C code".
Also applies to: 5312-5324, 5443-5471
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cluster_legacy.c` around lines 70 - 71, The two helper functions clusterSendFailoverNack and clusterProcessFailoverAuthNack are only used within this translation unit and should be made file-local: add the static qualifier to their forward declarations and their function definitions (and to any other internally-used helpers referenced in the same areas noted, e.g., the helpers around the 5312-5324 and 5443-5471 regions) so their symbols are not exported from the object file; ensure prototypes and definitions match (both marked static).tests/unit/cluster/failover2.tcl (1)
79-79: ⚡ Quick winConsider using the
pause_processhelper for consistency.Line 79 uses
exec kill -SIGSTOPdirectly, while the rest of the file uses thepause_processhelper (lines 15, 34, 120, 182). If atomic pausing is intentional for same-epoch testing, consider documenting the rationale; otherwise, prefer the framework helper:♻️ Align with test framework pattern
- exec kill -SIGSTOP [srv 0 pid] [srv -1 pid] [srv -2 pid] + pause_process [srv 0 pid] + pause_process [srv -1 pid] + pause_process [srv -2 pid]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/cluster/failover2.tcl` at line 79, Replace the direct call to exec kill -SIGSTOP used on the test processes with the existing pause_process helper for consistency with this test suite (the helper is used at lines where pause_process is referenced); locate the line containing exec kill -SIGSTOP [srv 0 pid] [srv -1 pid] [srv -2 pid] and change it to call pause_process for each target process (or document in a short comment why an atomic SIGSTOP across multiple pids is required if you intentionally need simultaneous pausing), ensuring you use the same helper signature and import/context as other uses in this file.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/cluster_legacy.c`:
- Around line 70-71: The two helper functions clusterSendFailoverNack and
clusterProcessFailoverAuthNack are only used within this translation unit and
should be made file-local: add the static qualifier to their forward
declarations and their function definitions (and to any other internally-used
helpers referenced in the same areas noted, e.g., the helpers around the
5312-5324 and 5443-5471 regions) so their symbols are not exported from the
object file; ensure prototypes and definitions match (both marked static).
In `@src/cluster_legacy.h`:
- Around line 147-150: The new cluster wire struct clusterMsgDataFailoverNack
must have its on-the-wire size pinned; add a compile-time assertion (e.g.,
static_assert or _Static_assert depending on project C standard) that
sizeof(clusterMsgDataFailoverNack) == 25 (1 byte for reason + 24 bytes for
notused1) with a clear message indicating this pins the wire format, so any
accidental padding/field changes will fail to compile; place the assertion next
to the struct definition and use the same symbol name in the assertion.
In `@tests/unit/cluster/failover2.tcl`:
- Line 79: Replace the direct call to exec kill -SIGSTOP used on the test
processes with the existing pause_process helper for consistency with this test
suite (the helper is used at lines where pause_process is referenced); locate
the line containing exec kill -SIGSTOP [srv 0 pid] [srv -1 pid] [srv -2 pid] and
change it to call pause_process for each target process (or document in a short
comment why an atomic SIGSTOP across multiple pids is required if you
intentionally need simultaneous pausing), ensuring you use the same helper
signature and import/context as other uses in this file.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 5e5ebad9-589c-475d-8218-e3d48f92cb1b
📒 Files selected for processing (6)
src/cluster_legacy.csrc/cluster_legacy.hsrc/debug.csrc/server.csrc/server.htests/unit/cluster/failover2.tcl
zuiderkwast
left a comment
There was a problem hiding this comment.
Looks good in general.
Have you consider using the light header for NACK? If we use it from the start, we don't need to support the non-light variant.
Yes, i did consider using a light header, but we do need the And another reason, i want to keep the AUTH_REQUEST / AUTH_ACK / AUTH_NACK handling logic in one place. Keeping symmetry across these three election messages. NACK is point-to-point and only emitted during a failed elections, so i think i can accept it. BUT of course, we can use light header for this type. |
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
…the failover" test Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tests/unit/cluster/failover2.tcl (2)
78-78:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winTypo in comment.
"Killing there primary nodes" should be "Killing three primary nodes".
- # Killing there primary nodes. + # Killing three primary nodes.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/cluster/failover2.tcl` at line 78, Typo in test comment: update the comment string "Killing there primary nodes." in tests/unit/cluster/failover2.tcl to read "Killing three primary nodes." so the intent is clear; locate the commented line containing that exact phrase and replace "there" with "three".
70-70:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winGrammar issue in test name.
The test name contains "then they are elected" which should be "when they are elected" for grammatical correctness.
- test "Primaries will not time out then they are elected in the same epoch - delay $delay" { + test "Primaries will not time out when they are elected in the same epoch - delay $delay" {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/cluster/failover2.tcl` at line 70, The test name string "Primaries will not time out then they are elected in the same epoch - delay $delay" has a grammar mistake; update the test declaration to replace "then" with "when" so it reads "Primaries will not time out when they are elected in the same epoch - delay $delay" (locate the test block that begins with the same quoted title and change the text there).Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@tests/unit/cluster/failover2.tcl`:
- Line 78: Typo in test comment: update the comment string "Killing there
primary nodes." in tests/unit/cluster/failover2.tcl to read "Killing three
primary nodes." so the intent is clear; locate the commented line containing
that exact phrase and replace "there" with "three".
- Line 70: The test name string "Primaries will not time out then they are
elected in the same epoch - delay $delay" has a grammar mistake; update the test
declaration to replace "then" with "when" so it reads "Primaries will not time
out when they are elected in the same epoch - delay $delay" (locate the test
block that begins with the same quoted title and change the text there).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 2f09e5f2-a421-48f6-843d-4f1f753114f9
📒 Files selected for processing (2)
src/cluster_legacy.ctests/unit/cluster/failover2.tcl
🚧 Files skipped from review as they are similar to previous changes (1)
- src/cluster_legacy.c
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #3833 +/- ##
============================================
+ Coverage 76.59% 76.75% +0.16%
============================================
Files 162 162
Lines 80788 80845 +57
============================================
+ Hits 61881 62056 +175
+ Misses 18907 18789 -118
🚀 New features to boost your workflow:
|
Signed-off-by: Binbin <binloveplay1314@qq.com>
Let's use the normal header for NACK now. In the future, we can add light variant of all of these. |
zuiderkwast
left a comment
There was a problem hiding this comment.
LGTM. Some comments/questions. You can check if you agree or not, or just merge.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
sarthakaggarwal97
left a comment
There was a problem hiding this comment.
I did try this out with valkey-fuzzer and it looks to be passing 100 fuzzer runs: https://github.com/valkey-io/valkey-fuzzer/actions/runs/27182824004
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
|
@enjoy-binbin We can merge this. Just out of curiosity, I want to try this bot to see if there are any interesting comments. @valkey-review-bot Please review this PR. |
|
The NACK protocol itself looks solid — the capability gating ( |
Signed-off-by: Binbin <binloveplay1314@qq.com>
In the current failover protocol a replica sends AUTH_REQUEST exactly
once per epoch and each voter casts at most one vote per epoch. Despite
the various delay heuristics in clusterHandleReplicaFailover that try to
stagger replicas, concurrent elections can still collide on the same
epoch. When the votes split and nobody reaches the quorum, the losing
replica has no way to learn this in time and must first wait for the
election to be declared expired after auth_timeout (2 * cluster-node-timeout)
and then wait another auth_retry_time (2 * auth_timeout) before it is even
allowed to start the next election with a higher epoch.
Introduce a new message type, FAILOVER_AUTH_NACK, that voters reply with
from every refusal branch. The replica counts incoming NACKs; since a
voter never re-answers within the same epoch, once size - nack_count
drops below the quorum the election is declared unwinnable and a new
one is started with a higher epoch right away, shrinking recovery from
the auth_timeout + auth_retry_time window to a few cron ticks.
Wire compatibility is preserved by gating NACK emission on a new
CLUSTER_NODE_FAILOVER_AUTH_NACK_SUPPORTED capability flag advertised in
PING/PONG flags via clusterUpdateMyselfFlags. Peers that do not advertise
the capability never see the new message type and fall back to the
legacy auth_timeout path.
Adding a new DEBUG CLUSTER-FAILOVER-DELAY hook overrides the delay
computed in clusterHandleReplicaFailover for testing.