Raft Cluster: Implement stale leader step-down after quorum loss#3916
Conversation
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Signed-off-by: Su Ko <rhtn1128@gmail.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## cluster-v2 #3916 +/- ##
==============================================
+ Coverage 76.51% 76.72% +0.20%
==============================================
Files 166 166
Lines 82735 82745 +10
==============================================
+ Hits 63306 63483 +177
+ Misses 19429 19262 -167
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/cluster_raft.c (1)
324-326: 💤 Low valueUse new helper consistently throughout the file.
This helper correctly computes quorum. However, lines 1396, 1488, and 1547 still compute quorum inline with
server.cluster->size / 2 + 1. Consider usingclusterRaftQuorum()in those locations for consistency and single-source-of-truth.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cluster_raft.c` around lines 324 - 326, Replace inline quorum calculations "server.cluster->size / 2 + 1" with a call to the helper clusterRaftQuorum() wherever quorum is computed in this file; specifically update the code blocks that currently compute quorum inline (they reference server.cluster->size / 2 + 1) to call clusterRaftQuorum() instead, ensuring all quorum logic uses the single helper function and removing duplicate computation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/cluster_raft.c`:
- Around line 324-326: Replace inline quorum calculations "server.cluster->size
/ 2 + 1" with a call to the helper clusterRaftQuorum() wherever quorum is
computed in this file; specifically update the code blocks that currently
compute quorum inline (they reference server.cluster->size / 2 + 1) to call
clusterRaftQuorum() instead, ensuring all quorum logic uses the single helper
function and removing duplicate computation.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro Plus
Run ID: e0e079e6-0758-4d50-8c1b-08058dd948bd
📒 Files selected for processing (2)
src/cluster_raft.ctests/unit/cluster/cluster-raft-proto.tcl
zuiderkwast
left a comment
There was a problem hiding this comment.
Nice! It's pretty strait-forward. My only non-trivial comment is about revising the code about leader-based failure detection. It also has some majority overdue check that becomes redundant (I believe) with this feature.
| serverLog(LL_NOTICE, "Stepping down to follower: %s.", reason); | ||
| } | ||
|
|
||
| static int clusterRaftLeaderHasFreshQuorum(clusterRaftState *rs, mstime_t now) { |
There was a problem hiding this comment.
Let's add a short comment about what it does. "Fresh" isn't immediately obvious if you don't know what this feature is.
Also, don't pass the global raft state around.
| @@ -826,13 +869,8 @@ static void clusterRaftDeferPendingProposals(void) { | |||
| /* Step down to follower if we see a higher term. Returns 1 if stepped down. */ | |||
| static int clusterRaftMaybeStepDown(clusterRaftState *rs, uint64_t term) { | |||
There was a problem hiding this comment.
I see the global clusterRaftState is passed around here. My mistake. Remove it if you want. (Not required since it's not introduced in this PR, but I'd prefer it anyways.)
| static int clusterRaftMaybeStepDown(clusterRaftState *rs, uint64_t term) { | |
| static int clusterRaftMaybeStepDown(uint64_t term) { | |
| clusterRaftState *rs = RAFT_STATE(); |
| if (memcmp(argv[0], myself->name, CLUSTER_NAMELEN) == 0) { | ||
| if (rs->role == RAFT_ROLE_JOINER) { | ||
| rs->role = RAFT_ROLE_FOLLOWER; | ||
| rs->lost_quorum_since = 0; |
There was a problem hiding this comment.
Used by the leader only, so isn't it enough to initialize this only after becoming a leader?
| /* Let quorum-loss step-down proceed without refreshing ack state. */ | ||
| return; | ||
| } | ||
| /* Majority overdue — reset all ack times. */ |
There was a problem hiding this comment.
When we're adding leader stopdown in minority partition, do we need to keep this logic for skipping marking nodes as FAIL if a majority are overdue? It's kind of the same purpose.
I think you should take a look at the whole of this function clusterRaftDetectFailures and think about it. If lost_quorum_since is set, we could skip all of it, and probably we don't need to detect if a majority is overdue and reset the ack times.
This code was added in this commit: 74b2c09
Signed-off-by: Su Ko <rhtn1128@gmail.com>
Signed-off-by: Su Ko <rhtn1128@gmail.com>
Signed-off-by: Su Ko <rhtn1128@gmail.com>
Signed-off-by: Su Ko <rhtn1128@gmail.com>
Signed-off-by: Su Ko <rhtn1128@gmail.com>
Signed-off-by: Su Ko <rhtn1128@gmail.com>
Signed-off-by: Su Ko <rhtn1128@gmail.com>
| set leader_idx -999 | ||
| set leader_client "" | ||
| foreach {idx client} [list 0 $r0 -1 $r1 -2 $r2] { | ||
| if {[get_cluster_info_field $client cluster_raft_role] eq "leader"} { |
There was a problem hiding this comment.
There is an existing function CI. It's using idx instead of client. It's defined in tests/test_helper.tcl.
| if {[get_cluster_info_field $client cluster_raft_role] eq "leader"} { | |
| if {[CI $idx cluster_raft_role] eq "leader"} { |
| if (rs->role == RAFT_ROLE_LEADER) { | ||
| if (clusterRaftLeaderHasFreshQuorum(now)) { | ||
| rs->lost_quorum_since = 0; | ||
| rs->last_fresh_quorum_time = now; | ||
| } else { | ||
| if (rs->lost_quorum_since == 0) { | ||
| rs->lost_quorum_since = now; | ||
| serverLog(LL_NOTICE, "Leader lost quorum freshness, waiting before step-down."); | ||
| } else if (rs->last_fresh_quorum_time > 0 && | ||
| now - rs->last_fresh_quorum_time > server.cluster_node_timeout) { | ||
| clusterRaftStepDown(now, "lost quorum freshness"); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
How long do we wait before step-down? 🤔
- If we have received ack from the majority of followers during the last
cluster_node_timeout, we setlast_fresh_quorum_time = now; - If we haven't received ack from the majority for the last
cluster_node_timeout, clusterRaftLeaderHasFreshQuorum returns false and we setlost_quorum_since = nowand log "Leader lost quorum freshness". - After waiting another
cluster_node_timeout, step-down.
It looks to me that the leader steps down after 2 * cluster_node_timeout after the last ack from the majority of nodes.
Is this correct? Should we instead step down after only 1 * cluster_node_timeout?
There was a problem hiding this comment.
Thanks for feeback. Good Catch The intended behavior is to step down after roughly 1 * cluster_node_timeout since the last fresh quorum observation, not `2 * cluster_node_timeout 👍
| wait_for_condition 100 50 { | ||
| [get_cluster_info_field $leader_client cluster_raft_role] eq "follower" && | ||
| [get_cluster_info_field $leader_client cluster_raft_leader] eq "" |
There was a problem hiding this comment.
We should test that the leader steps down at the right time. If it should step down after cluster_node_timeout, then we should not wait 5 * cluster_node_timeout. We can wait something like 1.5 * cluster_node_timeout` instead. If it takes too long to step down, it's an error.
We can use a higher cluster_node_timeout (for example 3000 or 5000) in this test if the GitHub CI is slow.
Signed-off-by: Su Ko <rhtn1128@gmail.com>
Signed-off-by: Su Ko <rhtn1128@gmail.com>
Signed-off-by: Su Ko <rhtn1128@gmail.com>
Signed-off-by: Su Ko <rhtn1128@gmail.com>
| break | ||
| } | ||
| after 50 | ||
| } |
There was a problem hiding this comment.
Why did you change from wait_for_condition to this explicit loop?
IMO, with this style, it's less easy to see how long we're waiting. wait_for_condition is very common in our tests, so we know that it means 100 x 50ms.
Btw, 150 * 50 = 7500, but cluster-node-timeout is 1000 in this test. What's the reasoning?
Signed-off-by: Su Ko <rhtn1128@gmail.com>
Signed-off-by: Su Ko <rhtn1128@gmail.com>
zuiderkwast
left a comment
There was a problem hiding this comment.
LGTM, thanks!
Now, there is a merge conflict though, after I merged #3887. It should be easy to solve. rs->todo_save_config = 1 was added in clusterRaftMaybeStepDown.
# Conflicts: # src/cluster_raft.c
Implement stale leader step-down after quorum loss in the Raft cluster protocol.
Leaders now step down to FOLLOWER when quorum freshness is lost for longer than
cluster-node-timeout. This typically means that the leader is in a minority partition.In the cron path, the leader checks if a quorum of nodes have acked AppendEntries within the last cluster-node-timeout.
closes: #3861