Skip to content

Raft Cluster: Implement stale leader step-down after quorum loss#3916

Merged
zuiderkwast merged 15 commits into
valkey-io:cluster-v2from
bandalgomsu:issue-3861
Jun 11, 2026
Merged

Raft Cluster: Implement stale leader step-down after quorum loss#3916
zuiderkwast merged 15 commits into
valkey-io:cluster-v2from
bandalgomsu:issue-3861

Conversation

@bandalgomsu

@bandalgomsu bandalgomsu commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Implement stale leader step-down after quorum loss in the Raft cluster protocol.
Leaders now step down to FOLLOWER when quorum freshness is lost for longer than cluster-node-timeout. This typically means that the leader is in a minority partition.

In the cron path, the leader checks if a quorum of nodes have acked AppendEntries within the last cluster-node-timeout.

closes: #3861

@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The description is directly related to the changeset, explaining the stale leader step-down mechanism and the logic flow for quorum freshness tracking.
Title check ✅ Passed The title 'cluster-v2 Implement stale leader step-down after quorum loss' directly and clearly describes the main change: implementing logic for leaders to step down when quorum freshness is lost, which is the primary objective of the PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@bandalgomsu bandalgomsu marked this pull request as draft June 4, 2026 11:03
Signed-off-by: Su Ko <rhtn1128@gmail.com>
@codecov

codecov Bot commented Jun 4, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.72%. Comparing base (ccd1505) to head (513cc3a).
⚠️ Report is 1 commits behind head on cluster-v2.

Additional details and impacted files
@@              Coverage Diff               @@
##           cluster-v2    #3916      +/-   ##
==============================================
+ Coverage       76.51%   76.72%   +0.20%     
==============================================
  Files             166      166              
  Lines           82735    82745      +10     
==============================================
+ Hits            63306    63483     +177     
+ Misses          19429    19262     -167     
Files with missing lines Coverage Δ
src/cluster_raft.c 63.76% <100.00%> (+0.67%) ⬆️

... and 20 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@bandalgomsu bandalgomsu marked this pull request as ready for review June 6, 2026 03:13

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/cluster_raft.c (1)

324-326: 💤 Low value

Use new helper consistently throughout the file.

This helper correctly computes quorum. However, lines 1396, 1488, and 1547 still compute quorum inline with server.cluster->size / 2 + 1. Consider using clusterRaftQuorum() in those locations for consistency and single-source-of-truth.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cluster_raft.c` around lines 324 - 326, Replace inline quorum
calculations "server.cluster->size / 2 + 1" with a call to the helper
clusterRaftQuorum() wherever quorum is computed in this file; specifically
update the code blocks that currently compute quorum inline (they reference
server.cluster->size / 2 + 1) to call clusterRaftQuorum() instead, ensuring all
quorum logic uses the single helper function and removing duplicate computation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/cluster_raft.c`:
- Around line 324-326: Replace inline quorum calculations "server.cluster->size
/ 2 + 1" with a call to the helper clusterRaftQuorum() wherever quorum is
computed in this file; specifically update the code blocks that currently
compute quorum inline (they reference server.cluster->size / 2 + 1) to call
clusterRaftQuorum() instead, ensuring all quorum logic uses the single helper
function and removing duplicate computation.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: e0e079e6-0758-4d50-8c1b-08058dd948bd

📥 Commits

Reviewing files that changed from the base of the PR and between 2604112 and 87d649e.

📒 Files selected for processing (2)
  • src/cluster_raft.c
  • tests/unit/cluster/cluster-raft-proto.tcl

@zuiderkwast zuiderkwast left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! It's pretty strait-forward. My only non-trivial comment is about revising the code about leader-based failure detection. It also has some majority overdue check that becomes redundant (I believe) with this feature.

Comment thread src/cluster_raft.c
Comment thread src/cluster_raft.c Outdated
Comment thread src/cluster_raft.c Outdated
serverLog(LL_NOTICE, "Stepping down to follower: %s.", reason);
}

static int clusterRaftLeaderHasFreshQuorum(clusterRaftState *rs, mstime_t now) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a short comment about what it does. "Fresh" isn't immediately obvious if you don't know what this feature is.

Also, don't pass the global raft state around.

Comment thread src/cluster_raft.c Outdated
Comment thread src/cluster_raft.c Outdated
@@ -826,13 +869,8 @@ static void clusterRaftDeferPendingProposals(void) {
/* Step down to follower if we see a higher term. Returns 1 if stepped down. */
static int clusterRaftMaybeStepDown(clusterRaftState *rs, uint64_t term) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the global clusterRaftState is passed around here. My mistake. Remove it if you want. (Not required since it's not introduced in this PR, but I'd prefer it anyways.)

Suggested change
static int clusterRaftMaybeStepDown(clusterRaftState *rs, uint64_t term) {
static int clusterRaftMaybeStepDown(uint64_t term) {
clusterRaftState *rs = RAFT_STATE();

Comment thread src/cluster_raft.c Outdated
Comment thread src/cluster_raft.c Outdated
if (memcmp(argv[0], myself->name, CLUSTER_NAMELEN) == 0) {
if (rs->role == RAFT_ROLE_JOINER) {
rs->role = RAFT_ROLE_FOLLOWER;
rs->lost_quorum_since = 0;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used by the leader only, so isn't it enough to initialize this only after becoming a leader?

Comment thread src/cluster_raft.c Outdated
/* Let quorum-loss step-down proceed without refreshing ack state. */
return;
}
/* Majority overdue — reset all ack times. */

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we're adding leader stopdown in minority partition, do we need to keep this logic for skipping marking nodes as FAIL if a majority are overdue? It's kind of the same purpose.

I think you should take a look at the whole of this function clusterRaftDetectFailures and think about it. If lost_quorum_since is set, we could skip all of it, and probably we don't need to detect if a majority is overdue and reset the ack times.

This code was added in this commit: 74b2c09

Comment thread src/cluster_raft.c Outdated
Signed-off-by: Su Ko <rhtn1128@gmail.com>
Signed-off-by: Su Ko <rhtn1128@gmail.com>
Signed-off-by: Su Ko <rhtn1128@gmail.com>
Signed-off-by: Su Ko <rhtn1128@gmail.com>
Signed-off-by: Su Ko <rhtn1128@gmail.com>
Signed-off-by: Su Ko <rhtn1128@gmail.com>
Signed-off-by: Su Ko <rhtn1128@gmail.com>
Comment thread src/cluster_raft.c Outdated
set leader_idx -999
set leader_client ""
foreach {idx client} [list 0 $r0 -1 $r1 -2 $r2] {
if {[get_cluster_info_field $client cluster_raft_role] eq "leader"} {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an existing function CI. It's using idx instead of client. It's defined in tests/test_helper.tcl.

Suggested change
if {[get_cluster_info_field $client cluster_raft_role] eq "leader"} {
if {[CI $idx cluster_raft_role] eq "leader"} {

Comment thread src/cluster_raft.c Outdated
Comment on lines +1744 to +1757
if (rs->role == RAFT_ROLE_LEADER) {
if (clusterRaftLeaderHasFreshQuorum(now)) {
rs->lost_quorum_since = 0;
rs->last_fresh_quorum_time = now;
} else {
if (rs->lost_quorum_since == 0) {
rs->lost_quorum_since = now;
serverLog(LL_NOTICE, "Leader lost quorum freshness, waiting before step-down.");
} else if (rs->last_fresh_quorum_time > 0 &&
now - rs->last_fresh_quorum_time > server.cluster_node_timeout) {
clusterRaftStepDown(now, "lost quorum freshness");
}
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long do we wait before step-down? 🤔

  1. If we have received ack from the majority of followers during the last cluster_node_timeout, we set last_fresh_quorum_time = now;
  2. If we haven't received ack from the majority for the last cluster_node_timeout, clusterRaftLeaderHasFreshQuorum returns false and we set lost_quorum_since = now and log "Leader lost quorum freshness".
  3. After waiting another cluster_node_timeout, step-down.

It looks to me that the leader steps down after 2 * cluster_node_timeout after the last ack from the majority of nodes.

Is this correct? Should we instead step down after only 1 * cluster_node_timeout?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for feeback. Good Catch The intended behavior is to step down after roughly 1 * cluster_node_timeout since the last fresh quorum observation, not `2 * cluster_node_timeout 👍

Comment on lines +386 to +388
wait_for_condition 100 50 {
[get_cluster_info_field $leader_client cluster_raft_role] eq "follower" &&
[get_cluster_info_field $leader_client cluster_raft_leader] eq ""

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should test that the leader steps down at the right time. If it should step down after cluster_node_timeout, then we should not wait 5 * cluster_node_timeout. We can wait something like 1.5 * cluster_node_timeout` instead. If it takes too long to step down, it's an error.

We can use a higher cluster_node_timeout (for example 3000 or 5000) in this test if the GitHub CI is slow.

Comment thread tests/unit/cluster/cluster-raft-proto.tcl Outdated
Comment thread tests/unit/cluster/cluster-raft-proto.tcl Outdated
Signed-off-by: Su Ko <rhtn1128@gmail.com>
Signed-off-by: Su Ko <rhtn1128@gmail.com>
Signed-off-by: Su Ko <rhtn1128@gmail.com>
Signed-off-by: Su Ko <rhtn1128@gmail.com>
Comment thread tests/unit/cluster/cluster-raft.tcl Outdated
break
}
after 50
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you change from wait_for_condition to this explicit loop?

IMO, with this style, it's less easy to see how long we're waiting. wait_for_condition is very common in our tests, so we know that it means 100 x 50ms.

Btw, 150 * 50 = 7500, but cluster-node-timeout is 1000 in this test. What's the reasoning?

Comment thread src/cluster_raft.c Outdated
Comment thread tests/unit/cluster/cluster-raft.tcl Outdated
Signed-off-by: Su Ko <rhtn1128@gmail.com>
Signed-off-by: Su Ko <rhtn1128@gmail.com>

@zuiderkwast zuiderkwast left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Now, there is a merge conflict though, after I merged #3887. It should be easy to solve. rs->todo_save_config = 1 was added in clusterRaftMaybeStepDown.

@zuiderkwast zuiderkwast linked an issue Jun 11, 2026 that may be closed by this pull request
@zuiderkwast zuiderkwast changed the title cluster-v2 Implement stale leader step-down after quorum loss Raft Cluster: Implement stale leader step-down after quorum loss Jun 11, 2026
@zuiderkwast zuiderkwast merged commit 49322a1 into valkey-io:cluster-v2 Jun 11, 2026
27 of 28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Raft Cluster: Minority Partition Detection & Leader Step-Down

2 participants