Raft Cluster: Implement stale leader step-down after quorum loss by bandalgomsu · Pull Request #3916 · valkey-io/valkey

bandalgomsu · 2026-06-04T11:03:18Z

Implement stale leader step-down after quorum loss in the Raft cluster protocol.
Leaders now step down to FOLLOWER when quorum freshness is lost for longer than cluster-node-timeout. This typically means that the leader is in a minority partition.

In the cron path, the leader checks if a quorum of nodes have acked AppendEntries within the last cluster-node-timeout.

closes: #3861

coderabbitai · 2026-06-04T11:03:27Z

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The description is directly related to the changeset, explaining the stale leader step-down mechanism and the logic flow for quorum freshness tracking.
Title check	✅ Passed	The title 'cluster-v2 Implement stale leader step-down after quorum loss' directly and clearly describes the main change: implementing logic for leaders to step down when quorum freshness is lost, which is the primary objective of the PR.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: Su Ko <rhtn1128@gmail.com>

codecov · 2026-06-04T12:48:55Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.72%. Comparing base (ccd1505) to head (513cc3a).
⚠️ Report is 1 commits behind head on cluster-v2.

Additional details and impacted files

@@              Coverage Diff               @@
##           cluster-v2    #3916      +/-   ##
==============================================
+ Coverage       76.51%   76.72%   +0.20%     
==============================================
  Files             166      166              
  Lines           82735    82745      +10     
==============================================
+ Hits            63306    63483     +177     
+ Misses          19429    19262     -167

Files with missing lines	Coverage Δ
src/cluster_raft.c	`63.76% <100.00%> (+0.67%)`	⬆️

... and 20 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

coderabbitai

🧹 Nitpick comments (1)

src/cluster_raft.c (1)
324-326: 💤 Low value

Use new helper consistently throughout the file.

This helper correctly computes quorum. However, lines 1396, 1488, and 1547 still compute quorum inline with server.cluster->size / 2 + 1. Consider using clusterRaftQuorum() in those locations for consistency and single-source-of-truth.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cluster_raft.c` around lines 324 - 326, Replace inline quorum
calculations "server.cluster->size / 2 + 1" with a call to the helper
clusterRaftQuorum() wherever quorum is computed in this file; specifically
update the code blocks that currently compute quorum inline (they reference
server.cluster->size / 2 + 1) to call clusterRaftQuorum() instead, ensuring all
quorum logic uses the single helper function and removing duplicate computation.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/cluster_raft.c`:
- Around line 324-326: Replace inline quorum calculations "server.cluster->size
/ 2 + 1" with a call to the helper clusterRaftQuorum() wherever quorum is
computed in this file; specifically update the code blocks that currently
compute quorum inline (they reference server.cluster->size / 2 + 1) to call
clusterRaftQuorum() instead, ensuring all quorum logic uses the single helper
function and removing duplicate computation.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: e0e079e6-0758-4d50-8c1b-08058dd948bd

📥 Commits

Reviewing files that changed from the base of the PR and between 2604112 and 87d649e.

📒 Files selected for processing (2)

src/cluster_raft.c
tests/unit/cluster/cluster-raft-proto.tcl

zuiderkwast

Nice! It's pretty strait-forward. My only non-trivial comment is about revising the code about leader-based failure detection. It also has some majority overdue check that becomes redundant (I believe) with this feature.

zuiderkwast · 2026-06-08T10:17:17Z

+    serverLog(LL_NOTICE, "Stepping down to follower: %s.", reason);
+}
+
+static int clusterRaftLeaderHasFreshQuorum(clusterRaftState *rs, mstime_t now) {


Let's add a short comment about what it does. "Fresh" isn't immediately obvious if you don't know what this feature is.

Also, don't pass the global raft state around.

zuiderkwast · 2026-06-08T10:33:55Z

@@ -826,13 +869,8 @@ static void clusterRaftDeferPendingProposals(void) {
 /* Step down to follower if we see a higher term. Returns 1 if stepped down. */
 static int clusterRaftMaybeStepDown(clusterRaftState *rs, uint64_t term) {


I see the global clusterRaftState is passed around here. My mistake. Remove it if you want. (Not required since it's not introduced in this PR, but I'd prefer it anyways.)

Suggested change

static int clusterRaftMaybeStepDown(clusterRaftState *rs, uint64_t term) {

static int clusterRaftMaybeStepDown(uint64_t term) {

clusterRaftState *rs = RAFT_STATE();

zuiderkwast · 2026-06-08T10:35:08Z

            if (memcmp(argv[0], myself->name, CLUSTER_NAMELEN) == 0) {
                if (rs->role == RAFT_ROLE_JOINER) {
                    rs->role = RAFT_ROLE_FOLLOWER;
+                    rs->lost_quorum_since = 0;


Used by the leader only, so isn't it enough to initialize this only after becoming a leader?

zuiderkwast · 2026-06-08T12:25:32Z

+            /* Let quorum-loss step-down proceed without refreshing ack state. */
+            return;
+        }
        /* Majority overdue — reset all ack times. */


When we're adding leader stopdown in minority partition, do we need to keep this logic for skipping marking nodes as FAIL if a majority are overdue? It's kind of the same purpose.

I think you should take a look at the whole of this function clusterRaftDetectFailures and think about it. If lost_quorum_since is set, we could skip all of it, and probably we don't need to detect if a majority is overdue and reset the ack times.

This code was added in this commit: 74b2c09

Signed-off-by: Su Ko <rhtn1128@gmail.com>

zuiderkwast · 2026-06-08T15:00:16Z

+        set leader_idx -999
+        set leader_client ""
+        foreach {idx client} [list 0 $r0 -1 $r1 -2 $r2] {
+            if {[get_cluster_info_field $client cluster_raft_role] eq "leader"} {


There is an existing function CI. It's using idx instead of client. It's defined in tests/test_helper.tcl.

Suggested change

if {[get_cluster_info_field $client cluster_raft_role] eq "leader"} {

if {[CI $idx cluster_raft_role] eq "leader"} {

zuiderkwast · 2026-06-08T19:36:50Z

+        if (rs->role == RAFT_ROLE_LEADER) {
+            if (clusterRaftLeaderHasFreshQuorum(now)) {
+                rs->lost_quorum_since = 0;
+                rs->last_fresh_quorum_time = now;
+            } else {
+                if (rs->lost_quorum_since == 0) {
+                    rs->lost_quorum_since = now;
+                    serverLog(LL_NOTICE, "Leader lost quorum freshness, waiting before step-down.");
+                } else if (rs->last_fresh_quorum_time > 0 &&
+                           now - rs->last_fresh_quorum_time > server.cluster_node_timeout) {
+                    clusterRaftStepDown(now, "lost quorum freshness");
+                }
+            }
+        }


How long do we wait before step-down? 🤔

If we have received ack from the majority of followers during the last cluster_node_timeout, we set last_fresh_quorum_time = now;

If we haven't received ack from the majority for the last cluster_node_timeout, clusterRaftLeaderHasFreshQuorum returns false and we set lost_quorum_since = now and log "Leader lost quorum freshness".

After waiting another cluster_node_timeout, step-down.

It looks to me that the leader steps down after 2 * cluster_node_timeout after the last ack from the majority of nodes.

Is this correct? Should we instead step down after only 1 * cluster_node_timeout?

Thanks for feeback. Good Catch The intended behavior is to step down after roughly 1 * cluster_node_timeout since the last fresh quorum observation, not `2 * cluster_node_timeout 👍

zuiderkwast · 2026-06-08T19:44:13Z

+        wait_for_condition 100 50 {
+            [get_cluster_info_field $leader_client cluster_raft_role] eq "follower" &&
+            [get_cluster_info_field $leader_client cluster_raft_leader] eq ""


We should test that the leader steps down at the right time. If it should step down after cluster_node_timeout, then we should not wait 5 * cluster_node_timeout. We can wait something like 1.5 * cluster_node_timeout` instead. If it takes too long to step down, it's an error.

We can use a higher cluster_node_timeout (for example 3000 or 5000) in this test if the GitHub CI is slow.

Signed-off-by: Su Ko <rhtn1128@gmail.com>

zuiderkwast · 2026-06-09T20:31:22Z

+                break
+            }
+            after 50
+        }


Why did you change from wait_for_condition to this explicit loop?

IMO, with this style, it's less easy to see how long we're waiting. wait_for_condition is very common in our tests, so we know that it means 100 x 50ms.

Btw, 150 * 50 = 7500, but cluster-node-timeout is 1000 in this test. What's the reasoning?

Signed-off-by: Su Ko <rhtn1128@gmail.com>

zuiderkwast

LGTM, thanks!

Now, there is a merge conflict though, after I merged #3887. It should be easy to solve. rs->todo_save_config = 1 was added in clusterRaftMaybeStepDown.

# Conflicts: # src/cluster_raft.c

bandalgomsu marked this pull request as draft June 4, 2026 11:03

github-actions Bot assigned bandalgomsu Jun 4, 2026

bandalgomsu force-pushed the issue-3861 branch from 2e88346 to e3cb11d Compare June 4, 2026 11:08

Implement stale leader step-down after quorum loss

87d649e

Signed-off-by: Su Ko <rhtn1128@gmail.com>

bandalgomsu force-pushed the issue-3861 branch from e3cb11d to 87d649e Compare June 4, 2026 12:24

bandalgomsu marked this pull request as ready for review June 6, 2026 03:13

coderabbitai Bot reviewed Jun 8, 2026

View reviewed changes

zuiderkwast reviewed Jun 8, 2026

View reviewed changes

bandalgomsu added 7 commits June 8, 2026 21:54

Refactor use clusterRaftQuorum() consistently

b305dca

Signed-off-by: Su Ko <rhtn1128@gmail.com>

Refactor use RAFT_STATE() in step-down helpers

3690db2

Signed-off-by: Su Ko <rhtn1128@gmail.com>

Docs clarify step-down helper comments

047bb2c

Signed-off-by: Su Ko <rhtn1128@gmail.com>

Refactor move quorum freshness checks to cron

23f96ba

Signed-off-by: Su Ko <rhtn1128@gmail.com>

Refactor keep quorum-loss state leader-only

4002e83

Signed-off-by: Su Ko <rhtn1128@gmail.com>

Refactor skip failure detection during quorum loss

d4b0da7

Signed-off-by: Su Ko <rhtn1128@gmail.com>

Fix Formatting

0161b54

Signed-off-by: Su Ko <rhtn1128@gmail.com>

zuiderkwast reviewed Jun 8, 2026

View reviewed changes

bandalgomsu added 4 commits June 9, 2026 20:18

Docs modify comment in clusterRaftStepDown

655516f

Signed-off-by: Su Ko <rhtn1128@gmail.com>

Test move cluster_raft step-down test to a non-proto suite

4a843c2

Signed-off-by: Su Ko <rhtn1128@gmail.com>

Fix step down after one quorum-loss timeout

cb08ba7

Signed-off-by: Su Ko <rhtn1128@gmail.com>

Test bound cluster_raft quorum-loss step-down timing

2e5e479

Signed-off-by: Su Ko <rhtn1128@gmail.com>

zuiderkwast reviewed Jun 9, 2026

View reviewed changes

bandalgomsu added 2 commits June 10, 2026 06:09

Test Simplify cluster-raft.tcl

b644e27

Signed-off-by: Su Ko <rhtn1128@gmail.com>

Refactor Simplify clusterRaftStepDown

3dfd390

Signed-off-by: Su Ko <rhtn1128@gmail.com>

zuiderkwast approved these changes Jun 10, 2026

View reviewed changes

Merge remote-tracking branch 'upstream/cluster-v2' into issue-3861

513cc3a

# Conflicts: # src/cluster_raft.c

zuiderkwast linked an issue Jun 11, 2026 that may be closed by this pull request

Raft Cluster: Minority Partition Detection & Leader Step-Down #3861

Closed

zuiderkwast changed the title ~~cluster-v2 Implement stale leader step-down after quorum loss~~ Raft Cluster: Implement stale leader step-down after quorum loss Jun 11, 2026

zuiderkwast merged commit 49322a1 into valkey-io:cluster-v2 Jun 11, 2026
27 of 28 checks passed

		@@ -826,13 +869,8 @@ static void clusterRaftDeferPendingProposals(void) {
		/* Step down to follower if we see a higher term. Returns 1 if stepped down. */
		static int clusterRaftMaybeStepDown(clusterRaftState *rs, uint64_t term) {

	static int clusterRaftMaybeStepDown(clusterRaftState *rs, uint64_t term) {
	static int clusterRaftMaybeStepDown(uint64_t term) {
	clusterRaftState *rs = RAFT_STATE();

	if {[get_cluster_info_field $client cluster_raft_role] eq "leader"} {
	if {[CI $idx cluster_raft_role] eq "leader"} {

Conversation

bandalgomsu commented Jun 4, 2026 • edited by zuiderkwast Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ Failed checks (1 warning)

Uh oh!

codecov Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

zuiderkwast left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

zuiderkwast left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bandalgomsu commented Jun 4, 2026 •

edited by zuiderkwast

Loading

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading

codecov Bot commented Jun 4, 2026 •

edited

Loading