Raft Cluster: fix CLUSTER SHARDS visibility and replica health reporting by zuiderkwast · Pull Request #3910 · valkey-io/valkey

zuiderkwast · 2026-06-03T18:12:29Z

Some fixes for the raft cluster bus:

Add nodes to shards dict on NODE_JOIN apply — Nodes joining the cluster were added to the nodes dict but not the shards dict, making them invisible in CLUSTER SHARDS.
This affected both paths: fresh node creation on followers and MEET-flagged nodes transitioning to full members.
Broadcast repl_offset when it drops to zero — The leader broadcasts a follower's replication offset to peers when it changes from 0 to non-zero (replica finishes sync).
Now also broadcasts when it drops from non-zero to 0 (replica starts full resync), so other nodes correctly report the replica's health as "loading" in CLUSTER SHARDS.
Include zero offsets in periodic REPL_OFFSETS broadcast — The periodic broadcast skipped nodes with offset 0, so a peer that missed the immediate transition broadcast (e.g. due to a brief disconnection) would never learn the correct value. Also fixes the case where the raft leader is a data replica and its offset drops to 0.
Defer first REPL_OFFSETS broadcast — Initialize the broadcast timer to startup time so the first periodic broadcast is deferred by 10 seconds, avoiding unnecessary
traffic during cluster formation.

I discovered these when working on #3887, running the cluster-shards.tcl test suite, involving node restarts. These two fixes are not depending on persistence though.

coderabbitai · 2026-06-03T18:12:38Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: d8bc844f-7400-439d-adcb-c462897b3739

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title directly describes the two main fixes in the changeset: improving CLUSTER SHARDS visibility and replica health reporting in the raft cluster implementation.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The pull request description is directly related to the changeset, providing clear context about raft cluster fixes for CLUSTER SHARDS visibility and replica health reporting.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cluster_raft.c`:
- Around line 1315-1318: The follower transition check only broadcasts when
prev_offset != follower_repl_offset and one side is zero, but the leader's own
path still only emits on 0->non-zero and periodic broadcasts skip zero offsets;
update the leader emission logic to also emit when the leader's replication
offset transitions from non-zero to 0 and ensure periodic broadcasts include
zero offsets so peers are updated. Concretely, modify the code paths that use
prev_offset and follower_repl_offset (and respect node->flags /
CLUSTER_NODE_MEET) to treat both 0->non-zero and non-zero->0 as broadcast-worthy
transitions and adjust the periodic broadcast logic to not omit zero offsets.
Ensure the same conditional symmetry used for followers is applied to the
leader-side emission.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 97ce5d13-d85d-453b-95f0-ade46b707bdb

📥 Commits

Reviewing files that changed from the base of the PR and between e67cc9c and 8091baf.

📒 Files selected for processing (1)

src/cluster_raft.c

Nodes joining via NODE_JOIN were added to the nodes dict but not the shards dict, making them invisible in CLUSTER SHARDS responses. Add clusterAddNodeToShard in both paths: when creating a fresh node (on followers that never saw the MEET) and when transitioning an existing MEET-flagged node to a full member. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

Three changes to replication offset propagation: 1. Broadcast when offset drops from non-zero to 0 (replica starts full resync), not just 0 to non-zero. This lets peers report the replica's health as "loading" in CLUSTER SHARDS. 2. Include zero offsets in the periodic broadcast. Previously nodes with offset 0 were skipped, so a peer that missed the immediate broadcast (e.g. brief disconnection) would never learn the correct value. This also affected the raft leader's own offset when it is a data replica. 3. Initialize the broadcast timer to startup time so the first periodic broadcast is deferred by 10 seconds, avoiding unnecessary traffic during cluster formation. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

codecov · 2026-06-03T20:04:09Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.49%. Comparing base (e67cc9c) to head (d75a8fc).

Additional details and impacted files

@@              Coverage Diff               @@
##           cluster-v2    #3910      +/-   ##
==============================================
+ Coverage       76.47%   76.49%   +0.01%     
==============================================
  Files             166      166              
  Lines           82602    82605       +3     
==============================================
+ Hits            63174    63189      +15     
+ Misses          19428    19416      -12

Files with missing lines	Coverage Δ
src/cluster_raft.c	`61.26% <100.00%> (-0.99%)`	⬇️

... and 21 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

murphyjacob4

Makes sense to me

github-actions Bot assigned zuiderkwast Jun 3, 2026

coderabbitai Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread src/cluster_raft.c

zuiderkwast force-pushed the raft-fixes branch from bd27872 to 98899ed Compare June 3, 2026 18:54

zuiderkwast added 2 commits June 3, 2026 21:33

zuiderkwast force-pushed the raft-fixes branch from 98899ed to d75a8fc Compare June 3, 2026 19:36

murphyjacob4 approved these changes Jun 3, 2026

View reviewed changes

zuiderkwast merged commit 2604112 into valkey-io:cluster-v2 Jun 4, 2026
26 of 28 checks passed

zuiderkwast deleted the raft-fixes branch June 4, 2026 08:31

coderabbitai Bot mentioned this pull request Jun 8, 2026

Raft Cluster: Implement stale leader step-down after quorum loss #3916

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raft Cluster: fix CLUSTER SHARDS visibility and replica health reporting#3910

Raft Cluster: fix CLUSTER SHARDS visibility and replica health reporting#3910
zuiderkwast merged 2 commits into
valkey-io:cluster-v2from
zuiderkwast:raft-fixes

zuiderkwast commented Jun 3, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 3, 2026 •

edited

Loading

Review skipped

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

codecov Bot commented Jun 3, 2026

Uh oh!

murphyjacob4 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zuiderkwast commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov Bot commented Jun 3, 2026

Codecov Report

Uh oh!

murphyjacob4 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zuiderkwast commented Jun 3, 2026 •

edited

Loading

coderabbitai Bot commented Jun 3, 2026 •

edited

Loading