Skip to content

Raft Cluster: fix CLUSTER SHARDS visibility and replica health reporting#3910

Merged
zuiderkwast merged 2 commits into
valkey-io:cluster-v2from
zuiderkwast:raft-fixes
Jun 4, 2026
Merged

Raft Cluster: fix CLUSTER SHARDS visibility and replica health reporting#3910
zuiderkwast merged 2 commits into
valkey-io:cluster-v2from
zuiderkwast:raft-fixes

Conversation

@zuiderkwast

@zuiderkwast zuiderkwast commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Some fixes for the raft cluster bus:

  1. Add nodes to shards dict on NODE_JOIN apply — Nodes joining the cluster were added to the nodes dict but not the shards dict, making them invisible in CLUSTER SHARDS.
    This affected both paths: fresh node creation on followers and MEET-flagged nodes transitioning to full members.

  2. Broadcast repl_offset when it drops to zero — The leader broadcasts a follower's replication offset to peers when it changes from 0 to non-zero (replica finishes sync).
    Now also broadcasts when it drops from non-zero to 0 (replica starts full resync), so other nodes correctly report the replica's health as "loading" in CLUSTER SHARDS.

  3. Include zero offsets in periodic REPL_OFFSETS broadcast — The periodic broadcast skipped nodes with offset 0, so a peer that missed the immediate transition broadcast (e.g. due to a brief disconnection) would never learn the correct value. Also fixes the case where the raft leader is a data replica and its offset drops to 0.

  4. Defer first REPL_OFFSETS broadcast — Initialize the broadcast timer to startup time so the first periodic broadcast is deferred by 10 seconds, avoiding unnecessary
    traffic during cluster formation.

I discovered these when working on #3887, running the cluster-shards.tcl test suite, involving node restarts. These two fixes are not depending on persistence though.

@coderabbitai

coderabbitai Bot commented Jun 3, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: d8bc844f-7400-439d-adcb-c462897b3739

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title directly describes the two main fixes in the changeset: improving CLUSTER SHARDS visibility and replica health reporting in the raft cluster implementation.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The pull request description is directly related to the changeset, providing clear context about raft cluster fixes for CLUSTER SHARDS visibility and replica health reporting.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cluster_raft.c`:
- Around line 1315-1318: The follower transition check only broadcasts when
prev_offset != follower_repl_offset and one side is zero, but the leader's own
path still only emits on 0->non-zero and periodic broadcasts skip zero offsets;
update the leader emission logic to also emit when the leader's replication
offset transitions from non-zero to 0 and ensure periodic broadcasts include
zero offsets so peers are updated. Concretely, modify the code paths that use
prev_offset and follower_repl_offset (and respect node->flags /
CLUSTER_NODE_MEET) to treat both 0->non-zero and non-zero->0 as broadcast-worthy
transitions and adjust the periodic broadcast logic to not omit zero offsets.
Ensure the same conditional symmetry used for followers is applied to the
leader-side emission.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 97ce5d13-d85d-453b-95f0-ade46b707bdb

📥 Commits

Reviewing files that changed from the base of the PR and between e67cc9c and 8091baf.

📒 Files selected for processing (1)
  • src/cluster_raft.c

Comment thread src/cluster_raft.c
Nodes joining via NODE_JOIN were added to the nodes dict but not the
shards dict, making them invisible in CLUSTER SHARDS responses.

Add clusterAddNodeToShard in both paths: when creating a fresh node
(on followers that never saw the MEET) and when transitioning an
existing MEET-flagged node to a full member.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Three changes to replication offset propagation:

1. Broadcast when offset drops from non-zero to 0 (replica starts full
   resync), not just 0 to non-zero. This lets peers report the replica's
   health as "loading" in CLUSTER SHARDS.

2. Include zero offsets in the periodic broadcast. Previously nodes with
   offset 0 were skipped, so a peer that missed the immediate broadcast
   (e.g. brief disconnection) would never learn the correct value. This
   also affected the raft leader's own offset when it is a data replica.

3. Initialize the broadcast timer to startup time so the first periodic
   broadcast is deferred by 10 seconds, avoiding unnecessary traffic
   during cluster formation.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
@codecov

codecov Bot commented Jun 3, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.49%. Comparing base (e67cc9c) to head (d75a8fc).

Additional details and impacted files
@@              Coverage Diff               @@
##           cluster-v2    #3910      +/-   ##
==============================================
+ Coverage       76.47%   76.49%   +0.01%     
==============================================
  Files             166      166              
  Lines           82602    82605       +3     
==============================================
+ Hits            63174    63189      +15     
+ Misses          19428    19416      -12     
Files with missing lines Coverage Δ
src/cluster_raft.c 61.26% <100.00%> (-0.99%) ⬇️

... and 21 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@murphyjacob4 murphyjacob4 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me

@zuiderkwast zuiderkwast merged commit 2604112 into valkey-io:cluster-v2 Jun 4, 2026
26 of 28 checks passed
@zuiderkwast zuiderkwast deleted the raft-fixes branch June 4, 2026 08:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants