Raft Cluster: fix CLUSTER SHARDS visibility and replica health reporting#3910
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/cluster_raft.c`:
- Around line 1315-1318: The follower transition check only broadcasts when
prev_offset != follower_repl_offset and one side is zero, but the leader's own
path still only emits on 0->non-zero and periodic broadcasts skip zero offsets;
update the leader emission logic to also emit when the leader's replication
offset transitions from non-zero to 0 and ensure periodic broadcasts include
zero offsets so peers are updated. Concretely, modify the code paths that use
prev_offset and follower_repl_offset (and respect node->flags /
CLUSTER_NODE_MEET) to treat both 0->non-zero and non-zero->0 as broadcast-worthy
transitions and adjust the periodic broadcast logic to not omit zero offsets.
Ensure the same conditional symmetry used for followers is applied to the
leader-side emission.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 97ce5d13-d85d-453b-95f0-ade46b707bdb
📒 Files selected for processing (1)
src/cluster_raft.c
Nodes joining via NODE_JOIN were added to the nodes dict but not the shards dict, making them invisible in CLUSTER SHARDS responses. Add clusterAddNodeToShard in both paths: when creating a fresh node (on followers that never saw the MEET) and when transitioning an existing MEET-flagged node to a full member. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Three changes to replication offset propagation: 1. Broadcast when offset drops from non-zero to 0 (replica starts full resync), not just 0 to non-zero. This lets peers report the replica's health as "loading" in CLUSTER SHARDS. 2. Include zero offsets in the periodic broadcast. Previously nodes with offset 0 were skipped, so a peer that missed the immediate broadcast (e.g. brief disconnection) would never learn the correct value. This also affected the raft leader's own offset when it is a data replica. 3. Initialize the broadcast timer to startup time so the first periodic broadcast is deferred by 10 seconds, avoiding unnecessary traffic during cluster formation. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## cluster-v2 #3910 +/- ##
==============================================
+ Coverage 76.47% 76.49% +0.01%
==============================================
Files 166 166
Lines 82602 82605 +3
==============================================
+ Hits 63174 63189 +15
+ Misses 19428 19416 -12
🚀 New features to boost your workflow:
|
Some fixes for the raft cluster bus:
Add nodes to shards dict on NODE_JOIN apply — Nodes joining the cluster were added to the nodes dict but not the shards dict, making them invisible in CLUSTER SHARDS.
This affected both paths: fresh node creation on followers and MEET-flagged nodes transitioning to full members.
Broadcast repl_offset when it drops to zero — The leader broadcasts a follower's replication offset to peers when it changes from 0 to non-zero (replica finishes sync).
Now also broadcasts when it drops from non-zero to 0 (replica starts full resync), so other nodes correctly report the replica's health as "loading" in CLUSTER SHARDS.
Include zero offsets in periodic REPL_OFFSETS broadcast — The periodic broadcast skipped nodes with offset 0, so a peer that missed the immediate transition broadcast (e.g. due to a brief disconnection) would never learn the correct value. Also fixes the case where the raft leader is a data replica and its offset drops to 0.
Defer first REPL_OFFSETS broadcast — Initialize the broadcast timer to startup time so the first periodic broadcast is deferred by 10 seconds, avoiding unnecessary
traffic during cluster formation.
I discovered these when working on #3887, running the
cluster-shards.tcltest suite, involving node restarts. These two fixes are not depending on persistence though.