Skip to content

Cluster bus v2 - Added shard level epoch and proposal pre-validation#3899

Open
sushilpaneru1 wants to merge 9 commits into
valkey-io:cluster-v2from
sushilpaneru1:cluster-v2-shard-epoch
Open

Cluster bus v2 - Added shard level epoch and proposal pre-validation#3899
sushilpaneru1 wants to merge 9 commits into
valkey-io:cluster-v2from
sushilpaneru1:cluster-v2-shard-epoch

Conversation

@sushilpaneru1

@sushilpaneru1 sushilpaneru1 commented Jun 2, 2026

Copy link
Copy Markdown

closes #3863, closes #3874

  • Shard Epoch - Introduced fencing token as additional state to guard against stale or out of order mutation of a shard's state. Overall, the idea is - for each shard's state, we maintain a version number which is incremented on each mutation; stale mutation will be rejected as the associated version number will be older (similar to MVCC).

  • Pre-Validation and Rejection - The raft leader now pre-validates proposals (shard epoch staleness, shard-id mismatch, missing fields) before appending to the log. Invalid proposals are rejected immediately, keeping the log clean. When a proposal is rejected, the leader sends a REJECT <entry> message back to the proposing follower. The follower matches this against its pending proposals and fires the client callback with an error, so the client receives -ERR proposal rejected by raft leader immediately instead of blocking until timeout i.e two layer callback chain is handled. Two stats counters (cluster_stats_proposals_rejected_prevalidation and cluster_stats_proposals_rejected_apply) provide observability into rejections at each stage.

Testing

  • New tests added - Verifies rejection due to stale epoch, missing epoch, wrong shard-id mismatch and also rejection wire message.
  • With epoch being introduced for REPLICATE, some of the test failed due to race condition (stale epoch), added retry to the tests.

@coderabbitai

coderabbitai Bot commented Jun 2, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 155854fa-288c-47c3-b8b6-804811be95d7

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 56.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title clearly and concisely describes the main change: introduction of shard-level epochs and proposal pre-validation in the cluster bus v2, which aligns directly with the primary objectives of this PR.
Description check ✅ Passed The pull request description clearly explains the introduction of shard epoch as a fencing token and pre-validation mechanism, directly mapping to the changeset across design docs and implementation files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@sushilpaneru1 sushilpaneru1 marked this pull request as draft June 2, 2026 20:38

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
src/cluster_state.h (1)

132-134: ⚡ Quick win

Document the shard-epoch helper contracts in the header.

These new public APIs have non-obvious behavior that callers need to rely on correctly, especially around missing shard IDs and lifecycle ownership of epoch entries. Please add function comments here rather than forcing every caller to infer the contract from the .c file. As per coding guidelines, "Document why code exists, not just what it does; document all functions in C code".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cluster_state.h` around lines 132 - 134, Add descriptive function
comments above clusterGetShardEpoch, clusterSetShardEpoch, and
clusterRemoveShardEpoch that explain their contracts: state what
clusterGetShardEpoch returns for existing and missing shard_id (e.g., special
value or error), whether the caller must distinguish a zero epoch vs "not
found", and whether callers should check for missing entries; document that
clusterSetShardEpoch creates or updates an epoch entry and specify
ownership/lifecycle guarantees (who may remove it and whether it is persisted),
and document clusterRemoveShardEpoch's effect (idempotent, no-op if missing) and
any concurrency/thread-safety expectations. Make sure each comment also states
parameter semantics (shard_id lifetime/NULL handling) and the return/side-effect
guarantees so callers don’t need to read the .c to use these APIs.
tests/unit/cluster/cluster-raft-proto.tcl (3)

409-430: ⚡ Quick win

Consider using a more robust synchronization or longer timeout.

The after 500 wait (line 419) is timing-dependent. While 500ms is probably sufficient to verify that pre-validation rejected the proposal immediately, it might be flaky on slow CI environments. According to coding guidelines, tests should avoid timing-dependent waits and use proper synchronization.

Suggested alternatives

Option 1: Use a longer timeout for safety

-            after 500
+            after 2000

Option 2: Poll the commit index to ensure it remains stable

-            after 500
-
-            # Verify log index unchanged (rejected at pre-validation).
-            set commit_after [get_cluster_info_field $r0 cluster_raft_commit_index]
-            assert_equal $commit_before $commit_after
+            # Poll to ensure commit index remains unchanged over time.
+            set stable 1
+            for {set i 0} {$i < 5} {incr i} {
+                after 200
+                set commit_now [get_cluster_info_field $r0 cluster_raft_commit_index]
+                if {$commit_now != $commit_before} {
+                    set stable 0
+                    break
+                }
+            }
+            assert_equal 1 $stable "commit index should remain unchanged"

As per coding guidelines, Valkey Tcl tests should avoid timing-dependent tests and use proper synchronization.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/cluster/cluster-raft-proto.tcl` around lines 409 - 430, The test
uses a fixed sleep (after 500) after sending the fake PROPOSE which is
timing-dependent; replace this with a synchronization loop that polls the Raft
commit index (using get_cluster_info_field $r0 cluster_raft_commit_index) or an
existing wait helper until either the index remains equal to the pre-send value
for a short stable period or a timeout elapses, then proceed to evaluate
$state_check; update the test inside the foreach over $stale_proposals (around
the raft_send/connect_fake_node block) to remove the after 500 and use the
polling/wait approach so the assertion that commit_before == commit_after is
robust on slow CI.

444-460: ⚡ Quick win

Consider using a more robust synchronization or longer timeout.

The after 500 wait (line 452) has the same timing-dependency concern as the stale proposal tests. See previous comment for suggested alternatives.

As per coding guidelines, Valkey Tcl tests should avoid timing-dependent tests and use proper synchronization.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/cluster/cluster-raft-proto.tcl` around lines 444 - 460, The test
relies on a brittle fixed sleep (after 500); replace it with deterministic
synchronization by polling the cluster state or using an existing wait helper
instead of a hard sleep: after calling connect_fake_node/raft_send, loop (with a
short sleep and a total timeout) calling get_cluster_info_field $r0
cluster_raft_commit_index and assert it remains equal to the initial
$commit_before, or use a provided helper like
wait_for_condition/wait_for_cluster_raft_commit_index_equal if available; remove
the after 500 and close the fd only after the poll confirms the commit index is
unchanged (or the wait helper times out and fails the test).

437-460: 💤 Low value

Consider adding state verification assertions.

The missing epoch tests verify that the commit index doesn't change but lack the cluster state assertions that the stale proposal tests include (lines 399-406, evaluated at line 426). Adding similar state checks would make these tests more thorough and consistent with the stale proposal tests.

For example, the NODE_FORGET test could verify that $node1_id is still present in CLUSTER NODES, and the SLOT_CHANGE test could verify that slot assignments remain unchanged.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/cluster/cluster-raft-proto.tcl` around lines 437 - 460, Add
assertions that the cluster state remains unchanged after each missing-epoch
proposal: after sending propose_msg and verifying commit index unchanged, query
the cluster state (e.g., run "CLUSTER NODES" and the slot assignment query used
in the stale-proposal tests) and assert that for the "NODE_FORGET (missing
epoch)" case $node1_id is still present, and for "SLOT_CHANGE (missing epoch)"
that slot ownership/assignments remain identical to their pre-proposal values;
implement these checks inside the same test block that references
missing_epoch_proposals using the existing helpers (commit_before/commit_after,
propose_msg, and the cluster info retrieval used elsewhere) so the tests mirror
the state-verification done in the stale proposal tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@design-docs/cluster-raft.md`:
- Around line 629-642: The PROPOSE and Leader Validation section is now
inconsistent with the Validation section; update the earlier "PROPOSE and Leader
Validation" paragraph to state that leaders perform a pre-validation check and
may reject obviously stale proposals before appending (a best-effort, read-only
check that does not bump epoch), and clarify that authoritative validation still
happens at apply time (referencing the "Validation" section's apply-time
validation rules and epoch 0 behavior). Ensure the updated text mentions the
leader pre-validation behavior, that it is best-effort/read-only, and that
apply-time validation remains authoritative.

In `@src/cluster_state.c`:
- Around line 419-432: clusterSetShardEpoch currently inserts missing epoch
entries into server.cluster->shard_epochs, breaking the invariant that epoch
entries mirror server.cluster->shards; update clusterSetShardEpoch to first
check that the shard exists in server.cluster->shards (using dictFind on
server.cluster->shards with the same sds key) and if the shard is not present
simply free any temporary sds and return without adding to shard_epochs; only
update or add an epoch entry when the shard key is known to exist so
clusterAddNodeToShard won't reuse orphaned epochs.

---

Nitpick comments:
In `@src/cluster_state.h`:
- Around line 132-134: Add descriptive function comments above
clusterGetShardEpoch, clusterSetShardEpoch, and clusterRemoveShardEpoch that
explain their contracts: state what clusterGetShardEpoch returns for existing
and missing shard_id (e.g., special value or error), whether the caller must
distinguish a zero epoch vs "not found", and whether callers should check for
missing entries; document that clusterSetShardEpoch creates or updates an epoch
entry and specify ownership/lifecycle guarantees (who may remove it and whether
it is persisted), and document clusterRemoveShardEpoch's effect (idempotent,
no-op if missing) and any concurrency/thread-safety expectations. Make sure each
comment also states parameter semantics (shard_id lifetime/NULL handling) and
the return/side-effect guarantees so callers don’t need to read the .c to use
these APIs.

In `@tests/unit/cluster/cluster-raft-proto.tcl`:
- Around line 409-430: The test uses a fixed sleep (after 500) after sending the
fake PROPOSE which is timing-dependent; replace this with a synchronization loop
that polls the Raft commit index (using get_cluster_info_field $r0
cluster_raft_commit_index) or an existing wait helper until either the index
remains equal to the pre-send value for a short stable period or a timeout
elapses, then proceed to evaluate $state_check; update the test inside the
foreach over $stale_proposals (around the raft_send/connect_fake_node block) to
remove the after 500 and use the polling/wait approach so the assertion that
commit_before == commit_after is robust on slow CI.
- Around line 444-460: The test relies on a brittle fixed sleep (after 500);
replace it with deterministic synchronization by polling the cluster state or
using an existing wait helper instead of a hard sleep: after calling
connect_fake_node/raft_send, loop (with a short sleep and a total timeout)
calling get_cluster_info_field $r0 cluster_raft_commit_index and assert it
remains equal to the initial $commit_before, or use a provided helper like
wait_for_condition/wait_for_cluster_raft_commit_index_equal if available; remove
the after 500 and close the fd only after the poll confirms the commit index is
unchanged (or the wait helper times out and fails the test).
- Around line 437-460: Add assertions that the cluster state remains unchanged
after each missing-epoch proposal: after sending propose_msg and verifying
commit index unchanged, query the cluster state (e.g., run "CLUSTER NODES" and
the slot assignment query used in the stale-proposal tests) and assert that for
the "NODE_FORGET (missing epoch)" case $node1_id is still present, and for
"SLOT_CHANGE (missing epoch)" that slot ownership/assignments remain identical
to their pre-proposal values; implement these checks inside the same test block
that references missing_epoch_proposals using the existing helpers
(commit_before/commit_after, propose_msg, and the cluster info retrieval used
elsewhere) so the tests mirror the state-verification done in the stale proposal
tests.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 5cf4bdc6-f8cc-4247-a71a-b2aa849b9a6f

📥 Commits

Reviewing files that changed from the base of the PR and between e67cc9c and 0d72aeb.

📒 Files selected for processing (7)
  • design-docs/cluster-raft.md
  • src/cluster.c
  • src/cluster_legacy.c
  • src/cluster_raft.c
  • src/cluster_state.c
  • src/cluster_state.h
  • tests/unit/cluster/cluster-raft-proto.tcl

Comment thread design-docs/cluster-raft.md
Comment thread src/cluster_state.c Outdated
@codecov

codecov Bot commented Jun 2, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 78.24427% with 57 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.57%. Comparing base (2604112) to head (1469f89).
⚠️ Report is 2 commits behind head on cluster-v2.

Files with missing lines Patch % Lines
src/cluster_raft.c 78.24% 57 Missing ⚠️
Additional details and impacted files
@@              Coverage Diff               @@
##           cluster-v2    #3899      +/-   ##
==============================================
+ Coverage       76.48%   76.57%   +0.08%     
==============================================
  Files             166      166              
  Lines           82605    82806     +201     
==============================================
+ Hits            63182    63409     +227     
+ Misses          19423    19397      -26     
Files with missing lines Coverage Δ
src/cluster_raft.c 67.99% <78.24%> (+6.72%) ⬆️

... and 20 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@zuiderkwast zuiderkwast left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Not a full review. I just took a quick look and noticed that the common code outside of cluster_raft.c is affected more than I was expecting. See the comment below.

Comment thread src/cluster_state.c Outdated
dictSetUnsignedIntegerVal(epoch_de, 0);
} else {
sdsfree(epoch_key);
}

@zuiderkwast zuiderkwast Jun 2, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shard-epoch is a Raft-cluster-specific concept, so ideally it should be implemented in cluster_raft.c and stored in the clusterRaftState instead of the common cluster state.

I see though that clusterAddNodeToShard and clusterRemoveNodeFromShard are called from various places, including from cluster_nodes.c when loading node.conf. 🤔

Have you considered any other ways to structure this?

When we persist shard-epoch in nodes.conf (combining this PR with #3887) do we need to store it for all shards in the vars line or how should we store it?

I have an idea: the nodes.conf format has three unused columns when raft is used: num pings, last pong, epoch (used for gossip, all zeroed out for raft). When nodes.conf is loaded, these three are passed to clusterCurrentBus->setNodePingPongEpoch() so the cluster bus implementation can do whatever it wants, and it's only used for the gossip protocol currently. For Raft, we could use e.g. the epoch field to store the shard-epoch for the shard the node belongs to. The shard-id is in the node's aux fields, so there is a way to match it.

This callback (clusterCurrentBus->setNodePingPongEpoch() and getNodePingPongEpoch() - or we can rename them or split them into separate callbacks for the three fields) provides a way for cluster_nodes.c to pass the shard-epoch to cluster_raft.c so it can update it in the raft state.

For calls to clusterAddNodeToShard within cluster_raft.c, we could replace them with a wrapper to add a node and also do the shard-epoch bookkeeping.

It's an idea worth exploring, but I don't know yet which design is cleaner.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reusing the epoch field makes a lot of sense to me. Alternatively we can store it in the aux/extensions?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean in the ip:port@cport,hostname,aux=val* string? It identifies the node, and when it changes, the node proposes a NODE_INFO entry for itself. Shard-epoch can bump without the node being changed...

The unused epoch fields seems better. It'd be enough to store it on the primary for each shard, but we could as well store it on all nodes.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the nodes.conf format has three unused columns when raft is used: num pings, last pong, epoch (used for gossip, all zeroed out for raft)

I like the idea. Now I have context on #3887 I will see what I can do.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What did we decide on here? Store it in the epoch of each node belonging to the shard? Or at least in the primary of the shard. I do like the idea of nodes.conf roughly splitting into base and incremental:

# BASE
a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2 192.168.1.10:7001@17001 myself,master - 0 0 1 connected 0-5460 shard-id=f1e2d3c4b5a6f1e2d3c4b5a6f1e2d3c4b5a6f1e2
b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3 192.168.1.11:7002@17002 master - 0 0 2 connected 5461-10922 shard-id=a6b5c4d3e2f1a6b5c4d3e2f1a6b5c4d3e2f1a6b5
c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4 192.168.1.12:7003@17003 master - 0 0 3 connected 10923-16383 shard-id=e2f1a6b5c4d3e2f1a6b5c4d3e2f1a6b5c4d3e2f1
# INCREMENTAL
log ...

Storing shard epoch in the "base" will mean that we will replay by reconstructing shard epoch state from the base, and as we replay the incremental logs, we can revalidate the shard epoch bumps, giving us some additional log validation. We can achieve the same with vars, but I like how clear this looks.

@sushilpaneru1 sushilpaneru1 Jun 8, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think the current implementation aligns with your proposal. I have plugged into setNodePingPongEpoch so all node entries will have the shard epoch (Base). On replay of the log entries, we should be able to build the state back.

Note - as per #3887, there's a full rewrite for SLOT_CHANGE, SET_REPLICA_OF, FAILOVER so node entries will have the latest shapshot unless this does not change in future.

Comment thread design-docs/cluster-raft.md Outdated
Comment thread design-docs/cluster-raft.md
Comment thread design-docs/cluster-raft.md Outdated
Comment thread design-docs/cluster-raft.md Outdated
Comment thread src/cluster_state.c Outdated
dictSetUnsignedIntegerVal(epoch_de, 0);
} else {
sdsfree(epoch_key);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reusing the epoch field makes a lot of sense to me. Alternatively we can store it in the aux/extensions?

@sushilpaneru1 sushilpaneru1 force-pushed the cluster-v2-shard-epoch branch 2 times, most recently from f7d4366 to 48a19b6 Compare June 5, 2026 18:45
Comment thread design-docs/cluster-raft.md Outdated
Comment thread design-docs/cluster-raft.md Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SET_REPLICA_OF could potentially move a replica from one shard to another, right?

In theory it could race against a failover in the previous shard. Similar to NODE_FORGET. We shouldn't allow a primary to move shards without a failover first, so I think we would want an epoch guard here.

Could we model NODE_FORGET and this into one operation: SHARD_CHANGE?

SHARD_CHANGE <node-id> <source-shard-or-dash> <source-shard-epoch> <target-shard> <target-shard-epoch>

{shard, epoch} should uniquely identify a primary node. So we wouldn't need to send the primary node. Although we could for additional validation.

AFAICT NODE_FORGET includes removal from the shard AND removal from the cluster. Splitting it like this would mean SHARD_CHANGE my_id curr_shard curr_epoch - - removes from the shard, and then we can safely follow up with NODE_FORGET my_id to remove it from the cluster.

Creating a new shard and becoming primary would be: SHARD_CHANGE my_id - 0 new_shard_id 0 -> target-shard-epoch of zero implies create a new shard and can automatically make this node primary.

@zuiderkwast wdyt?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SET_REPLICA_OF could potentially move a replica from one shard to another, right?

Yes. Even automatically, when replica migration is enabled (which it is by default).

The node is either a replica in another shard or an empty primary before the move.

I used the name SET_PRIMARY_OF to hint that the node becomes a replica in the target shard. The name SHARD_CHANGE doesn't hint that. How about MOVE_REPLICA or something?

Source shard-id and shard-epoch and target shard-id and shard-epoch are enough. The primary id becomes is redundant. We can skip it, depending on what we name the entry.

AFAICT NODE_FORGET includes removal from the shard AND removal from the cluster. Splitting it like this would mean SHARD_CHANGE my_id curr_shard curr_epoch - - removes from the shard, and then we can safely follow up with NODE_FORGET my_id to remove it from the cluster.

All nodes in the cluster belong to a shard though. In gossip cluster, it does. It creates a new random shard-id for itself when it joins the cluster as an empty primary, and when it moves to an empty primary using CLUSTER REPLICATE NO ONE (SET_REPLICA_OF -).

To avoid a new state where a node exists without a shard, I think we can keep NODE_FORGET removing a node from a shard and from the cluster, at least for now. But you're right that we may way want to bump the shard-epoch on NODE_FORGET too.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you're right that we may way want to bump the shard-epoch on NODE_FORGET too.

@sushilpaneru1 has NODE_FORGET bumping in this PR. I was just trying to lower the surface area of shard-epoch-bumping commands. Converging to a single command could be done later, or not. Semantically it won't change much.

It creates a new random shard-id for itself when it joins the cluster as an empty primary, and when it moves to an empty primary using CLUSTER REPLICATE NO ONE (SET_REPLICA_OF -).

Right - that's what I meant when I said remove from shard - really means form a new shard. Then NODE_FORGET can only be done on primaries of shards that have no slots and no replicas.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would that mean CLUSTER FORGET can fail in the middle, leaving the half-forgotten node as a singleton?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not as a singleton (in Raft terms) but rather a member of an empty shard. Alternatively we can keep the single NODE_FORGET. I guess doing it atomically may be more friendly.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the design doc I read "It is bumped each time membership or leadership of the shard changes." - so according to this NODE_FORGET should bump it.

If we bump it, then should the node that got REJECT retry it later with the latest epoch? We should avoid breaking the CLUSTER FORGET behavior.

@murphyjacob4 murphyjacob4 Jun 12, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think two options:

  1. NODE_FORGET bumps epoch, and therefore will be sequenced against leadership changes. Primaries that own slots will be disallowed from being forgotten until they are failed over.
  2. NODE_FORGET implicitly performs failover if the node was primary.

I would prefer 1, since this will make the log more explicit. The node that wants to do the forget will have to decide whether it wants to a) give up or b) trigger failover on the shard and retry. Most likely, we would want to go with "a)" and let the user decide if they want to do "b)" manually.

@murphyjacob4 murphyjacob4 Jun 12, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid breaking the CLUSTER FORGET behavior.

The existing CLUSTER FORGET is a bad user experience when we have a global log. It makes sense in gossip world since there is no canonical global state, but I don't know we need to have 1:1 parity.

Anyways, in the old model, replicas would not forget their primaries, which meant that primaries that had replicas would not actually be forgotten (they would be re-added after the ban list expiration).

My attempt at thinking it through:

Scenario Gossip behavior Raft implications My Suggestion
Leader nodes N/A We need to elect a new leader Allow removal. Once the membership is committed that the leader is out of the quorum, a new election will start and a new leader committed as the AE timers expire. We can optimize by instigating election immediately after NODE_FORGET commit, but this is just an optimization, and the old leader would instigate this on a single node to prevent a storm.
Replica nodes Allows removal No data loss risk as long as it doesn't race against failover Allow removal, but make NODE_FORGET use a shard epoch check/bump to prevent failover races.
Primaries with no slots and no replicas Allows removal No data loss risk since there are no slots. Allow removal.
Primaries with replicas (with slots or with no slots) Denies removal (replicas cannot forget their primary) Removal would cause the shard to have no active primary Deny removal. We should force the user to do a failover first and align the behavior with gossip. We can give an explicit error message, whereas gossip would silently fail.
Primaries with slots and without replicas Allows removal Risk of data loss if those slots have content. Risk of losing full slot coverage. Deny removal. The old CLUSTER FORGET would happily accept this and the node would be lost, and the cluster would enter fail state due to lack of coverage. In a bootstrapped production cluster - this is a footgun IMO. I would prefer we just make the administrator handle it explicitly (migrate the slots off and then CLUSTER FORGET the empty primary). If the primary node to be removed is down or unavailable, the administrator should be able to force migrate the slots off the node through CLUSTER SETSLOT et al.

Let me know if it makes sense.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing CLUSTER FORGET is a bad user experience when we have a global log. It makes sense in gossip world since there is no canonical global state, but I don't know we need to have 1:1 parity.

We can add some new admin commands that are better and recommend using them instead, but for the existing commands we should try to make them work...

Anyways, in the old model, replicas would not forget their primaries

This means it's fine to reject it in this case! Good new. No existing control plane would deleted primaries that have replicas.

| Primaries with slots and without replicas | .. | Data loss | Deny removal |

Yeah, there's data loss, but in some setups, that may be fine. Raft cluster is not only for the strongly consistent setups with sync replication that we're building. It should replace gossip also for the other cluster users too.

I'm fine with it. How and why would a sensible control plane do this? To replace it with a new node and assign the slots again? I'm not really sure if anyone does this. Can we forbid it in gossip too, to keep CLUSTER FORGET consistent?

I see that valkey-cli --cluster del-node rejects this with a message like "[ERR] Node %s:%d is not empty! Reshard data away and try again." -- It's the reference control plane tool, so it's a good sign.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added retry logic (limited retries) in an attempt to make it consistent with current behavior, do have a look at it and lmk if we are good with it.

@zuiderkwast

Copy link
Copy Markdown
Contributor

@sushilpaneru1 To link the PR to the issue, you need to use one of the keywords recognized by github for this. See https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/using-keywords-in-issues-and-pull-requests

@sushilpaneru1 sushilpaneru1 force-pushed the cluster-v2-shard-epoch branch from 48a19b6 to 648a47c Compare June 7, 2026 05:38
@sushilpaneru1 sushilpaneru1 marked this pull request as ready for review June 8, 2026 06:20
@sushilpaneru1 sushilpaneru1 requested a review from zuiderkwast June 8, 2026 06:20
@sushilpaneru1 sushilpaneru1 changed the title Cluster bus v2 - Added shard level epoch Cluster bus v2 - Added shard level epoch and proposal pre-validation Jun 8, 2026
CaffeeLake pushed a commit to CaffeeLake/valkey that referenced this pull request Jun 11, 2026
Try to fix the failures seen for `test "PSYNC2 valkey-io#3899 regression: verify
consistency"`.

This change resets the query buffer parser state in
`replicationCachePrimary()` which is called when the connection to the
primary is lost. Before valkey-io#2092, this was done by `resetClient()`.

The solution was inspired by the discussion about the regression
mentioned (discussion from 2017) and the related commits from that time:
6bc6bd4,
469d6e2,
c180bc7.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

@zuiderkwast zuiderkwast left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments.

Regarding whether we need to bump epoch on FORGET, I don't really know, but I guess we can change that later as well.

For the FAILOVER entry, do we need both primary-id and shard-id+epoch? It's redundant but I don't remember what we said earlier.

@murphyjacob4 I appreciate that you take a look at the correctness details.

Comment thread design-docs/cluster-raft.md Outdated
Comment thread design-docs/cluster-raft.md Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the design doc I read "It is bumped each time membership or leadership of the shard changes." - so according to this NODE_FORGET should bump it.

If we bump it, then should the node that got REJECT retry it later with the latest epoch? We should avoid breaking the CLUSTER FORGET behavior.

Comment thread src/cluster_raft.c Outdated
Comment thread src/cluster_raft.c Outdated
Comment thread src/cluster_raft.c Outdated
Comment thread src/cluster_raft.c
sushil paneru and others added 7 commits June 16, 2026 04:23
Signed-off-by: Sushil Paneru <sushil.paneru1@gmail.com>
Signed-off-by: Sushil Paneru <sushil.paneru1@gmail.com>
Signed-off-by: Sushil Paneru <sushil.paneru1@gmail.com>
Signed-off-by: Sushil Paneru <sushil.paneru1@gmail.com>
Signed-off-by: Sushil Paneru <sushil.paneru1@gmail.com>
Signed-off-by: Sushil Paneru <sushil.paneru1@gmail.com>
Signed-off-by: Sushil Paneru <sushil.paneru1@gmail.com>
Signed-off-by: Sushil Paneru <sushil.paneru1@gmail.com>
@sushilpaneru1 sushilpaneru1 force-pushed the cluster-v2-shard-epoch branch from 1469f89 to 061bed3 Compare June 17, 2026 07:32
Signed-off-by: Sushil Paneru <sushil.paneru1@gmail.com>
and the entry is applied. On mismatch, the entry is a no-op and
the error is propagated to the caller's callback.

### Retry on stale epoch

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a high level, retry for proposal rejection looks like it defeats the purpose of epoch level fencing. From a operator point of view I think it makes sense to have a internal retry if epoch was stale; it kinda says the operation is semantically correct but issued concurrently. I have added a bound to number of retries. lmk what you think about the overall strategy? @zuiderkwast @murphyjacob4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Raft Cluster: Leader-Side Proposal Pre-Validation and Early Rejection Raft Cluster: Shard-Level Epochs for Fencing and Conflict Resolution

3 participants