Cluster bus v2 - Added shard level epoch and proposal pre-validation#3899
Cluster bus v2 - Added shard level epoch and proposal pre-validation#3899sushilpaneru1 wants to merge 9 commits into
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (4)
src/cluster_state.h (1)
132-134: ⚡ Quick winDocument the shard-epoch helper contracts in the header.
These new public APIs have non-obvious behavior that callers need to rely on correctly, especially around missing shard IDs and lifecycle ownership of epoch entries. Please add function comments here rather than forcing every caller to infer the contract from the
.cfile. As per coding guidelines, "Document why code exists, not just what it does; document all functions in C code".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cluster_state.h` around lines 132 - 134, Add descriptive function comments above clusterGetShardEpoch, clusterSetShardEpoch, and clusterRemoveShardEpoch that explain their contracts: state what clusterGetShardEpoch returns for existing and missing shard_id (e.g., special value or error), whether the caller must distinguish a zero epoch vs "not found", and whether callers should check for missing entries; document that clusterSetShardEpoch creates or updates an epoch entry and specify ownership/lifecycle guarantees (who may remove it and whether it is persisted), and document clusterRemoveShardEpoch's effect (idempotent, no-op if missing) and any concurrency/thread-safety expectations. Make sure each comment also states parameter semantics (shard_id lifetime/NULL handling) and the return/side-effect guarantees so callers don’t need to read the .c to use these APIs.tests/unit/cluster/cluster-raft-proto.tcl (3)
409-430: ⚡ Quick winConsider using a more robust synchronization or longer timeout.
The
after 500wait (line 419) is timing-dependent. While 500ms is probably sufficient to verify that pre-validation rejected the proposal immediately, it might be flaky on slow CI environments. According to coding guidelines, tests should avoid timing-dependent waits and use proper synchronization.Suggested alternatives
Option 1: Use a longer timeout for safety
- after 500 + after 2000Option 2: Poll the commit index to ensure it remains stable
- after 500 - - # Verify log index unchanged (rejected at pre-validation). - set commit_after [get_cluster_info_field $r0 cluster_raft_commit_index] - assert_equal $commit_before $commit_after + # Poll to ensure commit index remains unchanged over time. + set stable 1 + for {set i 0} {$i < 5} {incr i} { + after 200 + set commit_now [get_cluster_info_field $r0 cluster_raft_commit_index] + if {$commit_now != $commit_before} { + set stable 0 + break + } + } + assert_equal 1 $stable "commit index should remain unchanged"As per coding guidelines, Valkey Tcl tests should avoid timing-dependent tests and use proper synchronization.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/cluster/cluster-raft-proto.tcl` around lines 409 - 430, The test uses a fixed sleep (after 500) after sending the fake PROPOSE which is timing-dependent; replace this with a synchronization loop that polls the Raft commit index (using get_cluster_info_field $r0 cluster_raft_commit_index) or an existing wait helper until either the index remains equal to the pre-send value for a short stable period or a timeout elapses, then proceed to evaluate $state_check; update the test inside the foreach over $stale_proposals (around the raft_send/connect_fake_node block) to remove the after 500 and use the polling/wait approach so the assertion that commit_before == commit_after is robust on slow CI.
444-460: ⚡ Quick winConsider using a more robust synchronization or longer timeout.
The
after 500wait (line 452) has the same timing-dependency concern as the stale proposal tests. See previous comment for suggested alternatives.As per coding guidelines, Valkey Tcl tests should avoid timing-dependent tests and use proper synchronization.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/cluster/cluster-raft-proto.tcl` around lines 444 - 460, The test relies on a brittle fixed sleep (after 500); replace it with deterministic synchronization by polling the cluster state or using an existing wait helper instead of a hard sleep: after calling connect_fake_node/raft_send, loop (with a short sleep and a total timeout) calling get_cluster_info_field $r0 cluster_raft_commit_index and assert it remains equal to the initial $commit_before, or use a provided helper like wait_for_condition/wait_for_cluster_raft_commit_index_equal if available; remove the after 500 and close the fd only after the poll confirms the commit index is unchanged (or the wait helper times out and fails the test).
437-460: 💤 Low valueConsider adding state verification assertions.
The missing epoch tests verify that the commit index doesn't change but lack the cluster state assertions that the stale proposal tests include (lines 399-406, evaluated at line 426). Adding similar state checks would make these tests more thorough and consistent with the stale proposal tests.
For example, the NODE_FORGET test could verify that
$node1_idis still present inCLUSTER NODES, and the SLOT_CHANGE test could verify that slot assignments remain unchanged.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/cluster/cluster-raft-proto.tcl` around lines 437 - 460, Add assertions that the cluster state remains unchanged after each missing-epoch proposal: after sending propose_msg and verifying commit index unchanged, query the cluster state (e.g., run "CLUSTER NODES" and the slot assignment query used in the stale-proposal tests) and assert that for the "NODE_FORGET (missing epoch)" case $node1_id is still present, and for "SLOT_CHANGE (missing epoch)" that slot ownership/assignments remain identical to their pre-proposal values; implement these checks inside the same test block that references missing_epoch_proposals using the existing helpers (commit_before/commit_after, propose_msg, and the cluster info retrieval used elsewhere) so the tests mirror the state-verification done in the stale proposal tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@design-docs/cluster-raft.md`:
- Around line 629-642: The PROPOSE and Leader Validation section is now
inconsistent with the Validation section; update the earlier "PROPOSE and Leader
Validation" paragraph to state that leaders perform a pre-validation check and
may reject obviously stale proposals before appending (a best-effort, read-only
check that does not bump epoch), and clarify that authoritative validation still
happens at apply time (referencing the "Validation" section's apply-time
validation rules and epoch 0 behavior). Ensure the updated text mentions the
leader pre-validation behavior, that it is best-effort/read-only, and that
apply-time validation remains authoritative.
In `@src/cluster_state.c`:
- Around line 419-432: clusterSetShardEpoch currently inserts missing epoch
entries into server.cluster->shard_epochs, breaking the invariant that epoch
entries mirror server.cluster->shards; update clusterSetShardEpoch to first
check that the shard exists in server.cluster->shards (using dictFind on
server.cluster->shards with the same sds key) and if the shard is not present
simply free any temporary sds and return without adding to shard_epochs; only
update or add an epoch entry when the shard key is known to exist so
clusterAddNodeToShard won't reuse orphaned epochs.
---
Nitpick comments:
In `@src/cluster_state.h`:
- Around line 132-134: Add descriptive function comments above
clusterGetShardEpoch, clusterSetShardEpoch, and clusterRemoveShardEpoch that
explain their contracts: state what clusterGetShardEpoch returns for existing
and missing shard_id (e.g., special value or error), whether the caller must
distinguish a zero epoch vs "not found", and whether callers should check for
missing entries; document that clusterSetShardEpoch creates or updates an epoch
entry and specify ownership/lifecycle guarantees (who may remove it and whether
it is persisted), and document clusterRemoveShardEpoch's effect (idempotent,
no-op if missing) and any concurrency/thread-safety expectations. Make sure each
comment also states parameter semantics (shard_id lifetime/NULL handling) and
the return/side-effect guarantees so callers don’t need to read the .c to use
these APIs.
In `@tests/unit/cluster/cluster-raft-proto.tcl`:
- Around line 409-430: The test uses a fixed sleep (after 500) after sending the
fake PROPOSE which is timing-dependent; replace this with a synchronization loop
that polls the Raft commit index (using get_cluster_info_field $r0
cluster_raft_commit_index) or an existing wait helper until either the index
remains equal to the pre-send value for a short stable period or a timeout
elapses, then proceed to evaluate $state_check; update the test inside the
foreach over $stale_proposals (around the raft_send/connect_fake_node block) to
remove the after 500 and use the polling/wait approach so the assertion that
commit_before == commit_after is robust on slow CI.
- Around line 444-460: The test relies on a brittle fixed sleep (after 500);
replace it with deterministic synchronization by polling the cluster state or
using an existing wait helper instead of a hard sleep: after calling
connect_fake_node/raft_send, loop (with a short sleep and a total timeout)
calling get_cluster_info_field $r0 cluster_raft_commit_index and assert it
remains equal to the initial $commit_before, or use a provided helper like
wait_for_condition/wait_for_cluster_raft_commit_index_equal if available; remove
the after 500 and close the fd only after the poll confirms the commit index is
unchanged (or the wait helper times out and fails the test).
- Around line 437-460: Add assertions that the cluster state remains unchanged
after each missing-epoch proposal: after sending propose_msg and verifying
commit index unchanged, query the cluster state (e.g., run "CLUSTER NODES" and
the slot assignment query used in the stale-proposal tests) and assert that for
the "NODE_FORGET (missing epoch)" case $node1_id is still present, and for
"SLOT_CHANGE (missing epoch)" that slot ownership/assignments remain identical
to their pre-proposal values; implement these checks inside the same test block
that references missing_epoch_proposals using the existing helpers
(commit_before/commit_after, propose_msg, and the cluster info retrieval used
elsewhere) so the tests mirror the state-verification done in the stale proposal
tests.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 5cf4bdc6-f8cc-4247-a71a-b2aa849b9a6f
📒 Files selected for processing (7)
design-docs/cluster-raft.mdsrc/cluster.csrc/cluster_legacy.csrc/cluster_raft.csrc/cluster_state.csrc/cluster_state.htests/unit/cluster/cluster-raft-proto.tcl
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## cluster-v2 #3899 +/- ##
==============================================
+ Coverage 76.48% 76.57% +0.08%
==============================================
Files 166 166
Lines 82605 82806 +201
==============================================
+ Hits 63182 63409 +227
+ Misses 19423 19397 -26
🚀 New features to boost your workflow:
|
zuiderkwast
left a comment
There was a problem hiding this comment.
Nice! Not a full review. I just took a quick look and noticed that the common code outside of cluster_raft.c is affected more than I was expecting. See the comment below.
| dictSetUnsignedIntegerVal(epoch_de, 0); | ||
| } else { | ||
| sdsfree(epoch_key); | ||
| } |
There was a problem hiding this comment.
Shard-epoch is a Raft-cluster-specific concept, so ideally it should be implemented in cluster_raft.c and stored in the clusterRaftState instead of the common cluster state.
I see though that clusterAddNodeToShard and clusterRemoveNodeFromShard are called from various places, including from cluster_nodes.c when loading node.conf. 🤔
Have you considered any other ways to structure this?
When we persist shard-epoch in nodes.conf (combining this PR with #3887) do we need to store it for all shards in the vars line or how should we store it?
I have an idea: the nodes.conf format has three unused columns when raft is used: num pings, last pong, epoch (used for gossip, all zeroed out for raft). When nodes.conf is loaded, these three are passed to clusterCurrentBus->setNodePingPongEpoch() so the cluster bus implementation can do whatever it wants, and it's only used for the gossip protocol currently. For Raft, we could use e.g. the epoch field to store the shard-epoch for the shard the node belongs to. The shard-id is in the node's aux fields, so there is a way to match it.
This callback (clusterCurrentBus->setNodePingPongEpoch() and getNodePingPongEpoch() - or we can rename them or split them into separate callbacks for the three fields) provides a way for cluster_nodes.c to pass the shard-epoch to cluster_raft.c so it can update it in the raft state.
For calls to clusterAddNodeToShard within cluster_raft.c, we could replace them with a wrapper to add a node and also do the shard-epoch bookkeeping.
It's an idea worth exploring, but I don't know yet which design is cleaner.
There was a problem hiding this comment.
Reusing the epoch field makes a lot of sense to me. Alternatively we can store it in the aux/extensions?
There was a problem hiding this comment.
You mean in the ip:port@cport,hostname,aux=val* string? It identifies the node, and when it changes, the node proposes a NODE_INFO entry for itself. Shard-epoch can bump without the node being changed...
The unused epoch fields seems better. It'd be enough to store it on the primary for each shard, but we could as well store it on all nodes.
There was a problem hiding this comment.
the nodes.conf format has three unused columns when raft is used: num pings, last pong, epoch (used for gossip, all zeroed out for raft)
I like the idea. Now I have context on #3887 I will see what I can do.
There was a problem hiding this comment.
What did we decide on here? Store it in the epoch of each node belonging to the shard? Or at least in the primary of the shard. I do like the idea of nodes.conf roughly splitting into base and incremental:
# BASE
a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2 192.168.1.10:7001@17001 myself,master - 0 0 1 connected 0-5460 shard-id=f1e2d3c4b5a6f1e2d3c4b5a6f1e2d3c4b5a6f1e2
b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3 192.168.1.11:7002@17002 master - 0 0 2 connected 5461-10922 shard-id=a6b5c4d3e2f1a6b5c4d3e2f1a6b5c4d3e2f1a6b5
c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4 192.168.1.12:7003@17003 master - 0 0 3 connected 10923-16383 shard-id=e2f1a6b5c4d3e2f1a6b5c4d3e2f1a6b5c4d3e2f1
# INCREMENTAL
log ...
Storing shard epoch in the "base" will mean that we will replay by reconstructing shard epoch state from the base, and as we replay the incremental logs, we can revalidate the shard epoch bumps, giving us some additional log validation. We can achieve the same with vars, but I like how clear this looks.
There was a problem hiding this comment.
Yes, I think the current implementation aligns with your proposal. I have plugged into setNodePingPongEpoch so all node entries will have the shard epoch (Base). On replay of the log entries, we should be able to build the state back.
Note - as per #3887, there's a full rewrite for SLOT_CHANGE, SET_REPLICA_OF, FAILOVER so node entries will have the latest shapshot unless this does not change in future.
| dictSetUnsignedIntegerVal(epoch_de, 0); | ||
| } else { | ||
| sdsfree(epoch_key); | ||
| } |
There was a problem hiding this comment.
Reusing the epoch field makes a lot of sense to me. Alternatively we can store it in the aux/extensions?
f7d4366 to
48a19b6
Compare
There was a problem hiding this comment.
SET_REPLICA_OF could potentially move a replica from one shard to another, right?
In theory it could race against a failover in the previous shard. Similar to NODE_FORGET. We shouldn't allow a primary to move shards without a failover first, so I think we would want an epoch guard here.
Could we model NODE_FORGET and this into one operation: SHARD_CHANGE?
SHARD_CHANGE <node-id> <source-shard-or-dash> <source-shard-epoch> <target-shard> <target-shard-epoch>
{shard, epoch} should uniquely identify a primary node. So we wouldn't need to send the primary node. Although we could for additional validation.
AFAICT NODE_FORGET includes removal from the shard AND removal from the cluster. Splitting it like this would mean SHARD_CHANGE my_id curr_shard curr_epoch - - removes from the shard, and then we can safely follow up with NODE_FORGET my_id to remove it from the cluster.
Creating a new shard and becoming primary would be: SHARD_CHANGE my_id - 0 new_shard_id 0 -> target-shard-epoch of zero implies create a new shard and can automatically make this node primary.
@zuiderkwast wdyt?
There was a problem hiding this comment.
SET_REPLICA_OF could potentially move a replica from one shard to another, right?
Yes. Even automatically, when replica migration is enabled (which it is by default).
The node is either a replica in another shard or an empty primary before the move.
I used the name SET_PRIMARY_OF to hint that the node becomes a replica in the target shard. The name SHARD_CHANGE doesn't hint that. How about MOVE_REPLICA or something?
Source shard-id and shard-epoch and target shard-id and shard-epoch are enough. The primary id becomes is redundant. We can skip it, depending on what we name the entry.
AFAICT NODE_FORGET includes removal from the shard AND removal from the cluster. Splitting it like this would mean SHARD_CHANGE my_id curr_shard curr_epoch - - removes from the shard, and then we can safely follow up with NODE_FORGET my_id to remove it from the cluster.
All nodes in the cluster belong to a shard though. In gossip cluster, it does. It creates a new random shard-id for itself when it joins the cluster as an empty primary, and when it moves to an empty primary using CLUSTER REPLICATE NO ONE (SET_REPLICA_OF -).
To avoid a new state where a node exists without a shard, I think we can keep NODE_FORGET removing a node from a shard and from the cluster, at least for now. But you're right that we may way want to bump the shard-epoch on NODE_FORGET too.
There was a problem hiding this comment.
But you're right that we may way want to bump the shard-epoch on NODE_FORGET too.
@sushilpaneru1 has NODE_FORGET bumping in this PR. I was just trying to lower the surface area of shard-epoch-bumping commands. Converging to a single command could be done later, or not. Semantically it won't change much.
It creates a new random shard-id for itself when it joins the cluster as an empty primary, and when it moves to an empty primary using CLUSTER REPLICATE NO ONE (SET_REPLICA_OF -).
Right - that's what I meant when I said remove from shard - really means form a new shard. Then NODE_FORGET can only be done on primaries of shards that have no slots and no replicas.
There was a problem hiding this comment.
Would that mean CLUSTER FORGET can fail in the middle, leaving the half-forgotten node as a singleton?
There was a problem hiding this comment.
Not as a singleton (in Raft terms) but rather a member of an empty shard. Alternatively we can keep the single NODE_FORGET. I guess doing it atomically may be more friendly.
There was a problem hiding this comment.
In the design doc I read "It is bumped each time membership or leadership of the shard changes." - so according to this NODE_FORGET should bump it.
If we bump it, then should the node that got REJECT retry it later with the latest epoch? We should avoid breaking the CLUSTER FORGET behavior.
There was a problem hiding this comment.
I think two options:
- NODE_FORGET bumps epoch, and therefore will be sequenced against leadership changes. Primaries that own slots will be disallowed from being forgotten until they are failed over.
- NODE_FORGET implicitly performs failover if the node was primary.
I would prefer 1, since this will make the log more explicit. The node that wants to do the forget will have to decide whether it wants to a) give up or b) trigger failover on the shard and retry. Most likely, we would want to go with "a)" and let the user decide if they want to do "b)" manually.
There was a problem hiding this comment.
We should avoid breaking the CLUSTER FORGET behavior.
The existing CLUSTER FORGET is a bad user experience when we have a global log. It makes sense in gossip world since there is no canonical global state, but I don't know we need to have 1:1 parity.
Anyways, in the old model, replicas would not forget their primaries, which meant that primaries that had replicas would not actually be forgotten (they would be re-added after the ban list expiration).
My attempt at thinking it through:
| Scenario | Gossip behavior | Raft implications | My Suggestion |
|---|---|---|---|
| Leader nodes | N/A | We need to elect a new leader | Allow removal. Once the membership is committed that the leader is out of the quorum, a new election will start and a new leader committed as the AE timers expire. We can optimize by instigating election immediately after NODE_FORGET commit, but this is just an optimization, and the old leader would instigate this on a single node to prevent a storm. |
| Replica nodes | Allows removal | No data loss risk as long as it doesn't race against failover | Allow removal, but make NODE_FORGET use a shard epoch check/bump to prevent failover races. |
| Primaries with no slots and no replicas | Allows removal | No data loss risk since there are no slots. | Allow removal. |
| Primaries with replicas (with slots or with no slots) | Denies removal (replicas cannot forget their primary) | Removal would cause the shard to have no active primary | Deny removal. We should force the user to do a failover first and align the behavior with gossip. We can give an explicit error message, whereas gossip would silently fail. |
| Primaries with slots and without replicas | Allows removal | Risk of data loss if those slots have content. Risk of losing full slot coverage. | Deny removal. The old CLUSTER FORGET would happily accept this and the node would be lost, and the cluster would enter fail state due to lack of coverage. In a bootstrapped production cluster - this is a footgun IMO. I would prefer we just make the administrator handle it explicitly (migrate the slots off and then CLUSTER FORGET the empty primary). If the primary node to be removed is down or unavailable, the administrator should be able to force migrate the slots off the node through CLUSTER SETSLOT et al. |
Let me know if it makes sense.
There was a problem hiding this comment.
The existing CLUSTER FORGET is a bad user experience when we have a global log. It makes sense in gossip world since there is no canonical global state, but I don't know we need to have 1:1 parity.
We can add some new admin commands that are better and recommend using them instead, but for the existing commands we should try to make them work...
Anyways, in the old model, replicas would not forget their primaries
This means it's fine to reject it in this case! Good new. No existing control plane would deleted primaries that have replicas.
| Primaries with slots and without replicas | .. | Data loss | Deny removal |
Yeah, there's data loss, but in some setups, that may be fine. Raft cluster is not only for the strongly consistent setups with sync replication that we're building. It should replace gossip also for the other cluster users too.
I'm fine with it. How and why would a sensible control plane do this? To replace it with a new node and assign the slots again? I'm not really sure if anyone does this. Can we forbid it in gossip too, to keep CLUSTER FORGET consistent?
I see that valkey-cli --cluster del-node rejects this with a message like "[ERR] Node %s:%d is not empty! Reshard data away and try again." -- It's the reference control plane tool, so it's a good sign.
There was a problem hiding this comment.
I have added retry logic (limited retries) in an attempt to make it consistent with current behavior, do have a look at it and lmk if we are good with it.
|
@sushilpaneru1 To link the PR to the issue, you need to use one of the keywords recognized by github for this. See https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/using-keywords-in-issues-and-pull-requests |
48a19b6 to
648a47c
Compare
Try to fix the failures seen for `test "PSYNC2 valkey-io#3899 regression: verify consistency"`. This change resets the query buffer parser state in `replicationCachePrimary()` which is called when the connection to the primary is lost. Before valkey-io#2092, this was done by `resetClient()`. The solution was inspired by the discussion about the regression mentioned (discussion from 2017) and the related commits from that time: 6bc6bd4, 469d6e2, c180bc7. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
zuiderkwast
left a comment
There was a problem hiding this comment.
A few comments.
Regarding whether we need to bump epoch on FORGET, I don't really know, but I guess we can change that later as well.
For the FAILOVER entry, do we need both primary-id and shard-id+epoch? It's redundant but I don't remember what we said earlier.
@murphyjacob4 I appreciate that you take a look at the correctness details.
There was a problem hiding this comment.
In the design doc I read "It is bumped each time membership or leadership of the shard changes." - so according to this NODE_FORGET should bump it.
If we bump it, then should the node that got REJECT retry it later with the latest epoch? We should avoid breaking the CLUSTER FORGET behavior.
Signed-off-by: Sushil Paneru <sushil.paneru1@gmail.com>
Signed-off-by: Sushil Paneru <sushil.paneru1@gmail.com>
Signed-off-by: Sushil Paneru <sushil.paneru1@gmail.com>
Signed-off-by: Sushil Paneru <sushil.paneru1@gmail.com>
Signed-off-by: Sushil Paneru <sushil.paneru1@gmail.com>
Signed-off-by: Sushil Paneru <sushil.paneru1@gmail.com>
Signed-off-by: Sushil Paneru <sushil.paneru1@gmail.com>
Signed-off-by: Sushil Paneru <sushil.paneru1@gmail.com>
1469f89 to
061bed3
Compare
Signed-off-by: Sushil Paneru <sushil.paneru1@gmail.com>
061bed3 to
18a1c8b
Compare
| and the entry is applied. On mismatch, the entry is a no-op and | ||
| the error is propagated to the caller's callback. | ||
|
|
||
| ### Retry on stale epoch |
There was a problem hiding this comment.
On a high level, retry for proposal rejection looks like it defeats the purpose of epoch level fencing. From a operator point of view I think it makes sense to have a internal retry if epoch was stale; it kinda says the operation is semantically correct but issued concurrently. I have added a bound to number of retries. lmk what you think about the overall strategy? @zuiderkwast @murphyjacob4
closes #3863, closes #3874
Shard Epoch- Introduced fencing token as additional state to guard against stale or out of order mutation of a shard's state. Overall, the idea is - for each shard's state, we maintain a version number which is incremented on each mutation; stale mutation will be rejected as the associated version number will be older (similar to MVCC).Pre-Validation and Rejection- The raft leader now pre-validates proposals (shard epoch staleness, shard-id mismatch, missing fields) before appending to the log. Invalid proposals are rejected immediately, keeping the log clean. When a proposal is rejected, the leader sends aREJECT <entry>message back to the proposing follower. The follower matches this against its pending proposals and fires the client callback with an error, so the client receives-ERR proposal rejected by raft leaderimmediately instead of blocking until timeout i.e two layer callback chain is handled. Two stats counters (cluster_stats_proposals_rejected_prevalidationandcluster_stats_proposals_rejected_apply) provide observability into rejections at each stage.Testing
New tests added- Verifies rejection due to stale epoch, missing epoch, wrong shard-id mismatch and also rejection wire message.REPLICATE, some of the test failed due to race condition (stale epoch), added retry to the tests.