Cluster Raft: Persist state and WAL in nodes.conf#3887
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## cluster-v2 #3887 +/- ##
==============================================
- Coverage 76.48% 76.46% -0.03%
==============================================
Files 166 166
Lines 82605 82735 +130
==============================================
+ Hits 63182 63260 +78
- Misses 19423 19475 +52
🚀 New features to boost your workflow:
|
| sds buf = sdsempty(); | ||
| for (uint64_t i = 0; i < rs->log_count; i++) { | ||
| raftLogEntry *e = rs->log[i]; | ||
| if (e->index < from) continue; |
There was a problem hiding this comment.
some form of indexing would be helpful to skip faster?
There was a problem hiding this comment.
I hope with log trimming in the future (#3858), the log should never be very long, so hopefully we don't need to optimize this. We can do it later if we see that it's needed.
| Raft state is persisted in `nodes.conf`. The file has three sections: | ||
|
|
||
| 1. **Node lines** — the cluster state snapshot (nodes, slots, replication | ||
| topology) as of `lastApplied`. |
There was a problem hiding this comment.
I assume this will include the shard level epoch as well ?
There was a problem hiding this comment.
Yes, we'll need to persist it in some way, either in the node lines or in the vars line.
I have an idea. I posted it on your PR. #3899 (comment)
Persist currentTerm, votedFor, and lastApplied in the vars line. Uncommitted log entries are appended to the end of nodes.conf as 'log <index> <term> <type> <data>' lines. Full rewrite (atomic write+rename) is triggered on term/vote changes and when applying log entries that affect myself: SLOT_CHANGE (when target or source is myself), SET_REPLICA_OF (replica is myself), FAILOVER (promoted or demoted is myself), and NODE_JOIN (promoted from learner). This ensures the snapshot in nodes.conf is up to date for state that matters on restart. Other log entries are appended with a single write+fsync, batched per event loop cycle. On load, the snapshot (node lines) represents state at lastApplied. Log entries in the tail are replayed into the in-memory log. The leader will update commit index via AppendEntries after reconnection. Also start replication on load if the node is a replica, and stop reading nodes.conf if a line without trailing newline is encountered (indicates a crash during append). Unskip tests that were blocked on persistence. Adapt failover stress test for raft: use role check instead of epoch Replace config epoch comparison with role check to detect failover. Skip the epoch post-condition assertion for cluster-raft. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
After loading nodes.conf, server.cluster->size was stuck at 0 because NODE_JOIN apply (which increments size) only runs for log tail entries, not for nodes already in the snapshot. Restore size in postLoad by counting loaded nodes without CLUSTER_NODE_MEET flag. Guard initLast to only set MEET flag and become singleton leader on fresh start (size == 0), not on restart. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
auxShardIdSetter() called clusterAddNodeToShard() without first removing the node from its previous shard. During MEET handshake, a node's shard-id can be updated multiple times (HI on the outbound link, then HELLO on the inbound link), causing it to accumulate in multiple shard dict entries. CLUSTER SHARDS then returns the node in a stale empty-slot shard instead of the correct one. Fix by early-returning when the shard-id is unchanged, and calling clusterRemoveNodeFromShard() before updating when it does change. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Shard-id is managed by dedicated raft log entries (NODE_JOIN, SET_REPLICA_OF), not NODE_INFO. Including it in the NODE_INFO address string caused unnecessary log entries whenever SET_REPLICA_OF changed a node's shard-id, as the periodic divergence check would detect the shard-id difference and re-propose NODE_INFO. Add clusterNodeAppendAddressStringNoShardId() which omits the shard-id aux field, and use it when building and comparing NODE_INFO entries. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
A node must only grant its vote if the candidate's log is at least as up-to-date as its own (Raft §5.4.1). Without this check, a node with a stale log (e.g. after CLUSTER RESET HARD) could win an election and overwrite other nodes' logs via AppendEntries truncation. Compare the candidate's last log term and index against our own: grant the vote only if the candidate's last term is higher, or if terms are equal and the candidate's last index is >= ours. Skip 'CLUSTER MYSHARDID reports same shard id after cluster restart' under raft: when R0-R7 restart while R8 stays running, R8 inflates its term with repeated failed elections, disrupting leader election. This needs pre-vote (Raft §9.6) to fix properly. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
The cluster state is initialized to CLUSTER_FAIL. Without setting todo_update_slot_coverage during raft init, clusterRaftCheckSlotCoverage never runs after a restart, leaving cluster_state permanently at FAIL even though all slots are properly assigned from the loaded nodes.conf. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Every raftLogAppend call was followed by raftLogMarkDirty. Merge the persistence marking into raftLogAppend itself. On startup, postLoad clears todo_persist_log since loaded entries are already on disk. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Without periodic rewrites, nodes.conf grows unboundedly with appended log lines. Trigger a full rewrite (which removes applied entries from the tail) when 100 entries have been applied since the last rewrite. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
7c9f76a to
415c420
Compare
Send the success AE_ACK from beforeSleep after entries are persisted, rather than immediately in the AE handler. This makes the persist- before-ACK safety invariant explicit in the code instead of relying on event loop ordering (beforeSleep running before write handlers). The ACK reports current state (term, last_log_index) at send time, so it remains correct even if a leader change occurs between receiving AE and sending the deferred ACK. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Raft nodes must remain active during RDB loading (full sync): - Followers must persist entries and send AE_ACK, otherwise the leader may fail them over. - The leader must continue sending AE heartbeats, otherwise followers will start elections. Call clusterBeforeSleep() from the ProcessingEventsWhileBlocked path in beforeSleep so that raft persistence, deferred ACKs, and heartbeat broadcasting are handled during loading. Gossip's beforeSleep returns early during loading to preserve existing behavior. The slot migration cron in clusterBeforeSleep is also skipped during loading, preserving legacy behavior. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
murphyjacob4
left a comment
There was a problem hiding this comment.
Thanks @zuiderkwast for a super clean implementation (appreciate the use of todo bits to defer it all to beforeSleep, making it easy to rationalize about).
A couple correctness related things
my_last_committed_info starts as an empty string. The periodic NODE_INFO divergence check compares against it after 10 seconds, always finding a mismatch and proposing a redundant NODE_INFO entry. Initialize it when our own NODE_JOIN is applied, since at that point our address and flags are known. Extract the NODE_INFO data string construction into clusterRaftBuildMyNodeInfo() to avoid duplication across the three call sites. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
The Raft paper requires votedFor to be on stable storage before responding to a vote request. Otherwise a crash after responding but before persisting could allow voting for a different candidate in the same term after restart, breaking the single-vote-per-term invariant. Defer the granted vote response to beforeSleep, after the full config rewrite (triggered by todo_save_config) persists votedFor. Denial responses are sent immediately since they don't change persisted state. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
When starting an election, the candidate increments its term and votes for itself. These must be on stable storage before sending RequestVote, otherwise a crash could allow the node to vote for a different candidate in the same term after restart (double-voting). Defer broadcasting RequestVote to beforeSleep, after todo_save_config persists the new term and votedFor. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
When a log conflict is detected and entries are truncated, those entries may already be persisted on disk. A full config rewrite is needed to remove them, otherwise they would be replayed on restart. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Previously, a singleton leader (quorum=1) committed and applied entries inline in clusterRaftPropose, before they were persisted. This meant side effects (CLUSTER MEET unblock, broadcast AE to new peer) could fire before the entry was on stable storage. Move singleton commit to beforeSleep, after the persist step. This ensures all entries are durable before being applied, and all apply side effects (unblock meets, broadcast AE) happen at the right time without needing separate deferral flags for each. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Write a compacted nodes.conf at shutdown. Without this, the node restarts with committed entries in the log tail that it cannot apply until the leader sends AE with the updated commit index. With the rewrite, lastApplied is up to date and the log tail only contains truly uncommitted entries (if any). Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
The trailing newline check could break gossip users whose editors strip final newlines. Since it only protects raft log lines from crash-during-append, and we plan to add per-line checksums as the proper solution, remove the check for now. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Persist Raft state to disk so nodes can recover after restart without losing cluster membership or committed state.
More details on persistence below.
Additional changes, found when testing restarts (which need persistence but also other things):
What's persisted
Write strategy
FAILOVER, NODE_JOIN promotion). This keeps the node-lines snapshot current for state that matters on restart.
When 100 entries have been appended without a full rewrite, we trigger a full rewrite.
Startup
Incomplete lines (no trailing newline) are discarded as crash artifacts.
Tests
Closes #3857