Cluster Raft: Persist state and WAL in nodes.conf by zuiderkwast · Pull Request #3887 · valkey-io/valkey

zuiderkwast · 2026-06-01T15:04:07Z

Persist Raft state to disk so nodes can recover after restart without losing cluster membership or committed state.

More details on persistence below.

Additional changes, found when testing restarts (which need persistence but also other things):

Fix node appearing in multiple shards — shard-id changes left stale entries
Exclude shard-id from NODE_INFO — shard-id changes should go through SET_REPLICA_OF, not NODE_INFO
Log completeness check in RequestVote — Raft safety: don't vote for candidates with less complete logs
Evaluate slot coverage on startup — so cluster_state reaches "ok" after restart without waiting for a new commit
Run Raft cluster's beforeSleep during RDB loading (ProcessingEventsWhileBlocked) - followers need to persist and ack entries and heartbeats.

What's persisted

Node lines: each node and its name, slots, etc. just like in legacy and CLUSTER NODES - this is the snapshot
Vars line: currentTerm, votedFor, lastApplied (added to existing vars section)
Log lines: Uncommitted entries appended as log at the end of nodes.conf

Write strategy

Full rewrite (atomic write → temp file → rename → fsync): triggered on term/vote changes and when an applied entry affects myself (SLOT_CHANGE, SET_REPLICA_OF,
FAILOVER, NODE_JOIN promotion). This keeps the node-lines snapshot current for state that matters on restart.
Append-only (write + fsync, batched per event loop): for new log entries from AE or local proposals. The fsync completes in beforeSleep before AE_ACK is added to the send buffer, satisfying Raft's persist-before-acknowledge requirement.

When 100 entries have been appended without a full rewrite, we trigger a full rewrite.

Startup

Node lines → cluster state snapshot restored
Vars → currentTerm, votedFor, lastApplied restored; commitIndex set to lastApplied
Log tail → entries replayed into in-memory log
Leader sends AE after reconnection → commitIndex advances → entries applied

Incomplete lines (no trailing newline) are discarded as crash artifacts.

Tests

Unskipped persistence-dependent tests: availability-zone, cluster-shards, failover, hostnames, resharding
Adapted failover stress test for raft (role check instead of epoch comparison)

Closes #3857

coderabbitai · 2026-06-01T15:04:18Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: d2054f2e-4735-4467-b9e5-000a856e91e2

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-06-01T15:32:24Z

Codecov Report

❌ Patch coverage is 72.82609% with 50 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.46%. Comparing base (2604112) to head (e702a5e).

Files with missing lines	Patch %	Lines
src/cluster_raft.c	71.42%	46 Missing ⚠️
src/cluster_nodes.c	75.00%	4 Missing ⚠️

Additional details and impacted files

@@              Coverage Diff               @@
##           cluster-v2    #3887      +/-   ##
==============================================
- Coverage       76.48%   76.46%   -0.03%     
==============================================
  Files             166      166              
  Lines           82605    82735     +130     
==============================================
+ Hits            63182    63260      +78     
- Misses          19423    19475      +52

Files with missing lines	Coverage Δ
src/cluster.c	`91.05% <100.00%> (ø)`
src/cluster_legacy.c	`91.61% <100.00%> (-0.30%)`	⬇️
src/server.c	`89.56% <100.00%> (+0.05%)`	⬆️
src/cluster_nodes.c	`75.85% <75.00%> (-0.76%)`	⬇️
src/cluster_raft.c	`63.03% <71.42%> (+1.76%)`	⬆️

... and 15 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

hpatro · 2026-06-02T10:58:40Z

+    sds buf = sdsempty();
+    for (uint64_t i = 0; i < rs->log_count; i++) {
+        raftLogEntry *e = rs->log[i];
+        if (e->index < from) continue;


some form of indexing would be helpful to skip faster?

I hope with log trimming in the future (#3858), the log should never be very long, so hopefully we don't need to optimize this. We can do it later if we see that it's needed.

sushilpaneru1 · 2026-06-02T20:46:55Z

+Raft state is persisted in `nodes.conf`. The file has three sections:
+
+1. **Node lines** — the cluster state snapshot (nodes, slots, replication
+   topology) as of `lastApplied`.


I assume this will include the shard level epoch as well ?

Yes, we'll need to persist it in some way, either in the node lines or in the vars line.

I have an idea. I posted it on your PR. #3899 (comment)

Persist currentTerm, votedFor, and lastApplied in the vars line. Uncommitted log entries are appended to the end of nodes.conf as 'log <index> <term> <type> <data>' lines. Full rewrite (atomic write+rename) is triggered on term/vote changes and when applying log entries that affect myself: SLOT_CHANGE (when target or source is myself), SET_REPLICA_OF (replica is myself), FAILOVER (promoted or demoted is myself), and NODE_JOIN (promoted from learner). This ensures the snapshot in nodes.conf is up to date for state that matters on restart. Other log entries are appended with a single write+fsync, batched per event loop cycle. On load, the snapshot (node lines) represents state at lastApplied. Log entries in the tail are replayed into the in-memory log. The leader will update commit index via AppendEntries after reconnection. Also start replication on load if the node is a replica, and stop reading nodes.conf if a line without trailing newline is encountered (indicates a crash during append). Unskip tests that were blocked on persistence. Adapt failover stress test for raft: use role check instead of epoch Replace config epoch comparison with role check to detect failover. Skip the epoch post-condition assertion for cluster-raft. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

After loading nodes.conf, server.cluster->size was stuck at 0 because NODE_JOIN apply (which increments size) only runs for log tail entries, not for nodes already in the snapshot. Restore size in postLoad by counting loaded nodes without CLUSTER_NODE_MEET flag. Guard initLast to only set MEET flag and become singleton leader on fresh start (size == 0), not on restart. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

auxShardIdSetter() called clusterAddNodeToShard() without first removing the node from its previous shard. During MEET handshake, a node's shard-id can be updated multiple times (HI on the outbound link, then HELLO on the inbound link), causing it to accumulate in multiple shard dict entries. CLUSTER SHARDS then returns the node in a stale empty-slot shard instead of the correct one. Fix by early-returning when the shard-id is unchanged, and calling clusterRemoveNodeFromShard() before updating when it does change. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

Shard-id is managed by dedicated raft log entries (NODE_JOIN, SET_REPLICA_OF), not NODE_INFO. Including it in the NODE_INFO address string caused unnecessary log entries whenever SET_REPLICA_OF changed a node's shard-id, as the periodic divergence check would detect the shard-id difference and re-propose NODE_INFO. Add clusterNodeAppendAddressStringNoShardId() which omits the shard-id aux field, and use it when building and comparing NODE_INFO entries. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

A node must only grant its vote if the candidate's log is at least as up-to-date as its own (Raft §5.4.1). Without this check, a node with a stale log (e.g. after CLUSTER RESET HARD) could win an election and overwrite other nodes' logs via AppendEntries truncation. Compare the candidate's last log term and index against our own: grant the vote only if the candidate's last term is higher, or if terms are equal and the candidate's last index is >= ours. Skip 'CLUSTER MYSHARDID reports same shard id after cluster restart' under raft: when R0-R7 restart while R8 stays running, R8 inflates its term with repeated failed elections, disrupting leader election. This needs pre-vote (Raft §9.6) to fix properly. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

The cluster state is initialized to CLUSTER_FAIL. Without setting todo_update_slot_coverage during raft init, clusterRaftCheckSlotCoverage never runs after a restart, leaving cluster_state permanently at FAIL even though all slots are properly assigned from the loaded nodes.conf. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

Every raftLogAppend call was followed by raftLogMarkDirty. Merge the persistence marking into raftLogAppend itself. On startup, postLoad clears todo_persist_log since loaded entries are already on disk. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

Without periodic rewrites, nodes.conf grows unboundedly with appended log lines. Trigger a full rewrite (which removes applied entries from the tail) when 100 entries have been applied since the last rewrite. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

Send the success AE_ACK from beforeSleep after entries are persisted, rather than immediately in the AE handler. This makes the persist- before-ACK safety invariant explicit in the code instead of relying on event loop ordering (beforeSleep running before write handlers). The ACK reports current state (term, last_log_index) at send time, so it remains correct even if a leader change occurs between receiving AE and sending the deferred ACK. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

Raft nodes must remain active during RDB loading (full sync): - Followers must persist entries and send AE_ACK, otherwise the leader may fail them over. - The leader must continue sending AE heartbeats, otherwise followers will start elections. Call clusterBeforeSleep() from the ProcessingEventsWhileBlocked path in beforeSleep so that raft persistence, deferred ACKs, and heartbeat broadcasting are handled during loading. Gossip's beforeSleep returns early during loading to preserve existing behavior. The slot migration cron in clusterBeforeSleep is also skipped during loading, preserving legacy behavior. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

murphyjacob4

Thanks @zuiderkwast for a super clean implementation (appreciate the use of todo bits to defer it all to beforeSleep, making it easy to rationalize about).

A couple correctness related things

my_last_committed_info starts as an empty string. The periodic NODE_INFO divergence check compares against it after 10 seconds, always finding a mismatch and proposing a redundant NODE_INFO entry. Initialize it when our own NODE_JOIN is applied, since at that point our address and flags are known. Extract the NODE_INFO data string construction into clusterRaftBuildMyNodeInfo() to avoid duplication across the three call sites. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

The Raft paper requires votedFor to be on stable storage before responding to a vote request. Otherwise a crash after responding but before persisting could allow voting for a different candidate in the same term after restart, breaking the single-vote-per-term invariant. Defer the granted vote response to beforeSleep, after the full config rewrite (triggered by todo_save_config) persists votedFor. Denial responses are sent immediately since they don't change persisted state. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

When starting an election, the candidate increments its term and votes for itself. These must be on stable storage before sending RequestVote, otherwise a crash could allow the node to vote for a different candidate in the same term after restart (double-voting). Defer broadcasting RequestVote to beforeSleep, after todo_save_config persists the new term and votedFor. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

When a log conflict is detected and entries are truncated, those entries may already be persisted on disk. A full config rewrite is needed to remove them, otherwise they would be replayed on restart. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

Previously, a singleton leader (quorum=1) committed and applied entries inline in clusterRaftPropose, before they were persisted. This meant side effects (CLUSTER MEET unblock, broadcast AE to new peer) could fire before the entry was on stable storage. Move singleton commit to beforeSleep, after the persist step. This ensures all entries are durable before being applied, and all apply side effects (unblock meets, broadcast AE) happen at the right time without needing separate deferral flags for each. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

Write a compacted nodes.conf at shutdown. Without this, the node restarts with committed entries in the log tail that it cannot apply until the leader sends AE with the updated commit index. With the rewrite, lastApplied is up to date and the log tail only contains truly uncommitted entries (if any). Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

The trailing newline check could break gossip users whose editors strip final newlines. Since it only protects raft log lines from crash-during-append, and we plan to add per-line checksums as the proper solution, remove the check for now. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

github-actions Bot assigned zuiderkwast Jun 1, 2026

hpatro reviewed Jun 2, 2026

View reviewed changes

sushilpaneru1 reviewed Jun 2, 2026

View reviewed changes

This was referenced Jun 2, 2026

Cluster bus v2 - Added shard level epoch and proposal pre-validation #3899

Open

Raft Cluster: fix CLUSTER SHARDS visibility and replica health reporting #3910

Merged

zuiderkwast added 10 commits June 4, 2026 10:32

Update design doc: Raft log and state persistence in nodes.conf

bc35258

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

Design doc: add checksum future work item for log persistence

415c420

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

zuiderkwast force-pushed the raft-persist-log branch from 7c9f76a to 415c420 Compare June 6, 2026 23:19

zuiderkwast added 2 commits June 7, 2026 02:09

zuiderkwast marked this pull request as ready for review June 8, 2026 08:47

zuiderkwast requested a review from murphyjacob4 June 8, 2026 08:58

murphyjacob4 reviewed Jun 8, 2026

View reviewed changes

Comment thread design-docs/cluster-raft.md

Comment thread src/cluster_nodes.c Outdated

Comment thread src/cluster_nodes.c

Comment thread src/cluster_raft.c Outdated

Comment thread src/cluster_raft.c

Comment thread src/cluster_raft.c

murphyjacob4 reviewed Jun 8, 2026

View reviewed changes

Comment thread src/cluster_raft.c Outdated

Comment thread src/cluster_raft.c Outdated

Comment thread src/cluster_raft.c

Comment thread src/cluster_raft.c

zuiderkwast added 6 commits June 9, 2026 12:08

Raft: define RAFT_LOG_REWRITE_THRESHOLD for periodic rewrite

c64fbc4

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

zuiderkwast added 2 commits June 9, 2026 17:20

This was referenced Jun 9, 2026

Raft Cluster: Crash Recovery (Restart and Rejoin) #3859

Open

Raft Cluster: Checksums for WAL integrity and torn write detection #3951

Closed

murphyjacob4 approved these changes Jun 9, 2026

View reviewed changes

zuiderkwast merged commit ccd1505 into valkey-io:cluster-v2 Jun 10, 2026
27 of 28 checks passed

zuiderkwast linked an issue Jun 10, 2026 that may be closed by this pull request

Raft Cluster: Persistence (State, Current Term, VotedFor, Log) #3857

Closed

zuiderkwast deleted the raft-persist-log branch June 10, 2026 09:01

This was referenced Jun 10, 2026

Raft Cluster: Checksum for log lines in nodes.conf #3960

Closed

Raft Cluster: Implement stale leader step-down after quorum loss #3916

Merged

Conversation

zuiderkwast commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's persisted

Write strategy

Startup

Tests

Uh oh!

coderabbitai Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

codecov Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

hpatro Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

zuiderkwast Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sushilpaneru1 Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

zuiderkwast Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

murphyjacob4 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zuiderkwast commented Jun 1, 2026 •

edited

Loading

coderabbitai Bot commented Jun 1, 2026 •

edited

Loading

codecov Bot commented Jun 1, 2026 •

edited

Loading