Skip to content

Cluster Raft: Persist state and WAL in nodes.conf#3887

Merged
zuiderkwast merged 20 commits into
valkey-io:cluster-v2from
zuiderkwast:raft-persist-log
Jun 10, 2026
Merged

Cluster Raft: Persist state and WAL in nodes.conf#3887
zuiderkwast merged 20 commits into
valkey-io:cluster-v2from
zuiderkwast:raft-persist-log

Conversation

@zuiderkwast

@zuiderkwast zuiderkwast commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Persist Raft state to disk so nodes can recover after restart without losing cluster membership or committed state.

More details on persistence below.

Additional changes, found when testing restarts (which need persistence but also other things):

  1. Fix node appearing in multiple shards — shard-id changes left stale entries
  2. Exclude shard-id from NODE_INFO — shard-id changes should go through SET_REPLICA_OF, not NODE_INFO
  3. Log completeness check in RequestVote — Raft safety: don't vote for candidates with less complete logs
  4. Evaluate slot coverage on startup — so cluster_state reaches "ok" after restart without waiting for a new commit
  5. Run Raft cluster's beforeSleep during RDB loading (ProcessingEventsWhileBlocked) - followers need to persist and ack entries and heartbeats.

What's persisted

  • Node lines: each node and its name, slots, etc. just like in legacy and CLUSTER NODES - this is the snapshot
  • Vars line: currentTerm, votedFor, lastApplied (added to existing vars section)
  • Log lines: Uncommitted entries appended as log at the end of nodes.conf

Write strategy

  • Full rewrite (atomic write → temp file → rename → fsync): triggered on term/vote changes and when an applied entry affects myself (SLOT_CHANGE, SET_REPLICA_OF,
    FAILOVER, NODE_JOIN promotion). This keeps the node-lines snapshot current for state that matters on restart.
  • Append-only (write + fsync, batched per event loop): for new log entries from AE or local proposals. The fsync completes in beforeSleep before AE_ACK is added to the send buffer, satisfying Raft's persist-before-acknowledge requirement.

When 100 entries have been appended without a full rewrite, we trigger a full rewrite.

Startup

  1. Node lines → cluster state snapshot restored
  2. Vars → currentTerm, votedFor, lastApplied restored; commitIndex set to lastApplied
  3. Log tail → entries replayed into in-memory log
  4. Leader sends AE after reconnection → commitIndex advances → entries applied

Incomplete lines (no trailing newline) are discarded as crash artifacts.

Tests

  • Unskipped persistence-dependent tests: availability-zone, cluster-shards, failover, hostnames, resharding
  • Adapted failover stress test for raft (role check instead of epoch comparison)

Closes #3857

@coderabbitai

coderabbitai Bot commented Jun 1, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: d2054f2e-4735-4467-b9e5-000a856e91e2

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov

codecov Bot commented Jun 1, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 72.82609% with 50 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.46%. Comparing base (2604112) to head (e702a5e).

Files with missing lines Patch % Lines
src/cluster_raft.c 71.42% 46 Missing ⚠️
src/cluster_nodes.c 75.00% 4 Missing ⚠️
Additional details and impacted files
@@              Coverage Diff               @@
##           cluster-v2    #3887      +/-   ##
==============================================
- Coverage       76.48%   76.46%   -0.03%     
==============================================
  Files             166      166              
  Lines           82605    82735     +130     
==============================================
+ Hits            63182    63260      +78     
- Misses          19423    19475      +52     
Files with missing lines Coverage Δ
src/cluster.c 91.05% <100.00%> (ø)
src/cluster_legacy.c 91.61% <100.00%> (-0.30%) ⬇️
src/server.c 89.56% <100.00%> (+0.05%) ⬆️
src/cluster_nodes.c 75.85% <75.00%> (-0.76%) ⬇️
src/cluster_raft.c 63.03% <71.42%> (+1.76%) ⬆️

... and 15 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread src/cluster_raft.c Outdated
Comment thread src/cluster_raft.c
sds buf = sdsempty();
for (uint64_t i = 0; i < rs->log_count; i++) {
raftLogEntry *e = rs->log[i];
if (e->index < from) continue;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some form of indexing would be helpful to skip faster?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope with log trimming in the future (#3858), the log should never be very long, so hopefully we don't need to optimize this. We can do it later if we see that it's needed.

Comment thread src/cluster_raft.c
Raft state is persisted in `nodes.conf`. The file has three sections:

1. **Node lines** — the cluster state snapshot (nodes, slots, replication
topology) as of `lastApplied`.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this will include the shard level epoch as well ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we'll need to persist it in some way, either in the node lines or in the vars line.

I have an idea. I posted it on your PR. #3899 (comment)

Comment thread design-docs/cluster-raft.md Outdated
Comment thread design-docs/cluster-raft.md
Comment thread design-docs/cluster-raft.md Outdated
Comment thread design-docs/cluster-raft.md
Persist currentTerm, votedFor, and lastApplied in the vars line.
Uncommitted log entries are appended to the end of nodes.conf as
'log <index> <term> <type> <data>' lines.

Full rewrite (atomic write+rename) is triggered on term/vote changes
and when applying log entries that affect myself: SLOT_CHANGE (when
target or source is myself), SET_REPLICA_OF (replica is myself),
FAILOVER (promoted or demoted is myself), and NODE_JOIN (promoted from
learner). This ensures the snapshot in nodes.conf is up to date for
state that matters on restart.

Other log entries are appended with a single write+fsync, batched per
event loop cycle.

On load, the snapshot (node lines) represents state at lastApplied.
Log entries in the tail are replayed into the in-memory log. The
leader will update commit index via AppendEntries after reconnection.

Also start replication on load if the node is a replica, and stop
reading nodes.conf if a line without trailing newline is encountered
(indicates a crash during append).

Unskip tests that were blocked on persistence.

Adapt failover stress test for raft: use role check instead of epoch
Replace config epoch comparison with role check to detect failover.
Skip the epoch post-condition assertion for cluster-raft.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
After loading nodes.conf, server.cluster->size was stuck at 0 because
NODE_JOIN apply (which increments size) only runs for log tail entries,
not for nodes already in the snapshot.

Restore size in postLoad by counting loaded nodes without CLUSTER_NODE_MEET
flag. Guard initLast to only set MEET flag and become singleton leader on
fresh start (size == 0), not on restart.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
auxShardIdSetter() called clusterAddNodeToShard() without first
removing the node from its previous shard. During MEET handshake,
a node's shard-id can be updated multiple times (HI on the outbound
link, then HELLO on the inbound link), causing it to accumulate in
multiple shard dict entries. CLUSTER SHARDS then returns the node
in a stale empty-slot shard instead of the correct one.

Fix by early-returning when the shard-id is unchanged, and calling
clusterRemoveNodeFromShard() before updating when it does change.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Shard-id is managed by dedicated raft log entries (NODE_JOIN,
SET_REPLICA_OF), not NODE_INFO. Including it in the NODE_INFO
address string caused unnecessary log entries whenever SET_REPLICA_OF
changed a node's shard-id, as the periodic divergence check would
detect the shard-id difference and re-propose NODE_INFO.

Add clusterNodeAppendAddressStringNoShardId() which omits the
shard-id aux field, and use it when building and comparing NODE_INFO
entries.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
A node must only grant its vote if the candidate's log is at least
as up-to-date as its own (Raft §5.4.1). Without this check, a node
with a stale log (e.g. after CLUSTER RESET HARD) could win an
election and overwrite other nodes' logs via AppendEntries truncation.

Compare the candidate's last log term and index against our own:
grant the vote only if the candidate's last term is higher, or if
terms are equal and the candidate's last index is >= ours.

Skip 'CLUSTER MYSHARDID reports same shard id after cluster restart'
under raft: when R0-R7 restart while R8 stays running, R8 inflates
its term with repeated failed elections, disrupting leader election.
This needs pre-vote (Raft §9.6) to fix properly.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
The cluster state is initialized to CLUSTER_FAIL. Without setting
todo_update_slot_coverage during raft init, clusterRaftCheckSlotCoverage
never runs after a restart, leaving cluster_state permanently at FAIL
even though all slots are properly assigned from the loaded nodes.conf.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Every raftLogAppend call was followed by raftLogMarkDirty. Merge the
persistence marking into raftLogAppend itself. On startup, postLoad
clears todo_persist_log since loaded entries are already on disk.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Without periodic rewrites, nodes.conf grows unboundedly with appended
log lines. Trigger a full rewrite (which removes applied entries from
the tail) when 100 entries have been applied since the last rewrite.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Send the success AE_ACK from beforeSleep after entries are persisted,
rather than immediately in the AE handler. This makes the persist-
before-ACK safety invariant explicit in the code instead of relying
on event loop ordering (beforeSleep running before write handlers).

The ACK reports current state (term, last_log_index) at send time,
so it remains correct even if a leader change occurs between receiving
AE and sending the deferred ACK.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Raft nodes must remain active during RDB loading (full sync):
- Followers must persist entries and send AE_ACK, otherwise the leader
  may fail them over.
- The leader must continue sending AE heartbeats, otherwise followers
  will start elections.

Call clusterBeforeSleep() from the ProcessingEventsWhileBlocked path
in beforeSleep so that raft persistence, deferred ACKs, and heartbeat
broadcasting are handled during loading.

Gossip's beforeSleep returns early during loading to preserve existing
behavior. The slot migration cron in clusterBeforeSleep is also skipped
during loading, preserving legacy behavior.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
@zuiderkwast zuiderkwast marked this pull request as ready for review June 8, 2026 08:47
@zuiderkwast zuiderkwast requested a review from murphyjacob4 June 8, 2026 08:58
Comment thread design-docs/cluster-raft.md
Comment thread src/cluster_nodes.c Outdated
Comment thread src/cluster_nodes.c
Comment thread src/cluster_raft.c Outdated
Comment thread src/cluster_raft.c
Comment thread src/cluster_raft.c

@murphyjacob4 murphyjacob4 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zuiderkwast for a super clean implementation (appreciate the use of todo bits to defer it all to beforeSleep, making it easy to rationalize about).

A couple correctness related things

Comment thread src/cluster_raft.c Outdated
Comment thread src/cluster_raft.c Outdated
Comment thread src/cluster_raft.c
Comment thread src/cluster_raft.c
my_last_committed_info starts as an empty string. The periodic
NODE_INFO divergence check compares against it after 10 seconds,
always finding a mismatch and proposing a redundant NODE_INFO entry.

Initialize it when our own NODE_JOIN is applied, since at that point
our address and flags are known. Extract the NODE_INFO data string
construction into clusterRaftBuildMyNodeInfo() to avoid duplication
across the three call sites.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
The Raft paper requires votedFor to be on stable storage before
responding to a vote request. Otherwise a crash after responding but
before persisting could allow voting for a different candidate in the
same term after restart, breaking the single-vote-per-term invariant.

Defer the granted vote response to beforeSleep, after the full config
rewrite (triggered by todo_save_config) persists votedFor. Denial
responses are sent immediately since they don't change persisted state.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
When starting an election, the candidate increments its term and votes
for itself. These must be on stable storage before sending RequestVote,
otherwise a crash could allow the node to vote for a different candidate
in the same term after restart (double-voting).

Defer broadcasting RequestVote to beforeSleep, after todo_save_config
persists the new term and votedFor.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
When a log conflict is detected and entries are truncated, those
entries may already be persisted on disk. A full config rewrite is
needed to remove them, otherwise they would be replayed on restart.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Previously, a singleton leader (quorum=1) committed and applied entries
inline in clusterRaftPropose, before they were persisted. This meant
side effects (CLUSTER MEET unblock, broadcast AE to new peer) could
fire before the entry was on stable storage.

Move singleton commit to beforeSleep, after the persist step. This
ensures all entries are durable before being applied, and all apply
side effects (unblock meets, broadcast AE) happen at the right time
without needing separate deferral flags for each.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Write a compacted nodes.conf at shutdown. Without this, the node
restarts with committed entries in the log tail that it cannot apply
until the leader sends AE with the updated commit index. With the
rewrite, lastApplied is up to date and the log tail only contains
truly uncommitted entries (if any).

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
The trailing newline check could break gossip users whose editors
strip final newlines. Since it only protects raft log lines from
crash-during-append, and we plan to add per-line checksums as the
proper solution, remove the check for now.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
@zuiderkwast zuiderkwast merged commit ccd1505 into valkey-io:cluster-v2 Jun 10, 2026
27 of 28 checks passed
@zuiderkwast zuiderkwast linked an issue Jun 10, 2026 that may be closed by this pull request
@zuiderkwast zuiderkwast deleted the raft-persist-log branch June 10, 2026 09:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Raft Cluster: Persistence (State, Current Term, VotedFor, Log)

4 participants