[Draft] Durability RFC by hpatro · Pull Request #33 · valkey-io/valkey-rfc

hpatro · 2025-10-27T21:53:13Z

Here is a high level design document for Valkey. It's still in draft phase. So, we should expect changes to the file over the course of time. It takes a subset of requirements ffrom #29 and outlines the structure.

Here's the markdown rendered view of the document: https://github.com/hpatro/valkey-rfc/blob/durability_hld/Durability.md

While folks are reviewing and adding comments. I'm planning to look at the low level details for.

Finalize the API contract and data format for logging control path and data path.
Finalize the additional metadata for snapshots.
Performance impact (latency)

Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>

jjuleslasarte · 2025-10-28T16:16:15Z

+
+## Problem Statement
+
+Valkey is an in-memory data store designed for high performance. Valkey currently uses an asynchronous replication model for high throughput and low latency, which means writes acknowledged to clients may be lost if replicas lag, crash, or are promoted while out-of-sync. Because all data resides in memory by default, process restarts or crashes risk losing recent writes. Although periodic snapshots (RDB) and append-only logs (AOF) provide persistence, they entail trade-offs: snapshots may omit recent writes and AOF only supports single-node durability. In cluster mode, data loss remains possible when a replica is promoted prematurely. Shifting durability responsibility to clients (via WAIT) is not intuitive and places burden on the application layer.


it also i think makes durability hard to reason about and debug - which data is supposed to be durable? which one is ok to get lost?

WAIT only guarantees durability if you wait for all replicas.

Even if you have exactly two replicas and use WAIT 2 in the applications, during a rolling upgrade you may add one or more extra replicas and if the primary fails during this time, there's a chance one of the replicas which didn't ack the latest writes can win the failover.

So in practice, WAIT doesn't provide any durability guarantees. This is the main problem with WAIT, IMO.

jjuleslasarte · 2025-10-28T16:17:37Z

+
+## Problem Statement
+
+Valkey is an in-memory data store designed for high performance. Valkey currently uses an asynchronous replication model for high throughput and low latency, which means writes acknowledged to clients may be lost if replicas lag, crash, or are promoted while out-of-sync. Because all data resides in memory by default, process restarts or crashes risk losing recent writes. Although periodic snapshots (RDB) and append-only logs (AOF) provide persistence, they entail trade-offs: snapshots may omit recent writes and AOF only supports single-node durability. In cluster mode, data loss remains possible when a replica is promoted prematurely. Shifting durability responsibility to clients (via WAIT) is not intuitive and places burden on the application layer.


nit: shifting of durability responsibility seems more of a motivation for this solution vs. part of the problem statement (AOF doesn't provide durability guarantees as it does not prevent data loss seems to be the key one)

AOF only provides the guarantee if the append-fsync is set to always.

Even then, if you're running in sentinel a replica with an older dataset could be promoted, leading to data loss, right?

You cover this. Ignore me.

jjuleslasarte · 2025-10-28T16:22:45Z

+
+Upon receiving a write:
+
+1. The leader applies the command to its local in-memory state immediately, ensuring fast execution.


nit: since we don't ack to the client until we achieve quorum, i am not sure "fast execution" matters as a distinction here, is there a reason this is important?

It should ensure high throughput, which is an important consideration.

jjuleslasarte · 2025-10-28T16:24:14Z

+Blocking occurs only at the primary. Replica nodes apply entries passively as they are received and do not block client operations directly.
+Read operations and other non-durable commands are not affected by blocking behavior.
+
+> Note:


i find this a bit confusing, the summary above talks of both modes but the detail says only one mode will be supported on v1?

jjuleslasarte · 2025-10-28T16:26:21Z

+
+Open Questions:
+
+* Should a mechanism be introduced to temporarily block all new write traffic once the replication buffer limit is reached, effectively stalling writes until the backlog is cleared? This could prevent unbounded memory growth and avoid triggering failover operations caused by memory exhaustion. 


the failover operations would have the side effect of possibly causing data loss, right? in this case, i do think this mechanism would be a good opt in (if clients care about durability, by enabling they are already taking a hit on performance, so this seems like a tradeoff many would take). Not sure of the complexity of this though. Naively it doesn't seem that big of a lift.

jjuleslasarte · 2025-10-28T16:27:30Z

+1. Opt to blocking of read operation on modified key(s).
+2. If the durable operation fails, there is no dirty data to cleanup on the primary and the operation can be retried.
+
+#### Con(s)


some others from the other writeup we are chatting about are:

max memory enforcement

eviction

ttl

jjuleslasarte · 2025-10-28T16:28:41Z

+#### Con(s)
+
+1. Commands with random operation like `SPOP` needs to be made deterministic across all the nodes in a shard.
+2. Lua scripts cannot be safely supported under write-ahead logging because their read and write sets are only known during execution. Since WAL requires the command to be logged before execution, it cannot capture the dynamic side effects or conditional logic inside a script.


an alternative is to attempt to make them deterministic but redis tried (and "failed") to do that in versions prior to 5 (i think? 5 ish)

jjuleslasarte · 2025-10-28T16:30:30Z

+
+#### Log Compaction ([UML Code](./assets/UML.md#uml-code-for-log-compaction))
+
+Log compaction is the process of truncating old, committed log entries that are no longer required for recovery, replacing them with a compact snapshot of the current state. This ensures that the replicated log remains bounded in size, preventing unbounded memory growth and reducing catch-up time for lagging replicas.


is the plan to offer a configuratble threshold and handle this in a timed event or to let clients snapshot and always use that to compact?

jjuleslasarte · 2025-10-28T16:32:06Z

+
+>  Note:
+>
+> * We should start with same thread for end-to-end operation for P0 and then move to further improvements.


agree - it will be easier to reason about and test , and i dont think it has many drawbacks implementation wise at least?

jjuleslasarte · 2025-10-28T16:33:11Z

+
+![Log Compaction](https://github.com/hpatro/valkey-rfc/blob/durability_hld/assets/LogCompaction.svg)
+
+### Logging embedded within Valkey


is the alternative to offer this as a module?

sarthakaggarwal97 · 2025-10-28T17:28:34Z

+
+The durability specification introduces a **RAFT-based state machine replication** model at the shard level, ensuring deterministic leadership, consistent log replication, and predictable recovery across Valkey cluster. Each shard operates as an independent group consisting of one leader (primary) and multiple followers (replicas).
+All write operations flow through the primary and are durably replicated before being acknowledged.
+Replication and election are handled at shard granularity, ensuring fault isolation and predictable recovery. A write is considered durable once it has been replicated to a quorum of nodes and safely committed in the leader’s log. Acknowledged writes are guaranteed to survive any single-node or minority failure.


This looks to be an overlap with the current strategy where the quorum of all the primary elects a primary of the shard. Would both of the strategies co exist? Or do we want to eventually phase out the quorum strategy? I am guessing we will still end up gossiping about the cluster state across nodes.

You would have to use one or the other. Rolling upgrade / mixed cluster is a challenge that we might attempt to solve later. It would be nice to have but it's not strictly required.

Btw, under Requirements above, this is only for standalone mode to start with. Cluster mode is under Planned for future.

sarthakaggarwal97 · 2025-10-28T17:32:04Z

+
+1. The leader applies the command to its local in-memory state immediately, ensuring fast execution.
+2. The corresponding log entry is appended to the RAFT log and replicated asynchronously to follower nodes.
+3. The client response is blocked until quorum acknowledgment confirms that the entry is durably replicated and committed.


We want to wait for all the replication and operations to complete, do we foresee any issues on the client side with this kind of waiting? Did we consider an option to return an ack and use notifications to inform the client when the data is sufficiently replicated?

Okay this is answered in the following blocking section

sarthakaggarwal97 · 2025-10-28T17:36:04Z

+Blocking determines when a client receives acknowledgment for an operation relative to its durability state.
+While the leader applies a command immediately to its in-memory state, the client response is held until the entry is confirmed by the configured durability mode. This ensures that acknowledgment semantics always reflect the intended durability guarantees. The client remains blocked until a quorum of replicas (**synchronous mode**) acknowledges the log entry, guaranteeing durability against minority or single-node failures.
+
+Blocking occurs only at the primary. Replica nodes apply entries passively as they are received and do not block client operations directly.


I might be wrong, but I guess this is where the problem of dirty read can arise since we are not blocking the replicas to respond to GET type commands while the SET type command is being processed.

I have seen in other databases that the values are sometimes associated with the version number, which helps give the client information if the read is of the latest version or not.

We might implement it in this way as an incremental step, but in the long term, I don't think read-from-replicas should be read-uncommitted either. I believe it can be a pitfall for users. The replica too should either use blocking on the replica too or a WAL on the incoming replication stream, which is much easier when all writes are coming from the primary. No need for MVCC or anything like that. The replica can just delay the incoming replication stream and apply them to the data set only when they're committed by the primary.

Am I understanding the write-behind logic correctly in this scenario, specifically the need for the rollback in point 6?

Initial conditions, 5 nodes, node 0 wins first leader election

Node 0 receives a write request and executes it immediately before logging (N0')

There is a network partition such that Node 0 (leader) and Node 1 (follower) are in a minority partition

Node 0 log is replicated to node 1 (N1')

Timeout happens and leader election occurs, the majority partition elects a new leader (here node 2)

Node 2 starts handling writes and replicating its log

When the network recovers, not only will node 0 have to step down (per raft algorithm) but both node 0 and node 1 must be able to roll back their state

In such a scenario, I believe N0 and N1 will need to do a full resynchronization against N2. This is actually already a capability of Valkey since we track offsets and replication IDs:

N0 has replication ID "repl_id_1" (for sake of example) and offset 100

N0 is partitioned, gets a write, and now is at offset 101. N1 also gets the write and is at offset 101. But the write is not acked or committed (not a majority ack yet)

N2 is promoted. It sets secondary replication ID to "repl_id_1" with offset up to 100 (last acked write) and creates a new replication ID "repl_id_2". A majority of followers agree this is the last acknowledged write and allow the promotion.

N0 and N1 recover from partition. Before serving, they must learn of the new leader. When learning of the leader, they recognize they have diverged (offset 101 vs maximum offset 100 on "repl_id_1") and would full resync with N2. During this time those nodes would not serve reads (-LOADING)

Rollback support is more of an optimization, where step 4 could "partial sync" to repl_id_2 by undoing the write at offset 101 and psyncing from offset 100.

Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>

zuiderkwast · 2025-10-30T14:05:17Z

+While the leader applies a command immediately to its in-memory state, the client response is held until the entry is confirmed by the configured durability mode. This ensures that acknowledgment semantics always reflect the intended durability guarantees. The client remains blocked until a quorum of replicas (**synchronous mode**) acknowledges the log entry, guaranteeing durability against minority or single-node failures.
+
+Blocking occurs only at the primary. Replica nodes apply entries passively as they are received and do not block client operations directly.
+Read operations and other non-durable commands are not affected by blocking behavior.


Read operations need to be blocked too, if the key has been modified by another client and not yet been durably replicated.

zuiderkwast · 2025-10-30T14:05:33Z

+> We could make the blocking mode configurable in the future and introduce two additional modes:
+>
+> * **Asynchronous mode:** The client receives acknowledgment immediately after the leader applies the command locally. This mode prioritizes latency over durability.
+> * **Semi-synchronous mode:** The client is unblocked once at least one replica acknowledges the log entry, providing a balance between durability and performance.


Interesting idea, but it seems not very useful to me. You don't get the durability guarantee and also not the low latency. (And you can already achieve this with WAIT 1.)

What's more important here are the guarantees. If you don't need the durability guarantee, you'd use the async mode AKA read uncommitted. I think we should name this after the isolation levels and use commands like CLIENT READ UNCOMMITTED vs CLIENT READ COMMITTED for the client to choose the mode.

zuiderkwast · 2025-10-30T14:21:08Z

+#### Cons(s):
+
+1. Requires building synchronous replication or quorum based replication system.
+2. More overlap of asynchronous replication and synchronous replication system.


I think the overlap between the async and sync replication is a huge benefit. We have 90% of the sync replication already implemented, including that we have log compaction ( = backlog trimming and full sync using RDB snapshot).

Notice: Sync replication is just async replication with acks from the replicas.

With a shared replication stream, we can support sync replication to followers that are part of the quorum AND async replication to replicas that aren't part of the quorum, for example replicas that have enabled the cluster-replica-no-failover config, with a single shared replication stream.

zuiderkwast · 2025-10-30T14:48:38Z

+>
+> * How to discover peer nodes? 
+>   * We can't reuse the `REPLICAOF` mechanism as we intend to piggyback on RAFT's consensus model for leader election.
+>   * Hence, new nodes can discover peers via the `shard-nodes` config which is a list of ip:port address.


If we add it as a config shard-nodes, this config is modified when more nodes join the raft cluster and users need to use CONFIG REWRITE to update the config on disk? What happens if the nodes restart and the quorum has changed? Multiple nodes can have been added and others removed.

This information is similar to what we store in nodes.conf in cluster mode. If we store it on disk at all, then we should update it whenever the nodes and the quorum changes.

Another possibility (my preference) is that we don't start all nodes as followers. We
can start each node as an empty primary, just as we do today. We can use REPLICAOF on the other nodes to add them as replicas. This doesn't necessarily make them part of the quorum though, but it is a way to let the primary discover them.

Normally, the primary adds a node to the quorum by replicating a change and this is acked by the existing quorum. The problem is that before there is any quorum at all, there is no way to commit any changes, but we can step away from this just once in the boostrapping phase. The primary can just decide to add them to the quorum and replicate the topology change as committed.

zuiderkwast · 2025-10-30T14:53:18Z

+>   * Hence, new nodes can discover peers via the `shard-nodes` config which is a list of ip:port address.
+> * When to start the initialization phase? - To be added
+
+### Node addition([UML Code](https://github.com/hpatro/valkey-rfc/blob/durability_hld/assets/UML.md#uml-code-for-node-addition))


CLUSTER ADDNODE, CLUSTER REMOVENODE etc.

We already have CLUSTER REPLICAOF and CLUSTER FORGET to add and remove nodes in cluster mode. I can't see why we don't reuse those.

Btw, this RFC seems to be about standalone mode still. (Cluster mode is future.)

hpatro · 2025-10-30T15:39:54Z

I need to add few sections (offline feedback from Yair):

Impact on modules.
Rebuilding in-memory view after failover due to dirty view.
Compound data structure problem

hpatro · 2025-10-30T15:44:45Z

Thanks @jjuleslasarte / @zuiderkwast / @sarthakaggarwal97. Taking a look at them.

zuiderkwast · 2025-10-30T17:00:51Z

+
+![Node Addition](https://github.com/hpatro/valkey-rfc/blob/durability_hld/assets/NodeAddition.png)
+
+### Primary removal ([UML Code](https://github.com/hpatro/valkey-rfc/blob/durability_hld/assets/UML.md#uml-code-for-primary-removal))


Regarding primary removal, please review section 3.10 Leadership transfer extension in Ongaro's PhD: https://web.stanford.edu/~ouster/cgi-bin/papers/OngaroPhD.pdf

It is essentially the same coordinated failover algorithm that we already have in Valkey. It would be a pity to ignore it, only to switch to something inferior.

The leader pauses writes.

The leader waits for the target server to catch up the log index (or replication offset).

The leader sends a TimeoutNow request to the target server. This request has the same
effect as the target server’s election timer firing: the target server starts a new election (incrementing its term and becoming a candidate).

murphyjacob4 · 2025-11-06T19:37:36Z

+
+![Lifecycle of a write command](https://github.com/hpatro/valkey-rfc/blob/durability_hld/assets/WriteCommand.svg)
+
+### Lifecycle of a read command ([UML Code](https://github.com/hpatro/valkey-rfc/blob/durability_hld/assets/UML.md#uml-code-for-lifecycle-of-a-read-command))


Is preventing stale reads on the primary a goal? If we want read-after-write consistency, then this is needed.

E.g:

Node A is primary, Node B and C are replicas

Node A Is partitioned from Node B and C

Node B detects this and promotes itself with C's permission, but Node A hasn't discovered it

Node B accepts some writes, and gets quorum from C

Node A gets a read request, and it still thinks it's primary, so it serves it

If 4 and 5 were the same application, this is a failure of read-after-write consistency. To solve this, I think there are two options in Raft:

Second, a leader must check whether it has been deposed before processing a read-only request (its information may be stale if a more recent leader has been elected).
Raft handles this by having the leader exchange heartbeat messages with a majority of the cluster before responding to read-only requests. Alternatively, the leader
could rely on the heartbeat mechanism to provide a form
of lease [9], but this would rely on timing for safety (it
assumes bounded clock skew).

https://tikv.org/blog/lease-read/

Add initial draft

eb0c56c

Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>

jjuleslasarte reviewed Oct 28, 2025

View reviewed changes

sarthakaggarwal97 reviewed Oct 28, 2025

View reviewed changes

hpatro added 2 commits October 28, 2025 15:11

Update lifecycle of a read command

c15319d

Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>

Fix image link

9f7fa74

Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>

zuiderkwast reviewed Oct 30, 2025

View reviewed changes

murphyjacob4 reviewed Nov 6, 2025

View reviewed changes

jjuleslasarte mentioned this pull request Nov 24, 2025

[POC] Key blocking for dirty writes valkey-io/valkey#2865

Closed

hpatro mentioned this pull request Jan 6, 2026

Raft based replication valkey-io/valkey#3019

Open

12 tasks

jjuleslasarte mentioned this pull request Jan 9, 2026

[Durability] Implement Synchronous replication valkey-io/valkey#3037

Open


		## Problem Statement

		Valkey is an in-memory data store designed for high performance. Valkey currently uses an asynchronous replication model for high throughput and low latency, which means writes acknowledged to clients may be lost if replicas lag, crash, or are promoted while out-of-sync. Because all data resides in memory by default, process restarts or crashes risk losing recent writes. Although periodic snapshots (RDB) and append-only logs (AOF) provide persistence, they entail trade-offs: snapshots may omit recent writes and AOF only supports single-node durability. In cluster mode, data loss remains possible when a replica is promoted prematurely. Shifting durability responsibility to clients (via WAIT) is not intuitive and places burden on the application layer.


		Upon receiving a write:

		1. The leader applies the command to its local in-memory state immediately, ensuring fast execution.


		Open Questions:

		* Should a mechanism be introduced to temporarily block all new write traffic once the replication buffer limit is reached, effectively stalling writes until the backlog is cleared? This could prevent unbounded memory growth and avoid triggering failover operations caused by memory exhaustion.


		#### Log Compaction ([UML Code](./assets/UML.md#uml-code-for-log-compaction))

		Log compaction is the process of truncating old, committed log entries that are no longer required for recovery, replacing them with a compact snapshot of the current state. This ensures that the replicated log remains bounded in size, preventing unbounded memory growth and reducing catch-up time for lagging replicas.


		![Log Compaction](https://github.com/hpatro/valkey-rfc/blob/durability_hld/assets/LogCompaction.svg)

		### Logging embedded within Valkey


		![Node Addition](https://github.com/hpatro/valkey-rfc/blob/durability_hld/assets/NodeAddition.png)

		### Primary removal ([UML Code](https://github.com/hpatro/valkey-rfc/blob/durability_hld/assets/UML.md#uml-code-for-primary-removal))


		![Lifecycle of a write command](https://github.com/hpatro/valkey-rfc/blob/durability_hld/assets/WriteCommand.svg)

		### Lifecycle of a read command ([UML Code](https://github.com/hpatro/valkey-rfc/blob/durability_hld/assets/UML.md#uml-code-for-lifecycle-of-a-read-command))

Conversation

hpatro commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zuiderkwast Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zuiderkwast Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hpatro commented Oct 30, 2025

Uh oh!

hpatro commented Oct 30, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

hpatro commented Oct 27, 2025 •

edited

Loading

zuiderkwast Oct 30, 2025 •

edited

Loading

zuiderkwast Oct 30, 2025 •

edited

Loading