Skip to content

[Draft] Durability RFC#33

Open
hpatro wants to merge 3 commits into
valkey-io:mainfrom
hpatro:durability_hld
Open

[Draft] Durability RFC#33
hpatro wants to merge 3 commits into
valkey-io:mainfrom
hpatro:durability_hld

Conversation

@hpatro

@hpatro hpatro commented Oct 27, 2025

Copy link
Copy Markdown
Contributor

Here is a high level design document for Valkey. It's still in draft phase. So, we should expect changes to the file over the course of time. It takes a subset of requirements ffrom #29 and outlines the structure.

Here's the markdown rendered view of the document: https://github.com/hpatro/valkey-rfc/blob/durability_hld/Durability.md

While folks are reviewing and adding comments. I'm planning to look at the low level details for.

  1. Finalize the API contract and data format for logging control path and data path.
  2. Finalize the additional metadata for snapshots.
  3. Performance impact (latency)

Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Comment thread Durability.md

## Problem Statement

Valkey is an in-memory data store designed for high performance. Valkey currently uses an asynchronous replication model for high throughput and low latency, which means writes acknowledged to clients may be lost if replicas lag, crash, or are promoted while out-of-sync. Because all data resides in memory by default, process restarts or crashes risk losing recent writes. Although periodic snapshots (RDB) and append-only logs (AOF) provide persistence, they entail trade-offs: snapshots may omit recent writes and AOF only supports single-node durability. In cluster mode, data loss remains possible when a replica is promoted prematurely. Shifting durability responsibility to clients (via WAIT) is not intuitive and places burden on the application layer.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it also i think makes durability hard to reason about and debug - which data is supposed to be durable? which one is ok to get lost?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WAIT only guarantees durability if you wait for all replicas.

Even if you have exactly two replicas and use WAIT 2 in the applications, during a rolling upgrade you may add one or more extra replicas and if the primary fails during this time, there's a chance one of the replicas which didn't ack the latest writes can win the failover.

So in practice, WAIT doesn't provide any durability guarantees. This is the main problem with WAIT, IMO.

Comment thread Durability.md

## Problem Statement

Valkey is an in-memory data store designed for high performance. Valkey currently uses an asynchronous replication model for high throughput and low latency, which means writes acknowledged to clients may be lost if replicas lag, crash, or are promoted while out-of-sync. Because all data resides in memory by default, process restarts or crashes risk losing recent writes. Although periodic snapshots (RDB) and append-only logs (AOF) provide persistence, they entail trade-offs: snapshots may omit recent writes and AOF only supports single-node durability. In cluster mode, data loss remains possible when a replica is promoted prematurely. Shifting durability responsibility to clients (via WAIT) is not intuitive and places burden on the application layer.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: shifting of durability responsibility seems more of a motivation for this solution vs. part of the problem statement (AOF doesn't provide durability guarantees as it does not prevent data loss seems to be the key one)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AOF only provides the guarantee if the append-fsync is set to always.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even then, if you're running in sentinel a replica with an older dataset could be promoted, leading to data loss, right?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You cover this. Ignore me.

Comment thread Durability.md
Comment thread Durability.md

Upon receiving a write:

1. The leader applies the command to its local in-memory state immediately, ensuring fast execution.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: since we don't ack to the client until we achieve quorum, i am not sure "fast execution" matters as a distinction here, is there a reason this is important?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should ensure high throughput, which is an important consideration.

Comment thread Durability.md
Blocking occurs only at the primary. Replica nodes apply entries passively as they are received and do not block client operations directly.
Read operations and other non-durable commands are not affected by blocking behavior.

> Note:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i find this a bit confusing, the summary above talks of both modes but the detail says only one mode will be supported on v1?

Comment thread Durability.md

Open Questions:

* Should a mechanism be introduced to temporarily block all new write traffic once the replication buffer limit is reached, effectively stalling writes until the backlog is cleared? This could prevent unbounded memory growth and avoid triggering failover operations caused by memory exhaustion.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the failover operations would have the side effect of possibly causing data loss, right? in this case, i do think this mechanism would be a good opt in (if clients care about durability, by enabling they are already taking a hit on performance, so this seems like a tradeoff many would take). Not sure of the complexity of this though. Naively it doesn't seem that big of a lift.

Comment thread Durability.md
1. Opt to blocking of read operation on modified key(s).
2. If the durable operation fails, there is no dirty data to cleanup on the primary and the operation can be retried.

#### Con(s)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some others from the other writeup we are chatting about are:

  • max memory enforcement
  • eviction
  • ttl

Comment thread Durability.md
#### Con(s)

1. Commands with random operation like `SPOP` needs to be made deterministic across all the nodes in a shard.
2. Lua scripts cannot be safely supported under write-ahead logging because their read and write sets are only known during execution. Since WAL requires the command to be logged before execution, it cannot capture the dynamic side effects or conditional logic inside a script.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an alternative is to attempt to make them deterministic but redis tried (and "failed") to do that in versions prior to 5 (i think? 5 ish)

Comment thread Durability.md

#### Log Compaction ([UML Code](./assets/UML.md#uml-code-for-log-compaction))

Log compaction is the process of truncating old, committed log entries that are no longer required for recovery, replacing them with a compact snapshot of the current state. This ensures that the replicated log remains bounded in size, preventing unbounded memory growth and reducing catch-up time for lagging replicas.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the plan to offer a configuratble threshold and handle this in a timed event or to let clients snapshot and always use that to compact?

Comment thread Durability.md

> Note:
>
> * We should start with same thread for end-to-end operation for P0 and then move to further improvements.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree - it will be easier to reason about and test , and i dont think it has many drawbacks implementation wise at least?

Comment thread Durability.md

![Log Compaction](https://github.com/hpatro/valkey-rfc/blob/durability_hld/assets/LogCompaction.svg)

### Logging embedded within Valkey

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the alternative to offer this as a module?

Comment thread Durability.md

The durability specification introduces a **RAFT-based state machine replication** model at the shard level, ensuring deterministic leadership, consistent log replication, and predictable recovery across Valkey cluster. Each shard operates as an independent group consisting of one leader (primary) and multiple followers (replicas).
All write operations flow through the primary and are durably replicated before being acknowledged.
Replication and election are handled at shard granularity, ensuring fault isolation and predictable recovery. A write is considered durable once it has been replicated to a quorum of nodes and safely committed in the leader’s log. Acknowledged writes are guaranteed to survive any single-node or minority failure.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks to be an overlap with the current strategy where the quorum of all the primary elects a primary of the shard. Would both of the strategies co exist? Or do we want to eventually phase out the quorum strategy? I am guessing we will still end up gossiping about the cluster state across nodes.

@zuiderkwast zuiderkwast Oct 30, 2025

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You would have to use one or the other. Rolling upgrade / mixed cluster is a challenge that we might attempt to solve later. It would be nice to have but it's not strictly required.

Btw, under Requirements above, this is only for standalone mode to start with. Cluster mode is under Planned for future.

Comment thread Durability.md

1. The leader applies the command to its local in-memory state immediately, ensuring fast execution.
2. The corresponding log entry is appended to the RAFT log and replicated asynchronously to follower nodes.
3. The client response is blocked until quorum acknowledgment confirms that the entry is durably replicated and committed.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to wait for all the replication and operations to complete, do we foresee any issues on the client side with this kind of waiting? Did we consider an option to return an ack and use notifications to inform the client when the data is sufficiently replicated?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay this is answered in the following blocking section

Comment thread Durability.md
Blocking determines when a client receives acknowledgment for an operation relative to its durability state.
While the leader applies a command immediately to its in-memory state, the client response is held until the entry is confirmed by the configured durability mode. This ensures that acknowledgment semantics always reflect the intended durability guarantees. The client remains blocked until a quorum of replicas (**synchronous mode**) acknowledges the log entry, guaranteeing durability against minority or single-node failures.

Blocking occurs only at the primary. Replica nodes apply entries passively as they are received and do not block client operations directly.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be wrong, but I guess this is where the problem of dirty read can arise since we are not blocking the replicas to respond to GET type commands while the SET type command is being processed.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have seen in other databases that the values are sometimes associated with the version number, which helps give the client information if the read is of the latest version or not.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might implement it in this way as an incremental step, but in the long term, I don't think read-from-replicas should be read-uncommitted either. I believe it can be a pitfall for users. The replica too should either use blocking on the replica too or a WAL on the incoming replication stream, which is much easier when all writes are coming from the primary. No need for MVCC or anything like that. The replica can just delay the incoming replication stream and apply them to the data set only when they're committed by the primary.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I understanding the write-behind logic correctly in this scenario, specifically the need for the rollback in point 6?

Initial conditions, 5 nodes, node 0 wins first leader election

  1. Node 0 receives a write request and executes it immediately before logging (N0')
  2. There is a network partition such that Node 0 (leader) and Node 1 (follower) are in a minority partition
  3. Node 0 log is replicated to node 1 (N1')
  4. Timeout happens and leader election occurs, the majority partition elects a new leader (here node 2)
  5. Node 2 starts handling writes and replicating its log
  6. When the network recovers, not only will node 0 have to step down (per raft algorithm) but both node 0 and node 1 must be able to roll back their state

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In such a scenario, I believe N0 and N1 will need to do a full resynchronization against N2. This is actually already a capability of Valkey since we track offsets and replication IDs:

  1. N0 has replication ID "repl_id_1" (for sake of example) and offset 100
  2. N0 is partitioned, gets a write, and now is at offset 101. N1 also gets the write and is at offset 101. But the write is not acked or committed (not a majority ack yet)
  3. N2 is promoted. It sets secondary replication ID to "repl_id_1" with offset up to 100 (last acked write) and creates a new replication ID "repl_id_2". A majority of followers agree this is the last acknowledged write and allow the promotion.
  4. N0 and N1 recover from partition. Before serving, they must learn of the new leader. When learning of the leader, they recognize they have diverged (offset 101 vs maximum offset 100 on "repl_id_1") and would full resync with N2. During this time those nodes would not serve reads (-LOADING)

Rollback support is more of an optimization, where step 4 could "partial sync" to repl_id_2 by undoing the write at offset 101 and psyncing from offset 100.

Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Comment thread Durability.md
While the leader applies a command immediately to its in-memory state, the client response is held until the entry is confirmed by the configured durability mode. This ensures that acknowledgment semantics always reflect the intended durability guarantees. The client remains blocked until a quorum of replicas (**synchronous mode**) acknowledges the log entry, guaranteeing durability against minority or single-node failures.

Blocking occurs only at the primary. Replica nodes apply entries passively as they are received and do not block client operations directly.
Read operations and other non-durable commands are not affected by blocking behavior.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read operations need to be blocked too, if the key has been modified by another client and not yet been durably replicated.

Comment thread Durability.md
> We could make the blocking mode configurable in the future and introduce two additional modes:
>
> * **Asynchronous mode:** The client receives acknowledgment immediately after the leader applies the command locally. This mode prioritizes latency over durability.
> * **Semi-synchronous mode:** The client is unblocked once at least one replica acknowledges the log entry, providing a balance between durability and performance.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting idea, but it seems not very useful to me. You don't get the durability guarantee and also not the low latency. (And you can already achieve this with WAIT 1.)

What's more important here are the guarantees. If you don't need the durability guarantee, you'd use the async mode AKA read uncommitted. I think we should name this after the isolation levels and use commands like CLIENT READ UNCOMMITTED vs CLIENT READ COMMITTED for the client to choose the mode.

Comment thread Durability.md
Comment on lines +164 to +167
#### Cons(s):

1. Requires building synchronous replication or quorum based replication system.
2. More overlap of asynchronous replication and synchronous replication system.

@zuiderkwast zuiderkwast Oct 30, 2025

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the overlap between the async and sync replication is a huge benefit. We have 90% of the sync replication already implemented, including that we have log compaction ( = backlog trimming and full sync using RDB snapshot).

Notice: Sync replication is just async replication with acks from the replicas.

With a shared replication stream, we can support sync replication to followers that are part of the quorum AND async replication to replicas that aren't part of the quorum, for example replicas that have enabled the cluster-replica-no-failover config, with a single shared replication stream.

Comment thread Durability.md
>
> * How to discover peer nodes?
> * We can't reuse the `REPLICAOF` mechanism as we intend to piggyback on RAFT's consensus model for leader election.
> * Hence, new nodes can discover peers via the `shard-nodes` config which is a list of ip:port address.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add it as a config shard-nodes, this config is modified when more nodes join the raft cluster and users need to use CONFIG REWRITE to update the config on disk? What happens if the nodes restart and the quorum has changed? Multiple nodes can have been added and others removed.

This information is similar to what we store in nodes.conf in cluster mode. If we store it on disk at all, then we should update it whenever the nodes and the quorum changes.

Another possibility (my preference) is that we don't start all nodes as followers. We
can start each node as an empty primary, just as we do today. We can use REPLICAOF on the other nodes to add them as replicas. This doesn't necessarily make them part of the quorum though, but it is a way to let the primary discover them.

Normally, the primary adds a node to the quorum by replicating a change and this is acked by the existing quorum. The problem is that before there is any quorum at all, there is no way to commit any changes, but we can step away from this just once in the boostrapping phase. The primary can just decide to add them to the quorum and replicate the topology change as committed.

Comment thread Durability.md
> * Hence, new nodes can discover peers via the `shard-nodes` config which is a list of ip:port address.
> * When to start the initialization phase? - To be added

### Node addition([UML Code](https://github.com/hpatro/valkey-rfc/blob/durability_hld/assets/UML.md#uml-code-for-node-addition))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLUSTER ADDNODE, CLUSTER REMOVENODE etc.

We already have CLUSTER REPLICAOF and CLUSTER FORGET to add and remove nodes in cluster mode. I can't see why we don't reuse those.

Btw, this RFC seems to be about standalone mode still. (Cluster mode is future.)

@hpatro

hpatro commented Oct 30, 2025

Copy link
Copy Markdown
Contributor Author

I need to add few sections (offline feedback from Yair):

  1. Impact on modules.
  2. Rebuilding in-memory view after failover due to dirty view.
  3. Compound data structure problem

@hpatro

hpatro commented Oct 30, 2025

Copy link
Copy Markdown
Contributor Author

Thanks @jjuleslasarte / @zuiderkwast / @sarthakaggarwal97. Taking a look at them.

Comment thread Durability.md

![Node Addition](https://github.com/hpatro/valkey-rfc/blob/durability_hld/assets/NodeAddition.png)

### Primary removal ([UML Code](https://github.com/hpatro/valkey-rfc/blob/durability_hld/assets/UML.md#uml-code-for-primary-removal))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding primary removal, please review section 3.10 Leadership transfer extension in Ongaro's PhD: https://web.stanford.edu/~ouster/cgi-bin/papers/OngaroPhD.pdf

It is essentially the same coordinated failover algorithm that we already have in Valkey. It would be a pity to ignore it, only to switch to something inferior.

  1. The leader pauses writes.
  2. The leader waits for the target server to catch up the log index (or replication offset).
  3. The leader sends a TimeoutNow request to the target server. This request has the same
    effect as the target server’s election timer firing: the target server starts a new election (incrementing its term and becoming a candidate).

Comment thread Durability.md

![Lifecycle of a write command](https://github.com/hpatro/valkey-rfc/blob/durability_hld/assets/WriteCommand.svg)

### Lifecycle of a read command ([UML Code](https://github.com/hpatro/valkey-rfc/blob/durability_hld/assets/UML.md#uml-code-for-lifecycle-of-a-read-command))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is preventing stale reads on the primary a goal? If we want read-after-write consistency, then this is needed.

E.g:

  1. Node A is primary, Node B and C are replicas
  2. Node A Is partitioned from Node B and C
  3. Node B detects this and promotes itself with C's permission, but Node A hasn't discovered it
  4. Node B accepts some writes, and gets quorum from C
  5. Node A gets a read request, and it still thinks it's primary, so it serves it

If 4 and 5 were the same application, this is a failure of read-after-write consistency. To solve this, I think there are two options in Raft:

Second, a leader must check whether it has been deposed before processing a read-only request (its information may be stale if a more recent leader has been elected).
Raft handles this by having the leader exchange heartbeat messages with a majority of the cluster before responding to read-only requests. Alternatively, the leader
could rely on the heartbeat mechanism to provide a form
of lease [9], but this would rely on timing for safety (it
assumes bounded clock skew).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants