[POC] Key blocking for dirty writes by jjuleslasarte · Pull Request #2865 · valkey-io/valkey

jjuleslasarte · 2025-11-24T00:16:51Z

This draft PR is a POC for a dirty key blocking and tracking mechanism to support durability implemented with WBL approach, as outlined in the durablily HLD. It implements 'blocking' of:

acknowledgement of a write until there's consensus
dirty reads in the primary node.

Dirty key tracking implementation

Blocking of dirty reads and delaying ack of writes is centered around the ability to determine the set of keys that are dirty or uncommitted by the "quorum" of nodes in the cluster.

This can be realized by keeping track of the list of items that are dirty in the some data-structure (a radix tree in this draft) where each key is corresponded with the byte offset it is currently waiting on acknowledgements from replicas. An incoming read operation can quickly determine whether the keys it tries to access are dirty or not, and if dirty, what offset is required to unblock its responses. This mechanism can also serve if we want to allow a configurable "tail data loss" (i.e allow up to x seconds / x amount of data to be acked before consensus)

The downside of this approach is that the response message gets buffered in memory on the valkey server. This creates certain amount of memory pressure on the server for 2 categories of scenarios:

The incoming reader client tries to read a very large item (i.e. a composite item that contains many fields) thus generates a large response buffer in memory
Many incoming clients try to access dirty items (i.e. hot items) and accumulates a large amount of responses buffered in memory.

For those cases, we could implement a throttling mechanism in the valkey server.

Hooks/touchpoints

This POC hooks into the processing of the command in the following places. The existing flow for most commands is

processCommand -> call() -> c->cmnd->proc()

This PR adds the following hooks:

processCommand
- preCommandExec: track the pre-execution positions in the reply cob of the client (and, eventually, connected monitors)
- call
  - preCall: record the starting replication offset of the command about to be executed.
  - cmd->cmnd->proc()
  - postCall : here, we detect whether we have a command that has made a key dirty (by comparing the replication offset from preCall with the current one), or if the command has accessed a dirty key.
    - In the write path, the blocking offset is the current offset.
    - In the read path, we get the offset by searching for the key in the rax tree.
- postCommandExec
  - block the client for the given offset

The separation of preCall/postCall and preCommandExect and postCommandExect is so that we can handle multi and lua.

In multi/exec, the individual commands in a transaction

MULTI 
SET a b
SET b c
EXEC

don't go trough processCommand - and get directly called() in the exec command, but form part of the same command, so client needs to block at the position pre-execution of multi. I had included the support for multi (you can see it here) in this PR to better show this, but it was a bit too long for one draft, so I decided to leave that for the next PR.

example

client sends SET FOO BAR

primary

27121:M 23 Nov 2025 15:59:42.820 b postCommandExec hook entered for command 'set'
27121:M 23 Nov 2025 15:59:42.820 b client should be blocked at offset 5509,
// ...
// replica one acks
27121:M 23 Nov 2025 15:59:44.404 b preCall hook: pre_call_replication_offset=5523, pre_call_num_ops_pending_propagation=0

// replica two acks
27121:M 23 Nov 2025 15:59:44.802 b preCall hook: pre_call_replication_offset=5523, pre_call_num_ops_pending_propagation=0

// ...client unblocked
27121:M 23 Nov 2025 15:59:44.802 b unblocking clients for consensus offset 5523,

TODO

testing
handling of multi/exec (see here)
handling of lua
handling of db level commands
performance testing
configs and handling of turning on / off
telemetry

Looking for feedback on the general approach, the touchpoints with the major codepaths and naming. I will edit this PR

zuiderkwast

Very nice to see this PoC!

I'm posting some random comments. I know it's not really ready for review yet so there's no need to reply to them at this point.

There are many TODOs throughout the code which is nice to see. We obviously don't need everything in the PoC, but still good to have them written down. We can convert them to issues or something later.

When we do raft-based replication later, each replica will need to be added to the quorum in some explicit way and only these replicas in the quorum actually count when we count acks. We can keep this in mind and modify how we count what's "committed" and not, later.

hpatro

Few things @jjuleslasarte and I had discussed and would need to revisit at a later point:

Move from offset tracking mechanism to raft's term/index/commit for consensus (modify getConsensusOffset).
Figure out throttling mechanism on accumulation of buffer. Dropping COB on overrun isn't feasible.
RAX to be replaced with hashtable (if deemed not efficient). Note: This is only ephemeral data though i.e. during the blocking phase.

Hook points to discuss:

server.c:3794: preCall(); - Capture replication offset (pre execution)
server.c:3994: postCall(); - Determine if write command got executed and handles special blocking (scripts)
server.c:4528: preCommandExec(); - Prepares individual client for blocking
server.c:4533: postCommandExec(); - Blocks individual client
networking.c:1680: isClientReplyBufferLimited(c) - Response buffering and limited
replication.c:1426: postReplicaAck(); Processes acknowledgment and allows clients to progress to certain offset.

hpatro · 2025-11-25T16:12:48Z

@ranshid Do we remove the read handler callback while the client is blocked?

hpatro · 2025-11-25T16:20:10Z

@allenss-amazon @yairgott @madolson @rjd15372 Could you folks take a look and provide your feedback?

hpatro · 2025-11-25T16:45:26Z

+// Describes a pre-execution COB offset for a client
+typedef struct preExecutionOffsetPosition {
+    // True if the pre execution offset/reply block are initialized
+    bool recorded;


When would it be not recorded?

hpatro · 2025-11-25T16:56:35Z

+long long getConsensusOffset(const unsigned long numAcksNeeded) {
+    const unsigned long numReplicas = listLength(server.replicas);
+    if (numAcksNeeded == 0) {
+        // If no ack is needed, then the consensus offset is the one primary is at.
+        return server.primary_repl_offset;
+    }
+
+    // If the number of connected replicas is less than the number of required replicas,
+    // return -1 because we don't have enough number of replicas for the ACK. 
+    if (numReplicas < numAcksNeeded) {
+        return -1;
+    }


This might as well be a callback

size_t getConsensusOffset();

sarthakaggarwal97 · 2026-01-06T06:49:47Z

        run_with_period(100) modulesCron();
    }

+    run_with_period(1000) clearUncommittedKeysAcknowledged();


can this be very expensive for write heavy workloads?

maybe we can find ways to optimize it later on.

sarthakaggarwal97 · 2026-01-06T06:50:34Z

+    return false;
+}
+
+int isPrimaryDurabilityEnabled(void) {


Suggested change

int isPrimaryDurabilityEnabled(void) {

int durabilityEnabledOnPrimary(void) {

sarthakaggarwal97

I am still in the process of absorbing this change. Leaving some minor thoughts. I will keep on taking a look the next few days.

sarthakaggarwal97 · 2026-01-06T06:53:39Z

+ * Utility function to determine whether the durability flag has been enabled.
+ * return       1 if durability is enabled, 0 otherwise.
+ */
+int isDurabilityEnabled(void) {


probably a server config?

sarthakaggarwal97 · 2026-01-06T06:57:54Z

+        return;
+    }
+    if (c->clientDurabilityInfo.blocked_responses == NULL) {
+        c->clientDurabilityInfo.blocked_responses = listCreate();


should this be created lazily for short lived clients which may not need it?

sarthakaggarwal97 · 2026-01-06T06:58:29Z

        run_with_period(100) modulesCron();
    }

+    run_with_period(1000) clearUncommittedKeysAcknowledged();


maybe we can find ways to optimize it later on.

sarthakaggarwal97 · 2026-01-06T07:00:30Z

+
+    // don't bother sorting if there is only one replica. 
+    if (numReplicas > 1) { 
+        qsort(replica_offsets, numReplicas, sizeof(long long), offsetSorterDesc);


maybe a min heap type structure might help to avoid sorting always?

We need to verify total duration was at least 2 seconds, elapsed time can be quite variable to check upper-bound Fixes valkey-io#2843 Signed-off-by: Roshan Khatri <rvkhatri@amazon.com> Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>

…o#2508) By default, when the number of elements in a zset exceeds 128, the underlying data structure adopts a skiplist. We can reduce memory usage by embedding elements into the skiplist nodes. Change the `zskiplistNode` memory layout as follows: ``` Before +-------------+ +-----> | element-sds | | +-------------+ | +------------------+-------+------------------+---------+-----+---------+ | element--pointer | score | backward-pointer | level-0 | ... | level-N | +------------------+-------+------------------+---------+-----+---------+ After +-------+------------------+---------+-----+---------+-------------+ + score | backward-pointer | level-0 | ... | level-N | element-sds | +-------+------------------+---------+-----+---------+-------------+ ``` Before the embedded SDS representation, we include one byte representing the size of the SDS header, i.e. the offset into the SDS representation where that actual string starts. The memory saving is therefore one pointer minus one byte = 7 bytes per element, regardless of other factors such as element size or number of elements. ### Benchmark step I generated the test data using the following lua script && cli command. And check memory usage using the `info` command. **lua script** ``` local start_idx = tonumber(ARGV[1]) local end_idx = tonumber(ARGV[2]) local elem_count = tonumber(ARGV[3]) for i = start_idx, end_idx do local key = "zset:" .. string.format("%012d", i) local members = {} for j = 0, elem_count - 1 do table.insert(members, j) table.insert(members, "member:" .. j) end redis.call("ZADD", key, unpack(members)) end return "OK: Created " .. (end_idx - start_idx + 1) .. " zsets" ``` **valkey-cli command** `valkey-cli EVAL "$(catcreate_zsets.lua)" 0 0 100000 ${ZSET_ELEMENT_NUM}` ### Benchmark result |number of elements in a zset | memory usage before optimization | memory usage after optimization | change | |-------|-------|-------|-------| | 129 | 1047MB | 943MB | -9.9% | | 256 | 2010MB| 1803MB| -10.3%| | 512 | 3904MB|3483MB| -10.8%| --------- Signed-off-by: chzhoo <czawyx@163.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>

) Test the SCAN consistency by alternating SCAN calls to primary and replica. We cannot rely on the exact order of the elements and the returned cursor number. --------- Signed-off-by: yzc-yzc <96833212+yzc-yzc@users.noreply.github.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>

…validation (valkey-io#2600) Addresses: valkey-io#2588 ## Overview Previously we call `emptyData()` during a fullSync before validating the RDB version is compatible. This change adds an rdb flag that allows us to flush the database from within `rdbLoadRioWithLoadingCtx`. THhis provides the option to only flush the data if the rdb has a valid version and signature. In the case where we do have an invalid version and signature, we don't emptyData, so if a full sync fails for that reason a replica can still serve stale data instead of clients experiencing cache misses. ## Changes - Added a new flag `RDBFLAGS_EMPTY_DATA` that signals to flush the database after rdb validation - Added logic to call `emptyData` in `rdbLoadRioWithLoadingCtx` in `rdb.c` - Added logic to not clear data if the RDB validation fails in `replication.c` using new return type `RDB_INCOMPATIBLE` - Modified the signature of `rdbLoadRioWithLoadingCtx` to return RDB success codes and updated all calling sites. ## Testing Added a tcl test that uses the debug command `reload nosave` to load from an RDB that has a future version number. This triggers the same code path that full sync's will use, and verifies that we don't flush the data until after the validation is complete. A test already exists that checks that the data is flushed: https://github.com/valkey-io/valkey/blob/unstable/tests/integration/replication.tcl#L1504 --------- Signed-off-by: Venkat Pamulapati <pamuvenk@amazon.com> Signed-off-by: Venkat Pamulapati <33398322+ChiliPaneer@users.noreply.github.com> Co-authored-by: Venkat Pamulapati <pamuvenk@amazon.com> Co-authored-by: Harkrishn Patro <bunty.hari@gmail.com> Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>

…ey-io#2856) Fix a little miss in "Hash field TTL and active expiry propagates correctly through chain replication" test in `hashexpire.tcl`. The test did not wait for the initial sync of the chained replica and thus made the test flakey Signed-off-by: Arad Zilberstein <aradz@amazon.com> Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>

The original test code only checks: The original test code only checks: 1. wait_for_cluster_size 4, which calls cluster_size_consistent for every node. Inside that function, for each node, cluster_size_consistent queries cluster_known_nodes, which is calculated as (unsigned long long)dictSize(server.cluster->nodes). However, when a new node is added to the cluster, it is first created in the HANDSHAKE state, and clusterAddNode adds it to the nodes hash table. Therefore, it is possible for the new node to still be in HANDSHAKE status (processed asynchronously) even though it appears that all nodes “know” there are 4 nodes in the cluster. 2. cluster_state for every node, but when a new node is added, server.cluster->state remains FAIL. Some handshake processes may not have completed yet, which likely causes the flakiness. To address this, added a --cluster check to ensure that the config state is consistent. Fixes valkey-io#2693. Signed-off-by: Hanxi Zhang <hanxizh@amazon.com> Co-authored-by: Binbin <binloveplay1314@qq.com> Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>

This commit adds script function flags to the module API, which allows function scripts to specify the function flags programmatically. When the scripting engine compiles the script code can extract the flags from the code and set the flags on the compiled function objects. --------- Signed-off-by: Ricardo Dias <ricardo.dias@percona.com> Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>

valkey-io#2833) In dual channel replication, when the rdb channel client finish the RDB transfer, it will enter REPLICA_STATE_RDB_TRANSMITTED state. During this time, there will be a brief window that we are not able to see the connection in the INFO REPLICATION. In the worst case, we might not see the connection for the DEFAULT_WAIT_BEFORE_RDB_CLIENT_FREE seconds. I guess there is no harm to list this state, showing connected_slaves but not showing the connection is bad when troubleshooting. Note that this also affects the `valkey-cli --rdb` and `--functions-rdb` options. Before the client is in the `rdb_transmitted` state and is released, we will now see it in the info (see the example later). Before, not showing the replica info ``` role:master connected_slaves:1 ``` After, for dual channel replication: ``` role:master connected_slaves:1 slave0:ip=xxx,port=xxx,state=rdb_transmitted,offset=0,lag=0,type=rdb-channel ``` After, for valkey-cli --rdb-only and --functions-rdb: ``` role:master connected_slaves:1 slave0:ip=xxx,port=xxx,state=rdb_transmitted,offset=0,lag=0,type=replica ``` Signed-off-by: Binbin <binloveplay1314@qq.com> Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>

Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>

…mited by COB (valkey-io#2824) After introducing the dual channel replication in valkey-io#60, we decided in valkey-io#915 not to add a new configuration item to limit the replica's local replication buffer, just use "client-output-buffer-limit replica hard" to limit it. We need to document this behavior and mention that once the limit is reached, all future data will accumulate in the primary side. Signed-off-by: Binbin <binloveplay1314@qq.com> Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>

Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>

jjuleslasarte · 2026-01-09T17:23:39Z

closing this DRAFT - I'm incorporating @zuiderkwast @hpatro and @sarthakaggarwal97 's feedback into the work in progress in the durability branch. Tracking issue will be #3037 - I'll open up an in progress draft PR against durability branch.

github-actions Bot assigned jjuleslasarte Nov 24, 2025

zuiderkwast reviewed Nov 24, 2025

View reviewed changes

hpatro reviewed Nov 24, 2025

View reviewed changes

hpatro reviewed Nov 25, 2025

View reviewed changes

hpatro reviewed Nov 26, 2025

View reviewed changes

hpatro requested a review from sarthakaggarwal97 December 16, 2025 16:07

sarthakaggarwal97 reviewed Jan 6, 2026

View reviewed changes

jjuleslasarte force-pushed the consistent-writes branch 3 times, most recently from 150e83f to fe1e40c Compare January 6, 2026 22:08

roshkhatri and others added 17 commits January 6, 2026 15:08

[poc] consistent writes w/ wbl

8cd9867

Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>

remove multi handling for next pr

1f700be

Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>

adding some more logging for poc

448410a

Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>

addressing feedback on PR

df3992c

Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>

Initial commit for key blocking

3163a12

Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>

remove unused files and correct merge

bb0f95a

Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>

fix merge conflicts

f4b19de

Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>

fix merge conflicts

89b78b0

Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>

jjuleslasarte and others added 8 commits January 6, 2026 15:08

fix merge conflicts

1104f42

Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>

fix merge conflicts

23ff193

Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>

Initial commit for key blocking

f9bd46e

Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>

fix merge

bbc92a1

Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>

server fixes

7ea275b

Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>

removing file

56c81b7

Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>

removing file

968fabe

Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>

fixing once more

83e2080

Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>

jjuleslasarte force-pushed the consistent-writes branch from 7e91d03 to 83e2080 Compare January 6, 2026 23:15

jjuleslasarte added 2 commits January 6, 2026 15:16

fixing makefile

9b08437

Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>

requesting ack from replicas

3426625

Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>

jjuleslasarte mentioned this pull request Jan 9, 2026

[Durability] Implement Synchronous replication #3037

Open

jjuleslasarte closed this Jan 9, 2026

jjuleslasarte mentioned this pull request Jan 13, 2026

[Sync replication] replication: rax -> hashtable and handling of failovers #3057

Closed

	int isPrimaryDurabilityEnabled(void) {
	int durabilityEnabledOnPrimary(void) {

Conversation

jjuleslasarte commented Nov 24, 2025

Dirty key tracking implementation

Hooks/touchpoints

example

primary

TODO

Uh oh!

zuiderkwast left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hpatro left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hpatro Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hpatro commented Nov 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sarthakaggarwal97 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jjuleslasarte commented Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

hpatro left a comment •

edited

Loading

hpatro Nov 25, 2025 •

edited

Loading