Skip to content

[POC] Key blocking for dirty writes#2865

Closed
jjuleslasarte wants to merge 27 commits into
valkey-io:unstablefrom
jjuleslasarte:consistent-writes
Closed

[POC] Key blocking for dirty writes#2865
jjuleslasarte wants to merge 27 commits into
valkey-io:unstablefrom
jjuleslasarte:consistent-writes

Conversation

@jjuleslasarte

Copy link
Copy Markdown
Contributor

This draft PR is a POC for a dirty key blocking and tracking mechanism to support durability implemented with WBL approach, as outlined in the durablily HLD. It implements 'blocking' of:

  • acknowledgement of a write until there's consensus
  • dirty reads in the primary node.

Dirty key tracking implementation

Blocking of dirty reads and delaying ack of writes is centered around the ability to determine the set of keys that are dirty or uncommitted by the "quorum" of nodes in the cluster.

This can be realized by keeping track of the list of items that are dirty in the some data-structure (a radix tree in this draft) where each key is corresponded with the byte offset it is currently waiting on acknowledgements from replicas. An incoming read operation can quickly determine whether the keys it tries to access are dirty or not, and if dirty, what offset is required to unblock its responses. This mechanism can also serve if we want to allow a configurable "tail data loss" (i.e allow up to x seconds / x amount of data to be acked before consensus)

The downside of this approach is that the response message gets buffered in memory on the valkey server. This creates certain amount of memory pressure on the server for 2 categories of scenarios: 

  • The incoming reader client tries to read a very large item (i.e. a composite item that contains many fields) thus generates a large response buffer in memory
  • Many incoming clients try to access dirty items (i.e. hot items) and accumulates a large amount of responses buffered in memory.

For those cases, we could implement a throttling mechanism in the valkey server.

Hooks/touchpoints

This POC hooks into the processing of the command in the following places. The existing flow for most commands is

processCommand -> call() -> c->cmnd->proc()

This PR adds the following hooks:

  • processCommand
    • preCommandExec: track the pre-execution positions in the reply cob of the client (and, eventually, connected monitors)
    • call
      • preCall: record the starting replication offset of the command about to be executed.
      • cmd->cmnd->proc()
      • postCall : here, we detect whether we have a command that has made a key dirty (by comparing the replication offset from preCall with the current one), or if the command has accessed a dirty key.
        • In the write path, the blocking offset is the current offset.
        • In the read path, we get the offset by searching for the key in the rax tree.
    • postCommandExec
      • block the client for the given offset

The separation of preCall/postCall and preCommandExect and postCommandExect is so that we can handle multi and lua.

In multi/exec, the individual commands in a transaction

MULTI 
SET a b
SET b c
EXEC

don't go trough processCommand - and get directly called() in the exec command, but form part of the same command, so client needs to block at the position pre-execution of multi. I had included the support for multi (you can see it here) in this PR to better show this, but it was a bit too long for one draft, so I decided to leave that for the next PR.

example

client sends SET FOO BAR

primary

27121:M 23 Nov 2025 15:59:42.820 b postCommandExec hook entered for command 'set'
27121:M 23 Nov 2025 15:59:42.820 b client should be blocked at offset 5509,
// ...
// replica one acks
27121:M 23 Nov 2025 15:59:44.404 b preCall hook: pre_call_replication_offset=5523, pre_call_num_ops_pending_propagation=0

// replica two acks
27121:M 23 Nov 2025 15:59:44.802 b preCall hook: pre_call_replication_offset=5523, pre_call_num_ops_pending_propagation=0

// ...client unblocked
27121:M 23 Nov 2025 15:59:44.802 b unblocking clients for consensus offset 5523,

TODO

  • testing
  • handling of multi/exec (see here)
  • handling of lua
  • handling of db level commands
  • performance testing
  • configs and handling of turning on / off
  • telemetry

Looking for feedback on the general approach, the touchpoints with the major codepaths and naming. I will edit this PR

@zuiderkwast zuiderkwast left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice to see this PoC!

I'm posting some random comments. I know it's not really ready for review yet so there's no need to reply to them at this point.

There are many TODOs throughout the code which is nice to see. We obviously don't need everything in the PoC, but still good to have them written down. We can convert them to issues or something later.

When we do raft-based replication later, each replica will need to be added to the quorum in some explicit way and only these replicas in the quorum actually count when we count acks. We can keep this in mind and modify how we count what's "committed" and not, later.

Comment thread src/durable_write.c Outdated
Comment thread src/reply_blocking.c
Comment thread src/durable_write.c Outdated
Comment thread src/server.h Outdated
Comment thread src/server.h
Comment thread src/server.h Outdated
Comment thread src/server.c Outdated
Comment thread src/server.c Outdated
Comment thread src/reply_blocking.c

@hpatro hpatro left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few things @jjuleslasarte and I had discussed and would need to revisit at a later point:

  1. Move from offset tracking mechanism to raft's term/index/commit for consensus (modify getConsensusOffset).
  2. Figure out throttling mechanism on accumulation of buffer. Dropping COB on overrun isn't feasible.
  3. RAX to be replaced with hashtable (if deemed not efficient). Note: This is only ephemeral data though i.e. during the blocking phase.

Hook points to discuss:

  • server.c:3794: preCall(); - Capture replication offset (pre execution)
  • server.c:3994: postCall(); - Determine if write command got executed and handles special blocking (scripts)
  • server.c:4528: preCommandExec(); - Prepares individual client for blocking
  • server.c:4533: postCommandExec(); - Blocks individual client
  • networking.c:1680: isClientReplyBufferLimited(c) - Response buffering and limited
  • replication.c:1426: postReplicaAck(); Processes acknowledgment and allows clients to progress to certain offset.

Comment thread src/networking.c

@hpatro hpatro Nov 25, 2025

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ranshid Do we remove the read handler callback while the client is blocked?

@hpatro

hpatro commented Nov 25, 2025

Copy link
Copy Markdown
Contributor

@allenss-amazon @yairgott @madolson @rjd15372 Could you folks take a look and provide your feedback?

Comment thread src/reply_blocking.h
// Describes a pre-execution COB offset for a client
typedef struct preExecutionOffsetPosition {
// True if the pre execution offset/reply block are initialized
bool recorded;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would it be not recorded?

Comment thread src/durable_write.c Outdated
Comment thread src/reply_blocking.c
Comment on lines +160 to +171
long long getConsensusOffset(const unsigned long numAcksNeeded) {
const unsigned long numReplicas = listLength(server.replicas);
if (numAcksNeeded == 0) {
// If no ack is needed, then the consensus offset is the one primary is at.
return server.primary_repl_offset;
}

// If the number of connected replicas is less than the number of required replicas,
// return -1 because we don't have enough number of replicas for the ACK.
if (numReplicas < numAcksNeeded) {
return -1;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might as well be a callback

size_t getConsensusOffset();

Comment thread src/server.c
run_with_period(100) modulesCron();
}

run_with_period(1000) clearUncommittedKeysAcknowledged();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be very expensive for write heavy workloads?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can find ways to optimize it later on.

Comment thread src/reply_blocking.c
return false;
}

int isPrimaryDurabilityEnabled(void) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int isPrimaryDurabilityEnabled(void) {
int durabilityEnabledOnPrimary(void) {

@sarthakaggarwal97 sarthakaggarwal97 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still in the process of absorbing this change. Leaving some minor thoughts. I will keep on taking a look the next few days.

Comment thread src/reply_blocking.c
* Utility function to determine whether the durability flag has been enabled.
* return 1 if durability is enabled, 0 otherwise.
*/
int isDurabilityEnabled(void) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably a server config?

Comment thread src/reply_blocking.c
return;
}
if (c->clientDurabilityInfo.blocked_responses == NULL) {
c->clientDurabilityInfo.blocked_responses = listCreate();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be created lazily for short lived clients which may not need it?

Comment thread src/server.c
run_with_period(100) modulesCron();
}

run_with_period(1000) clearUncommittedKeysAcknowledged();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can find ways to optimize it later on.

Comment thread src/reply_blocking.c

// don't bother sorting if there is only one replica.
if (numReplicas > 1) {
qsort(replica_offsets, numReplicas, sizeof(long long), offsetSorterDesc);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a min heap type structure might help to avoid sorting always?

@jjuleslasarte jjuleslasarte force-pushed the consistent-writes branch 3 times, most recently from 150e83f to fe1e40c Compare January 6, 2026 22:08
roshkhatri and others added 17 commits January 6, 2026 15:08
We need to verify total duration was at least 2 seconds, elapsed time
can be quite variable to check upper-bound

Fixes valkey-io#2843

Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
…o#2508)

By default, when the number of elements in a zset exceeds 128, the
underlying data structure adopts a skiplist. We can reduce memory usage
by embedding elements into the skiplist nodes. Change the `zskiplistNode`
memory layout as follows:

```
Before
                 +-------------+
         +-----> | element-sds |
         |       +-------------+
         |
 +------------------+-------+------------------+---------+-----+---------+
 | element--pointer | score | backward-pointer | level-0 | ... | level-N |
 +------------------+-------+------------------+---------+-----+---------+

After
 +-------+------------------+---------+-----+---------+-------------+
 + score | backward-pointer | level-0 | ... | level-N | element-sds |
 +-------+------------------+---------+-----+---------+-------------+
```

Before the embedded SDS representation, we include one byte representing
the size of the SDS header, i.e. the offset into the SDS representation
where that actual string starts.

The memory saving is therefore one pointer minus one byte = 7 bytes per
element, regardless of other factors such as element size or number of
elements.

### Benchmark step

I generated the test data using the following lua script && cli command.
And check memory usage using the `info` command.

**lua script**
```
local start_idx = tonumber(ARGV[1])
local end_idx = tonumber(ARGV[2])
local elem_count = tonumber(ARGV[3])

for i = start_idx, end_idx do
    local key = "zset:" .. string.format("%012d", i)
    local members = {}

    for j = 0, elem_count - 1 do
        table.insert(members, j)
        table.insert(members, "member:" .. j)
    end

    redis.call("ZADD", key, unpack(members))
end

return "OK: Created " .. (end_idx - start_idx + 1) .. " zsets"
```

**valkey-cli command**
`valkey-cli EVAL "$(catcreate_zsets.lua)" 0 0 100000
${ZSET_ELEMENT_NUM}`

### Benchmark result
|number of elements in a zset | memory usage before optimization |
memory usage after optimization | change |
|-------|-------|-------|-------|
| 129 | 1047MB | 943MB | -9.9% |
| 256 |  2010MB|  1803MB| -10.3%|
| 512 |  3904MB|3483MB| -10.8%|

---------

Signed-off-by: chzhoo <czawyx@163.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
)

Test the SCAN consistency by alternating SCAN
calls to primary and replica.
We cannot rely on the exact order of the elements and the returned
cursor number.

---------

Signed-off-by: yzc-yzc <96833212+yzc-yzc@users.noreply.github.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
…validation (valkey-io#2600)

Addresses: valkey-io#2588

## Overview
Previously we call `emptyData()` during a fullSync before validating the
RDB version is compatible.

This change adds an rdb flag that allows us to flush the database from
within `rdbLoadRioWithLoadingCtx`. THhis provides the option to only
flush the data if the rdb has a valid version and signature. In the case
where we do have an invalid version and signature, we don't emptyData,
so if a full sync fails for that reason a replica can still serve stale
data instead of clients experiencing cache misses.

## Changes
- Added a new flag `RDBFLAGS_EMPTY_DATA` that signals to flush the
database after rdb validation
- Added logic to call `emptyData` in `rdbLoadRioWithLoadingCtx` in
`rdb.c`
- Added logic to not clear data if the RDB validation fails in
`replication.c` using new return type `RDB_INCOMPATIBLE`
- Modified the signature of `rdbLoadRioWithLoadingCtx` to return RDB
success codes and updated all calling sites.

## Testing
Added a tcl test that uses the debug command `reload nosave` to load
from an RDB that has a future version number. This triggers the same
code path that full sync's will use, and verifies that we don't flush
the data until after the validation is complete.

A test already exists that checks that the data is flushed:
https://github.com/valkey-io/valkey/blob/unstable/tests/integration/replication.tcl#L1504

---------

Signed-off-by: Venkat Pamulapati <pamuvenk@amazon.com>
Signed-off-by: Venkat Pamulapati <33398322+ChiliPaneer@users.noreply.github.com>
Co-authored-by: Venkat Pamulapati <pamuvenk@amazon.com>
Co-authored-by: Harkrishn Patro <bunty.hari@gmail.com>
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
…ey-io#2856)

Fix a little miss in "Hash field TTL and active expiry propagates
correctly through chain replication" test in `hashexpire.tcl`.
The test did not wait for the initial sync of the chained replica and thus  made the test flakey

Signed-off-by: Arad Zilberstein <aradz@amazon.com>
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
The original test code only checks:

The original test code only checks:

1. wait_for_cluster_size 4, which calls cluster_size_consistent for every node.
Inside that function, for each node, cluster_size_consistent queries cluster_known_nodes,
which is calculated as (unsigned long long)dictSize(server.cluster->nodes). However, when
a new node is added to the cluster, it is first created in the HANDSHAKE state, and
clusterAddNode adds it to the nodes hash table. Therefore, it is possible for the new
node to still be in HANDSHAKE status (processed asynchronously) even though it appears
that all nodes “know” there are 4 nodes in the cluster.

2. cluster_state for every node, but when a new node is added, server.cluster->state remains FAIL.

Some handshake processes may not have completed yet, which likely causes the flakiness.
To address this, added a --cluster check to ensure that the config state is consistent.

Fixes valkey-io#2693.

Signed-off-by: Hanxi Zhang <hanxizh@amazon.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
This commit adds script function flags to the module API, which allows
function scripts to specify the function flags programmatically.

When the scripting engine compiles the script code can extract the flags
from the code and set the flags on the compiled function objects.

---------

Signed-off-by: Ricardo Dias <ricardo.dias@percona.com>
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
valkey-io#2833)

In dual channel replication, when the rdb channel client finish
the RDB transfer, it will enter REPLICA_STATE_RDB_TRANSMITTED
state. During this time, there will be a brief window that we are
not able to see the connection in the INFO REPLICATION.

In the worst case, we might not see the connection for the
DEFAULT_WAIT_BEFORE_RDB_CLIENT_FREE seconds. I guess there is no
harm to list this state, showing connected_slaves but not showing
the connection is bad when troubleshooting.

Note that this also affects the `valkey-cli --rdb` and `--functions-rdb`
options. Before the client is in the `rdb_transmitted` state and is
released, we will now see it in the info (see the example later).

Before, not showing the replica info
```
role:master
connected_slaves:1
```

After, for dual channel replication:
```
role:master
connected_slaves:1
slave0:ip=xxx,port=xxx,state=rdb_transmitted,offset=0,lag=0,type=rdb-channel
```

After, for valkey-cli --rdb-only and --functions-rdb:
```
role:master
connected_slaves:1
slave0:ip=xxx,port=xxx,state=rdb_transmitted,offset=0,lag=0,type=replica
```

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
…mited by COB (valkey-io#2824)

After introducing the dual channel replication in valkey-io#60, we decided in valkey-io#915
not to add a new configuration item to limit the replica's local replication
buffer, just use "client-output-buffer-limit replica hard" to limit it.

We need to document this behavior and mention that once the limit is reached,
all future data will accumulate in the primary side.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
jjuleslasarte and others added 8 commits January 6, 2026 15:08
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
@jjuleslasarte

Copy link
Copy Markdown
Contributor Author

closing this DRAFT - I'm incorporating @zuiderkwast @hpatro and @sarthakaggarwal97 's feedback into the work in progress in the durability branch. Tracking issue will be #3037 - I'll open up an in progress draft PR against durability branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.