[POC] Key blocking for dirty writes#2865
Conversation
zuiderkwast
left a comment
There was a problem hiding this comment.
Very nice to see this PoC!
I'm posting some random comments. I know it's not really ready for review yet so there's no need to reply to them at this point.
There are many TODOs throughout the code which is nice to see. We obviously don't need everything in the PoC, but still good to have them written down. We can convert them to issues or something later.
When we do raft-based replication later, each replica will need to be added to the quorum in some explicit way and only these replicas in the quorum actually count when we count acks. We can keep this in mind and modify how we count what's "committed" and not, later.
There was a problem hiding this comment.
Few things @jjuleslasarte and I had discussed and would need to revisit at a later point:
- Move from offset tracking mechanism to raft's term/index/commit for consensus (modify
getConsensusOffset). - Figure out throttling mechanism on accumulation of buffer. Dropping COB on overrun isn't feasible.
- RAX to be replaced with hashtable (if deemed not efficient). Note: This is only ephemeral data though i.e. during the blocking phase.
Hook points to discuss:
- server.c:3794: preCall(); - Capture replication offset (pre execution)
- server.c:3994: postCall(); - Determine if write command got executed and handles special blocking (scripts)
- server.c:4528: preCommandExec(); - Prepares individual client for blocking
- server.c:4533: postCommandExec(); - Blocks individual client
- networking.c:1680: isClientReplyBufferLimited(c) - Response buffering and limited
- replication.c:1426: postReplicaAck(); Processes acknowledgment and allows clients to progress to certain offset.
There was a problem hiding this comment.
@ranshid Do we remove the read handler callback while the client is blocked?
|
@allenss-amazon @yairgott @madolson @rjd15372 Could you folks take a look and provide your feedback? |
| // Describes a pre-execution COB offset for a client | ||
| typedef struct preExecutionOffsetPosition { | ||
| // True if the pre execution offset/reply block are initialized | ||
| bool recorded; |
There was a problem hiding this comment.
When would it be not recorded?
| long long getConsensusOffset(const unsigned long numAcksNeeded) { | ||
| const unsigned long numReplicas = listLength(server.replicas); | ||
| if (numAcksNeeded == 0) { | ||
| // If no ack is needed, then the consensus offset is the one primary is at. | ||
| return server.primary_repl_offset; | ||
| } | ||
|
|
||
| // If the number of connected replicas is less than the number of required replicas, | ||
| // return -1 because we don't have enough number of replicas for the ACK. | ||
| if (numReplicas < numAcksNeeded) { | ||
| return -1; | ||
| } |
There was a problem hiding this comment.
This might as well be a callback
size_t getConsensusOffset();
| run_with_period(100) modulesCron(); | ||
| } | ||
|
|
||
| run_with_period(1000) clearUncommittedKeysAcknowledged(); |
There was a problem hiding this comment.
can this be very expensive for write heavy workloads?
There was a problem hiding this comment.
maybe we can find ways to optimize it later on.
| return false; | ||
| } | ||
|
|
||
| int isPrimaryDurabilityEnabled(void) { |
There was a problem hiding this comment.
| int isPrimaryDurabilityEnabled(void) { | |
| int durabilityEnabledOnPrimary(void) { |
sarthakaggarwal97
left a comment
There was a problem hiding this comment.
I am still in the process of absorbing this change. Leaving some minor thoughts. I will keep on taking a look the next few days.
| * Utility function to determine whether the durability flag has been enabled. | ||
| * return 1 if durability is enabled, 0 otherwise. | ||
| */ | ||
| int isDurabilityEnabled(void) { |
There was a problem hiding this comment.
probably a server config?
| return; | ||
| } | ||
| if (c->clientDurabilityInfo.blocked_responses == NULL) { | ||
| c->clientDurabilityInfo.blocked_responses = listCreate(); |
There was a problem hiding this comment.
should this be created lazily for short lived clients which may not need it?
| run_with_period(100) modulesCron(); | ||
| } | ||
|
|
||
| run_with_period(1000) clearUncommittedKeysAcknowledged(); |
There was a problem hiding this comment.
maybe we can find ways to optimize it later on.
|
|
||
| // don't bother sorting if there is only one replica. | ||
| if (numReplicas > 1) { | ||
| qsort(replica_offsets, numReplicas, sizeof(long long), offsetSorterDesc); |
There was a problem hiding this comment.
maybe a min heap type structure might help to avoid sorting always?
150e83f to
fe1e40c
Compare
We need to verify total duration was at least 2 seconds, elapsed time can be quite variable to check upper-bound Fixes valkey-io#2843 Signed-off-by: Roshan Khatri <rvkhatri@amazon.com> Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
…o#2508) By default, when the number of elements in a zset exceeds 128, the underlying data structure adopts a skiplist. We can reduce memory usage by embedding elements into the skiplist nodes. Change the `zskiplistNode` memory layout as follows: ``` Before +-------------+ +-----> | element-sds | | +-------------+ | +------------------+-------+------------------+---------+-----+---------+ | element--pointer | score | backward-pointer | level-0 | ... | level-N | +------------------+-------+------------------+---------+-----+---------+ After +-------+------------------+---------+-----+---------+-------------+ + score | backward-pointer | level-0 | ... | level-N | element-sds | +-------+------------------+---------+-----+---------+-------------+ ``` Before the embedded SDS representation, we include one byte representing the size of the SDS header, i.e. the offset into the SDS representation where that actual string starts. The memory saving is therefore one pointer minus one byte = 7 bytes per element, regardless of other factors such as element size or number of elements. ### Benchmark step I generated the test data using the following lua script && cli command. And check memory usage using the `info` command. **lua script** ``` local start_idx = tonumber(ARGV[1]) local end_idx = tonumber(ARGV[2]) local elem_count = tonumber(ARGV[3]) for i = start_idx, end_idx do local key = "zset:" .. string.format("%012d", i) local members = {} for j = 0, elem_count - 1 do table.insert(members, j) table.insert(members, "member:" .. j) end redis.call("ZADD", key, unpack(members)) end return "OK: Created " .. (end_idx - start_idx + 1) .. " zsets" ``` **valkey-cli command** `valkey-cli EVAL "$(catcreate_zsets.lua)" 0 0 100000 ${ZSET_ELEMENT_NUM}` ### Benchmark result |number of elements in a zset | memory usage before optimization | memory usage after optimization | change | |-------|-------|-------|-------| | 129 | 1047MB | 943MB | -9.9% | | 256 | 2010MB| 1803MB| -10.3%| | 512 | 3904MB|3483MB| -10.8%| --------- Signed-off-by: chzhoo <czawyx@163.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
) Test the SCAN consistency by alternating SCAN calls to primary and replica. We cannot rely on the exact order of the elements and the returned cursor number. --------- Signed-off-by: yzc-yzc <96833212+yzc-yzc@users.noreply.github.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
…validation (valkey-io#2600) Addresses: valkey-io#2588 ## Overview Previously we call `emptyData()` during a fullSync before validating the RDB version is compatible. This change adds an rdb flag that allows us to flush the database from within `rdbLoadRioWithLoadingCtx`. THhis provides the option to only flush the data if the rdb has a valid version and signature. In the case where we do have an invalid version and signature, we don't emptyData, so if a full sync fails for that reason a replica can still serve stale data instead of clients experiencing cache misses. ## Changes - Added a new flag `RDBFLAGS_EMPTY_DATA` that signals to flush the database after rdb validation - Added logic to call `emptyData` in `rdbLoadRioWithLoadingCtx` in `rdb.c` - Added logic to not clear data if the RDB validation fails in `replication.c` using new return type `RDB_INCOMPATIBLE` - Modified the signature of `rdbLoadRioWithLoadingCtx` to return RDB success codes and updated all calling sites. ## Testing Added a tcl test that uses the debug command `reload nosave` to load from an RDB that has a future version number. This triggers the same code path that full sync's will use, and verifies that we don't flush the data until after the validation is complete. A test already exists that checks that the data is flushed: https://github.com/valkey-io/valkey/blob/unstable/tests/integration/replication.tcl#L1504 --------- Signed-off-by: Venkat Pamulapati <pamuvenk@amazon.com> Signed-off-by: Venkat Pamulapati <33398322+ChiliPaneer@users.noreply.github.com> Co-authored-by: Venkat Pamulapati <pamuvenk@amazon.com> Co-authored-by: Harkrishn Patro <bunty.hari@gmail.com> Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
…ey-io#2856) Fix a little miss in "Hash field TTL and active expiry propagates correctly through chain replication" test in `hashexpire.tcl`. The test did not wait for the initial sync of the chained replica and thus made the test flakey Signed-off-by: Arad Zilberstein <aradz@amazon.com> Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
The original test code only checks: The original test code only checks: 1. wait_for_cluster_size 4, which calls cluster_size_consistent for every node. Inside that function, for each node, cluster_size_consistent queries cluster_known_nodes, which is calculated as (unsigned long long)dictSize(server.cluster->nodes). However, when a new node is added to the cluster, it is first created in the HANDSHAKE state, and clusterAddNode adds it to the nodes hash table. Therefore, it is possible for the new node to still be in HANDSHAKE status (processed asynchronously) even though it appears that all nodes “know” there are 4 nodes in the cluster. 2. cluster_state for every node, but when a new node is added, server.cluster->state remains FAIL. Some handshake processes may not have completed yet, which likely causes the flakiness. To address this, added a --cluster check to ensure that the config state is consistent. Fixes valkey-io#2693. Signed-off-by: Hanxi Zhang <hanxizh@amazon.com> Co-authored-by: Binbin <binloveplay1314@qq.com> Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
This commit adds script function flags to the module API, which allows function scripts to specify the function flags programmatically. When the scripting engine compiles the script code can extract the flags from the code and set the flags on the compiled function objects. --------- Signed-off-by: Ricardo Dias <ricardo.dias@percona.com> Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
valkey-io#2833) In dual channel replication, when the rdb channel client finish the RDB transfer, it will enter REPLICA_STATE_RDB_TRANSMITTED state. During this time, there will be a brief window that we are not able to see the connection in the INFO REPLICATION. In the worst case, we might not see the connection for the DEFAULT_WAIT_BEFORE_RDB_CLIENT_FREE seconds. I guess there is no harm to list this state, showing connected_slaves but not showing the connection is bad when troubleshooting. Note that this also affects the `valkey-cli --rdb` and `--functions-rdb` options. Before the client is in the `rdb_transmitted` state and is released, we will now see it in the info (see the example later). Before, not showing the replica info ``` role:master connected_slaves:1 ``` After, for dual channel replication: ``` role:master connected_slaves:1 slave0:ip=xxx,port=xxx,state=rdb_transmitted,offset=0,lag=0,type=rdb-channel ``` After, for valkey-cli --rdb-only and --functions-rdb: ``` role:master connected_slaves:1 slave0:ip=xxx,port=xxx,state=rdb_transmitted,offset=0,lag=0,type=replica ``` Signed-off-by: Binbin <binloveplay1314@qq.com> Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
…mited by COB (valkey-io#2824) After introducing the dual channel replication in valkey-io#60, we decided in valkey-io#915 not to add a new configuration item to limit the replica's local replication buffer, just use "client-output-buffer-limit replica hard" to limit it. We need to document this behavior and mention that once the limit is reached, all future data will accumulate in the primary side. Signed-off-by: Binbin <binloveplay1314@qq.com> Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
7e91d03 to
83e2080
Compare
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
|
closing this DRAFT - I'm incorporating @zuiderkwast @hpatro and @sarthakaggarwal97 's feedback into the work in progress in the durability branch. Tracking issue will be #3037 - I'll open up an in progress draft PR against durability branch. |
This draft PR is a POC for a dirty key blocking and tracking mechanism to support durability implemented with WBL approach, as outlined in the durablily HLD. It implements 'blocking' of:
Dirty key tracking implementation
Blocking of dirty reads and delaying ack of writes is centered around the ability to determine the set of keys that are dirty or uncommitted by the "quorum" of nodes in the cluster.
This can be realized by keeping track of the list of items that are dirty in the some data-structure (a radix tree in this draft) where each key is corresponded with the byte offset it is currently waiting on acknowledgements from replicas. An incoming read operation can quickly determine whether the keys it tries to access are dirty or not, and if dirty, what offset is required to unblock its responses. This mechanism can also serve if we want to allow a configurable "tail data loss" (i.e allow up to x seconds / x amount of data to be acked before consensus)
The downside of this approach is that the response message gets buffered in memory on the valkey server. This creates certain amount of memory pressure on the server for 2 categories of scenarios:
For those cases, we could implement a throttling mechanism in the valkey server.
Hooks/touchpoints
This POC hooks into the processing of the command in the following places. The existing flow for most commands is
processCommand -> call() -> c->cmnd->proc()This PR adds the following hooks:
The separation of preCall/postCall and preCommandExect and postCommandExect is so that we can handle multi and lua.
In multi/exec, the individual commands in a transaction
don't go trough
processCommand- and get directly called() in the exec command, but form part of the same command, so client needs to block at the position pre-execution ofmulti. I had included the support for multi (you can see it here) in this PR to better show this, but it was a bit too long for one draft, so I decided to leave that for the next PR.example
client sends
SET FOO BARprimary
TODO
Looking for feedback on the general approach, the touchpoints with the major codepaths and naming. I will edit this PR