Backport Unstable to 9.1 for RC2 by sarthakaggarwal97 · Pull Request #3519 · valkey-io/valkey

sarthakaggarwal97 · 2026-04-16T14:37:20Z

This PR helps support backporting commits from unstable to 9.1 Branch for RC2 Release

Cherry-picks 60 commits from unstable into 9.1, covering all PRs marked "To be backported" in the Valkey 9.1 project board plus bug fixes, test fixes, and CI improvements.

Included PRs (60)

…#3359) The multiStateMemOverhead() function was incorrectly calculating the memory overhead for watched keys. It used sizeof(c->mstate->watched_keys) which is the size of the list structure itself, instead of sizeof(watchedKey) which is the actual per-key overhead. This was introduced in valkey-io#1405. Signed-off-by: Binbin <binloveplay1314@qq.com>

I'm developing a module to provide luajit as lua execution engine. See valkey-io#1229 for details. Unfortunately said module doesn't support debugging yet. So in order to test it with valkey tests scripting debug tests need to be skipped. This patch makes an anonymous test skippable by name. Signed-off-by: secwall <secwall@yandex-team.ru>

@zuiderkwast

These are forked from the RFC instructions created by @zuiderkwast and @hpatro in https://github.com/valkey-io/valkey-rfc/pulls/1 and https://github.com/valkey-io/valkey-rfc/pulls/6. It also includes the Atomic Slot Migration design to bootstrap the folder. --------- Signed-off-by: Jacob Murphy <jkmurphy@google.com>

) Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com>

Probably added by mistake during some merge of valkey-io#1566 Signed-off-by: Ran Shidlansik <ranshid@amazon.com>

…olation (valkey-io#3375) The daily workflow was directly invoking the `valkey-unit-gtests` executable. The intended invocation is to use `gtest-parallel` to ensure that the tests are executed in isolation. Signed-off-by: harrylin98 <harrylin980107@gmail.com>

…ey-io#3380) When a MATCH pattern maps to a specific slot, `CLUSTERSCAN` can skip directly to that slot instead of walking through all slots one by one. - On `cursor 0`, starts directly at the matching slot - If cursor is behind the matching slot, jumps forward - If cursor is ahead of the matching slot, we conclude the scan as we cannot match keys - If both SLOT and MATCH are provided but target different slots, returns 0 immediately Signed-off-by: nmvk <r@nmvk.com>

Previously, our workflow used a global concurrency group, which effectively limited execution to one running job and one pending job. Any additional requests were automatically canceled, preventing a true queue from forming. We are now shifting to a model where we remove the concurrency restriction and allow jobs to queue directly on the self-hosted runner. This enables multiple workflow runs to be accepted and queued instead of being dropped. While GitHub can accept workflow triggers at a high rate (e.g., hundreds per minute), the actual execution is still constrained by runner capacity, in our case, a single runner processing one job at a time. However, queued jobs are subject to GitHub’s 24-hour timeout policy. This means any job that waits in the queue for more than 24 hours before starting will be automatically canceled (timedout). In practical terms, this approach improves reliability by eliminating premature cancellations, but the effective queue size is still bounded by how many jobs the runner can process within a 24-hour window. we could increase the number of runners to run these in parallel. Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>

…o#3276) Pin package manager dependencies in CI workflows to improve the Pinned-Dependencies score in OpenSSF Scorecard. Changes: - benchmark-on-label.yml, benchmark-release.yml: add `--require-hashes` to `pip install` adding on valkey-perf-benchmark repo: valkey-io/valkey-perf-benchmark#44 - ci.yml: pin `yamlfmt` to `v0.21.0` instead of `@latest` - reply-schemas-linter.yml: use npm ci with `package-lock.json` instead of unpinned npm install, package files in `utils/reply-schema-linter/` Signed-off-by: Roshaan Khatri <rvkhatri@amazon.com> Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>

Upload the entire results directory instead of only metrics JSON files. This includes server logs which are useful for debugging benchmark failures. Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>

…3209) The `valkey-cli --cluster del-node` command fails when attempting to delete unreachable or failed nodes, reporting `No such node ID` even though the node exists in the cluster topology. The root cause is the command only loads information about reachable nodes, causing the lookup to fail. This PR added a new function for loading all nodes information to solve this. ### Implementation 1. Loading all nodes from gossip: - Added `clusterManagerLoadAllInfoFromNode()` that loads both reachable and unreachable nodes from cluster gossip - Extracts common logic into `clusterManagerLoadInfoCommon()` with an `include_unreachable` flag - Keeps the original `clusterManagerLoadInfoFromNode()` unchanged to avoid affecting existing callers 2. Added success message to be consistent with other cluster commands: `[OK] Node <id> removed from the cluster.` 3. Added test coverage for `del-node` which previously had none. 4. Load slot information from gossip for unreachable nodes in `clusterManagerNodeLoadInfo()` 5. Skip unreachable primaries in `clusterManagerNodeWithLeastReplicas()` ### Testing ``` ./runtest --single unit/cluster/cli [ok]: del-node: Cannot delete node with slots (9 ms) [ok]: del-node: Delete reachable node without slots (23 ms) [ok]: del-node: Delete unreachable node without slots (1333 ms) [ok]: del-node: Cannot delete unreachable primary with slots (3368 ms) ``` ``` valkey-cli --cluster del-node 127.0.0.1:7000 eb837ea7c48908e5304eafd8b1b3ced57147c448 Could not connect to Valkey at 127.0.0.1:7002: Connection refused >>> Removing node eb837ea7c48908e5304eafd8b1b3ced57147c448 from cluster 127.0.0.1:7000 >>> Sending CLUSTER FORGET messages to the cluster... >>> WARNING: Could not connect to node 127.0.0.1:7002, unable to send CLUSTER RESET. [OK] Node eb837ea7c48908e5304eafd8b1b3ced57147c448 removed from the cluster. ``` ### Behavior change Before ``` $ valkey-cli --cluster del-node <entry-node-ip>:<entry-node-port> <failed-node-id> Could not connect to Valkey at <target-node-ip>:<target-node-port>: Connection refused >>> Removing node <id> from cluster <entry-node-ip>:<entry-node-port> [ERR] No such node ID <id> ``` After ``` $ valkey-cli --cluster del-node <entry-node-ip>:<entry-node-port> <failed-node-id> Could not connect to Valkey at <target-node-ip>:<target-node-port>: Connection refused >>> Removing node <id> from cluster <entry-node-ip>:<entry-node-port> >>> Sending CLUSTER FORGET messages to the cluster... >>> WARNING: Could not connect to node <target-node-ip>:<target-node-port>, unable to send CLUSTER RESET. [OK] Node <id> removed from the cluster. ``` ### Related Issue Fixes valkey-io#3208 --------- Signed-off-by: Yang Zhao <zymy701@gmail.com>

Signed-off-by: harrylin98 <harrylin980107@gmail.com>

…es (valkey-io#3398) ## Problem The `EntryTest.entryUpdate` unit test fails on macOS with (mentioned in valkey-io#3200): Expected: (entryMemUsage(e10)) < (current_embedded_allocation_size * 3 / 4) actual: 48 vs 48 ## Root Cause `entryMemUsage` for embedded entries reflects the actual zmalloc allocation size, which depends on the platform allocator's bucket sizes. Valkey's bundled jemalloc is configured with `LG_QUANTUM=3` (8-byte granularity), giving size classes: 8, 16, 24, 32, 40, 48, 56, 64, ... However, macOS libc uses 16-byte aligned buckets: 16, 32, 48, 64, 80, ... The test's value10 (21 chars) produces an entryReqSize of 40 bytes. Jemalloc has a 40-byte size class, so entryMemUsage returns 40. macOS rounds up to 48, which equals 3/4 of e9's 64-byte allocation, causing the strict less-than assertion to fail. ## Fix Shrink value10 from 21 to 13 characters, reducing entryReqSize from 40 to 32 bytes. Both allocators have a 32-byte bucket, and 32 < 48 holds on both platforms. ## Test All tests pass on macOS. Signed-off-by: Alina Liu <liusalisa6363@gmail.com>

@ranshid

**Title:** ARM NEON SIMD optimization for pvFind() in vset.c **Description:** This PR resolves valkey-io#2806. Thanks to @ranshid for guidance on testing methodology and workload design. ### Summary This PR adds ARM64 NEON SIMD optimization to the `pvFind()` function in vset.c, which performs linear pointer search in pVector. The pVector is used internally by vset to track expired fields in hash objects (HFE). The optimization processes 4 pointers per iteration using 128-bit NEON vector instructions. ### Implementation Details - Added `pvFindSIMD_NEON64()` static inline helper function using NEON intrinsics - Modified `pvFind()` to use SIMD path when `len >= 8` on ARM64 - Added `#include <arm_neon.h>` guarded by `HAVE_ARM_NEON` - No changes to function signatures or external behavior - Scalar fallback remains for non-ARM64 platforms and small vectors ### Benchmark Results #### Micro benchmark (Apple M4 Pro, 50M iterations) Scalar version: ``` len16_mid | len= 16 pos= 8 | 2.8 ns len16_last | len= 16 pos= 15 | 5.0 ns len32_mid | len= 32 pos= 16 | 5.1 ns len64_last | len= 64 pos= 63 | 15.7 ns len127_last | len=127 pos=126 | 34.0 ns len127_notfound | len=127 pos= -1 | 33.8 ns ``` NEON version: ``` len16_mid | len= 16 pos= 8 | 2.0 ns | 1.42x len16_last | len= 16 pos= 15 | 2.7 ns | 1.85x len32_mid | len= 32 pos= 16 | 2.8 ns | 1.79x len64_last | len= 64 pos= 63 | 9.2 ns | 1.71x len127_last | len=127 pos=126 | 18.1 ns | 1.88x len127_notfound | len=127 pos= -1 | 17.8 ns | 1.90x ``` #### Production HFE benchmark (stress profile, 120s duration) Workload profile: - 5000 hashes with 64-127 fields each - Short TTLs (2-10 seconds) to trigger expiration - 30% deletes, 60% updates (both trigger pvFind) - 4 concurrent threads ``` Platform Mode Avg Time Throughput Speedup ----------- ------ -------- ---------- ------- Apple M4 scalar 94.3 ns 10.61 M/s baseline Apple M4 NEON 74.4 ns 13.44 M/s 1.27x (21% faster) ``` SIMD utilization (M4 NEON): ``` - 100% of calls used SIMD path - 99.3% of elements scanned via SIMD - 74.7% of matches found in SIMD section - 25.3% found in scalar tail (last 0-3 elements) ``` ### Platform Support - **ARM64 (aarch64)**: SIMD enabled via `HAVE_ARM_NEON` - **x86_64 / other**: Falls back to scalar implementation, no behavior change --------- Signed-off-by: Ahmad Belbeisi <ahmadbelb@gmail.com> Signed-off-by: Ahmad Belbeisi <ahmad.belbeisi@tum.de> Co-authored-by: Ran Shidlansik <ranshid@amazon.com>

…valkey-io#2227) In valkey-io#2023 (valkey-io#2209, etc.), we are exploring ways to make failover faster, that is, to minimize the delay. When a node is marked as FAIL and before the failover starts, there is a delay of 500-1000ms. The original purpose of this delay: 1. Allow FAIL to propagate to at least a majority of the primaries. This makes sure they will vote when a replica sends failover auth request. 2. Allow replicas to exchange their offsets, so they will have a correct view of their own rank. We want to minimize this delay while ensuring safety. It is useful for example in these cases: 1. If there is only one replica, then we don't need any delay, or 2. If there are more replicas, with a fast network, the replicas can exchange the offsets very quickly and start the failover within a few milliseconds instead of 500-1000ms. In this PR, when we can be sure that the replica is the best ranked replica, we let it initiate a failover immediately and completely remove the delay. ### How to ensure safety? 1. To make sure this replica has the best rank, it only skips the delay if it is sure that it have the best rank and that all replicas in the same shard agree that the primary is failing. A new flag `CLUSTER_NODE_MY_PRIMARY_FAIL` is introduced to indicate that each replica has marked its primary as FAIL. If all replicas say that the primary is failing, we also know that the offset is not updated, because the offset is not incrementing when the primary is failing. We can skip the delay only if we have received a message from all replicas and they all have set this flag. 2. To make sure the primaries will vote even if they didn't receive the FAIL yet, we use the `CLUSTERMSG_FLAG0_FORCEACK` to make sure they will vote. This is equivalent to broadcasting a FAIL message to all primaries before we broadcast the failover auth request (but cheaper). The race between FAIL (broadcast by A) and AUTH REQUEST (broadcast by R) is illustrated in the following sequence diagram: ``` A R B C | | | | | FAIL | | | |----->| AUTH R.| | | |------->| | | FAIL | | | |-------------->| | | | AUTH R.| | | |-------------->| | FAIL | | | |--------------------->| ``` ### Details This is the how the failover is initiated, with new steps marked with **(new)**: 1. A majority of primaries have marked another primary as PFAIL. 2. Some nodes counts failure reports and marks the failing primary as FAIL. The node that detects FAIL broadcasts it to all nodes in the cluster. 3. When a replica receives FAIL (or detects FAIL itself by counting PFAIL reports) it schedules a failover: a. It sets a timeout (500ms + random 0-500ms). b. It broadcasts pong to the other replicas in the same shard. c. **(new)** The pong (actually the clusterMsg header) has a new flag `CLUSTER_NODE_MY_PRIMARY_FAIL`. When the replicas broadcast pong to each other here, this flag is set. 4. **(new)** When the following conditions are met, skip the remaining delay and start the failover using AUTH REQUEST with the FORCE ACK flag set, that is if a. a PONG is received from every other replica in the same shard (broadcast within the shard) and b. all replicas have marked that its primary is FAIL in their last message (the new `CLUSTER_NODE_MY_PRIMARY_FAIL` flag is set) and c. this is the best replica (rank = 0) and d. my replication offset != 0. 5. When the delay has passed and no other replica has initiated failover, then initiate failover. Notes: * With 3(c), we don't need to wait for FAIL to propagate to all voting primaries. At this point, a FAIL has already been broadcast by some node, but there is a race so our auth request may arrive to some node before the FAIL. Using the FORCE ACK flag ensures the primaries will vote for us. (It is equivalent to broacasting another FAIL just before broadcasting auth request.) * 4(b) ensures that we have received the replication offset from all other replicas and that it's up to date. If a replica says that it's primary is failing, it also means that the replication from the primary to that replica has stopped. * 4(c) is to avoid a special bad case. It can happen that not all replicas know about each other. In this case, two replicas can think they are both the best replica and start the failover at the same time. This can already happen without this PR. When it happens, it usually means that a new replica has just joined and it has no data (offset = 0) and if it wins the election, there is a problem of dataloss (discussed and partially mitigated for the replica migration case in valkey-io#885). To avoid this case, skip this fast failover path if the replica has offset = 0. --------- Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

… tests (valkey-io#3404) These failures seem to be attributed to a race condition in the Aborted test case. `rdb-key-save-delay` 10000 was being set after `$master exec` triggered the full resync. Since repl-diskless-sync-delay 0 was set, the master would immediately start streaming the RDB to the replica once it reconnected. On the ARM runner, when it's fast enough, the entire RDB generation and transfer could complete in ~78ms, before the delay was ever applied. This meant the replica would complete the swap and have 1010 keys instead of the expected 200 and there would be no async_loading window to observe or abort which led to the failures. We saw this in the daily test failure logs ``` 92948:S * RDB memory usage when created 110.85 Mb 92948:S * Done loading RDB, keys loaded: 1010, keys expired: 0 ``` The fix moves `rdb-key-save-delay 10000` to before `$master exec` to guarantee the delay is in effect on the master before the RDB generation begins. Closes valkey-io#3394, closes valkey-io#3395. Signed-off-by: Nikhil Manglore <nmanglor@amazon.com>

The RXE project should keep the same version with the CI machine, showing uname in RDMA CI job to find out the reason of kmod installing failure. Signed-off-by: zhenwei pi <zhenwei.pi@linux.dev>

Fixes valkey-io#3299 Add brief guidance in valkey.conf explaining what the listpack thresholds control and the memory/CPU tradeoff when tuning them. --------- Signed-off-by: Tarte <emprimula@gmail.com> Signed-off-by: KimHuiSu <101166683+Tarte12@users.noreply.github.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

…io#3416) fixes: valkey-io#3200 --------- Signed-off-by: Ran Shidlansik <ranshid@amazon.com>

To avoid freeing the cluster link when EAGAIN occurs, so that we can try again and keeping the send messages. Signed-off-by: Binbin <binloveplay1314@qq.com>

…htable (valkey-io#3360) Previously, watchForKey() checked for duplicate watched keys by iterating through the client's entire watched_keys list with O(N) complexity, where N is the total number of keys watched by the client. So the time complexity for the WATCH command could be quite poor and become a slow command. This commit introduces a per-db hashtable (watched_keys_by_db) in the client's multiState structure to enable O(1) duplicate key detection. The hashtable is lazily allocated only when the client starts watching keys, minimizing memory overhead for clients that don't use WATCH. The per-db hashtable stores watchedKey* directly as the hashtable entry since it already contains the key, so no custom destructors are needed. Memory management remains centralized in the watched_keys list. This optimization is especially beneficial when a client watches many keys across different databases, as the check no longer scales with the total watched key count. This might be a minor scenario, but there's no harm in optimizing it. There is a test in multi.tcl, before this patch, it took 15s, and after this patch, it only took 50ms. ``` set elements {} for {set i 0} {$i < 50000} {incr i} { lappend elements key-$i } r watch {*}$elements r watch {*}$elements ``` Signed-off-by: Binbin <binloveplay1314@qq.com>

In Daily test runs with the `--accurate` flag, the corrupt-dump-fuzzer test runs for 10 minutes (600 seconds) with "sanitize_dump: no" and then another 10 minutes with "sanitize_dump: yes". This causes the runner to time out the whole test job to be aborted with a sigterm from the runner. Example: 13697:signal-handler (1774917673) Received SIGTERM scheduling shutdown... This change reduces this hard-coded fuzzer run from 10 to 1 minute. We have many tests jobs so the fuzzer gets plenty of time to run anyway. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

…#3424) The test case "The best replica can initiate an election immediately test" has been failing in CI jobs. Increase the timeout to account for slow runners. Old waiting time: 50 seconds. New waiting time: 120 seconds, with valgrind: 600 seconds. Intoduced in valkey-io#2227. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

…ty primary (valkey-io#2811) ## Summary This PR handles a network race condition that would cause a replica to read stale cluster packet and then incorrectly promote itself to an empty primary within an existing shard. ## Issue "Migrated replica reports zero repl offset and rank, and fails to win election - sigstop" the test case in `replica-migration.tcl` is a known example where the network race condition can occur. Here's the timeline: - T1: Slot migration — R7 and R3 both try to replicate from R0. Only R3 succeeds, triggering a BGSAVE on R0, while R7 blocks in `receiveSynchronousResponse()`. - T2: While R7 is blocked, R0 is SIGSTOP'd by the test. R4 wins the election and becomes the new primary. R4 sends PINGs to R7 - These PINGs land in R7's kernel TCP receive buffer but are not read by R7 yet. - T3 (5s after T1, due to receiveSynchronousResponse): R7 wakes up: - It reads from inbound links and finds out the remote nodes have already closed their end (they detected R7 as FAIL), getting "I/O error: connection closed" on each. R7 calls `accept()` for the new connections that were established during the block. These connections carry data that was sent seconds ago while R7 was dead. - R7's outbound links are still valid, so it sends PING to R4. - T4: R7 receives and processes fresh PONG packet from R4 on outbound link, reconfiguring itself to follow R4. - T5: R7 reads stale PING packet of R4 via inbound link. This packet was generated R4 was following R0, so R7 incorrectly believes R4 is still following R0 now. And Reconfiguring itself as a replica of R0 from R4. - T6: R7 finds R0 is FAIL, so it starts an election and wins, and becomes an empty-primary. ## Analyze So in T4, R4 is the new primary, and R7 is reconfiguring itself as a replica. So in R7's view, R4 is the primary, and myself (R7) is a replica. And in T5, there is a stale packet from sender (R4), the stale packet is saying: sender (R4) is a replica and R0 is the primary. We originally had a logic for stale packet, meaning we would try to ignore stale packet that would cause exceptions. So in T5: - sender_claims_to_be_primary is false since R4 is saying it is a replica. - sender_last_reported_as_primary is true since in R7's view, R4 is a primary. - sender_claimed_primary is R0, and sender (R4) and sender_claimed_primary (R0) is in the same shard. - nodeEpoch(sender_claimed_primary) is R0's epoch. R0 is an old and dead (not yet) primary. - sender_claimed_config_epoch is R0's epoch since R4 is a replica, and R0 is R4's primary. - nodeEpoch(sender_claimed_primary) == sender_claimed_config_epoch, so the logic fail and we process a stale packet. So in this point, the packet is not a stale packet in R4 and R0's view. - But it is a stale packet in myself (R7) view. In R7's local view, R4 is the new primary and it should have a bigger epoch, that is nodeEpoch(sender) should > sender_claimed_config_epoch. ## Fix The PR fixes the issue by enhancing the existing guardrail logic against stale packet. Previously that logic only detects `nodeEpoch(sender_claimed_primary) > sender_claimed_config_epoch` as stale packet, now it also checks `nodeEpoch(sender) > sender_claimed_config_epoch` to make sure we have up-to-date primary-replica chain. Signed-off-by: Zhijun <dszhijun@gmail.com> Co-authored-by: Binbin <binloveplay1314@qq.com>

Fixing multiple flaky tests. slave buffer are counted correctly in tests/unit/maxmemory.tcl Memory efficiency with values in range * in tests/unit/memefficiency.tcl These tests send large numbers of pipelined commands using deferring clients without reading replies, causing the server's client output buffer to grow. On slow CI runners, this leads to TCP backpressure and I/O errors that crash the test runner. Fix: Use CLIENT REPLY OFF to suppress reply generation, matching the pattern from commit 87d2330. --- Sub-replica reports zero repl offset and rank, and fails to win election in tests/unit/cluster/replica-migration.tcl New non-empty replica reports zero repl offset and rank, and fails to win election in tests/unit/cluster/replica-migration.tcl In the replica-migration tests, a MOVED errors results in an Tcl exception. After failover, wait_for_condition blocks issue GET commands to cluster nodes that may not have fully updated their slot routing. An unhandled MOVED exception crashes the test runner. Fix: Wrap the condition in catch so MOVED errors are retried. Also wrap debug prints in the else clause. Fixes the following tests: --- Replica can update the config epoch when trigger the failover - automatic in tests/unit/cluster/failover2.tcl Increase wait timeout for failover expiry. The test waits 10 seconds for "Failover attempt expired", but the default cluster-node-timeout in start_cluster is 3000ms, making auth_timeout 6 seconds plus ~3 seconds for failure detection — barely fitting in 10 seconds and failing on slow CI runners. Fix: Increase wait from 1000×10ms to 1200×50ms (60 seconds). --- dual-channel-replication lazyfree test The test looks up the replica's main-channel connection id after writing 50MB of data. On slow CI runners, the replica connection may have been disconnected by the output buffer soft limit (64MB/60s) before the lookup, causing get_client_id_by_last_cmd to return empty. Two changes: 1. Move the connection id lookup before the write loop, while the sync is known to be in progress. 2. Reduce writes from 50 x 1MB to 10 x 1MB. The test only needs enough data to exceed the lazyfree threshold (64 blocks ~= 1MB). 10MB is sufficient and avoids approaching the output buffer limit. --------- Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

Found while implementing `clusterNodeGetSlotRangeEnd` for `CLUSTERSCAN` range bounded scanning, which uses the simlilar approach. When tested on s390x via Docker and QEMU. Slot boundary came as 5440 instead of 5461. ``` 127.0.0.1:6001> clusterscan 0-{06S}-0 1) "0-{4HD}-0" 2) 1) "{06S}key1" 127.0.0.1:6001> keyslot 0-{4HD}-0 (error) ERR unknown command 'keyslot', with args beginning with: '0-{4HD}-0' 127.0.0.1:6001> cluster keyslot 0-{4HD}-0 (integer) 5440 127.0.0.1:6001> cluster nodes 08e28d7e8dcfc731ac537d0518bfa32577da6ec7 127.0.0.1:6002@16002 master - 0 1774415944347 2 connected 5461-10922 53f6d4e61eee13ea98441eec05b3c4c95c6c83a4 127.0.0.1:6003@16003 master - 0 1774415943314 3 connected 10923-16383 46664ac624f001cc0708e8ddcfcc9e45e8f444a0 127.0.0.1:6001@16001 myself,master - 0 0 1 connected 0-5460 ``` Updating here as it is same approach and does not follow `memrev64ifbe` followed by `memcpy` Signed-off-by: nmvk <r@nmvk.com> Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Binbin <binloveplay1314@qq.com>

Replace 'const int seqBufferMaxLength' with a #define to avoid a variable-length array warning (-Wgnu-folding-constant) in C. Also add -Werror to the linenoise Makefile so future warnings are caught at build time. Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>

…3420) valkey-cli and valkey-benchmark link only a small subset of .o files and do not include cluster_migrateslots.o. At -O3 this works because the compiler either inlines or discards the unused static function — the symbol reference never reaches the linker. At -O0 (no inlining, no dead-code elimination), the compiler emits the full body of getClientType into every .o that includes server.h, producing an unresolved reference to _isImportSlotMigrationJob at link time. fix is to avoid including server.h as part of external application compilation dependency fixes: valkey-io#3415 --------- Signed-off-by: Ran Shidlansik <ranshid@amazon.com>

…r Valkey commands. (valkey-io#3309) I was fixing grammar issues in the valkey-swift client (valkey-io/valkey-swift#357), and would like to apply the same fixes - missing words, full sentence punctuation, and imperative verb forms - to the breadth of the commands. --------- Signed-off-by: Joe Heck <j_heck@apple.com> Signed-off-by: Joseph Heck <j_heck@apple.com> Co-authored-by: Sarthak Aggarwal <sarthakaggarwal97@gmail.com> Co-authored-by: Lucas Yang <lucasyonge@gmail.com>

codecov · 2026-04-16T15:17:10Z

Codecov Report

❌ Patch coverage is 80.11118% with 322 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.32%. Comparing base (6b85ca4) to head (96eca79).
⚠️ Report is 67 commits behind head on 9.1.

Files with missing lines	Patch %	Lines
src/module.c	38.99%	194 Missing ⚠️
src/io_threads.c	72.66%	79 Missing ⚠️
src/networking.c	88.37%	15 Missing ⚠️
src/valkey-cli.c	83.87%	10 Missing ⚠️
src/server.c	90.36%	8 Missing ⚠️
src/call_reply.c	97.16%	4 Missing ⚠️
src/t_string.c	92.15%	4 Missing ⚠️
src/rdb.c	66.66%	2 Missing ⚠️
src/zmalloc.c	80.00%	2 Missing ⚠️
src/cluster_legacy.c	97.14%	1 Missing ⚠️
... and 3 more

Additional details and impacted files

@@            Coverage Diff             @@
##              9.1    #3519      +/-   ##
==========================================
+ Coverage   74.57%   76.32%   +1.75%     
==========================================
  Files         130      161      +31     
  Lines       72731    80705    +7974     
==========================================
+ Hits        54239    61602    +7363     
- Misses      18492    19103     +611

Files with missing lines	Coverage Δ
src/aof.c	`80.30% <100.00%> (-0.06%)`	⬇️
src/cluster.c	`92.16% <100.00%> (+0.16%)`	⬆️
src/cluster_migrateslots.c	`92.27% <100.00%> (+<0.01%)`	⬆️
src/commandlog.c	`95.79% <100.00%> (-0.79%)`	⬇️
src/commands.def	`100.00% <ø> (ø)`
src/config.c	`78.09% <100.00%> (+0.38%)`	⬆️
src/crc16.c	`100.00% <100.00%> (ø)`
src/eval.c	`91.71% <100.00%> (+0.28%)`	⬆️
src/fuzzer_command_generator.c	`76.75% <ø> (-0.21%)`	⬇️
src/hashtable.c	`97.42% <100.00%> (+4.54%)`	⬆️
... and 27 more

... and 53 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

dvkashapov · 2026-04-20T15:36:46Z

@sarthakaggarwal97 can you also backport #3306? It is targeted for 9.0 and 9.1, just got merged, would be awesome to include it here. Also do we need to mention it in release notes? Some users may see more client evictions that they have not seen previously because of improved tracking

…key-io#3464) ## Problem The test `Test module aof save on server start from empty` in `tests/unit/moduleapi/hooks.tcl` sporadically crashes with `I/O error reading reply`. **Frequency:** 2 out of 15 days (March 26 on `centosstream9-tls-module-no-tls`, April 8 on `fedorarawhide-jemalloc`). **Example failing run:** https://github.com/valkey-io/valkey/actions/runs/24110987718/job/70345236353 ## Root Cause The crash is a **use-after-unload** in the auth test module's blocking authentication thread, NOT a timing issue in the AOF test. The crash log from April 8 shows: ``` 71112:M 00:42:59.710 * Module testacl unloaded 71112:M 00:42:59.711 # crashed by signal: 11, si_code: 1 71112:M 00:42:59.711 # Crashed running the instruction at: 0x7f9dc717384b ``` The sequence: 1. `blocking_auth_cb` spawns a background thread (`AuthBlock_ThreadMain`) that sleeps 500ms 2. Thread wakes, calls `ValkeyModule_UnblockClient()` → main thread processes unblock, decrements `module->blocked_clients` 3. Auth command completes, test calls `r module unload testacl` 4. `moduleUnloadInternal` checks `blocked_clients == 0` if true, proceeds with `dlclose()` 5. **But the background thread is still executing cleanup code** (freeing strings, returning from function) 6. Thread returns into unmapped memory → **SIGSEGV** The `invalidFunctionWasCalled` in the stack trace is the crash handler's safety stub, and the crashing address `0x7f9dc717384b` is in the unmapped auth.so address space. ## Fix Track the background thread ID and `pthread_join()` it in `ValkeyModule_OnUnload` before the module is dlclose'd. This ensures the thread has fully exited before the code is unmapped. The key insight is that `ValkeyModule_UnblockClient()` signals "auth is done" but not "thread is done" — the thread still has cleanup code to execute after that call. `pthread_join()` is the correct synchronization point because it only returns after the thread has fully exited. No mutex is needed since both `blocking_auth_cb` (which creates the thread) and `OnUnload` (which joins it) run on the main event loop thread. Changes to `tests/modules/auth.c`: - Add global `blocking_auth_tid` and `blocking_auth_tid_valid` flag - Set `blocking_auth_tid_valid = 1` after successful `pthread_create` - In `OnUnload`, `pthread_join` the thread if one was created ## Testing Ran `unit/moduleapi/hooks` 100 loops on rpm-distros and ubuntu runners — **all passed**: - **Workflow run:** https://github.com/roshkhatri/valkey/actions/runs/24164276124 - **Config:** `--loops 100 --single unit/moduleapi/hooks` on `almalinux8`, `almalinux9`, `fedoralatest`, `fedorarawhide`, `centosstream9`, `ubuntu-jemalloc`, `ubuntu-arm` - **Result:** 7/7 jobs ✅, zero failures across 700 total test iterations Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>

The test was accidentally waking the IO threads while trying to check that they had gone idle. After the recent IO-thread refactor in valkey-io#3324, the [test](https://github.com/valkey-io/valkey/pull/3324/changes#diff-21314ec3a338f739eab1536f91f528d1efe7c6a93935a71b9c02f77a3858f121R112) started forcing `io-threads-always-active`, and its repeated `INFO` polling counted as fresh activity. So instead of just observing the worker threads, the test kept reactivating them and then flaked. --------- Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com> Signed-off-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com> Co-authored-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>

`hpersistCommand` calls `addReplyArrayLen` before `lookupKeyWrite` + `checkType`. When HPERSIST targets a non-hash key, the server writes a RESP array header followed by a WRONGTYPE error — a malformed response that permanently desynchronizes the client connection. This moves `lookupKeyWrite` + `checkType` before `addReplyArrayLen`, matching the pattern used by every other HFE command (e.g. `hgetdelCommand`, `hexpireGenericCommand`). Added a test for HPERSIST on a wrong-type key. Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>

This improves COB memory tracking when using copy avoidance for bulk string replies. This fix addresses underestimation of client memory usage that occurred when reply buffers stored pointers to shared `robj` instead of copying data. IO threads calculate actual reply sizes by calling `sdslen()` on strings before writing, for that we need atomic `tracked_for_cob` flag in payload headers to prevent race conditions and double accounting. See valkey-io#2396 --------- Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>

Add a build option to compile the Lua scripting engine as a static module and wire the server to load it directly at startup when enabled. The module load path now resolves on-load and on-unload entry points from the main binary, and the module lifecycle keeps those callbacks so unload works without a shared library handle. The Lua module build was updated to support both static and shared variants, with the static path exporting visible wrapper symbols and linking the server with the module archive. While touching the Lua code, a few internal symbols were renamed for consistency and the monotonic time helper was clarified. Note that this PR addresses the LUA module, but it can be applied to other "core" modules (like: Bloom, Json, Search and others). With this change, it will be easier to ship Valkey bundle with modules. Areas touched: * CMake * Makefile * Lua scripting module * Core module loading **Generated by CodeLite** --------- Signed-off-by: Eran Ifrah <eifrah@amazon.com>

## Add Command Result Event Notifications for Modules ### Summary 1. Adds new server events `ValkeyModuleEvent_CommandResultSuccess` and `ValkeyModuleEvent_CommandResultFailure` for that can notify subscribed modules after command execution. This enables modules to implement audit logging, error monitoring, performance tracking, and observability without modifying core server code. 2. Adds new server event `ValkeyModuleEvent_CommandResultACLDenied` for commands rejected by ACL. Together with PR valkey-io#2237 this covers auditing of authentication and authorisation. ### Motivation There is currently no module API to observe command outcomes after execution or to capture ACL denied commands. Modules that need audit logging or error monitoring have no mechanism to be notified when commands succeed or fail, what arguments were used, how long they took, or how many keys were modified. This feature fills that gap using the existing `ValkeyModule_SubscribeToServerEvent()` infrastructure. ### API #### Events | Event | Description | |---|---| | `ValkeyModuleEvent_CommandResultSuccess` | Fired after a command completes successfully | | `ValkeyModuleEvent_CommandResultFailure` | Fired after a command returns an error | | `ValkeyModuleEvent_CommandACLDenied` | Fired after a command is rejected by ACL | These are separate events (not sub-events), so modules can for example only subscribe to failures without incurring any callback overhead for successful commands. #### Event Data: `ValkeyModuleCommandResultInfo` The `data` pointer passed to the callback can be cast to `ValkeyModuleCommandResultInfo`: ```c typedef struct ValkeyModuleCommandResultInfo { uint64_t version; /* Version of this structure for ABI compat. */ const char *command_name; /* Full command name (e.g., "SET", "CLIENT|LIST"). */ long long duration_us; /* Execution duration in microseconds. */ long long dirty; /* Number of keys modified. */ uint64_t client_id; /* Client ID that executed the command. */ int is_module_client; /* 1 if command was from RM_Call, 0 otherwise. */ int argc; /* Number of command arguments. */ ValkeyModuleString **argv; /* Command arguments array (zero-copy, read-only). */ int acl_deny_reason; /* ACL_DENIED_CMD/KEY/CHANNEL/AUTH; 0 for non-ACL events */ const char *acl_object; /* Denied resource name (key/channel); NULL for CMD/AUTH */ } ValkeyModuleCommandResultInfoV1; ``` The struct is versioned (`VALKEYMODULE_COMMANDRESULTINFO_VERSION`) for forward-compatible API evolution. ### Usage Example ```c /* Callback receives events for whichever event(s) you subscribed to */ void OnCommandResult(ValkeyModuleCtx *ctx, ValkeyModuleEvent eid, uint64_t subevent, void *data) { VALKEYMODULE_NOT_USED(ctx); VALKEYMODULE_NOT_USED(subevent); ValkeyModuleCommandResultInfo *info = (ValkeyModuleCommandResultInfo *)data; if (info->version != VALKEYMODULE_COMMANDRESULTINFO_VERSION) return; int failed = (eid.id == VALKEYMODULE_EVENT_COMMAND_RESULT_FAILURE); /* Access fields directly */ printf("command=%s status=%s duration=%lldus dirty=%lld client=%llu\n", info->command_name, failed ? "FAIL" : "OK", info->duration_us, info->dirty, info->client_id); /* Access argv (read-only, zero-copy) */ for (int i = 0; i < info->argc; i++) { size_t len; const char *arg = ValkeyModule_StringPtrLen(info->argv[i], &len); printf(" argv[%d] = %.*s\n", i, (int)len, arg); } } /* Subscribe in ValkeyModule_OnLoad or at runtime */ /* Option A: command failures only (recommended for audit logging) */ ValkeyModule_SubscribeToServerEvent(ctx, ValkeyModuleEvent_CommandResultFailure, OnCommandResult); /* Option B: command successes only */ ValkeyModule_SubscribeToServerEvent(ctx, ValkeyModuleEvent_CommandResultSuccess, OnCommandResult); /* Option C: both command outcomes*/ ValkeyModule_SubscribeToServerEvent(ctx, ValkeyModuleEvent_CommandResultSuccess, OnCommandResult); ValkeyModule_SubscribeToServerEvent(ctx, ValkeyModuleEvent_CommandResultFailure, OnCommandResult); /* Subscribe to ACL Denied */ ValkeyModule_SubscribeToServerEvent(ctx, ValkeyModuleEvent_CommandResultACLDenied, onCommandResult); /* Unsubscribe pass NULL callback */ ValkeyModule_SubscribeToServerEvent(ctx, ValkeyModuleEvent_CommandResultFailure, NULL); ``` ### Design Decisions - **Separate events instead of sub-events**: Modules subscribing only to failures have zero overhead for successful commands (~2ns listener-list check vs ~30ns callback invocation per command). This is critical since success events fire on the hot path of every command. - **Stack-allocated info struct**: The `ValkeyModuleCommandResultInfoV1` is built on the stack ΓÇö no heap allocation per event. - **Zero-copy argv**: Arguments are passed directly from the client's argv array. Any integer-encoded arguments (from `tryObjectEncoding()` during command execution) are decoded to string-encoded objects before being passed to the callback, ensuring compatibility with `ValkeyModule_StringPtrLen()`. - **Early exit**: If no modules are subscribed to any server events, the event firing function returns immediately before building the info struct. - **Uses existing server event infrastructure**: Follows the `ValkeyModule_SubscribeToServerEvent()` pattern used by all other server events, rather than introducing a new callback mechanism. ### Files Changed | File | Change | |---|---| | `src/valkeymodule.h` | Event IDs, event constants, `ValkeyModuleCommandResultInfoV1` struct | | `src/module.c` | `moduleFireCommandResultEvent()`, event documentation, event version entries | | `src/module.h` | Function declaration | | `src/server.c` | Call `moduleFireCommandResultEvent()` from `call()` after command execution | | `src/server.c` | Call to `moduleFireCommandACLDeniedEvent` in `processCommand` after ACL rejection | | `tests/modules/commandresult.c` | Test module exercising the full API | | `tests/unit/moduleapi/commandresult.tcl` | Integration tests | --------- Signed-off-by: martinrvisser <mvisser@hotmail.com> Signed-off-by: martinrvisser <martinrvisser@users.noreply.github.com> Co-authored-by: Ricardo Dias <rjd15372@gmail.com>

valkey-io#3324 introduced `BATCH_SIZE` as a const int local variable and used it as an array bound. Clang 17 rejects this with: ``` io_threads.c:305:22: error: variable length array folded to constant array as an extension [-Werror,-Wgnu-folding-constant] 305 | void *batch_jobs[BATCH_SIZE]; | ^~~~~~~~~~ 1 error generated. make[1]: *** [io_threads.o] Error 1 make: *** [all] Error 2 ``` Old Clang versions do not emit this warning, maybe that is why the CI passed. Fix by promoting `BATCH_SIZE` to a file-scope `#define`. Signed-off-by: Yang Zhao <zymy701@gmail.com>

This deflakes all variants of `diskless replicas drop during rdb pipe`. The main issue turned out to be that the test was too sensitive to timing and log ordering under TLS, not that the core behavior was wrong. This keeps the same five subcases (no, slow, fast, all, timeout) but makes them much less CI-fragile. CI passes 200 times: https://github.com/sarthakaggarwal97/valkey/actions/runs/24547258515 --------- Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com> Signed-off-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com> Co-authored-by: Sarthak Aggarwal <25262500+sarthakaggarwal97@users.noreply.github.com>

…led (valkey-io#3458) When close_asap flag is set, set bytes read to 0 In the readToQueryBuf, the c->nread represents the number of bytes read. When close_asap flag is set, there is a bug where the c->nread isn't reset to 0 and this breaks the invariant. IOThreads then incorrectly think there is data to read and results in a crash. This change fixes this bug. To elaborate on the race possible: 1. Let's say that a IO thread job for reading query from a client got enqueued as part of a epoll - https://github.com/valkey-io/valkey/blob/unstable/src/io_threads.c#L417. 2. Later the client gets freed async and is marked as close_asap - https://github.com/valkey-io/valkey/blob/unstable/src/networking.c#L2175 3. While processing the io_thread job for the client, it invokes iothreadReadQueryFromClient. Here, [`readToQueryBuf`](https://github.com/valkey-io/valkey/blob/unstable/src/networking.c#L6497) returns as a no-op since the client is marked close-asap. Also, the c->nread is not reset to 0 and count contain the value from a previous read. 4. Later parseInputBuffer [gets invoked](https://github.com/valkey-io/valkey/blob/unstable/src/networking.c#L6514). 5. The parseInputBuffer then [accesses the query_buf](https://github.com/valkey-io/valkey/blob/unstable/src/networking.c#L3864). The query_buf here would be null in resetSharedQueryBuf as part of beforeNextClient. Signed-off-by: Deepak Nandihalli <deepak.nandihalli@gmail.com>

## Problem `Fix cluster` in `tests/unit/cluster/many-slot-migration.tcl` has been timing out daily on valgrind jobs since April 3, 2026. The test runs 10 cluster nodes under valgrind, migrating 40,000 keys across 1,000 slots — too much work for valgrind-instrumented builds. The slowdown is caused by valkey-io#3366 (dict→hashtable wrapper). Under `-O0` (valgrind builds), the `static inline` wrappers become real function calls that valgrind instruments, adding ~75% overhead to hot paths like `dictSize`. This compounds across 10 valgrind processes over a 20-minute migration test. No impact on production builds (`-O2` inlines everything). ## Fix Scale the test workload down under valgrind: 10,000 keys / 250 slots instead of 40,000 / 1,000. Normal runs are unchanged. Still exercises the same cluster repair path. Signed-off-by: Roshan Khatri <rvkhatri@amazon.com> Co-authored-by: sarthakaggarwal97 <sarthakaggarwal97@users.noreply.github.com>

…alkey-io#3498) There is a double free issue in the code. The error handling path called both decrRefCount(o) and streamFreeNACK(nack), but the nack was obtained from cgroup->pel via raxFind and is still referenced there. decrRefCount(o) frees it through freeStream -> streamFreeCG -> raxFreeWithCallback(cg->pel, zfree), so the explicit streamFreeNACK(nack) causes a double free. Remove the redundant streamFreeNACK(nack) call and add a regression test with a crafted corrupt payload that triggers the duplicate consumer PEL entry path. This was introduced in 492d8d0. Signed-off-by: Binbin <binloveplay1314@qq.com>

## Summary Fix a file descriptor leak in `connSocketBlockingConnect()` when `aeWait()` times out. ## Bug When `anetTcpNonBlockConnect()` succeeds but `aeWait()` times out (e.g., MIGRATE to an unreachable host), the fd is leaked because it was never assigned to `conn->fd`. The caller's `connClose()` checks `conn->fd != -1` and skips cleanup. ## Fix Assign `conn->fd = fd` immediately after `anetTcpNonBlockConnect()` succeeds, before `aeWait()`. This way the caller's normal `connClose()` cleanup path handles the fd on any error, which is consistent with how the rest of the connection lifecycle works. TLS connections also benefit since `connTLSBlockingConnect` delegates to this function for the TCP layer. ## Reproducer ``` valkey-cli SET key hello # Repeat against unreachable host: for i in $(seq 1 30); do valkey-cli MIGRATE 192.0.2.1 6379 key 0 500; done # Check: /proc/<pid>/fd shows 30 leaked socket fds ``` *This issue was generated by AI but verified, with love, by a human.* Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>

…alkey-io#3535) Match HGETDEL with the existing batch-delete pattern used by HDEL. HDEL already pauses hashtable auto-shrink while deleting multiple fields so shrink evaluation is deferred until the batch completes. HGETDEL was missing the same optimization even though it also deletes fields in a loop. Pause auto-shrink for hashtable-encoded hashes before the HGETDEL delete loop and resume it once afterwards. This preserves observable behavior and reduces redundant shrink work for multi-field deletes. Same as valkey-io#3144. Signed-off-by: DaeMyung Kang <charsyam@gmail.com>

…3504) The SPMC queue from valkey-io#3324 needs each `spmcCell` to be cache-line aligned, but plain `zmalloc()` does not guarantee that in all build configurations. This change introduces `zmalloc_cache_aligned()` and uses it for the SPMC queue buffer allocation in `spmcInit()`. Failing CI: https://github.com/valkey-io/valkey/actions/runs/24374139344 --------- Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>

…unload (valkey-io#3545) This follows up on the commandresult API work and fixes cleanup around unsubscribe and module unload. The main issue was that command-result event listeners could leave stale state behind. On unload, we removed the listeners themselves but didn’t fully update the fast-path listener counters. Separately, unsubscribing with a NULL callback could behave badly if the listener wasn’t present anymore. In practice, that meant later commands could still walk into command-result event handling after the module was supposed to be cleaned up. Failed in Daily as well yesterday: https://github.com/valkey-io/valkey/actions/runs/24753491944/job/72421581610#step:10:852 Related Failures: valkey-io#2936 (comment) --------- Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>

### Summary The daily CI sanitizer jobs with clang are failing during the build step. When the static Lua module is built with `-flto`, the `.o` files contain LLVM bitcode that gets archived into `libvalkeylua.a`. The system linker cannot read this bitcode, causing build failures: `/usr/bin/ld: /home/runner/work/valkey/valkey/src/modules/lua/libvalkeylua.a: member /home/runner/work/valkey/valkey/src/modules/lua/libvalkeylua.a(debug_lua.o) in archive is not an object` The previous fix (valkey-io#3546) pinned clang to version 17, but this was insufficient, the issue is not just a version mismatch but that the system linker fundamentally cannot read LTO bitcode from `.a` archives. Example failure: https://github.com/valkey-io/valkey/actions/runs/24865821147/job/72801509768 ### Fix Strip LTO flags from OPTIMIZATION in the Lua module Makefile using `override` Tested: https://github.com/hanxizh9910/valkey/actions/runs/24913834442 --------- Signed-off-by: Hanxi Zhang <hanxizh@amazon.com>

The bug was in dismissHashtable(), which computes the size passed to zmadvise_dontneed() for the top-level hashtable tables. ht->tables[i] points to a contiguous array of bucket objects, but the code used sizeof(bucket *) instead of sizeof(bucket) when calculating the length. That means it treated the allocation like an array of pointers rather than an array of buckets. As a result, the advised range was much smaller than the actual table allocation. On 64-bit builds, bucket is 64 bytes while bucket * is 8 bytes, so only about one eighth of the table was covered. This does not usually break correctness, but it defeats the purpose of the function: after a fork, we want to tell the kernel that the hashtable pages are no longer needed so we reduce copy-on-write overhead. With the wrong size, most of the table memory was never included in that hint. The fix is to use sizeof(bucket) so the full top-level bucket array is passed to zmadvise_dontneed(). Signed-off-by: DaeMyung Kang <charsyam@gmail.com>

In here we should go to error to free the resources: ``` error: if (listen_cmid) rdma_destroy_id(listen_cmid); if (listen_channel) rdma_destroy_event_channel(listen_channel); ret = ANET_ERR; end: freeaddrinfo(servinfo); return ret; } ``` Signed-off-by: Binbin <binloveplay1314@qq.com>

…alkey-io#3548) The default value of lua-enable-insecure-api cannot be safely changed from no to yes due to two issues: 1. In createEngineContext(), lua_enable_insecure_api was hardcoded to 0 before initializing Lua states, so deprecated APIs (newproxy, setfenv, getfenv) were never registered in the global table regardless of the actual config value. Once the global table is locked, the config change has no effect. 2. lua_insecure_api_current was initialized to 0 (struct zero-init) and never synced with the final config value. If the default was changed to yes(1), a subsequent CONFIG SET no would see both values as 0 and skip the evalReset() call in updateLuaEnableInsecureApi(). Fix by reading the real config via isLuaInsecureAPIEnabled() in createEngineContext() before Lua state initialization, and syncing lua_insecure_api_current after all config sources (default, config file, command-line args) are applied. Signed-off-by: Binbin <binloveplay1314@qq.com>

…io#3554) If not cleared, the job may no longer be valid by the time the client goes to cleanup. This dangling reference could cause a crash if you set slot-migration-log-max-len to 0 and are very unlucky. Signed-off-by: Jacob Murphy <jkmurphy@google.com>

Remove eval script cache entries that belong to a scripting engine when that engine is unregistered. This prevents the eval cache from retaining dangling engine pointers and keeps the tracked script memory in sync after engine shutdown. The scripting engine unregister path now invokes a new eval cleanup helper, which scans the cached scripts, drops matching entries from the LRU list and dictionary, and adjusts cache memory accounting accordingly. * scripting engine * eval cache Signed-off-by: Eran Ifrah <eifrah@amazon.com>

Bump `VALKEY_RELEASE_STAGE` from `rc1` to `rc2` and add release notes for changes backported from `unstable` in #3519 --------- Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com> Signed-off-by: Sarthak Aggarwal <sarthakaggarwal97@gmail.com> Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>

enjoy-binbin and others added 30 commits April 16, 2026 14:20

Make macOS leaks check skippable (valkey-io#3370)

164ce49

Add AGENTS.md file for agentic coding assistant steering (valkey-io#3371

8c32b38

) Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com>

remove duplicated lline (valkey-io#3379)

afb3b8b

Probably added by mistake during some merge of valkey-io#1566 Signed-off-by: Ran Shidlansik <ranshid@amazon.com>

Upload all benchmark artifacts including server logs (valkey-io#3388)

5607fd6

Upload the entire results directory instead of only metrics JSON files. This includes server logs which are useful for debugging benchmark failures. Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>

ci: include gtests in code coverage report

70dc6f8

Signed-off-by: harrylin98 <harrylin980107@gmail.com>

Show uname -a in RDMA CI job (valkey-io#3418)

3301857

The RXE project should keep the same version with the CI machine, showing uname in RDMA CI job to find out the reason of kmod installing failure. Signed-off-by: zhenwei pi <zhenwei.pi@linux.dev>

fix test_entry to consider diffrerent allocator size classes (valkey-…

d1db275

…io#3416) fixes: valkey-io#3200 --------- Signed-off-by: Ran Shidlansik <ranshid@amazon.com>

Handle EAGAIN in clusterWriteHandler (valkey-io#3421)

b20ffa2

To avoid freeing the cluster link when EAGAIN occurs, so that we can try again and keeping the send messages. Signed-off-by: Binbin <binloveplay1314@qq.com>

sarthakaggarwal97 requested a review from madolson April 16, 2026 14:37

sarthakaggarwal97 changed the title ~~Backports to 9.1 for RC2~~ Backport Unstable to 9.1 for RC2 Apr 16, 2026

sarthakaggarwal97 force-pushed the cherry-pick-unstable-to-9.1 branch from eeb44a9 to 0c8f75f Compare April 16, 2026 14:55

sarthakaggarwal97 mentioned this pull request Apr 16, 2026

Update version to 9.1.0-rc2 and add release notes #3521

Merged

roshkhatri and others added 21 commits April 23, 2026 19:53

madolson approved these changes Apr 27, 2026

View reviewed changes

madolson merged commit 2006a9d into valkey-io:9.1 Apr 27, 2026
76 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backport Unstable to 9.1 for RC2#3519

Backport Unstable to 9.1 for RC2#3519
madolson merged 67 commits into
valkey-io:9.1from
sarthakaggarwal97:cherry-pick-unstable-to-9.1

sarthakaggarwal97 commented Apr 16, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

dvkashapov commented Apr 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

sarthakaggarwal97 commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Included PRs (60)

Bug Fixes (15)

New Features (Valkey 9.1 project: "To be backported")

Performance Improvements

Test Fixes (13)

CI & Infrastructure (10)

Build Fixes (3)

Documentation & Maintenance (5)

Cleanup (2)

Uh oh!

codecov Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

dvkashapov commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

sarthakaggarwal97 commented Apr 16, 2026 •

edited

Loading

codecov Bot commented Apr 16, 2026 •

edited

Loading

dvkashapov commented Apr 20, 2026 •

edited

Loading