Skip to content

8.2 commits backport#463

Closed
sundb wants to merge 15 commits into
8.2-basefrom
8.2-backport
Closed

8.2 commits backport#463
sundb wants to merge 15 commits into
8.2-basefrom
8.2-backport

Conversation

@sundb

@sundb sundb commented Jun 4, 2026

Copy link
Copy Markdown
Owner

No description provided.

YaacovHazan and others added 2 commits June 4, 2026 09:15
…itectures (redis#15212)

Ensure backward compatibility and consistent behavior across different
architectures by explicitly setting the default value.

Fixes redis#15175

Co-authored-by: ofiryanai <ofiryanai1@gmail.com>
(cherry picked from commit 6c3a8ec)
Fixes [redis#15183](redis#15183).

## Motivation
Commit
[cf668f2](redis@cf668f2)
tightened cluster-announce-ip validation to require a valid IPv4 or IPv6
address, which is a regression for users that legitimately announce a
hostname.

## Changes
* isValidClusterAnnounceIp() now accepts either:
   * A valid IPv4/IPv6 address
   * A valid hostname — same character rules as cluster-announce-hostname,
length-bounded by NET_IP_STR_LEN to match the storage buffer.

(cherry picked from commit 21f2569)
(cherry picked from commit 54ea50c)
@sundb sundb changed the title Test tcp deadlock fixes (redis#14667) 8.2 commits backport Jun 4, 2026
@codecov

codecov Bot commented Jun 4, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 82.22222% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.30%. Comparing base (ee7464f) to head (7692e39).

Files with missing lines Patch % Lines
modules/vector-sets/vset.c 0.00% 3 Missing ⚠️
src/module.c 0.00% 3 Missing ⚠️
src/cluster_legacy.c 85.71% 1 Missing ⚠️
src/config.c 90.90% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           8.2-base     #463      +/-   ##
============================================
+ Coverage     69.24%   69.30%   +0.05%     
============================================
  Files           128      128              
  Lines         72353    72369      +16     
============================================
+ Hits          50104    50152      +48     
+ Misses        22249    22217      -32     
Files with missing lines Coverage Δ
src/blocked.c 92.35% <100.00%> (+0.02%) ⬆️
src/db.c 89.63% <100.00%> (+0.02%) ⬆️
src/networking.c 91.37% <100.00%> (+0.04%) ⬆️
src/t_stream.c 93.62% <100.00%> (+0.22%) ⬆️
src/cluster_legacy.c 74.13% <85.71%> (-0.19%) ⬇️
src/config.c 73.17% <90.90%> (-0.03%) ⬇️
modules/vector-sets/vset.c 25.31% <0.00%> (+0.85%) ⬆️
src/module.c 22.29% <0.00%> (+0.26%) ⬆️

... and 15 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…4373)

This PR aims to avoid the situation of a potential crash when efSearch
is too large (and therefore the memory allocated could lead to a server
crash or an integer overflow (where less memory is allocated than
expected).

- Limit the accepted EF in the request o 100_000 as in VADD
- Limit the ef search to the number of nodes in the HNSW graph

(cherry picked from commit 97df5b5)
moticless and others added 3 commits June 4, 2026 10:40
redis#14417)

Fix a heap-buffer-overflow vulnerability in the `CLUSTER FORGET` command
when provided with a node ID shorter than the expected 40 bytes.
When `CLUSTER FORGET` is called with a node ID that has a length smaller
than `CLUSTER_NAMELEN` (40 bytes), the `clusterBlacklistExists()`
function would read beyond the allocated string buffer. This occurs
because the function always attempted to read exactly 40 bytes via
`sdsnewlen(nodeid, CLUSTER_NAMELEN)`.

Changes:
- Added a `size_t len` parameter to `clusterBlacklistExists()` to use
the actual string length
- Updated all call sites to pass the appropriate length
- Added a test case to verify the fix
- Test added to verify that `CLUSTER FORGET` with an invalid short node
ID returns an appropriate error.

This PR is based on: valkey-io/valkey#2108
Co-authored-by: Ran Shidlansik <ranshid@amazon.com>

(cherry picked from commit 48d0aa9)
This PR is based on:
valkey-io/valkey#2347

This was introduced in redis#13512

The server crashes with a null pointer dereference when lookupKey() is
called from handleClientsBlockedOnKey(). The crash occurs because
server.executing_client is NULL, but the code attempts to access
server.executing_client->cmd->proc without checking.

**Crash scenario:**
Client 1 enables CLIENT NO-TOUCH
Client 2 blocks on BRPOP mylist 0
Client 1 executes RPUSH mylist elem
When unblocking Client 2, lookupKey() dereferences NULL
server.executing_client → crash

**Solution**
Added proper null checks before dereferencing server.executing_client:
Check if LOOKUP_NOTOUCH flag is already set before attempting to modify
it
Verify both server.current_client and server.executing_client are not
NULL before accessing their members
Maintain the TOUCH command exception for scripts

**Testing**
Added regression test in tests/unit/type/list.tcl that reproduces and
verifies the fix for this crash scenario.

This fix is based on valkey-io/valkey#2347

Co-authored-by: Uri Yagelnik <uriy@amazon.com>
Co-authored-by: Ran Shidlansik <ranshid@amazon.com>
(cherry picked from commit 5b49119)
…ancellation (redis#14420)

This issue was introduced by redis#10440
In that PR, we avoided resetting the current user during processCommand,
but overlooked the fact that this client might not be the current one,
it could be a client that was previously blocked due to shutdown.
If we don’t reset these clients, and the shutdown is canceled, then when
these clients continue executing other commands, they will trigger an
assertion.

This PR delays the operation of resetting the client to
processUnblockedClients and no longer skips SHUTDOWN_BLOCKED clients.

(cherry picked from commit 6d89370)
vitahlin and others added 9 commits June 4, 2026 12:40
This PR is based on: valkey-io/valkey#3511
Close redis#14983

During diskless replication, if **any single replica** cannot accept a
write (TCP send buffer full / `EAGAIN`), the master stops reading the
RDB pipe entirely, stalling data delivery to **all** replicas —
including fast ones that are ready to receive data.

The failure reason is similar to
redis#14946, the socket buffer is more
easy to fill.

In `rdbPipeReadHandler`, the master reads from the child's RDB pipe and
writes to all replica sockets in a loop. When `connWrite` to any replica
returns a partial write (socket send buffer full), the handler:

1. Installs a per-replica `rdbPipeWriteHandler` and increments
`rdb_pipe_numconns_writing`
2. **Removes the pipe read event** via `aeDeleteFileEvent(server.el,
server.rdb_pipe_read, AE_READABLE)`, stopping all pipe reads

The pipe read event is only re-enabled when **all** pending write
handlers complete (`rdb_pipe_numconns_writing == 0`), meaning the
**slowest replica dictates the throughput for all replicas**.

With one slow replica (consuming at ~290 KB/s due to `key-load-delay`):

- Master bursts ~1.3 MB of RDB data until the slow replica's socket send
buffer fills
- `rdbPipeReadHandler` disables the pipe read event
- **All replicas starve for 4–5 seconds** while the slow replica drains
its buffer
- Cycle repeats: burst → stall → burst → stall

Ultimately, it leads to a very slow synchronization process of the
entire master and replica.

1. Skip the entire `diskless replicas drop during rdb pipe` test under
Valgrind to avoid timing flakiness on slow env.

2. Move `start_server` inside the `foreach all_drop` loop so each
subcase gets a fresh master instead of sharing state across subcases.

3. For `no / slow / fast / all` subcases, replica 0 runs with
`key-load-delay 500`, which combined with the blocked-writer TCP
back-pressure can stall the RDB-saving child indefinitely; shrink the
dataset to ~40 MB so the transfer still exercises the blocked-writer
path but completes in reasonable time instead of hanging on the TCP
deadlock.
For the timeout subcase, replica 0 does not run with `key-load-delay
500`, so to avoid the TCP deadlock we still reduce the dataset somewhat,
but keep it larger than the other subcases. Otherwise the kernel TCP
send buffer can absorb the whole RDB, and we'd miss the
repl_last_partial_write != 0 "(full sync)" timeout path and only hit the
"(streaming sync)" path instead.

5. For the `all` subcase, set `rdb-key-save-delay 1000` on the master so
the RDB child keeps generating data while both replicas are killed,
ensuring the last-replica-drop path is exercised rather than racing with
normal completion.

6. Move the slow-replica `pause_process()` so it happens only in the
timeout subcase, not after killing replicas, so Redis observes the
disconnect promptly in non-timeout flows.

7. In the timeout subcase, set `repl-timeout` 2, wait inline for
`*Disconnecting timedout replica (full sync)*`, then restore
`repl-timeout` 60 so the remaining replica can finish the streamed RDB.

---------

Co-authored-by: Sarthak Aggarwal <sarthagg@amazon.com>
Co-authored-by: debing.sun <debing.sun@redis.com>
(cherry picked from commit 3189614)
`RedisModule_GetClusterNodeInfo` is
[documented](https://redis.io/docs/latest/develop/reference/modules/modules-api-ref/#redismodule_getclusternodeslist)
to be used with `RedisModule_GetClusterNodesList`, which states (and
does) that the returned strings are not null-terminated.

Therefore, it is unsafe to call `strlen` on the `const char *id` input
of `RedisModule_GetClusterNodeInfo`, and the API should assume the
string is of the correct length

(cherry picked from commit b043ac5)
Fixes data race where main thread modifies pauserehash in dictNext while
IO thread reads `useStoredKeyApi()` in `lookupCommand()->dictFind()`
path.
ASAN detected overlapping memory access at same byte offset
causing race condition. When bit fields are adjacent in a struct,
modifying one bit field requires a read-modify-write operation on the
entire memory unit, which can cause race conditions with concurrent
access to other bit fields in the same unit.
This was introduced by redis#13696,
although it was changed by redis#14440,
but this issue still exist.

The fix moves `useStoredKeyApi` to share a memory word with
`pauseAutoResize` instead. Since `pauseAutoResize` is never modified for
`server.commands`, this eliminates the race condition while maintaining
memory efficiency through bit field packing.

Reproduce step:
```
make SANITIZER=thread
./runtest --tsan --config io-threads 4 --accurate --verbose --dump-logs --single unit/obuf-limits --loop --stop
```

failed CI:
https://github.com/redis/redis/actions/runs/18803219305/job/53653572710

---------

Co-authored-by: Moti Cohen <moti.cohen@redis.com>
(cherry picked from commit 6ea4e2c)
- We will generate a random shard id when creating a cluster node, so
`auxFieldHandlers[af_shard_id].isPresent(n) == 0` never meet, so it
means we never add master nodes into `cluster.shards` when loading, this
bug is introduced in redis#10536 that
supports `shard-id` concept. BTW, redis#13468 can make replicas add into
`cluster.shards`

- Replica shard_id may be different with master, so before we add again,
we should remove it, otherwise, `cluster.shards` will have dirty shards,
introduced in redis#13468

These bugs causes the output of the `cluster shards` is corrupt, the
temporary `slot_info` may not be cleaned up, we may see the duplicated
slot ranges, and a shard info without master.

(cherry picked from commit bf0ba1d)
…up (redis#14552)

When destroying a consumer group with `XGROUP DESTROY`, the cached
`min_cgroup_last_id` was not being invalidated. This caused incorrect
behavior when using `XDELEX` with the `ACKED` option, as the cache still
referenced the destroyed group's `last_id`.

Invalidate the `min_cgroup_last_id` cache when the destroyed group's
`last_id` equals the cached minimum. The cache will be recalculated on
the next call to `streamEntryIsReferenced()`.

---------

Co-authored-by: guybe7 <guy.benoish@redislabs.com>
(cherry picked from commit bb6389e)
…egies (redis#14623)

This bug was introduced by redis#14130 and found by guybe7

When using XTRIM/XADD with approx mode (~) and DELREF/ACKED delete
strategies, if a node was eligible for removal but couldn't be removed
directly (because consumer group references need to be checked), the
code would incorrectly break out of the loop instead of continuing to
process entries within the node. This fix allows the per-entry deletion
logic to execute for eligible nodes when using non-KEEPREF strategies.

(cherry picked from commit 9ca860b)
…#14816)

`setModuleEnumConfig()` was passing the prefixed config name to module
callbacks instead of the unprefixed name, inconsistent with other
config types. Fixed by using getRegisteredConfigName() like bool,
numeric, and string configs do.

Added assertions to all test module config callbacks to validate
correct unprefixed names are received.

Issue was introduced by redis#13656

(cherry picked from commit 8a65b65)
In redis#14121, the SCAN filters order was changed, before redis#14121the order
was - pattern, expiration and type, after redis#14121pattern became last,
this break change broke the original behavior, which will cause scan
with pattern also to remove the expired keys.
This PR reorders the filters to be consistent with the original behavior
and extends a test to cover this scenario.

(cherry picked from commit 3920059)
## Summary

This bus was introduced by redis#14623

Before PR redis#14623, when a stream node was going to be fully removed, we
would just delete the whole node directly instead of iterating through
and deleting each entry.

Now, with the XTRIM/XADD flags, we have to iterate and delete entries
one by one. However, the implementation in issue redis#8169 didn’t consider
the case where all entries are removed, so `p` can end up being NULL.

Fixes an UndefinedBehaviorSanitizer error in `streamTrim()` when marking
the last entry in a listpack as deleted. The issue occurs when
performing pointer arithmetic on a NULL pointer after `lpNext()` reaches
the end of the listpack.

## Solution
If p is NULL, we skip the delta calculation and the calculation of
new `p`.

(cherry picked from commit 85ab4ca)
@sundb sundb closed this Jun 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants