Skip to content

[backport] Backport sweep for 9.1#3970

Closed
valkeyrie-ops[bot] wants to merge 8 commits into
9.1from
agent/backport/sweep/9.1
Closed

[backport] Backport sweep for 9.1#3970
valkeyrie-ops[bot] wants to merge 8 commits into
9.1from
agent/backport/sweep/9.1

Conversation

@valkeyrie-ops

@valkeyrie-ops valkeyrie-ops Bot commented Jun 11, 2026

Copy link
Copy Markdown

Backport sweep for 9.1

Automated cherry-picks from PRs marked "To be backported".

Applied

Source PR Title Detail
#3938 Fix IO-Threads redesign cleanup perf regression from #3544
#3964 Omit alldbs rule in ACL SAVE/LIST and CONFIG REWRITE for compatibility conflicts resolved by Claude Code
#3544 Revert "IO-Threads redesign cleanup work (#3544)" cherry-picked in a prior sweep
#3920 Reject integer overflow of length fields in zipmapValidateIntegrity
#3921 Reject NAN scores in listpack/ziplist-encoded sorted sets on RDB load
#3959 Stabilize CLUSTERSCAN unassigned-slot test by retrying DELSLOTS
#3939 Fix RESP3 type violation in addReplyCommandSubCommands
#3811 Fix off_t to int truncation in bio repl transfer size reporting

Generated by valkey-ci-agent using Claude Code.

roshkhatri and others added 2 commits June 11, 2026 10:39
This regression is still present in 9.1 GA as the cherrypick of revert
commit was missed during release.

Re-applies #3544 (reverted in #3756 due to ~20% SET regression) with the
performance fix from #3760.

**Root Cause:** The original #3544 changed
`tryOffloadFreeObjToIOThreads` to only offload the SDS buffer free to IO
threads, freeing the `robj` shell on the main thread. I carried out
profiling for the change and it showed that freeing the `robj` shell on
the main thread became the prime main-thread hotspot (~10% CPU), while
IO threads shifted from doing real `jemalloc` work to spinning idle on
`spmcDequeue`.

**Fix**: Keep `tryOffloadFreeObjToIOThreads` offloading the entire robj
(`decrRefCount`) to the IO thread. Cross-thread `zfree` is safe with
`jemalloc`.

This PR includes all cleanup work from #3544 so - 
- `trySendWriteToIOThreads`: defer clearing `last_header` until after
successful enqueue
- `evictClients`: simplified bookkeeping
- Queue sizes as runtime parameters instead of compile-time macros
- IO ignition policy using `stat_active_time` instead of `getrusage`
- Function renames (`IOThreadFreeArgv` --> `ioThreadFreeArgv`, etc.) and
doc comments

**Benchmark** on (Graviton4 c8gb.metal-48xl):
Config: SET, 128B values, 9 IO threads, pipeline=10, 1600 clients - Same
as Valkey official method
| Version | Throughput |
|---------|-----------|
| Unstable + original #3544 | ~1,554K rps |
| Unstable + this PR | ~2,116K rps |

<details>
<summary>Diff vs original #3544 (perf fix)</summary>

```diff
diff --git a/src/io_threads.c b/src/io_threads.c
--- a/src/io_threads.c
+++ b/src/io_threads.c
@@ // IO thread handler
 case JOB_REQ_FREE_OBJ:
-    zfree(data);
+    decrRefCount(data);
     break;

@@ // tryOffloadFreeObjToIOThreads
-    /* We offload only the free of the ptr that may be allocated by the I/O thread.
-     * The object itself was allocated by the main thread and will be freed by the main thread. */
-    void *job = tagJob(sdsAllocPtr(objectGetVal(obj)), JOB_REQ_FREE_OBJ);
+    void *job = tagJob(obj, JOB_REQ_FREE_OBJ);
     if (unlikely(spmcEnqueue(&io_shared_inbox, job) == false)) return C_ERR;
-    objectSetVal(obj, NULL);
-    decrRefCount(obj);
     io_jobs_submitted++;
```
</details>

---------

Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
#3964)

Database-level ACL #2309 introduced `alldbs` rule that was explicit for
all users and because of that previous versions no longer had the
ability to parse ACL strings produced by later versions.
Omit `alldbs` in `ACLDescribeSelector()`, that is used in `ACL
SAVE/LOAD` and `CONFIG REWRITE` command paths so that downgrades would
be possible if new feature was not used (`db=` and `resetdbs` rules).
Keep `ACL GETUSER` command's output as is and return `alldbs` in
`databases` field because of command's field-value format.
Add test to check that `ACL LIST` omits implicit `alldbs` and add check
to existing `ACL SAVE` and `CONFIG REWRITE` tests.

Fixes #3915

Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>
@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 96.77419% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 76.59%. Comparing base (71545e0) to head (9c36ed1).

Files with missing lines Patch % Lines
src/server.c 80.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##              9.1    #3970      +/-   ##
==========================================
- Coverage   76.73%   76.59%   -0.14%     
==========================================
  Files         163      163              
  Lines       81098    81121      +23     
==========================================
- Hits        62228    62136      -92     
- Misses      18870    18985     +115     
Files with missing lines Coverage Δ
src/acl.c 92.55% <ø> (-0.13%) ⬇️
src/rdb.c 77.62% <100.00%> (+0.69%) ⬆️
src/replication.c 86.41% <100.00%> (+0.48%) ⬆️
src/server.h 100.00% <ø> (ø)
src/t_zset.c 97.04% <100.00%> (+0.09%) ⬆️
src/zipmap.c 100.00% <100.00%> (ø)
src/server.c 89.43% <80.00%> (-0.03%) ⬇️

... and 20 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

akashkgit and others added 2 commits June 12, 2026 10:35
This reverts the
[commit](fdd9039)
that was merged as part of the PR #3544 due to a performance regression
observed [here](#3750)

Signed-off-by: akash kumar <akumdev@amazon.com>
…3920)

## Problem

A crafted zipmap entry can set the value length to a value near
`UINT32_MAX` so that the `l + e` sum (value length + one-byte free
space) wraps in `unsigned int` arithmetic. The wrapped sum advances the
validation cursor by a tiny amount, leaving `p` inside the buffer, so
the `OUT_OF_RANGE` check passes and `zipmapValidateIntegrity` wrongly
returns success. The field-length path has the same shape — advancing
`p` by a ~4GB length wraps the pointer on 32-bit builds.

`zipmapValidateIntegrity` is always called with `deep=1` from `rdb.c`
when loading `RDB_TYPE_HASH_ZIPMAP`, **including via `RESTORE`**, so any
client with `RESTORE` access can submit a payload that passes
validation. On 32-bit platforms this leads to out-of-bounds access
during the subsequent zipmap→listpack conversion. On 64-bit the
downstream `lpSafeToAdd` cap happens to reject it (the raw ~4GB length
exceeds `LISTPACK_MAX_SAFETY_SIZE`), but the validator should not accept
a malformed payload in the first place — this is the function whose sole
job is to reject it.

## Fix

Bounds-check the attacker-controlled length against the bytes remaining
in the zipmap, in 64-bit space, **before** any pointer arithmetic, for
both the field-length and value-length paths.

## Testing

- `tests/integration/corrupt-dump.tcl`: a `RESTORE`-path test exercising
the full attack surface; asserts rejection and that the server stays up.
- Verified the test **fails on the pre-fix code** (validator accepts the
value-length payload) and **passes after the fix**, confirmed by
stashing the fix during the integration run.
- Full `integration/corrupt-dump` suite: 76 passed, 0 failed.

> [!NOTE]
> Found via structure-aware fuzzing of the RESTORE path. This issue was
generated by AI but verified, with love, by a human.

Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
@github-actions

Copy link
Copy Markdown

❌ Provenance Check Alert

Potential code similarities detected with upstream repository.

  • 2026-06-12 10:42:32 [INFO] - matches redis/redis PR #9633 (similarity: 0.933, method: file_simhash+deep); file pairs: tests/integration/corrupt-dump.tcl <- tests/integration/corrupt-dump.tcl
  • 2026-06-12 10:42:32 [INFO] - matches redis/redis PR #15251 (similarity: 0.978, method: file_simhash+deep); file pairs: tests/integration/corrupt-dump.tcl <- tests/integration/corrupt-dump.tcl

This check was performed automatically by the Provenance Guard Action.

madolson and others added 2 commits June 13, 2026 10:08
…#3921)

## Problem

A crafted `RESTORE` payload can store a `NAN` score in a
listpack-encoded sorted set. The integrity validation
(`lpValidateIntegrityAndDups`) only checks the listpack *structure* and
member uniqueness — it does not check score validity — so the payload is
accepted on load.

When the sorted set is later converted to a skiplist (e.g. when it grows
past `zset-max-listpack-entries`, or via any operation that triggers
conversion), `zslInsertNode()` asserts the score is not `NAN`
(`t_zset.c:260`) and the server aborts. **Any client with `RESTORE`
access can remotely crash the server.**

The skiplist RDB format (`RDB_TYPE_ZSET` / `RDB_TYPE_ZSET_2`) already
rejects `NAN` scores at load time (`rdb.c`, "Zset with NAN score
detected"). The listpack format (`RDB_TYPE_ZSET_LISTPACK`) had no
equivalent check.

## Reproduction

```
RESTORE k 0 "\x11\x19\x19\x00\x00\x00\x04\x00\x82m1\x03\x83nan\x04\x82m2\x03\x832.5\x04\xFF\x50\x00...."
# loads OK, then:
ZADD k 9 x      # forces listpack->skiplist conversion -> serverAssert(!isnan(node->score)) -> SIGABRT
```

## Fix

Add `zzlValidateScores()`, which scans the scores of a listpack zset
after structural validation and rejects the payload if any score is
`NAN`. Mirrors the existing skiplist-format check. `inf`/`-inf` and
large finite scores remain accepted (only `NAN` is rejected), matching
normal `ZADD` semantics.

## Testing

- `tests/integration/corrupt-dump.tcl`: a `RESTORE`-path test asserting
rejection and that the server stays up.
- Verified the test **fails on the pre-fix code** (server crashes on
conversion) and **passes after the fix**, by stashing the fix during the
run.
- Confirmed valid zsets, including `inf`/`-inf`/large finite scores,
still load and convert correctly.
- Full `integration/corrupt-dump` suite: 74 passed, 0 failed.

> [!NOTE]
> Found via structure-aware fuzzing of the RESTORE path. This issue was
generated by AI but verified, with love, by a human.

Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
The Case 3 portion of the test was flaky: after a single round of
`CLUSTER DELSLOTS 0` on R0/R1/R2, the cluster could stay in OK state
and `wait_for_cluster_state fail` would time out with
`Cluster node 1 cluster_state:ok`.

The race is between R0's local DELSLOTS and the gossip already in
flight from R0. After R1 locally clears slot 0, a stale pre-DELSLOTS
packet from R0 (whose myslots still claims slot 0) hits the
isSlotUnclaimed fast path in clusterUpdateSlotsConfigWith and rebinds
slot 0 back to R0 on R1. See:
```
    if (isSlotUnclaimed(j) ||
        server.cluster->slots[j]->configEpoch < senderConfigEpoch ||
        clusterSlotFailoverGranted(j)) {
        ...
        clusterDelSlot(j);
        clusterAddSlot(sender, j);
        ...
    }
```

R0's subsequent "no longer claiming" PINGs cannot undo this, because
that path only sets owner_not_claiming_slot and never clears slots[j]:
```
    if (server.cluster->slots[j] == sender) {
        /* The slot is currently bound to the sender but the sender is no longer
         * claiming it. We don't want to unbind the slot yet as it can cause the cluster
         * to move to FAIL state and also throw client error. Keeping the slot bound to
         * the previous owner will cause a few client side redirects, but won't throw
         * any errors. We will keep track of the uncertainty in ownership to avoid
         * propagating misinformation about this slot's ownership using UPDATE
         * messages. */
        bitmapSetBit(server.cluster->owner_not_claiming_slot, j);
    }
```

Combined with clusterUpdateState's full-coverage check looking only
at slots[j] == NULL, R1 stays at cluster OK forever.
```
    if (server.cluster->slots[j] == NULL || ...) {
        new_state = CLUSTER_FAIL;
        ...
    }
```

Rather than fighting the protocol's intentional asymmetry around
"soft delete" via gossip, just retry the DELSLOTS pass until all
three nodes converge to FAIL. This keeps the test focused on the
CLUSTERSCAN error semantics it actually wants to verify.

This closes #3891. The test was added in #3674.

Signed-off-by: Binbin <binloveplay1314@qq.com>
@github-actions

Copy link
Copy Markdown

❌ Provenance Check Alert

Potential code similarities detected with upstream repository.

  • 2026-06-13 10:10:39 [INFO] - matches redis/redis PR #9633 (similarity: 0.868, method: file_simhash+deep); file pairs: tests/integration/corrupt-dump.tcl <- tests/integration/corrupt-dump.tcl
  • 2026-06-13 10:10:39 [INFO] - matches redis/redis PR #15251 (similarity: 0.906, method: file_simhash+deep); file pairs: tests/integration/corrupt-dump.tcl <- tests/integration/corrupt-dump.tcl
  • 2026-06-13 10:10:39 [INFO] - matches redis/redis PR #15214 (similarity: 0.868, method: file_simhash+deep); file pairs: tests/integration/corrupt-dump.tcl <- tests/integration/corrupt-dump.tcl

This check was performed automatically by the Provenance Guard Action.

@github-actions

Copy link
Copy Markdown

❌ Provenance Check Alert

Potential code similarities detected with upstream repository.

  • 2026-06-14 03:27:50 [INFO] - matches redis/redis PR #9633 (similarity: 0.868, method: file_simhash+deep); file pairs: tests/integration/corrupt-dump.tcl <- tests/integration/corrupt-dump.tcl
  • 2026-06-14 03:27:50 [INFO] - matches redis/redis PR #15251 (similarity: 0.906, method: file_simhash+deep); file pairs: tests/integration/corrupt-dump.tcl <- tests/integration/corrupt-dump.tcl
  • 2026-06-14 03:27:50 [INFO] - matches redis/redis PR #15214 (similarity: 0.868, method: file_simhash+deep); file pairs: tests/integration/corrupt-dump.tcl <- tests/integration/corrupt-dump.tcl

This check was performed automatically by the Provenance Guard Action.

rickrams and others added 2 commits June 14, 2026 10:24
## Summary

`addReplyCommandSubCommands` unconditionally called `addReplySetLen(c, 0)`
when a command has no subcommands, emitting a RESP3 Set type prefix (`~0`)
regardless of the `use_map` parameter. The non-empty path (below it) already
branches correctly on `use_map` — the empty early-return was simply missing
the same logic.

In RESP3, `COMMAND INFO <cmd>` returns the subcommands field as a Set (`~0`)
instead of an Array (`*0`) for any command without subcommands (e.g. PING).
Strict RESP3 client libraries that dispatch on collection type will
misinterpret the response. Not visible in RESP2 since both Set and Array use
the `*` prefix there.

## Fix

Apply the same `use_map` branch to the empty case:
- `addReplyMapLen(c, 0)` when `use_map=1`
- `addReplyArrayLen(c, 0)` otherwise

## Test

Added a `readraw` integration test in `tests/unit/introspection-2.tcl` that
inspects the raw wire-level type byte for the subcommands field of `COMMAND
INFO ping` in RESP3 mode, asserting `*0` (Array) rather than `~0` (Set).

Signed-off-by: Rick Ramsay <49293857+rickrams@users.noreply.github.com>
Signed-off-by: rickrams <rickrams@amazon.com>
`off_t` (64-bit), but were read into `int` (32-bit) locals in
`genValkeyInfoString()` and `handleBioThreadFinishedRDBDownload()`.

This causes INFO replication to report negative
`master_sync_total_bytes` during bio disk-based sync when RDB exceeds
2GB.

Fix: change the local variable types from `int` to `off_t`.

Signed-off-by: chx9 <lovelypiska@outlook.com>

@sarthakaggarwal97 sarthakaggarwal97 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok this has a problem because #3938 was applied first then reverted via #3544

@sarthakaggarwal97

Copy link
Copy Markdown
Contributor

I think #3544 was added to backporting list after #3938 was already applied. I will close this PR for now, and let it rebuild since the number of commits is low and it's not straightforward to rearrange the commits.

@valkeyrie-ops valkeyrie-ops Bot deleted the agent/backport/sweep/9.1 branch June 15, 2026 16:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants