[backport] Backport sweep for 9.1 by valkeyrie-ops[bot] · Pull Request #3970 · valkey-io/valkey

valkeyrie-ops · 2026-06-11T10:43:40Z

Backport sweep for 9.1

Automated cherry-picks from PRs marked "To be backported".

Applied

Source PR	Title	Detail
#3938	Fix IO-Threads redesign cleanup perf regression from #3544
#3964	Omit alldbs rule in ACL SAVE/LIST and CONFIG REWRITE for compatibility	conflicts resolved by Claude Code
#3544	Revert "IO-Threads redesign cleanup work (#3544)"	cherry-picked in a prior sweep
#3920	Reject integer overflow of length fields in zipmapValidateIntegrity
#3921	Reject NAN scores in listpack/ziplist-encoded sorted sets on RDB load
#3959	Stabilize CLUSTERSCAN unassigned-slot test by retrying DELSLOTS
#3939	Fix RESP3 type violation in addReplyCommandSubCommands
#3811	Fix off_t to int truncation in bio repl transfer size reporting

Generated by valkey-ci-agent using Claude Code.

This regression is still present in 9.1 GA as the cherrypick of revert commit was missed during release. Re-applies #3544 (reverted in #3756 due to ~20% SET regression) with the performance fix from #3760. **Root Cause:** The original #3544 changed `tryOffloadFreeObjToIOThreads` to only offload the SDS buffer free to IO threads, freeing the `robj` shell on the main thread. I carried out profiling for the change and it showed that freeing the `robj` shell on the main thread became the prime main-thread hotspot (~10% CPU), while IO threads shifted from doing real `jemalloc` work to spinning idle on `spmcDequeue`. **Fix**: Keep `tryOffloadFreeObjToIOThreads` offloading the entire robj (`decrRefCount`) to the IO thread. Cross-thread `zfree` is safe with `jemalloc`. This PR includes all cleanup work from #3544 so - - `trySendWriteToIOThreads`: defer clearing `last_header` until after successful enqueue - `evictClients`: simplified bookkeeping - Queue sizes as runtime parameters instead of compile-time macros - IO ignition policy using `stat_active_time` instead of `getrusage` - Function renames (`IOThreadFreeArgv` --> `ioThreadFreeArgv`, etc.) and doc comments **Benchmark** on (Graviton4 c8gb.metal-48xl): Config: SET, 128B values, 9 IO threads, pipeline=10, 1600 clients - Same as Valkey official method | Version | Throughput | |---------|-----------| | Unstable + original #3544 | ~1,554K rps | | Unstable + this PR | ~2,116K rps | <details> <summary>Diff vs original #3544 (perf fix)</summary> ```diff diff --git a/src/io_threads.c b/src/io_threads.c --- a/src/io_threads.c +++ b/src/io_threads.c @@ // IO thread handler case JOB_REQ_FREE_OBJ: - zfree(data); + decrRefCount(data); break; @@ // tryOffloadFreeObjToIOThreads - /* We offload only the free of the ptr that may be allocated by the I/O thread. - * The object itself was allocated by the main thread and will be freed by the main thread. */ - void *job = tagJob(sdsAllocPtr(objectGetVal(obj)), JOB_REQ_FREE_OBJ); + void *job = tagJob(obj, JOB_REQ_FREE_OBJ); if (unlikely(spmcEnqueue(&io_shared_inbox, job) == false)) return C_ERR; - objectSetVal(obj, NULL); - decrRefCount(obj); io_jobs_submitted++; ``` </details> --------- Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>

#3964) Database-level ACL #2309 introduced `alldbs` rule that was explicit for all users and because of that previous versions no longer had the ability to parse ACL strings produced by later versions. Omit `alldbs` in `ACLDescribeSelector()`, that is used in `ACL SAVE/LOAD` and `CONFIG REWRITE` command paths so that downgrades would be possible if new feature was not used (`db=` and `resetdbs` rules). Keep `ACL GETUSER` command's output as is and return `alldbs` in `databases` field because of command's field-value format. Add test to check that `ACL LIST` omits implicit `alldbs` and add check to existing `ACL SAVE` and `CONFIG REWRITE` tests. Fixes #3915 Signed-off-by: Daniil Kashapov <daniil.kashapov.ykt@gmail.com>

codecov · 2026-06-11T11:06:20Z

Codecov Report

❌ Patch coverage is 96.77419% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 76.59%. Comparing base (71545e0) to head (9c36ed1).

Files with missing lines	Patch %	Lines
src/server.c	80.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##              9.1    #3970      +/-   ##
==========================================
- Coverage   76.73%   76.59%   -0.14%     
==========================================
  Files         163      163              
  Lines       81098    81121      +23     
==========================================
- Hits        62228    62136      -92     
- Misses      18870    18985     +115

Files with missing lines	Coverage Δ
src/acl.c	`92.55% <ø> (-0.13%)`	⬇️
src/rdb.c	`77.62% <100.00%> (+0.69%)`	⬆️
src/replication.c	`86.41% <100.00%> (+0.48%)`	⬆️
src/server.h	`100.00% <ø> (ø)`
src/t_zset.c	`97.04% <100.00%> (+0.09%)`	⬆️
src/zipmap.c	`100.00% <100.00%> (ø)`
src/server.c	`89.43% <80.00%> (-0.03%)`	⬇️

... and 20 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

This reverts the [commit](fdd9039) that was merged as part of the PR #3544 due to a performance regression observed [here](#3750) Signed-off-by: akash kumar <akumdev@amazon.com>

…3920) ## Problem A crafted zipmap entry can set the value length to a value near `UINT32_MAX` so that the `l + e` sum (value length + one-byte free space) wraps in `unsigned int` arithmetic. The wrapped sum advances the validation cursor by a tiny amount, leaving `p` inside the buffer, so the `OUT_OF_RANGE` check passes and `zipmapValidateIntegrity` wrongly returns success. The field-length path has the same shape — advancing `p` by a ~4GB length wraps the pointer on 32-bit builds. `zipmapValidateIntegrity` is always called with `deep=1` from `rdb.c` when loading `RDB_TYPE_HASH_ZIPMAP`, **including via `RESTORE`**, so any client with `RESTORE` access can submit a payload that passes validation. On 32-bit platforms this leads to out-of-bounds access during the subsequent zipmap→listpack conversion. On 64-bit the downstream `lpSafeToAdd` cap happens to reject it (the raw ~4GB length exceeds `LISTPACK_MAX_SAFETY_SIZE`), but the validator should not accept a malformed payload in the first place — this is the function whose sole job is to reject it. ## Fix Bounds-check the attacker-controlled length against the bytes remaining in the zipmap, in 64-bit space, **before** any pointer arithmetic, for both the field-length and value-length paths. ## Testing - `tests/integration/corrupt-dump.tcl`: a `RESTORE`-path test exercising the full attack surface; asserts rejection and that the server stays up. - Verified the test **fails on the pre-fix code** (validator accepts the value-length payload) and **passes after the fix**, confirmed by stashing the fix during the integration run. - Full `integration/corrupt-dump` suite: 76 passed, 0 failed. > [!NOTE] > Found via structure-aware fuzzing of the RESTORE path. This issue was generated by AI but verified, with love, by a human. Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>

github-actions · 2026-06-12T10:42:33Z

❌ Provenance Check Alert

Potential code similarities detected with upstream repository.

2026-06-12 10:42:32 [INFO] - matches redis/redis PR #9633 (similarity: 0.933, method: file_simhash+deep); file pairs: tests/integration/corrupt-dump.tcl <- tests/integration/corrupt-dump.tcl
2026-06-12 10:42:32 [INFO] - matches redis/redis PR #15251 (similarity: 0.978, method: file_simhash+deep); file pairs: tests/integration/corrupt-dump.tcl <- tests/integration/corrupt-dump.tcl

This check was performed automatically by the Provenance Guard Action.

…#3921) ## Problem A crafted `RESTORE` payload can store a `NAN` score in a listpack-encoded sorted set. The integrity validation (`lpValidateIntegrityAndDups`) only checks the listpack *structure* and member uniqueness — it does not check score validity — so the payload is accepted on load. When the sorted set is later converted to a skiplist (e.g. when it grows past `zset-max-listpack-entries`, or via any operation that triggers conversion), `zslInsertNode()` asserts the score is not `NAN` (`t_zset.c:260`) and the server aborts. **Any client with `RESTORE` access can remotely crash the server.** The skiplist RDB format (`RDB_TYPE_ZSET` / `RDB_TYPE_ZSET_2`) already rejects `NAN` scores at load time (`rdb.c`, "Zset with NAN score detected"). The listpack format (`RDB_TYPE_ZSET_LISTPACK`) had no equivalent check. ## Reproduction ``` RESTORE k 0 "\x11\x19\x19\x00\x00\x00\x04\x00\x82m1\x03\x83nan\x04\x82m2\x03\x832.5\x04\xFF\x50\x00...." # loads OK, then: ZADD k 9 x # forces listpack->skiplist conversion -> serverAssert(!isnan(node->score)) -> SIGABRT ``` ## Fix Add `zzlValidateScores()`, which scans the scores of a listpack zset after structural validation and rejects the payload if any score is `NAN`. Mirrors the existing skiplist-format check. `inf`/`-inf` and large finite scores remain accepted (only `NAN` is rejected), matching normal `ZADD` semantics. ## Testing - `tests/integration/corrupt-dump.tcl`: a `RESTORE`-path test asserting rejection and that the server stays up. - Verified the test **fails on the pre-fix code** (server crashes on conversion) and **passes after the fix**, by stashing the fix during the run. - Confirmed valid zsets, including `inf`/`-inf`/large finite scores, still load and convert correctly. - Full `integration/corrupt-dump` suite: 74 passed, 0 failed. > [!NOTE] > Found via structure-aware fuzzing of the RESTORE path. This issue was generated by AI but verified, with love, by a human. Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>

The Case 3 portion of the test was flaky: after a single round of `CLUSTER DELSLOTS 0` on R0/R1/R2, the cluster could stay in OK state and `wait_for_cluster_state fail` would time out with `Cluster node 1 cluster_state:ok`. The race is between R0's local DELSLOTS and the gossip already in flight from R0. After R1 locally clears slot 0, a stale pre-DELSLOTS packet from R0 (whose myslots still claims slot 0) hits the isSlotUnclaimed fast path in clusterUpdateSlotsConfigWith and rebinds slot 0 back to R0 on R1. See: ``` if (isSlotUnclaimed(j) || server.cluster->slots[j]->configEpoch < senderConfigEpoch || clusterSlotFailoverGranted(j)) { ... clusterDelSlot(j); clusterAddSlot(sender, j); ... } ``` R0's subsequent "no longer claiming" PINGs cannot undo this, because that path only sets owner_not_claiming_slot and never clears slots[j]: ``` if (server.cluster->slots[j] == sender) { /* The slot is currently bound to the sender but the sender is no longer * claiming it. We don't want to unbind the slot yet as it can cause the cluster * to move to FAIL state and also throw client error. Keeping the slot bound to * the previous owner will cause a few client side redirects, but won't throw * any errors. We will keep track of the uncertainty in ownership to avoid * propagating misinformation about this slot's ownership using UPDATE * messages. */ bitmapSetBit(server.cluster->owner_not_claiming_slot, j); } ``` Combined with clusterUpdateState's full-coverage check looking only at slots[j] == NULL, R1 stays at cluster OK forever. ``` if (server.cluster->slots[j] == NULL || ...) { new_state = CLUSTER_FAIL; ... } ``` Rather than fighting the protocol's intentional asymmetry around "soft delete" via gossip, just retry the DELSLOTS pass until all three nodes converge to FAIL. This keeps the test focused on the CLUSTERSCAN error semantics it actually wants to verify. This closes #3891. The test was added in #3674. Signed-off-by: Binbin <binloveplay1314@qq.com>

github-actions · 2026-06-13T10:10:40Z

❌ Provenance Check Alert

Potential code similarities detected with upstream repository.

2026-06-13 10:10:39 [INFO] - matches redis/redis PR #9633 (similarity: 0.868, method: file_simhash+deep); file pairs: tests/integration/corrupt-dump.tcl <- tests/integration/corrupt-dump.tcl
2026-06-13 10:10:39 [INFO] - matches redis/redis PR #15251 (similarity: 0.906, method: file_simhash+deep); file pairs: tests/integration/corrupt-dump.tcl <- tests/integration/corrupt-dump.tcl
2026-06-13 10:10:39 [INFO] - matches redis/redis PR #15214 (similarity: 0.868, method: file_simhash+deep); file pairs: tests/integration/corrupt-dump.tcl <- tests/integration/corrupt-dump.tcl

This check was performed automatically by the Provenance Guard Action.

github-actions · 2026-06-14T03:27:51Z

❌ Provenance Check Alert

Potential code similarities detected with upstream repository.

2026-06-14 03:27:50 [INFO] - matches redis/redis PR #9633 (similarity: 0.868, method: file_simhash+deep); file pairs: tests/integration/corrupt-dump.tcl <- tests/integration/corrupt-dump.tcl
2026-06-14 03:27:50 [INFO] - matches redis/redis PR #15251 (similarity: 0.906, method: file_simhash+deep); file pairs: tests/integration/corrupt-dump.tcl <- tests/integration/corrupt-dump.tcl
2026-06-14 03:27:50 [INFO] - matches redis/redis PR #15214 (similarity: 0.868, method: file_simhash+deep); file pairs: tests/integration/corrupt-dump.tcl <- tests/integration/corrupt-dump.tcl

This check was performed automatically by the Provenance Guard Action.

## Summary `addReplyCommandSubCommands` unconditionally called `addReplySetLen(c, 0)` when a command has no subcommands, emitting a RESP3 Set type prefix (`~0`) regardless of the `use_map` parameter. The non-empty path (below it) already branches correctly on `use_map` — the empty early-return was simply missing the same logic. In RESP3, `COMMAND INFO <cmd>` returns the subcommands field as a Set (`~0`) instead of an Array (`*0`) for any command without subcommands (e.g. PING). Strict RESP3 client libraries that dispatch on collection type will misinterpret the response. Not visible in RESP2 since both Set and Array use the `*` prefix there. ## Fix Apply the same `use_map` branch to the empty case: - `addReplyMapLen(c, 0)` when `use_map=1` - `addReplyArrayLen(c, 0)` otherwise ## Test Added a `readraw` integration test in `tests/unit/introspection-2.tcl` that inspects the raw wire-level type byte for the subcommands field of `COMMAND INFO ping` in RESP3 mode, asserting `*0` (Array) rather than `~0` (Set). Signed-off-by: Rick Ramsay <49293857+rickrams@users.noreply.github.com> Signed-off-by: rickrams <rickrams@amazon.com>

`off_t` (64-bit), but were read into `int` (32-bit) locals in `genValkeyInfoString()` and `handleBioThreadFinishedRDBDownload()`. This causes INFO replication to report negative `master_sync_total_bytes` during bio disk-based sync when RDB exceeds 2GB. Fix: change the local variable types from `int` to `off_t`. Signed-off-by: chx9 <lovelypiska@outlook.com>

sarthakaggarwal97

ok this has a problem because #3938 was applied first then reverted via #3544

sarthakaggarwal97 · 2026-06-15T15:54:31Z

I think #3544 was added to backporting list after #3938 was already applied. I will close this PR for now, and let it rebuild since the number of commits is low and it's not straightforward to rearrange the commits.

roshkhatri and others added 2 commits June 11, 2026 10:39

dvkashapov approved these changes Jun 11, 2026

View reviewed changes

Nikhil-Manglore approved these changes Jun 11, 2026

View reviewed changes

akashkgit and others added 2 commits June 12, 2026 10:35

Revert "IO-Threads redesign cleanup work (#3544)" (#3756)

d9433a4

This reverts the [commit](fdd9039) that was merged as part of the PR #3544 due to a performance regression observed [here](#3750) Signed-off-by: akash kumar <akumdev@amazon.com>

madolson and others added 2 commits June 13, 2026 10:08

rickrams and others added 2 commits June 14, 2026 10:24

sarthakaggarwal97 approved these changes Jun 15, 2026

View reviewed changes

sarthakaggarwal97 reviewed Jun 15, 2026

View reviewed changes

sarthakaggarwal97 requested changes Jun 15, 2026

View reviewed changes

sarthakaggarwal97 closed this Jun 15, 2026

valkeyrie-ops Bot deleted the agent/backport/sweep/9.1 branch June 15, 2026 16:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[backport] Backport sweep for 9.1#3970

[backport] Backport sweep for 9.1#3970
valkeyrie-ops[bot] wants to merge 8 commits into
9.1from
agent/backport/sweep/9.1

valkeyrie-ops Bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

sarthakaggarwal97 left a comment

Uh oh!

sarthakaggarwal97 commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Conversation

valkeyrie-ops Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Backport sweep for 9.1

Applied

Uh oh!

codecov Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot commented Jun 12, 2026

❌ Provenance Check Alert

Uh oh!

github-actions Bot commented Jun 13, 2026

❌ Provenance Check Alert

Uh oh!

github-actions Bot commented Jun 14, 2026

❌ Provenance Check Alert

Uh oh!

sarthakaggarwal97 left a comment

Choose a reason for hiding this comment

Uh oh!

sarthakaggarwal97 commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

valkeyrie-ops Bot commented Jun 11, 2026 •

edited

Loading

codecov Bot commented Jun 11, 2026 •

edited

Loading