DRAFT: Fix copy-avoidance throughput regression by deferring reply size tracking to IO thread by rainsupreme · Pull Request #3990 · valkey-io/valkey

rainsupreme · 2026-06-15T22:57:03Z

Problem

PR #2652 introduced a 14–17% throughput regression on ARM (Graviton 3) and 8% on Intel (Sapphire Rapids) for GET workloads using copy avoidance with io-threads.

Root cause: sdslen(obj->ptr) in addReplyBulk() dereferences a random heap pointer on every reply for net_output_bytes_curr_cmd tracking. With copy avoidance, this value hasn't been touched since the client stored it — guaranteed L2/L3 cache miss at ~80ns each, consuming 15% of main-thread cycles at 2M rps.

perf diff confirms: addReplyBulk went from 1% → 14.8% (ARM) / 8.95% (Intel) of main-thread cycles.

Fix

The IO thread already computes sdslen when writing to the socket (trackBufReferences) — the cache miss is unavoidable there. This PR eliminates the redundant main-thread access with a two-phase deferred commandlog check:

Remove sdslen/digits10 from main thread — set track_bytes=0 for BULK_STR_REF payloads, delegating byte counting to the IO thread.
Stash argv refs at command time — when copy avoidance is active and commandlog-reply-larger-than >= 0, incrRefCount the argv into a per-client inline array (CMDLOG_INLINE_ARGV_MAX=4, zero allocation for GET/GETRANGE). Skip the large-reply check since reply size is unknown.
Accumulate bytes in IO thread — new io_reply_len_cmdlog atomic counter alongside existing io_tracked_reply_len.
Check threshold at write completion — postWriteToClient calls commandlogCheckDeferredLargeReply() once all replies are flushed, logging with the stashed argv for full command attribution, then releases the refs.

Results

ARM Graviton 3, 128B GET, io-threads=7, P=10, 3M random keys, 5 reps:

Build	Throughput	IPC	Backend Stalls	Δ
Pre-regression (`29d3244`)	2,179,643 rps	3.93	43%	—
Current HEAD (`d0ffbabb0`)	1,874,527 rps	1.56	60%	−14.0%
This PR (`1e21e024`)	2,194,660 rps	2.42	41%	+17.1%

Full recovery (+0.7% above pre-regression). CV improved from 0.73% → 0.21%.

Design

The fundamental timing problem: with copy avoidance, argv is available at command time but reply size isn't known until the IO thread writes. By write completion time, argv has been freed by resetClient.

Solution: bridge the gap with lightweight ref-counted stash. Cost per copy-avoidance command on main thread:

2× incrRefCount (plain integer increment, L1 hit — robj was just accessed during parsing)
2× pointer store into inline array (no heap allocation)
2× decrRefCount in postWriteToClient

When commandlog-reply-larger-than = -1 (disabled), the stash path is never entered — zero overhead.

Trade-offs

40 bytes per client struct (4 inline robj* slots + argc + atomic counter) — always allocated regardless of copy-avoidance usage. Negligible at any realistic connection count.
Commandlog timestamp is at write-completion rather than command-completion (typically one event loop iteration later).
Pipelining: if multiple copy-avoidance commands pipeline before a write completion, only the last command's argv is logged with the combined byte count.

Fixes the regression from #2652. Supersedes #3646 (commandlog can stay enabled by default).

coderabbitai · 2026-06-15T22:57:42Z

Important

Review skipped

Ignore keyword(s) in the title.

⛔ Ignored keywords (1)

draft

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 76ec379c-e390-4d34-b6bf-8741e3efdb84

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

rainsupreme · 2026-06-15T23:08:26Z

not really ready for review, but the perf makes it still interesting if extra allocations don't kill perf

to do: check mget perf (more args), check on multiple commands per write cycle - might overwrite the stashed size/args.

edit: on further investigation, MGET didn't have the original regression - seems that the data is still hot in memory. for multiple commands per write cycle - this is actually an issue, contrary to my previous understanding. The current implementation will fail to log some commands, and/or possibly sum the size of multiple commands together in this scenario. This seems difficult to optimize out of.

codecov · 2026-06-15T23:24:09Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.71%. Comparing base (d0ffbab) to head (bb75961).
⚠️ Report is 17 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #3990      +/-   ##
============================================
+ Coverage     76.66%   76.71%   +0.04%     
============================================
  Files           162      162              
  Lines         80733    80820      +87     
============================================
+ Hits          61897    61998     +101     
+ Misses        18836    18822      -14

Files with missing lines	Coverage Δ
src/commandlog.c	`96.42% <100.00%> (+0.63%)`	⬆️
src/networking.c	`92.25% <100.00%> (+<0.01%)`	⬆️
src/server.h	`100.00% <ø> (ø)`

... and 27 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

PR valkey-io#2652 added net_output_bytes_curr_cmd tracking in addReplyBulk's copy- avoidance path. This calls sdslen(obj->ptr) on every reply, dereferencing a random heap pointer that causes an L2/L3 cache miss at ~2M rps — adding 15% overhead on ARM (Graviton 3) and 8% on Intel (Sapphire Rapids). The root cause is a timing mismatch: copy avoidance defers value access to the IO thread, but the commandlog check needs reply size while argv is still available. By the time postWriteToClient fires (size known), argv is freed. Fix with two-phase deferred commandlog: 1. In addReplyBulk: never call sdslen() for BULK_STR_REF payloads. Set track_bytes=0, delegating byte counting to the IO thread which already computes this in trackBufReferences(). 2. In commandlogPushCurrentCommand: when copy avoidance is active (buf_encoded) and commandlog-reply-larger-than >= 0, stash argv refs via incrRefCount into a per-client inline array (CMDLOG_INLINE_ARGV_MAX=4, zero allocation for GET/ GETRANGE). Skip the large-reply check since exact size is unknown. 3. In trackBufReferences (IO thread): accumulate reply bytes into a new io_reply_len_cmdlog atomic counter alongside io_tracked_reply_len. 4. In postWriteToClient (main thread): once all replies are flushed, call commandlogCheckDeferredLargeReply() which checks io_reply_len_cmdlog against the threshold, logs with the stashed argv for full command attribution, then releases the refs. Cost per copy-avoidance command: 2 refcount increments + 2 pointer stores + 2 refcount decrements (all L1 cache hits since robj was just accessed during command parsing). Zero heap allocation for argc <= 4. Results (ARM Graviton 3, 128B GET, io-threads=7, P=10, 5 reps): Before fix (HEAD): 1,874,527 rps (IPC 1.56, 60% backend stalls) After fix: 2,194,660 rps (IPC 2.42, 41% backend stalls) +17.1% Pre-regression: 2,179,643 rps (IPC 3.93, 43% backend stalls) +0.7% Full commandlog functionality preserved: exact byte count, full command+args attribution, correct threshold gating, proper cleanup on client disconnect. Signed-off-by: Rain Valentine <rsg000@gmail.com>

…st-wins When multiple copy-avoidance commands are pipelined, the previous approach overwrote the argv stash for each subsequent command, only logging the last one at write completion. This lost commandlog entries for commands 1..N-1. New approach for pipelining: - First command in a batch: stash argv (defer check to postWriteToClient) - Subsequent commands: detect pipelining (cmdlog_argc > 0), flush the previous stash immediately, then do a synchronous sdslen check for the current command. The sdslen is cheap here because pipelined values are cache-hot from sequential addReplyBulk calls. addReplyBulk now conditionally computes reply size when pipelining is detected (cmdlog_argc > 0 && threshold >= 0), populating net_output_bytes_curr_cmd for the immediate check. Result: all pipelined commands get individual commandlog entries while preserving the zero-sdslen fast path for single/first commands. Signed-off-by: Rain Valentine <rsg000@gmail.com>

dvkashapov · 2026-06-16T09:22:22Z

Good stuff, I like the idea, please ping me when it's ready for review!
My only concern here is under-reporting on the first of pipelined commands, do we break any guarantees here?

 * Log it now with whatever bytes have been tracked so far (may be 0
 * if the IO thread hasn't run yet — accepted as under-report for the
 * first command).

github-actions Bot assigned rainsupreme Jun 15, 2026

rainsupreme changed the title ~~Fix copy-avoidance throughput regression by deferring reply size tracking to IO thread~~ DRAFT: Fix copy-avoidance throughput regression by deferring reply size tracking to IO thread Jun 15, 2026

rainsupreme force-pushed the fix/commandlog-copy-avoid-perf branch from 1e21e02 to e40767c Compare June 16, 2026 00:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRAFT: Fix copy-avoidance throughput regression by deferring reply size tracking to IO thread#3990

DRAFT: Fix copy-avoidance throughput regression by deferring reply size tracking to IO thread#3990
rainsupreme wants to merge 2 commits into
valkey-io:unstablefrom
valkey-rainfall:fix/commandlog-copy-avoid-perf

rainsupreme commented Jun 15, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading

Review skipped

Uh oh!

rainsupreme commented Jun 15, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 15, 2026 •

edited

Loading

Uh oh!

dvkashapov commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rainsupreme commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Results

Design

Trade-offs

Uh oh!

coderabbitai Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

rainsupreme commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

dvkashapov commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rainsupreme commented Jun 15, 2026 •

edited

Loading

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading

rainsupreme commented Jun 15, 2026 •

edited

Loading

codecov Bot commented Jun 15, 2026 •

edited

Loading