Skip to content

DRAFT: Fix copy-avoidance throughput regression by deferring reply size tracking to IO thread#3990

Draft
rainsupreme wants to merge 2 commits into
valkey-io:unstablefrom
valkey-rainfall:fix/commandlog-copy-avoid-perf
Draft

DRAFT: Fix copy-avoidance throughput regression by deferring reply size tracking to IO thread#3990
rainsupreme wants to merge 2 commits into
valkey-io:unstablefrom
valkey-rainfall:fix/commandlog-copy-avoid-perf

Conversation

@rainsupreme

@rainsupreme rainsupreme commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Problem

PR #2652 introduced a 14–17% throughput regression on ARM (Graviton 3) and 8% on Intel (Sapphire Rapids) for GET workloads using copy avoidance with io-threads.

Root cause: sdslen(obj->ptr) in addReplyBulk() dereferences a random heap pointer on every reply for net_output_bytes_curr_cmd tracking. With copy avoidance, this value hasn't been touched since the client stored it — guaranteed L2/L3 cache miss at ~80ns each, consuming 15% of main-thread cycles at 2M rps.

perf diff confirms: addReplyBulk went from 1% → 14.8% (ARM) / 8.95% (Intel) of main-thread cycles.

Fix

The IO thread already computes sdslen when writing to the socket (trackBufReferences) — the cache miss is unavoidable there. This PR eliminates the redundant main-thread access with a two-phase deferred commandlog check:

  1. Remove sdslen/digits10 from main thread — set track_bytes=0 for BULK_STR_REF payloads, delegating byte counting to the IO thread.

  2. Stash argv refs at command time — when copy avoidance is active and commandlog-reply-larger-than >= 0, incrRefCount the argv into a per-client inline array (CMDLOG_INLINE_ARGV_MAX=4, zero allocation for GET/GETRANGE). Skip the large-reply check since reply size is unknown.

  3. Accumulate bytes in IO thread — new io_reply_len_cmdlog atomic counter alongside existing io_tracked_reply_len.

  4. Check threshold at write completionpostWriteToClient calls commandlogCheckDeferredLargeReply() once all replies are flushed, logging with the stashed argv for full command attribution, then releases the refs.

Results

ARM Graviton 3, 128B GET, io-threads=7, P=10, 3M random keys, 5 reps:

Build Throughput IPC Backend Stalls Δ
Pre-regression (29d3244) 2,179,643 rps 3.93 43%
Current HEAD (d0ffbabb0) 1,874,527 rps 1.56 60% −14.0%
This PR (1e21e024) 2,194,660 rps 2.42 41% +17.1%

Full recovery (+0.7% above pre-regression). CV improved from 0.73% → 0.21%.

Design

The fundamental timing problem: with copy avoidance, argv is available at command time but reply size isn't known until the IO thread writes. By write completion time, argv has been freed by resetClient.

Solution: bridge the gap with lightweight ref-counted stash. Cost per copy-avoidance command on main thread:

  • incrRefCount (plain integer increment, L1 hit — robj was just accessed during parsing)
  • 2× pointer store into inline array (no heap allocation)
  • decrRefCount in postWriteToClient

When commandlog-reply-larger-than = -1 (disabled), the stash path is never entered — zero overhead.

Trade-offs

  • 40 bytes per client struct (4 inline robj* slots + argc + atomic counter) — always allocated regardless of copy-avoidance usage. Negligible at any realistic connection count.
  • Commandlog timestamp is at write-completion rather than command-completion (typically one event loop iteration later).
  • Pipelining: if multiple copy-avoidance commands pipeline before a write completion, only the last command's argv is logged with the combined byte count.

Fixes the regression from #2652. Supersedes #3646 (commandlog can stay enabled by default).

@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown

Important

Review skipped

Ignore keyword(s) in the title.

⛔ Ignored keywords (1)
  • draft

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 76ec379c-e390-4d34-b6bf-8741e3efdb84

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@rainsupreme

rainsupreme commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

not really ready for review, but the perf makes it still interesting if extra allocations don't kill perf

to do: check mget perf (more args), check on multiple commands per write cycle - might overwrite the stashed size/args.

edit: on further investigation, MGET didn't have the original regression - seems that the data is still hot in memory. for multiple commands per write cycle - this is actually an issue, contrary to my previous understanding. The current implementation will fail to log some commands, and/or possibly sum the size of multiple commands together in this scenario. This seems difficult to optimize out of.

@codecov

codecov Bot commented Jun 15, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.71%. Comparing base (d0ffbab) to head (bb75961).
⚠️ Report is 17 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #3990      +/-   ##
============================================
+ Coverage     76.66%   76.71%   +0.04%     
============================================
  Files           162      162              
  Lines         80733    80820      +87     
============================================
+ Hits          61897    61998     +101     
+ Misses        18836    18822      -14     
Files with missing lines Coverage Δ
src/commandlog.c 96.42% <100.00%> (+0.63%) ⬆️
src/networking.c 92.25% <100.00%> (+<0.01%) ⬆️
src/server.h 100.00% <ø> (ø)

... and 27 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@rainsupreme rainsupreme changed the title Fix copy-avoidance throughput regression by deferring reply size tracking to IO thread DRAFT: Fix copy-avoidance throughput regression by deferring reply size tracking to IO thread Jun 15, 2026
PR valkey-io#2652 added net_output_bytes_curr_cmd tracking in addReplyBulk's copy-
avoidance path. This calls sdslen(obj->ptr) on every reply, dereferencing a
random heap pointer that causes an L2/L3 cache miss at ~2M rps — adding 15%
overhead on ARM (Graviton 3) and 8% on Intel (Sapphire Rapids).

The root cause is a timing mismatch: copy avoidance defers value access to
the IO thread, but the commandlog check needs reply size while argv is still
available. By the time postWriteToClient fires (size known), argv is freed.

Fix with two-phase deferred commandlog:

1. In addReplyBulk: never call sdslen() for BULK_STR_REF payloads. Set
   track_bytes=0, delegating byte counting to the IO thread which already
   computes this in trackBufReferences().

2. In commandlogPushCurrentCommand: when copy avoidance is active (buf_encoded)
   and commandlog-reply-larger-than >= 0, stash argv refs via incrRefCount into
   a per-client inline array (CMDLOG_INLINE_ARGV_MAX=4, zero allocation for GET/
   GETRANGE). Skip the large-reply check since exact size is unknown.

3. In trackBufReferences (IO thread): accumulate reply bytes into a new
   io_reply_len_cmdlog atomic counter alongside io_tracked_reply_len.

4. In postWriteToClient (main thread): once all replies are flushed, call
   commandlogCheckDeferredLargeReply() which checks io_reply_len_cmdlog against
   the threshold, logs with the stashed argv for full command attribution, then
   releases the refs.

Cost per copy-avoidance command: 2 refcount increments + 2 pointer stores +
2 refcount decrements (all L1 cache hits since robj was just accessed during
command parsing). Zero heap allocation for argc <= 4.

Results (ARM Graviton 3, 128B GET, io-threads=7, P=10, 5 reps):
  Before fix (HEAD):   1,874,527 rps (IPC 1.56, 60% backend stalls)
  After fix:           2,194,660 rps (IPC 2.42, 41% backend stalls)  +17.1%
  Pre-regression:      2,179,643 rps (IPC 3.93, 43% backend stalls)  +0.7%

Full commandlog functionality preserved: exact byte count, full command+args
attribution, correct threshold gating, proper cleanup on client disconnect.

Signed-off-by: Rain Valentine <rsg000@gmail.com>
@rainsupreme rainsupreme force-pushed the fix/commandlog-copy-avoid-perf branch from 1e21e02 to e40767c Compare June 16, 2026 00:15
…st-wins

When multiple copy-avoidance commands are pipelined, the previous approach
overwrote the argv stash for each subsequent command, only logging the last
one at write completion. This lost commandlog entries for commands 1..N-1.

New approach for pipelining:
- First command in a batch: stash argv (defer check to postWriteToClient)
- Subsequent commands: detect pipelining (cmdlog_argc > 0), flush the
  previous stash immediately, then do a synchronous sdslen check for the
  current command. The sdslen is cheap here because pipelined values are
  cache-hot from sequential addReplyBulk calls.

addReplyBulk now conditionally computes reply size when pipelining is
detected (cmdlog_argc > 0 && threshold >= 0), populating
net_output_bytes_curr_cmd for the immediate check.

Result: all pipelined commands get individual commandlog entries while
preserving the zero-sdslen fast path for single/first commands.

Signed-off-by: Rain Valentine <rsg000@gmail.com>
@dvkashapov

Copy link
Copy Markdown
Member

Good stuff, I like the idea, please ping me when it's ready for review!
My only concern here is under-reporting on the first of pipelined commands, do we break any guarantees here?

 * Log it now with whatever bytes have been tracked so far (may be 0
 * if the IO thread hasn't run yet — accepted as under-report for the
 * first command).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants