DRAFT: Fix copy-avoidance throughput regression by deferring reply size tracking to IO thread#3990
Conversation
|
Important Review skippedIgnore keyword(s) in the title. ⛔ Ignored keywords (1)
Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
not really ready for review, but the perf makes it still interesting if extra allocations don't kill perf to do: check mget perf (more args), check on multiple commands per write cycle - might overwrite the stashed size/args. edit: on further investigation, MGET didn't have the original regression - seems that the data is still hot in memory. for multiple commands per write cycle - this is actually an issue, contrary to my previous understanding. The current implementation will fail to log some commands, and/or possibly sum the size of multiple commands together in this scenario. This seems difficult to optimize out of. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## unstable #3990 +/- ##
============================================
+ Coverage 76.66% 76.71% +0.04%
============================================
Files 162 162
Lines 80733 80820 +87
============================================
+ Hits 61897 61998 +101
+ Misses 18836 18822 -14
🚀 New features to boost your workflow:
|
PR valkey-io#2652 added net_output_bytes_curr_cmd tracking in addReplyBulk's copy- avoidance path. This calls sdslen(obj->ptr) on every reply, dereferencing a random heap pointer that causes an L2/L3 cache miss at ~2M rps — adding 15% overhead on ARM (Graviton 3) and 8% on Intel (Sapphire Rapids). The root cause is a timing mismatch: copy avoidance defers value access to the IO thread, but the commandlog check needs reply size while argv is still available. By the time postWriteToClient fires (size known), argv is freed. Fix with two-phase deferred commandlog: 1. In addReplyBulk: never call sdslen() for BULK_STR_REF payloads. Set track_bytes=0, delegating byte counting to the IO thread which already computes this in trackBufReferences(). 2. In commandlogPushCurrentCommand: when copy avoidance is active (buf_encoded) and commandlog-reply-larger-than >= 0, stash argv refs via incrRefCount into a per-client inline array (CMDLOG_INLINE_ARGV_MAX=4, zero allocation for GET/ GETRANGE). Skip the large-reply check since exact size is unknown. 3. In trackBufReferences (IO thread): accumulate reply bytes into a new io_reply_len_cmdlog atomic counter alongside io_tracked_reply_len. 4. In postWriteToClient (main thread): once all replies are flushed, call commandlogCheckDeferredLargeReply() which checks io_reply_len_cmdlog against the threshold, logs with the stashed argv for full command attribution, then releases the refs. Cost per copy-avoidance command: 2 refcount increments + 2 pointer stores + 2 refcount decrements (all L1 cache hits since robj was just accessed during command parsing). Zero heap allocation for argc <= 4. Results (ARM Graviton 3, 128B GET, io-threads=7, P=10, 5 reps): Before fix (HEAD): 1,874,527 rps (IPC 1.56, 60% backend stalls) After fix: 2,194,660 rps (IPC 2.42, 41% backend stalls) +17.1% Pre-regression: 2,179,643 rps (IPC 3.93, 43% backend stalls) +0.7% Full commandlog functionality preserved: exact byte count, full command+args attribution, correct threshold gating, proper cleanup on client disconnect. Signed-off-by: Rain Valentine <rsg000@gmail.com>
1e21e02 to
e40767c
Compare
…st-wins When multiple copy-avoidance commands are pipelined, the previous approach overwrote the argv stash for each subsequent command, only logging the last one at write completion. This lost commandlog entries for commands 1..N-1. New approach for pipelining: - First command in a batch: stash argv (defer check to postWriteToClient) - Subsequent commands: detect pipelining (cmdlog_argc > 0), flush the previous stash immediately, then do a synchronous sdslen check for the current command. The sdslen is cheap here because pipelined values are cache-hot from sequential addReplyBulk calls. addReplyBulk now conditionally computes reply size when pipelining is detected (cmdlog_argc > 0 && threshold >= 0), populating net_output_bytes_curr_cmd for the immediate check. Result: all pipelined commands get individual commandlog entries while preserving the zero-sdslen fast path for single/first commands. Signed-off-by: Rain Valentine <rsg000@gmail.com>
|
Good stuff, I like the idea, please ping me when it's ready for review! |
Problem
PR #2652 introduced a 14–17% throughput regression on ARM (Graviton 3) and 8% on Intel (Sapphire Rapids) for GET workloads using copy avoidance with io-threads.
Root cause:
sdslen(obj->ptr)inaddReplyBulk()dereferences a random heap pointer on every reply fornet_output_bytes_curr_cmdtracking. With copy avoidance, this value hasn't been touched since the client stored it — guaranteed L2/L3 cache miss at ~80ns each, consuming 15% of main-thread cycles at 2M rps.perf diffconfirms:addReplyBulkwent from 1% → 14.8% (ARM) / 8.95% (Intel) of main-thread cycles.Fix
The IO thread already computes
sdslenwhen writing to the socket (trackBufReferences) — the cache miss is unavoidable there. This PR eliminates the redundant main-thread access with a two-phase deferred commandlog check:Remove
sdslen/digits10from main thread — settrack_bytes=0forBULK_STR_REFpayloads, delegating byte counting to the IO thread.Stash argv refs at command time — when copy avoidance is active and
commandlog-reply-larger-than >= 0,incrRefCountthe argv into a per-client inline array (CMDLOG_INLINE_ARGV_MAX=4, zero allocation for GET/GETRANGE). Skip the large-reply check since reply size is unknown.Accumulate bytes in IO thread — new
io_reply_len_cmdlogatomic counter alongside existingio_tracked_reply_len.Check threshold at write completion —
postWriteToClientcallscommandlogCheckDeferredLargeReply()once all replies are flushed, logging with the stashed argv for full command attribution, then releases the refs.Results
ARM Graviton 3, 128B GET, io-threads=7, P=10, 3M random keys, 5 reps:
29d3244)d0ffbabb0)1e21e024)Full recovery (+0.7% above pre-regression). CV improved from 0.73% → 0.21%.
Design
The fundamental timing problem: with copy avoidance, argv is available at command time but reply size isn't known until the IO thread writes. By write completion time, argv has been freed by
resetClient.Solution: bridge the gap with lightweight ref-counted stash. Cost per copy-avoidance command on main thread:
incrRefCount(plain integer increment, L1 hit — robj was just accessed during parsing)decrRefCountinpostWriteToClientWhen
commandlog-reply-larger-than = -1(disabled), the stash path is never entered — zero overhead.Trade-offs
Fixes the regression from #2652. Supersedes #3646 (commandlog can stay enabled by default).