Add per-identifier time-consumed rate limit to BlockingCommandQueue#2627
Add per-identifier time-consumed rate limit to BlockingCommandQueue#2627stitesExpensify wants to merge 4 commits into
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3d167581a8
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if (command->request.test("ClearBlocks")) { | ||
| _blockingCommandQueue.clearRateLimits(); | ||
| } | ||
| } else if (SIEquals(command->request.methodLine, "SetBlockingQueueTimeRateLimit")) { |
There was a problem hiding this comment.
Register the time rate-limit command
When an operator sends SetBlockingQueueTimeRateLimit, this new branch is never reached because _handleIfStatusOrControlCommand() only calls _control() for names accepted by _isControlCommand(), and that whitelist still includes SetBlockingQueueRateLimit but not SetBlockingQueueTimeRateLimit (checked BedrockServer.cpp:1935-1957). In that normal control-command path the threshold cannot be enabled or cleared, leaving the new metric disabled in production.
Useful? React with 👍 / 👎.
| return; | ||
| } | ||
| lock_guard<decltype(_rateLimitMutex)> lock(_rateLimitMutex); | ||
| _identifierTimes[identifier] += elapsedMicros; |
There was a problem hiding this comment.
Reset the empty timer when recording execution time
For a blocking command that runs longer than 30 seconds while the blocking queue has no backlog, _dequeue() marks the queue empty before execution starts, then this line records the elapsed time after completion while leaving the old _emptyTime in place. The next Status or push() immediately sees an empty queue older than 30 seconds and clears _identifierTimes, so exactly the long-running commands this metric is meant to expose can disappear before they are reported or used for time-based limiting.
Useful? React with 👍 / 👎.
Details
Adds a parallel per-identifier time consumed rate limit alongside the existing count-based limit on the blocking command queue. The count metric runs unchanged — this PR adds a sibling metric tracking accumulated worker-0 (
blockingCommitthread) execution time per identifier.Motivation: production data over the last two weeks shows the count metric misfires in both directions, suggested by flodnv's comment on the original issue:
18629004on 2026-06-03 — by count, this looks like a moderate spike; by time consumed (~230s of worker-0 monopolization), it was by far the worst event in our window. Count under-states the threat by ~5x here.A time-consumed metric cleanly separates "many fast commands" (not a threat) from "few slow commands" (real sync-thread monopolization). This PR ships the metric in log-only mode alongside the count metric — neither enforces yet. Once we have time data in production, we can choose which to enforce (or both).
Changes:
_identifierTimesmap (microseconds) and_maxTimePerIdentifierthreshold inBedrockBlockingCommandQueue, gated independently of the count metric — either or both can be setrunCommandwithSTimeNow()deltas and feeds the accumulator viarecordExecutionTime()— only whenthreadId == 0and the identifier is non-emptySetBlockingQueueTimeRateLimitcontrol command acceptingMaxTimePerIdentifierMs(mirrorsSetBlockingQueueRateLimitstructure, includingClearBlockssemantics)blockingTimeRateLimitThresholdMs,blockedTimeIdentifiers,blockingQueueIdentifierTimesMs0(disabled); 30-second empty-queue auto-reset clears both_identifierCountsand_identifierTimestogether"Blocking queue rate limit (time): rejecting '<methodLine>' for identifier '<id>' (timeMs=X, thresholdMs=Y)"Split across 4 commits for review:
Fixed Issues
Follow-up to #2555 — related to https://github.com/Expensify/Expensify/issues/568969
Tests
makein Vagrant)libbedrock.a(per internal reminder; verified via./make.shin Vagrant — no errors)testTimeRateLimitingcluster test added inBlockingQueueRateLimitTest.cpp(class stays disabled to match log-only enforcement pattern, matching existing convention)SetBlockingQueueTimeRateLimit MaxTimePerIdentifierMs=1against a local cluster, send conflict-inducing commands from one identifier, verify SINFOBlocking queue rate limit (time): rejecting ...appears in logsStatus, verifyblockingTimeRateLimitThresholdMs,blockedTimeIdentifiers, andblockingQueueIdentifierTimesMsappear with correct valuesblockingQueueIdentifierTimesMsclears on next push (same auto-reset trigger as counts)blockingRateLimitThreshold,blockedIdentifiers,blockingQueueIdentifierCountsStatus fields unaffected