Skip to content

Add per-identifier time-consumed rate limit to BlockingCommandQueue#2627

Open
stitesExpensify wants to merge 4 commits into
mainfrom
stites-blockingQueueTimeRateLimit
Open

Add per-identifier time-consumed rate limit to BlockingCommandQueue#2627
stitesExpensify wants to merge 4 commits into
mainfrom
stites-blockingQueueTimeRateLimit

Conversation

@stitesExpensify

@stitesExpensify stitesExpensify commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Details

Adds a parallel per-identifier time consumed rate limit alongside the existing count-based limit on the blocking command queue. The count metric runs unchanged — this PR adds a sibling metric tracking accumulated worker-0 (blockingCommit thread) execution time per identifier.

Motivation: production data over the last two weeks shows the count metric misfires in both directions, suggested by flodnv's comment on the original issue:

  • Parallelism flashes: peak queue depth of 19-30 in 0.1-0.7s with ~10-22ms per command — count fires, total worker time is trivial (sub-second).
  • Sustained heavy commands: peak 40 over 231s with ~2.7s per command on accountID 18629004 on 2026-06-03 — by count, this looks like a moderate spike; by time consumed (~230s of worker-0 monopolization), it was by far the worst event in our window. Count under-states the threat by ~5x here.

A time-consumed metric cleanly separates "many fast commands" (not a threat) from "few slow commands" (real sync-thread monopolization). This PR ships the metric in log-only mode alongside the count metric — neither enforces yet. Once we have time data in production, we can choose which to enforce (or both).

Changes:

  • New _identifierTimes map (microseconds) and _maxTimePerIdentifier threshold in BedrockBlockingCommandQueue, gated independently of the count metric — either or both can be set
  • Worker 0 wraps runCommand with STimeNow() deltas and feeds the accumulator via recordExecutionTime() — only when threadId == 0 and the identifier is non-empty
  • New SetBlockingQueueTimeRateLimit control command accepting MaxTimePerIdentifierMs (mirrors SetBlockingQueueRateLimit structure, including ClearBlocks semantics)
  • New Status fields: blockingTimeRateLimitThresholdMs, blockedTimeIdentifiers, blockingQueueIdentifierTimesMs
  • Default threshold is 0 (disabled); 30-second empty-queue auto-reset clears both _identifierCounts and _identifierTimes together
  • New SINFO format: "Blocking queue rate limit (time): rejecting '<methodLine>' for identifier '<id>' (timeMs=X, thresholdMs=Y)"
  • Enforcement (STHROW) is commented out, mirroring the existing count log-only state. Time rollback is asymmetric to count (the rejected command's execution time is unknown at push) — documented inline with a TODO

Split across 4 commits for review:

  1. Infrastructure in the queue class (header + cpp)
  2. Worker-loop hook to feed the accumulator
  3. New control command
  4. Cluster test

Fixed Issues

Follow-up to #2555 — related to https://github.com/Expensify/Expensify/issues/568969

Tests

  • Bedrock compiled cleanly against the new code (verified via make in Vagrant)
  • Auth compiled cleanly against the new libbedrock.a (per internal reminder; verified via ./make.sh in Vagrant — no errors)
  • New testTimeRateLimiting cluster test added in BlockingQueueRateLimitTest.cpp (class stays disabled to match log-only enforcement pattern, matching existing convention)
  • Reviewer: send SetBlockingQueueTimeRateLimit MaxTimePerIdentifierMs=1 against a local cluster, send conflict-inducing commands from one identifier, verify SINFO Blocking queue rate limit (time): rejecting ... appears in logs
  • Reviewer: hit Status, verify blockingTimeRateLimitThresholdMs, blockedTimeIdentifiers, and blockingQueueIdentifierTimesMs appear with correct values
  • Reviewer: leave queue empty for >30s, verify blockingQueueIdentifierTimesMs clears on next push (same auto-reset trigger as counts)
  • Reviewer: confirm count metric still works unchanged — existing blockingRateLimitThreshold, blockedIdentifiers, blockingQueueIdentifierCounts Status fields unaffected

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3d167581a8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread BedrockServer.cpp
if (command->request.test("ClearBlocks")) {
_blockingCommandQueue.clearRateLimits();
}
} else if (SIEquals(command->request.methodLine, "SetBlockingQueueTimeRateLimit")) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Register the time rate-limit command

When an operator sends SetBlockingQueueTimeRateLimit, this new branch is never reached because _handleIfStatusOrControlCommand() only calls _control() for names accepted by _isControlCommand(), and that whitelist still includes SetBlockingQueueRateLimit but not SetBlockingQueueTimeRateLimit (checked BedrockServer.cpp:1935-1957). In that normal control-command path the threshold cannot be enabled or cleared, leaving the new metric disabled in production.

Useful? React with 👍 / 👎.

return;
}
lock_guard<decltype(_rateLimitMutex)> lock(_rateLimitMutex);
_identifierTimes[identifier] += elapsedMicros;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reset the empty timer when recording execution time

For a blocking command that runs longer than 30 seconds while the blocking queue has no backlog, _dequeue() marks the queue empty before execution starts, then this line records the elapsed time after completion while leaving the old _emptyTime in place. The next Status or push() immediately sees an empty queue older than 30 seconds and clears _identifierTimes, so exactly the long-running commands this metric is meant to expose can disappear before they are reported or used for time-based limiting.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant