fix(cloudflare): protect FLAG_LOG from async request interleaving#408
Open
vahidlazio wants to merge 2 commits into
Open
fix(cloudflare): protect FLAG_LOG from async request interleaving#408vahidlazio wants to merge 2 commits into
vahidlazio wants to merge 2 commits into
Conversation
CF Workers handle concurrent async requests on a single thread, interleaving at await points. The thread_local FLAG_LOG was set before `req.bytes().await` and read after `scheduler.wait(0).await`, so a concurrent request could overwrite another's telemetry data. Fix: set FLAG_LOG immediately before the synchronous resolve_flags call and take it into a local variable immediately after. The entire set→resolve→take window has no await points, so no interleaving. The scheduler.wait and telemetry code then operate on the local. Data is put back into FLAG_LOG only at the end with no await before handler return. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Restores the old atomic ResolveLogger/AssignLogger statics that safely accumulate data across concurrent async requests, replacing the thread_local RefCell<FLAG_LOG> that was not safe across await points. Key changes: - Restore static RESOLVE_LOGGER (ArcSwap) and ASSIGN_LOGGER (SegQueue) - Add static TELEMETRY_LOG (Mutex) to accumulate per-request latency and resolve-rate deltas across requests - Restore checkpoint() that atomically drains all three accumulators into a single WriteFlagLogsRequest per queue message - Keep telemetry collection: timer, scheduler.wait(0), latency histogram - Keep /metrics endpoint and KV-backed Prometheus exposition - Remove thread_local FLAG_LOG and RefCell This is both more correct (no data races between concurrent async requests) and more efficient (one batched queue message per checkpoint vs one per request). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes telemetry data loss and mixing under high request concurrency in the Cloudflare resolver.
CF Workers are single-threaded but handle concurrent async requests within the same isolate, interleaving at
awaitpoints. Thethread_localFLAG_LOGwas vulnerable:FLAG_LOG, callsreq.bytes().await→ yieldsFLAG_LOGwith a fresh defaultFLAG_LOGscheduler.wait(0).await→ yieldsFLAG_LOG— gets mixed A+B dataFLAG_LOGis nowNone, telemetry lostFix
FLAG_LOGinitialization to afterreq.bytes().awaitand right before the synchronousresolve_flagscallFLAG_LOGinto a local variable immediately afterresolve_flagsreturnsawaitpoints), so no interleaving is possiblescheduler.wait(0)and telemetry building operate on the local variableFLAG_LOGat the end with noawaitbefore handler returnSame treatment for the
flags:applyhandler.Context
The old code (pre-#400) used
LazyLock<ResolveLogger>with atomics — safe for concurrent async access and batched into periodic checkpoints. PR #400 replaced this with a per-requestthread_local RefCellpattern that is not safe acrossawaitpoints.Test plan
/metricsand the Confidence backend🤖 Generated with Claude Code