Skip to content

fix(cloudflare): protect FLAG_LOG from async request interleaving#408

Open
vahidlazio wants to merge 2 commits into
mainfrom
fix/cloudflare-flag-log-interleaving
Open

fix(cloudflare): protect FLAG_LOG from async request interleaving#408
vahidlazio wants to merge 2 commits into
mainfrom
fix/cloudflare-flag-log-interleaving

Conversation

@vahidlazio
Copy link
Copy Markdown
Contributor

Summary

Fixes telemetry data loss and mixing under high request concurrency in the Cloudflare resolver.

CF Workers are single-threaded but handle concurrent async requests within the same isolate, interleaving at await points. The thread_local FLAG_LOG was vulnerable:

  1. Request A sets FLAG_LOG, calls req.bytes().awaityields
  2. Request B starts, overwrites FLAG_LOG with a fresh default
  3. Request A resumes, resolves flags — writes to B's FLAG_LOG
  4. Request A hits scheduler.wait(0).awaityields
  5. Request B takes FLAG_LOG — gets mixed A+B data
  6. Request A resumes — FLAG_LOG is now None, telemetry lost

Fix

  • Move FLAG_LOG initialization to after req.bytes().await and right before the synchronous resolve_flags call
  • Take FLAG_LOG into a local variable immediately after resolve_flags returns
  • The set→resolve→take window is fully synchronous (no await points), so no interleaving is possible
  • scheduler.wait(0) and telemetry building operate on the local variable
  • Data is put back into FLAG_LOG at the end with no await before handler return

Same treatment for the flags:apply handler.

Context

The old code (pre-#400) used LazyLock<ResolveLogger> with atomics — safe for concurrent async access and batched into periodic checkpoints. PR #400 replaced this with a per-request thread_local RefCell pattern that is not safe across await points.

Test plan

  • Deploy to a test Worker and verify telemetry still flows to /metrics and the Confidence backend
  • Load test with concurrent requests and verify no telemetry data loss
  • Verify resolve responses are unaffected (this only changes the telemetry path)

🤖 Generated with Claude Code

vahidlazio and others added 2 commits May 19, 2026 12:09
CF Workers handle concurrent async requests on a single thread,
interleaving at await points. The thread_local FLAG_LOG was set before
`req.bytes().await` and read after `scheduler.wait(0).await`, so a
concurrent request could overwrite another's telemetry data.

Fix: set FLAG_LOG immediately before the synchronous resolve_flags call
and take it into a local variable immediately after. The entire
set→resolve→take window has no await points, so no interleaving. The
scheduler.wait and telemetry code then operate on the local. Data is put
back into FLAG_LOG only at the end with no await before handler return.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Restores the old atomic ResolveLogger/AssignLogger statics that safely
accumulate data across concurrent async requests, replacing the
thread_local RefCell<FLAG_LOG> that was not safe across await points.

Key changes:
- Restore static RESOLVE_LOGGER (ArcSwap) and ASSIGN_LOGGER (SegQueue)
- Add static TELEMETRY_LOG (Mutex) to accumulate per-request latency and
  resolve-rate deltas across requests
- Restore checkpoint() that atomically drains all three accumulators
  into a single WriteFlagLogsRequest per queue message
- Keep telemetry collection: timer, scheduler.wait(0), latency histogram
- Keep /metrics endpoint and KV-backed Prometheus exposition
- Remove thread_local FLAG_LOG and RefCell

This is both more correct (no data races between concurrent async
requests) and more efficient (one batched queue message per checkpoint
vs one per request).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant