Skip to content

Add dormant os_sync_wait futex scaffolding#19

Merged
jserv merged 1 commit intomainfrom
address-wait-futex
May 9, 2026
Merged

Add dormant os_sync_wait futex scaffolding#19
jserv merged 1 commit intomainfrom
address-wait-futex

Conversation

@jserv
Copy link
Copy Markdown
Contributor

@jserv jserv commented May 9, 2026

Land helpers for the macOS 14.4+ address-wait futex fast path but keep plain FUTEX_WAIT on the bucket+condvar path. Darwin's address-wait API maps the Linux EAGAIN pre-block race onto a successful wait, and that semantic gap is observable to user space, so os_sync_wait_enabled stays false until the gap can be bridged.

futex_os_sync_wait routes plain FUTEX_WAIT through os_sync_wait_on_address_with_timeout when activated, with an explicit __atomic_load_n pre-check, a 100 ms internal poll cap to observe proc_exit_group_requested / futex_interrupt_pending and the 1-second EINTR simulation, and an 8-strike EFAULT bail so a transient kernel copyin failure cannot burn host CPU. futex_remaining_ns operates on (sec, nsec) pairs to avoid overflowing tv_sec * 1e9 against the INT64_MAX/4 cap that linux_timespec_is_valid accepts.

futex_os_sync_wake_n calls os_sync_wake_by_address_any in a loop with a uint64_t budget clamped to UINT32_MAX, breaking on ENOENT so the syscall return is exact. _all was rejected because it returns 0, not a count, and would overshoot the wake budget.

The wake-side call sites (FUTEX_WAKE, FUTEX_WAKE_BITSET, FUTEX_REQUEUE, FUTEX_CMP_REQUEUE, FUTEX_WAKE_OP, futex_wake_one used by CLONE_CHILD_CLEARTID and robust-list cleanup) walk the bucket as before and then call futex_os_sync_wake_n. While dormant the helper returns 0 immediately, leaving the bucket-only wake count unchanged. FUTEX_WAKE_OP now tracks per-address counters (woken1, woken2) so the eventual kernel-side drain can be billed to the correct address; the syscall return woken1 + woken2 matches the pre-refactor total.

FUTEX_REQUEUE cannot migrate kernel-side waiters between addresses. The remaining wake + requeue budget after the bucket walk is widened to uint64_t before the add (so val + val2 never overflows) and used as an os_sync wake budget at uaddr only; once activated, kernel-side waiters wake spuriously and userspace re-acquires whatever it really wanted.

futex_wake() now takes (const guest_t *g, ...) so it can translate uaddr to a host pointer for os_sync. futex_wake_one and the two robust_list_walk call sites pass g through; the public API in src/runtime/futex.h is unchanged.

SDK gate: __has_include(<os/os_sync_wait_on_address.h>) sets ELFUSE_HAVE_OS_SYNC_WAIT_ON_ADDRESS so older SDKs build clean. Runtime gate: __builtin_available(macOS 14.4, *) cached in futex_init().


Summary by cubic

Add dormant scaffolding for the macOS 14.4+ address-wait futex fast path (os_sync_wait_on_address_with_timeout / os_sync_wake_by_address_any). Plain FUTEX_WAIT stays on the bucket path to preserve Linux -EAGAIN semantics; no behavior change yet.

  • New Features
    • Added futex_os_sync_wait for plain FUTEX_WAIT: pre-check maps -EAGAIN, 100 ms poll cap, 1 s EINTR simulation, 8-strike EFAULT bail.
    • Added futex_remaining_ns to compute (sec,nsec) deltas without int64 overflow.
    • Added futex_os_sync_wake_n to drain kernel waiters with exact counts; clamps to UINT32_MAX; called after bucket walks in WAKE, WAKE_BITSET, REQUEUE, WAKE_OP, futex_wake_one, and robust-list cleanup; dormant returns 0.
    • REQUEUE: widened remaining wake+requeue budget to uint64_t and used to wake at uaddr (kernel cannot requeue).
    • futex_wake now takes (const guest_t *g, ...); call sites updated; public header unchanged.
    • Gated via __has_include(<os/os_sync_wait_on_address.h>) and cached __builtin_available(macOS 14.4, *); os_sync_wait_enabled remains false.

Written for commit f1098a4. Summary will update on new commits.

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/runtime/futex.c">

<violation number="1" location="src/runtime/futex.c:292">
P2: When `delta_sec == 1` and `delta_nsec < 0`, the true remaining time is `(1e9 + delta_nsec)` ns which can be far below `cap_ns`. Returning `cap_ns` here causes up to 100 ms of extra latency past the guest's requested deadline before `ETIMEDOUT` fires.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread src/runtime/futex.c
Comment on lines +292 to +293
if (delta_sec >= 1)
return cap_ns;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: When delta_sec == 1 and delta_nsec < 0, the true remaining time is (1e9 + delta_nsec) ns which can be far below cap_ns. Returning cap_ns here causes up to 100 ms of extra latency past the guest's requested deadline before ETIMEDOUT fires.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/runtime/futex.c, line 292:

<comment>When `delta_sec == 1` and `delta_nsec < 0`, the true remaining time is `(1e9 + delta_nsec)` ns which can be far below `cap_ns`. Returning `cap_ns` here causes up to 100 ms of extra latency past the guest's requested deadline before `ETIMEDOUT` fires.</comment>

<file context>
@@ -223,6 +266,191 @@ static int futex_make_deadline(guest_t *g,
+    long delta_nsec = deadline->tv_nsec - now.tv_nsec;
+    if (delta_sec < 0 || (delta_sec == 0 && delta_nsec <= 0))
+        return 0;
+    if (delta_sec >= 1)
+        return cap_ns;
+    /* delta_sec == 0 and delta_nsec > 0; tv_nsec is normalized to
</file context>
Suggested change
if (delta_sec >= 1)
return cap_ns;
if (delta_sec >= 2)
return cap_ns;
if (delta_sec == 1) {
uint64_t rem = (uint64_t)(1000000000L + delta_nsec);
return rem < cap_ns ? rem : cap_ns;
}

Land helpers for the macOS 14.4+ address-wait futex fast path but keep
plain FUTEX_WAIT on the bucket+condvar path. Darwin's address-wait API
maps the Linux EAGAIN pre-block race onto a successful wait, and that
semantic gap is observable to user space, so os_sync_wait_enabled stays
false until the gap can be bridged.

futex_os_sync_wait routes plain FUTEX_WAIT through
os_sync_wait_on_address_with_timeout when activated, with an explicit
__atomic_load_n pre-check, a 100 ms internal poll cap to observe
proc_exit_group_requested / futex_interrupt_pending and the 1-second
EINTR simulation, and an 8-strike EFAULT bail so a transient kernel
copyin failure cannot burn host CPU. futex_remaining_ns operates on
(sec, nsec) pairs to avoid overflowing tv_sec * 1e9 against the
INT64_MAX/4 cap that linux_timespec_is_valid accepts.

futex_os_sync_wake_n calls os_sync_wake_by_address_any in a loop with
a uint64_t budget clamped to UINT32_MAX, breaking on ENOENT so the
syscall return is exact. _all was rejected because it returns 0, not
a count, and would overshoot the wake budget.

The wake-side call sites (FUTEX_WAKE, FUTEX_WAKE_BITSET, FUTEX_REQUEUE,
FUTEX_CMP_REQUEUE, FUTEX_WAKE_OP, futex_wake_one used by
CLONE_CHILD_CLEARTID and robust-list cleanup) walk the bucket as before
and then call futex_os_sync_wake_n. While dormant the helper returns 0
immediately, leaving the bucket-only wake count unchanged. FUTEX_WAKE_OP
now tracks per-address counters (woken1, woken2) so the eventual
kernel-side drain can be billed to the correct address; the syscall
return woken1 + woken2 matches the pre-refactor total.

FUTEX_REQUEUE cannot migrate kernel-side waiters between addresses.
The remaining wake + requeue budget after the bucket walk is widened
to uint64_t before the add (so val + val2 never overflows) and used as
an os_sync wake budget at uaddr only; once activated, kernel-side
waiters wake spuriously and userspace re-acquires whatever it really
wanted.

futex_wake() now takes (const guest_t *g, ...) so it can translate
uaddr to a host pointer for os_sync. futex_wake_one and the two
robust_list_walk call sites pass g through; the public API in
src/runtime/futex.h is unchanged.

SDK gate: __has_include(<os/os_sync_wait_on_address.h>) sets
ELFUSE_HAVE_OS_SYNC_WAIT_ON_ADDRESS so older SDKs build clean. Runtime
gate: __builtin_available(macOS 14.4, *) cached in futex_init().
@jserv jserv force-pushed the address-wait-futex branch from 4bdf0b9 to f1098a4 Compare May 9, 2026 04:17
@jserv jserv merged commit 1e00808 into main May 9, 2026
4 checks passed
@jserv jserv deleted the address-wait-futex branch May 9, 2026 04:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant