Skip to content

Harden clone3, vfork, exec, and mmap edges#18

Merged
jserv merged 1 commit intomainfrom
box64
May 8, 2026
Merged

Harden clone3, vfork, exec, and mmap edges#18
jserv merged 1 commit intomainfrom
box64

Conversation

@jserv
Copy link
Copy Markdown
Contributor

@jserv jserv commented May 8, 2026

Several independent fixes surfaced while exercising real multi-threaded fork/exec patterns. Each one was a real-world regression rather than defensive cleanup.

clone3 stack lifetime
Worker threads spawned via clone3(stack, stack_size) now record their
guest stack range. sys_munmap() walks active threads under thread_lock
in two passes (validate + commit) and defers the overlapping portion of
any live stack into a per-thread queue, then unmaps the gaps now. The
collect/commit/finish/rollback transaction marks each affected thread
busy; concurrent collects on the same thread cond_wait until the
in-flight transaction releases. On thread exit,
mem_cleanup_deferred_stack_unmaps() waits for busy=0, clears the live
stack range, snapshots the queue, drains entries one-by-one and drops
successfully unmapped ones; failed unmaps stay in the queue and log at
error level rather than silently leaking. The drain runs before the
CLONE_CHILD_CLEARTID futex wake so a joiner cannot reuse the freed VA
before the host page tables release it. Legacy clone() recovers the
range from the containing region via guest_region_find().

CLONE_VFORK with CLONE_VM
Previously took the in-process VM-clone path, which would have reset
the parent's guest_t on child execve. Now CLONE_VM|CLONE_VFORK falls
through to the posix_spawn helper-process path, which spawns a child
elfuse process and suspends the parent on a notify pipe
(--vfork-notify-fd) until the child execve()s or exits. Matches Linux
vfork semantics rather than blocking on host child exit.

execve sysroot resolution
Open and ELF-load go through path_resolve_sysroot_path() into a
separate path_host buffer; proc_set_elf_path() still publishes the
guest-visible path so /proc/self/exe stays stable across re-exec under
--sysroot. PT_INTERP is resolved the same way. test-sysroot-procfs-exec
exercises the full path.

Low-address mmap hints
Non-fixed mmap with hint in [ELF_DEFAULT_BASE, MMAP_BASE) probes the
low arena directly via find_free_gap_inner before falling back to the
high RW arena. box64 and other static-x86 toolchains reserve their
ET_EXEC image window at 0x400000 with a non-fixed hint and dereference
the address afterwards; forcing it into the high arena silently broke
them. Cached gap_hint is intentionally bypassed for the low probe so
unrelated allocations stay sequential up high.

brk page granularity
sys_brk now extends and updates page-table perms at GUEST_PAGE_SIZE
rather than 2MiB blocks. After finalize_block_perms() leaves
non-covered pages in a split block invalid, brk-driven growth must
call guest_update_perms on the materialized range so heap pages
inside an already-split block become accessible.

fork IPC SCM_RIGHTS chunking
sendmsg/recvmsg fd transfers chunked at FORK_IPC_FD_CHUNK=120 to avoid
the macOS per-cmsg fd limit; receiver allocates its own scratch buffer
per chunk instead of borrowing CMSG_DATA. Backing-fd send goes through
the same helper and detects stale fds via fcntl(F_GETFD) before
attempting transfer.

Stable synthetic procfs identity
/proc/* and /dev/shm stat fills now report a constant PROC_SYNTH_DEV
and a 64-bit FNV-1a hash of the path as st_ino, plus st_blksize=4096.
Without this, directory walkers collapsed multiple synthetic paths
onto the same (dev, ino) pair and reported false filesystem loops.

getcpu(168)
Synthetic CPU=0, node=0; obsolete cache pointer ignored. Required for
glibc and file(1) to start on workloads that probe topology.

--timeout 0 disables the vCPU watchdog
parse_int_arg lower bound dropped from 1 to 0; timeout=0 lets the
vCPU run loop iterate without alarm() preemption, which CPU-bound
guests need.


Summary by cubic

Hardens clone3/vfork/exec/mmap paths to fix real-world regressions and align behavior with Linux. Adds getcpu support and lets --timeout 0 disable the vCPU watchdog.

  • Process and Threading

    • CLONE_VM|CLONE_VFORK now uses a helper child with a notify pipe so the parent resumes when the child execs or exits; the child’s provided stack is respected. Matches Linux vfork semantics and avoids resetting the parent on child exec.
    • clone3 threads record their stack range; sys_munmap defers unmaps that overlap live stacks and completes them on thread exit before the CLONE_CHILD_CLEARTID wake to prevent reuse races.
    • execve resolves absolute paths under --sysroot for both the main ELF and PT_INTERP while publishing the guest-visible path so /proc/self/exe remains stable across re-exec.
  • Memory and Platform

    • Non-fixed mmap hints below MMAP_BASE are honored by probing the low arena first; fixes ET_EXEC windows (e.g., box64 at 0x400000) without disturbing high-arena gap hints.
    • brk grows at page granularity and updates page-table perms so heap pages in split blocks become accessible.
    • Fork IPC SCM_RIGHTS transfers are chunked (120 fds) with stale-fd checks to avoid macOS cmsg limits.
    • Synthetic /proc/* and /dev/shm now return a stable device and FNV-1a inode, and getcpu(168) returns cpu=0/node=0. --timeout 0 now disables the per-iteration vCPU watchdog.

Written for commit 81c03bc. Summary will update on new commits.

Several independent fixes surfaced while exercising real multi-threaded
fork/exec patterns. Each one was a real-world regression rather than
defensive cleanup.

clone3 stack lifetime
  Worker threads spawned via clone3(stack, stack_size) now record their
  guest stack range. sys_munmap() walks active threads under thread_lock
  in two passes (validate + commit) and defers the overlapping portion of
  any live stack into a per-thread queue, then unmaps the gaps now. The
  collect/commit/finish/rollback transaction marks each affected thread
  busy; concurrent collects on the same thread cond_wait until the
  in-flight transaction releases. On thread exit,
  mem_cleanup_deferred_stack_unmaps() waits for busy=0, clears the live
  stack range, snapshots the queue, drains entries one-by-one and drops
  successfully unmapped ones; failed unmaps stay in the queue and log at
  error level rather than silently leaking. The drain runs before the
  CLONE_CHILD_CLEARTID futex wake so a joiner cannot reuse the freed VA
  before the host page tables release it. Legacy clone() recovers the
  range from the containing region via guest_region_find().

CLONE_VFORK with CLONE_VM
  Previously took the in-process VM-clone path, which would have reset
  the parent's guest_t on child execve. Now CLONE_VM|CLONE_VFORK falls
  through to the posix_spawn helper-process path, which spawns a child
  elfuse process and suspends the parent on a notify pipe
  (--vfork-notify-fd) until the child execve()s or exits. Matches Linux
  vfork semantics rather than blocking on host child exit.

execve sysroot resolution
  Open and ELF-load go through path_resolve_sysroot_path() into a
  separate path_host buffer; proc_set_elf_path() still publishes the
  guest-visible path so /proc/self/exe stays stable across re-exec under
  --sysroot. PT_INTERP is resolved the same way. test-sysroot-procfs-exec
  exercises the full path.

Low-address mmap hints
  Non-fixed mmap with hint in [ELF_DEFAULT_BASE, MMAP_BASE) probes the
  low arena directly via find_free_gap_inner before falling back to the
  high RW arena. box64 and other static-x86 toolchains reserve their
  ET_EXEC image window at 0x400000 with a non-fixed hint and dereference
  the address afterwards; forcing it into the high arena silently broke
  them. Cached gap_hint is intentionally bypassed for the low probe so
  unrelated allocations stay sequential up high.

brk page granularity
  sys_brk now extends and updates page-table perms at GUEST_PAGE_SIZE
  rather than 2MiB blocks. After finalize_block_perms() leaves
  non-covered pages in a split block invalid, brk-driven growth must
  call guest_update_perms on the materialized range so heap pages
  inside an already-split block become accessible.

fork IPC SCM_RIGHTS chunking
  sendmsg/recvmsg fd transfers chunked at FORK_IPC_FD_CHUNK=120 to avoid
  the macOS per-cmsg fd limit; receiver allocates its own scratch buffer
  per chunk instead of borrowing CMSG_DATA. Backing-fd send goes through
  the same helper and detects stale fds via fcntl(F_GETFD) before
  attempting transfer.

Stable synthetic procfs identity
  /proc/* and /dev/shm stat fills now report a constant PROC_SYNTH_DEV
  and a 64-bit FNV-1a hash of the path as st_ino, plus st_blksize=4096.
  Without this, directory walkers collapsed multiple synthetic paths
  onto the same (dev, ino) pair and reported false filesystem loops.

getcpu(168)
  Synthetic CPU=0, node=0; obsolete cache pointer ignored. Required for
  glibc and file(1) to start on workloads that probe topology.

--timeout 0 disables the vCPU watchdog
  parse_int_arg lower bound dropped from 1 to 0; timeout=0 lets the
  vCPU run loop iterate without alarm() preemption, which CPU-bound
  guests need.
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 22 files

@jserv jserv merged commit 6bd9244 into main May 8, 2026
5 checks passed
@jserv jserv deleted the box64 branch May 8, 2026 08:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant