Conversation
Several independent fixes surfaced while exercising real multi-threaded fork/exec patterns. Each one was a real-world regression rather than defensive cleanup. clone3 stack lifetime Worker threads spawned via clone3(stack, stack_size) now record their guest stack range. sys_munmap() walks active threads under thread_lock in two passes (validate + commit) and defers the overlapping portion of any live stack into a per-thread queue, then unmaps the gaps now. The collect/commit/finish/rollback transaction marks each affected thread busy; concurrent collects on the same thread cond_wait until the in-flight transaction releases. On thread exit, mem_cleanup_deferred_stack_unmaps() waits for busy=0, clears the live stack range, snapshots the queue, drains entries one-by-one and drops successfully unmapped ones; failed unmaps stay in the queue and log at error level rather than silently leaking. The drain runs before the CLONE_CHILD_CLEARTID futex wake so a joiner cannot reuse the freed VA before the host page tables release it. Legacy clone() recovers the range from the containing region via guest_region_find(). CLONE_VFORK with CLONE_VM Previously took the in-process VM-clone path, which would have reset the parent's guest_t on child execve. Now CLONE_VM|CLONE_VFORK falls through to the posix_spawn helper-process path, which spawns a child elfuse process and suspends the parent on a notify pipe (--vfork-notify-fd) until the child execve()s or exits. Matches Linux vfork semantics rather than blocking on host child exit. execve sysroot resolution Open and ELF-load go through path_resolve_sysroot_path() into a separate path_host buffer; proc_set_elf_path() still publishes the guest-visible path so /proc/self/exe stays stable across re-exec under --sysroot. PT_INTERP is resolved the same way. test-sysroot-procfs-exec exercises the full path. Low-address mmap hints Non-fixed mmap with hint in [ELF_DEFAULT_BASE, MMAP_BASE) probes the low arena directly via find_free_gap_inner before falling back to the high RW arena. box64 and other static-x86 toolchains reserve their ET_EXEC image window at 0x400000 with a non-fixed hint and dereference the address afterwards; forcing it into the high arena silently broke them. Cached gap_hint is intentionally bypassed for the low probe so unrelated allocations stay sequential up high. brk page granularity sys_brk now extends and updates page-table perms at GUEST_PAGE_SIZE rather than 2MiB blocks. After finalize_block_perms() leaves non-covered pages in a split block invalid, brk-driven growth must call guest_update_perms on the materialized range so heap pages inside an already-split block become accessible. fork IPC SCM_RIGHTS chunking sendmsg/recvmsg fd transfers chunked at FORK_IPC_FD_CHUNK=120 to avoid the macOS per-cmsg fd limit; receiver allocates its own scratch buffer per chunk instead of borrowing CMSG_DATA. Backing-fd send goes through the same helper and detects stale fds via fcntl(F_GETFD) before attempting transfer. Stable synthetic procfs identity /proc/* and /dev/shm stat fills now report a constant PROC_SYNTH_DEV and a 64-bit FNV-1a hash of the path as st_ino, plus st_blksize=4096. Without this, directory walkers collapsed multiple synthetic paths onto the same (dev, ino) pair and reported false filesystem loops. getcpu(168) Synthetic CPU=0, node=0; obsolete cache pointer ignored. Required for glibc and file(1) to start on workloads that probe topology. --timeout 0 disables the vCPU watchdog parse_int_arg lower bound dropped from 1 to 0; timeout=0 lets the vCPU run loop iterate without alarm() preemption, which CPU-bound guests need.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Several independent fixes surfaced while exercising real multi-threaded fork/exec patterns. Each one was a real-world regression rather than defensive cleanup.
clone3 stack lifetime
Worker threads spawned via clone3(stack, stack_size) now record their
guest stack range. sys_munmap() walks active threads under thread_lock
in two passes (validate + commit) and defers the overlapping portion of
any live stack into a per-thread queue, then unmaps the gaps now. The
collect/commit/finish/rollback transaction marks each affected thread
busy; concurrent collects on the same thread cond_wait until the
in-flight transaction releases. On thread exit,
mem_cleanup_deferred_stack_unmaps() waits for busy=0, clears the live
stack range, snapshots the queue, drains entries one-by-one and drops
successfully unmapped ones; failed unmaps stay in the queue and log at
error level rather than silently leaking. The drain runs before the
CLONE_CHILD_CLEARTID futex wake so a joiner cannot reuse the freed VA
before the host page tables release it. Legacy clone() recovers the
range from the containing region via guest_region_find().
CLONE_VFORK with CLONE_VM
Previously took the in-process VM-clone path, which would have reset
the parent's guest_t on child execve. Now CLONE_VM|CLONE_VFORK falls
through to the posix_spawn helper-process path, which spawns a child
elfuse process and suspends the parent on a notify pipe
(--vfork-notify-fd) until the child execve()s or exits. Matches Linux
vfork semantics rather than blocking on host child exit.
execve sysroot resolution
Open and ELF-load go through path_resolve_sysroot_path() into a
separate path_host buffer; proc_set_elf_path() still publishes the
guest-visible path so /proc/self/exe stays stable across re-exec under
--sysroot. PT_INTERP is resolved the same way. test-sysroot-procfs-exec
exercises the full path.
Low-address mmap hints
Non-fixed mmap with hint in [ELF_DEFAULT_BASE, MMAP_BASE) probes the
low arena directly via find_free_gap_inner before falling back to the
high RW arena. box64 and other static-x86 toolchains reserve their
ET_EXEC image window at 0x400000 with a non-fixed hint and dereference
the address afterwards; forcing it into the high arena silently broke
them. Cached gap_hint is intentionally bypassed for the low probe so
unrelated allocations stay sequential up high.
brk page granularity
sys_brk now extends and updates page-table perms at GUEST_PAGE_SIZE
rather than 2MiB blocks. After finalize_block_perms() leaves
non-covered pages in a split block invalid, brk-driven growth must
call guest_update_perms on the materialized range so heap pages
inside an already-split block become accessible.
fork IPC SCM_RIGHTS chunking
sendmsg/recvmsg fd transfers chunked at FORK_IPC_FD_CHUNK=120 to avoid
the macOS per-cmsg fd limit; receiver allocates its own scratch buffer
per chunk instead of borrowing CMSG_DATA. Backing-fd send goes through
the same helper and detects stale fds via fcntl(F_GETFD) before
attempting transfer.
Stable synthetic procfs identity
/proc/* and /dev/shm stat fills now report a constant PROC_SYNTH_DEV
and a 64-bit FNV-1a hash of the path as st_ino, plus st_blksize=4096.
Without this, directory walkers collapsed multiple synthetic paths
onto the same (dev, ino) pair and reported false filesystem loops.
getcpu(168)
Synthetic CPU=0, node=0; obsolete cache pointer ignored. Required for
glibc and file(1) to start on workloads that probe topology.
--timeout 0 disables the vCPU watchdog
parse_int_arg lower bound dropped from 1 to 0; timeout=0 lets the
vCPU run loop iterate without alarm() preemption, which CPU-bound
guests need.
Summary by cubic
Hardens clone3/vfork/exec/mmap paths to fix real-world regressions and align behavior with Linux. Adds
getcpusupport and lets--timeout 0disable the vCPU watchdog.Process and Threading
--sysrootfor both the main ELF and PT_INTERP while publishing the guest-visible path so/proc/self/exeremains stable across re-exec.Memory and Platform
MMAP_BASEare honored by probing the low arena first; fixes ET_EXEC windows (e.g., box64 at 0x400000) without disturbing high-arena gap hints./proc/*and/dev/shmnow return a stable device and FNV-1a inode, andgetcpu(168)returns cpu=0/node=0.--timeout 0now disables the per-iteration vCPU watchdog.Written for commit 81c03bc. Summary will update on new commits.