This document is the canonical technical reference for the runtime. It maps the source tree, then walks each subsystem from guest entry to syscall return: memory layout, the EL1 shim and HVC protocol, page-table management, syscall translation, threads, fork/clone, signals, ptrace, dynamic linking, and the GDB stub.
It is aimed at contributors. For a high-level overview see ../README.md; for command-line use see usage.md; for the validation flow see testing.md.
elfuse runs one Linux guest process inside one Hypervisor.framework VM owned
by one macOS host process. Guest code executes at EL0. A small EL1 shim
(src/core/shim.S) handles exceptions, provides the syscall trap path, and
cooperates with the host through a compact HVC protocol.
Boot sequence:
src/main.cparses options and prepares guest bootstrap state.src/core/elf.cparses the ELF image and anyPT_INTERPinterpreter.src/core/stack.cbuilds a Linux-style initial stack (argc,argv,envp,auxv).src/core/guest.creserves guest memory, page tables, and semantic regions.src/core/shim.Senters guest EL0 and forwards traps through HVC exits.src/syscall/syscall.cdispatches Linux syscalls into domain handlers.src/syscall/proc.cowns the main vCPU run loop and stop/exit integration.
Top-level areas:
src/core/: guest memory, ELF loading, bootstrap, initial stack, vDSO, EL1 shim assemblysrc/syscall/: Linux syscall handlers and translation boundariessrc/runtime/: thread table, futexes, procfs helpers, fork/clone IPCsrc/debug/: crash reporting and built-in GDB RSP stub
Key files:
| File | Role |
|---|---|
src/main.c |
CLI, bootstrap, first vCPU creation, debugger startup |
src/core/guest.c |
guest memory reservation, region tracking, page tables |
src/core/elf.c |
ELF parsing, PT_LOAD, PT_INTERP, loader decisions |
src/core/stack.c |
Linux initial stack and auxv construction |
src/core/shim.S |
EL1 shim, exception vectors, HVC protocol |
src/syscall/syscall.c |
syscall dispatch and shared wrapper helpers |
src/syscall/mem.c |
brk, mmap, mprotect, mremap, madvise, msync |
src/syscall/fs.c, fs-stat.c, fs_xattr.c |
filesystem syscalls |
src/syscall/io.c, poll.c, fd.c, fdtable.c |
I/O, polling, FD lifecycle and table |
src/syscall/signal.c |
signal delivery, rt_sigframe, rt_sigaction |
src/syscall/time.c |
clocks, timers, setitimer, clock-ID translation |
src/syscall/sys.c |
uname, sysinfo, getrandom, prlimit64 |
src/syscall/net.c, net-msg.c, net-sockopt.c, netlink.c |
sockets |
src/syscall/translate.c |
errno and shared AT_* flag translation |
src/syscall/proc.c |
vCPU run loop, wait4, ptrace coordination |
src/syscall/exec.c |
execve: ELF reload and vCPU restart |
src/runtime/forkipc.c, fork-state.c |
fork/clone state transfer |
src/runtime/thread.c, futex.c |
guest thread table, futex wait queues |
src/runtime/procemu.c |
/proc, /dev, and selected pseudo-files |
src/debug/gdbstub.c, gdbstub-rsp.c, gdbstub-reg.c |
GDB RSP stub |
Apple HVF imposes a handful of constraints that shape the rest of the design:
- W^X is enforced even with
SCTLR.WXN=0. A page-table entry cannot be both writable and executable. Use RW for data and RX for code. - HVF returns
SCTLR=0x0by default;RES1bits must be set explicitly. The shim usesSCTLR_RES1(0x30D00980) plus the desired bits. - The MMU must be enabled from inside the vCPU (via HVC #4 from the shim),
not by the host before
hv_vcpu_run(). SettingSCTLR.M=1from the host before vCPU entry causes permission faults on the first instruction fetch. GUEST_IPA_BASEmust be0. ELF binaries use absolute addresses from their link address (e.g.,0x400000); a non-zero IPA base produces translation faults.- System registers cannot be set via
MSRfrom the guest becauseHCR_EL2.TSC=1traps allMSRwrites. Boot-time sysreg installation (RES1 bits, MMU enable, TTBR0, etc.) goes through HVC #4 from the EL1 shim. Runtime EL0 sysreg traps —MSR TPIDR_EL0and similar — are handled by the HVC #12 system-instruction trap path. - Only
HV_SYS_REG_*constants from Hypervisor.framework may be used for register IDs. - Cross-thread vCPU register access is unreliable; all register access must happen on the owning thread. This drives the snapshot protocol used by both ptrace and the GDB stub.
- HVF allows only one VM per host process. This is the reason
forkis implemented throughposix_spawnplus IPC state transfer.
Guest memory is identity-mapped (guest VA == guest IPA). Large areas use 2 MiB block descriptors by default; mappings that need mixed permissions are split to 4 KiB L3 pages (see Page Table Splitting).
0x000010000 - 0x0000FFFFF: Page table pool (960 KiB)
0x000100000 - 0x0001FFFFF: Shim code (2 MiB block, RX)
0x000200000 - 0x0003FFFFF: Shim data/stack (2 MiB block, RW)
0x000400000 - varies: ELF LOAD segments (PIE_LOAD_BASE for ET_DYN)
0x001000000: brk base (16 MiB)
0x007800000 - 0x007800FFF: Stack guard page (PROT_NONE, dynamic position)
0x007801000 - 0x007FFFFFF: Stack (8 MiB, 4×2 MiB blocks, RW, grows down)
0x010000000 - 0x0101FFFFF: mmap RX region (initial 2 MiB, pre-mapped RX)
0x010200000 - mmap_limit: mmap RX growth area
0x200000000 - 0x2001FFFFF: mmap RW region (initial 2 MiB at 8 GiB, RW)
0x200200000 - mmap_limit: mmap RW growth area
interp_base - varies: Dynamic linker (g->interp_base, --sysroot only)
The guest size is determined by the VM's configured IPA width (capped at 40-bit / 1 TiB):
- 36-bit IPA (64 GiB) — native AArch64 on Apple M2:
mmap_limit ≈ 56 GiB,interp_base ≈ 60 GiB - 40-bit IPA (1 TiB) — native AArch64 on Apple M3 and later:
mmap_limit ≈ 1016 GiB,interp_base ≈ 1020 GiB
Both mmap_limit and interp_base are computed at runtime from guest_size
and stored in guest_t. macOS demand-pages physical memory on first touch,
so the reservation costs no RAM until the guest writes to a page.
The mmap RW region starts at 8 GiB to match real Linux kernel address-space
layout, where mmap allocations sit well above text/data/brk.
For address spaces larger than 512 GiB, the L0 page table needs multiple entries (each covers 512 GiB). The page-table builders compute the L0 index from the actual IPA and allocate L1 tables on demand per L0 slot.
A 2 MiB L2 block descriptor cannot mix RW and RX permissions, but real shared
libraries combine .text (RX) and .data (RW) within a single 2 MiB range.
guest_split_block() in src/core/guest.c converts an L2 block descriptor
into a table descriptor pointing to an L3 table of 512 × 4 KiB page entries,
each with independent permissions.
Splitting is triggered by:
sys_mmapwithMAP_FIXED, when the fixed address lands in a block whose permissions differ from the request (typical case: the dynamic linker overlaying.dataRW onto a library.textRX block).sys_mprotect, when changing permissions for a sub-block range, e.g. RELRO finalization.
guest_update_perms() orchestrates the full workflow: it checks whether a
block needs splitting, splits it if so, then updates the affected L3 page
entries. Whole-block permission changes are done in place without splitting.
mmap itself uses a gap-finding allocator that walks the sorted region array
to find free address space. PROT_EXEC requests go to the RX region
(MMAP_RX_BASE = 0x10000000); other requests go to the RW region
(MMAP_BASE = 0x200000000). Address hints are honored when possible. This
arrangement makes .text and .data land in different 2 MiB blocks where
practical, and L3 splitting handles the residual cases where they share a
block.
When sys_mmap or sys_brk needs memory beyond the currently mapped page
tables, the host calls guest_extend_page_tables() to add new L2 entries.
This is safe because the vCPU is paused while servicing HVC #5. After the
host modifies the tables, it sets X8 = 1 and clears g->need_tlbi. The
shim inspects X8 on return from HVC #5 and selects one of three paths:
X8 == 0: no flush; restore GPRs (keep X0); ERET
X8 == 1: TLBI VMALLE1IS; DSB ISH; IC IALLU; DSB ISH; ISB
→ restore GPRs (keep X0); ERET
X8 == 2: TLBI VMALLE1IS; DSB ISH; ISB; IC IALLU; DSB ISH; ISB
→ discard the saved GPR frame (`add sp, sp, #256`) and ERET
on the rebuilt EL0 register state
X8 == 1 is taken when page-table permissions or layout changed but the
syscall return values still flow through the normal SVC ABI (writes still
preserve X1–X30, including the syscall number register, which Linux
already considers clobbered).
X8 == 2 is the execve / rt_sigreturn path: the host wrote a fresh
register state directly into the vCPU and the old saved frame on the EL1
stack is stale. The shim drops the frame and ERETs without restoring
GPRs. IC IALLU is required because the new program text may live in the
same physical pages that previously held the old text.
X8 (the syscall-number register) is already considered clobbered by the
Linux syscall ABI, so callers never expect it to be preserved across SVC.
Important: sys_execve and sys_rt_sigreturn return SYSCALL_EXEC_HAPPENED,
which bypasses the normal syscall dispatch epilogue. They therefore write
X8 = 2 directly via hv_vcpu_set_reg(). Any future code path that
returns SYSCALL_EXEC_HAPPENED and rebuilds the EL0 register state must do
the same.
Vectors enter EL1 from EL0 traps and forward them to the host through HVC.
DC ZVA is emulated inside the shim (it zeroes 64 bytes at the cache-line-
aligned address from the Rt register); HVF traps DC ZVA via HCR_EL2.TDZ=1.
| HVC # | Purpose | Registers |
|---|---|---|
| #0 | Normal exit | X0 = exit code |
| #2 | Bad exception | X0=ESR, X1=FAR, X2=ELR, X3=SPSR, X5=vector |
| #4 | Set boot system register | X0 = reg ID (0–8), X1 = value (used by the shim during boot to install RES1 bits and enable the MMU) |
| #5 | Syscall forward | X0–X5 = args, X8 = syscall number on entry; on return X8 is the post-syscall flag (0 = none, 1 = TLBI required, 2 = execve/rt_sigreturn rebuilt the register frame) |
| #7 | MRS trap (read sysreg) | host reads register from ESR ISS; returns value in X0 |
| #9 | W^X toggle | X0 = FAR, X1 = type (0 = exec→RX, 1 = write→RW) |
| #10 | BRK from EL0 | SIGTRAP delivery / ptrace-stop; GPRs in frame |
| #11 | EL0 fault | SIGSEGV/SIGILL delivery; GPRs in frame |
| #12 | EL0 system-instruction trap | cache maintenance logging (DC CVAU, IC IVAU, …) and MSR TPIDR_EL0 emulation |
Vector entry stubs that lead to svc_handler MUST NOT clobber any GPR. The
Linux syscall ABI preserves all registers except X0 across SVC #0, and
musl/glibc rely on this for scratch registers (X9–X15). If a vector entry
writes to any GPR (e.g., mov x5, #offset) before svc_handler saves
registers, the saved value is wrong and the EL0 caller's register state is
corrupted after ERET.
Only bad_exception vectors may clobber X5 (they halt, so preservation is
unnecessary).
Linux user-space compatibility comes from explicit translation at the syscall
boundary. src/syscall/syscall.c routes syscall numbers into focused domain
files. The translation layer is responsible for:
- errno translation
- flag translation
- Linux/macOS structure layout adaptation
- guest memory copying for pointer arguments
- Linux-compatible descriptor and signal semantics
macOS and Linux errno values diverge starting around 35. linux_errno() in
src/syscall/translate.c maps via switch. Notable mappings:
| Linux | macOS |
|---|---|
EAGAIN (11) |
EAGAIN (35) |
ENOSYS (38) |
ENOSYS (78) |
ENAMETOOLONG (36) |
ENAMETOOLONG (63) |
ELOOP (40) |
ELOOP (62) |
AT_* flag bits differ between Linux and macOS:
| Flag | Linux | macOS |
|---|---|---|
AT_SYMLINK_NOFOLLOW |
0x100 |
0x20 |
AT_SYMLINK_FOLLOW |
0x400 |
0x40 |
AT_REMOVEDIR |
0x200 |
0x80 |
Most AT_* flags go through translate_at_flags() in
src/syscall/translate.c before issuing macOS calls. faccessat is a
special case: Linux defines AT_EACCESS at the same bit (0x200) as
AT_REMOVEDIR, so faccessat paths instead use
translate_faccessat_flags() (also in src/syscall/translate.c), which
interprets that bit as AT_EACCESS.
Aarch64-Linux open flags differ from x86_64. From asm-generic/fcntl.h:
| Flag | Value (octal) | Value (hex) |
|---|---|---|
O_DIRECTORY |
040000 |
0x4000 |
O_NOFOLLOW |
0100000 |
0x8000 |
O_DIRECT |
0200000 |
0x10000 |
O_LARGEFILE |
0400000 |
0x20000 (no-op on LP64) |
O_CLOEXEC |
02000000 |
0x80000 |
Linux CLOCK_MONOTONIC = 1, macOS CLOCK_MONOTONIC = 6. See
translate_clockid() in src/syscall/time.c. Other clock IDs are
translated similarly.
Socket syscalls are translated in src/syscall/net.c and friends:
AF_INET6differs: Linux10, macOS30.sockaddrhas nosa_lenbyte on Linux but does on macOS. All conversions go throughlinux_to_mac_sockaddr()andmac_to_linux_sockaddr().- Linux ORs
SOCK_NONBLOCK(0x800) andSOCK_CLOEXEC(0x80000) into the type argument; both bits must be extracted before callingsocket(). SOL_SOCKEToption numbers (SO_TYPE,SO_SNDBUF,SO_RCVBUF, …) differ between platforms and are remapped per option.
The Linux initial stack must have SP 16-byte aligned and pointing directly
at argc. Total 8-byte words on the structured area are
35 + extra + argc + envc, where:
35covers the always-present pieces: 15 base auxv entries, the always-presentAT_EXECFNandAT_BASE(+4), the auxvAT_NULLterminator (+2), theenvpandargvNULLterminators (+2), andargcitself (+1). The base 15 auxv entries pack into15 × 2 = 30words; combined with the four extras above, that totals35.extraadds2forAT_SYSINFO_EHDRif a vDSO is present and another2forAT_EXECFDwhenbinfmt_miscpasses one.argcandenvcare the user-provided argument and environment counts.
If that total is odd, one padding word is pushed before auxv. Padding
goes above the structured area, never below. Post-push masking
(sp &= ~15) breaks because it inserts a gap between SP and argc. See
build_linux_stack() in src/core/stack.c for the full layout.
MAP_SHARED is treated as MAP_PRIVATE (copy-on-write). The guest is a
single process, so shared and private semantics are equivalent here. This
also enables tools like sort that expect file-backed shared mappings on
large inputs.
MAP_FIXED file-backed mmap must pread() file contents into guest
memory. Both the MAP_FIXED and non-fixed paths need this. The MAP_FIXED
path zeros the region first (for pages beyond EOF), then overlays file
content.
Guest threads map 1:1 to host pthreads. Each guest thread owns one vCPU and
shares the same guest address space. HVF supports multiple vCPUs per VM,
each bound to the host thread that created it; multiple vCPUs share guest
physical memory via hv_vm_map(). Up to MAX_THREADS = 64 guest threads
are supported per VM.
src/runtime/thread.c and thread.h:
thread_entry_tper thread holds the vCPU handle, host pthread, per-thread signal mask,clear_child_tid(forCLONE_CHILD_CLEARTID), and the thread'sSP_EL1exception stack._Thread_local current_threadgives O(1) access from syscall handlers.- Each thread receives a 4 KiB EL1 exception stack carved out of the shim data region.
src/runtime/futex.c implements:
- The classic ops:
FUTEX_WAIT,FUTEX_WAKE,FUTEX_WAIT_BITSET,FUTEX_WAKE_BITSET,FUTEX_REQUEUE,FUTEX_CMP_REQUEUE,FUTEX_WAKE_OP. - A subset of priority-inheritance ops:
FUTEX_LOCK_PI,FUTEX_UNLOCK_PI,FUTEX_TRYLOCK_PI. Priority semantics are not actually inherited — the ops behave as ordinary mutex acquire/release, which is enough for glibc and musl to make forward progress. futex_waitv(syscall 449) for batch waits across up to 128 futex addresses.- Robust-list cleanup:
set_robust_list/get_robust_listare wired up insrc/syscall/syscall.c, androbust_list_walk()releases owned futexes on thread exit, marking themFUTEX_OWNER_DIED.
Wait queues live in a hash table keyed by guest virtual address, with a
per-bucket mutex. Each waiter has its own condition variable for precise
wakeup. Thread exit calls futex_wake_one() on clear_child_tid, which is
how pthread_join() waits via the TID address.
Atomicity: the bucket mutex is held across the futex word read and the waiter enqueue, so the compare-and-wait is a single critical section.
| Resource | Lock | File |
|---|---|---|
mmap/brk allocators + page tables |
mmap_lock |
src/syscall/mem.c |
| FD table | fd_lock |
src/syscall/fdtable.c |
| Thread table | pthread_mutex |
src/runtime/thread.c |
| Futex wait queues | pthread_mutex (per bucket) |
src/runtime/futex.c |
Lock ordering is documented inline in those files (mmap_lock is order 1,
fd_lock is order 3) so callers that need multiple locks acquire them in
the right sequence.
Page-table consistency is preserved by the mmap_lock plus TLB broadcasts
via TLBI VMALLE1IS from any vCPU; hardware coherency is verified by
tests/test-multi-vcpu.
exit_group sets a global exit_group_requested flag, calls
thread_for_each(thread_force_exit_cb) which invokes hv_vcpus_exit() on
all worker vCPUs to break them out of hv_vcpu_run(), and joins worker
threads with a timeout so CLEARTID cleanup can complete.
- True priority inheritance for the PI futex ops. The ops are wired up but behave as ordinary mutexes; this matches what glibc and musl need for forward progress, not real RT scheduling.
- CPU affinity (
sched_setaffinity) returns the all-CPUs mask.
macOS HVF allows only one VM per process, so process-style fork cannot
clone the live VM. elfuse implements it through posix_spawn plus IPC
state transfer:
- Parent creates a
socketpair(AF_UNIX, SOCK_STREAM). - Parent
posix_spawns a newelfuse --fork-child <fd>process. - Parent serializes VM state over IPC (see paths below).
- Child receives state, creates its own VM, restores registers directly
into EL0 (bypassing the shim
_startso callee-saved GPRs survive), and enters the vCPU loop withX0 = 0(the child return fromclone). - Parent records the child in the process table and returns the child PID.
When g->shm_fd >= 0 the guest memory is file-backed (mkstemp + unlink,
MAP_SHARED). Fork sends the backing fd over SCM_RIGHTS:
- Parent stays on
MAP_SHAREDand does NOT remap — HVF caches the host VA→PA mapping fromhv_vm_map, and aMAP_FIXEDremap does not update Stage-2, so a remapping parent would observe stale pages. - Child maps the fd
MAP_PRIVATE, producing an instant CoW clone with zero data copy. - The IPC header sets
has_shm = 1andnum_regions = 0, skipping memory serialization entirely. - Child calls
guest_init_from_shm()instead ofguest_init(), and must restoreg->ttbr0from the IPC header —guest_init_from_shmzeroes the struct, and withoutttbr0page-table walks fail for all high VAs.
This path is roughly 50× faster than the legacy IPC copy path on large guest memories.
macOS rejects MAP_PRIVATE on shm_open fds (EINVAL), so the backing
file is created via mkstemp + unlink.
When g->shm_fd < 0, the parent serializes used memory regions in 1 MiB
chunks over the socket pair, the child calls guest_init() and receives the
data into fresh guest memory. CLOEXEC semantics follow POSIX: all FDs
(including those marked CLOEXEC) are inherited across fork. CLOEXEC takes
effect at execve (see src/syscall/exec.c step 4).
sys_clone_vm() in src/runtime/forkipc.c handles CLONE_VM without
CLONE_THREAD. Unlike sys_clone_thread(), VM-clone children are waitable
via wait4 and have exit semantics (exit_signal, vm_exit_status).
Unlike the posix_spawn fork path, they share the same guest_t*, the
same guest memory, and the same page tables.
In src/runtime/forkipc.c:
hv_vcpu_create()and per-threadSP_EL1allocation.- Set the child SP,
TPIDR_EL0(TLS), and copy the parent's signal mask. pthread_create()runningvcpu_run_loop()for the child vCPU.- Return the child TID to the parent; the child runs with
X0 = 0.
sys_execve in src/syscall/exec.c reloads the ELF, loads the dynamic
interpreter for dynamically-linked targets via the shared
elf_resolve_interp() helper (also used at startup), rebuilds page tables,
and restarts the vCPU. Signal handlers are reset to SIG_DFL per POSIX:
SIG_IGN stays SIG_IGN, and pending and blocked masks are preserved.
This happens in signal_reset_for_exec() after guest_reset.
Signals are fully implemented in src/syscall/signal.c. The signal frame
matches Linux arch/arm64/kernel/signal.c:setup_rt_frame() so the C
library's __restore_rt → rt_sigreturn (syscall 139) restores state
correctly. Both musl and glibc use this mechanism.
Key points:
signal_deliver()buildslinux_rt_sigframe_ton the guest stack, redirects the vCPU PC to the handler, and setsX0 = signum,X30 = sa_restorer.signal_rt_sigreturn()restores all 31 GPRs,SP,PC, andPSTATEfrom the frame.- SIGPIPE is queued automatically when
write,writev, orpwrite64returnsEPIPE. - Guest
ITIMER_REALis emulated internally rather than being forwarded to hostsetitimer, because macOS sharesalarm()andsetitimer(ITIMER_REAL)as a single timer, andelfusealready needsalarm()for its per-iteration vCPU watchdog. signal_check_timer()is called from the vCPU loop after each syscall.- After
SYSCALL_EXEC_HAPPENED, the vCPU loop verifies thatELR_EL1is non-zero — a defensive check against HVF register-sync bugs. - Each
thread_entry_tcarries its ownblockedmask.rt_sigprocmaskoperates oncurrent_thread->blocked, and child threads inherit the parent's mask at clone time.
clone(CLONE_VM) produces an inferior that shares guest memory; the tracer
attaches via PTRACE_SEIZE. BRK instructions trigger ptrace-stops, after
which the tracer reads and writes registers through the snapshot protocol.
In src/syscall/proc.c:
PTRACE_SEIZE— attach without stopping; setsptraced = 1.PTRACE_CONT— resume the stopped tracee, optionally injecting a signal.PTRACE_INTERRUPT— force the tracee into ptrace-stop viahv_vcpus_exit().PTRACE_GETREGSET/PTRACE_SETREGSET(NT_PRSTATUS) — read or write the tracee's register snapshot. Writes are applied on resume.
Cross-thread HVF register access is unreliable, so every actor uses a snapshot:
- Tracee snapshots its own vCPU registers into
ptrace_regsbefore entering ptrace-stop. ptrace_stopped = 1and the tracer is signaled viaptrace_cond.- Tracer reads and writes the snapshot via
GETREGSET/SETREGSET. - Tracer issues
PTRACE_CONT; the tracee clearsptrace_stoppedand is signaled viaresume_cond. - Tracee applies any dirty register changes back to the vCPU and resumes.
- Guest executes
BRK→ shim forwards via HVC #10. - If
current_thread->ptraced, the tracee callsthread_ptrace_stop(SIGTRAP). - Snapshot, broadcast, wait, resume as above.
wait4returnsWIFSTOPPED(status)withWSTOPSIG == SIGTRAPto the tracer.
sys_wait4() first checks for ptraced or VM-clone children via
thread_ptrace_wait() before falling through to the process table for
regular fork children, so ptrace-stops and VM-exit states are reported
without racing the regular wait path.
src/runtime/procemu.c intercepts a focused set of guest-visible paths
under /proc, /dev, and a few Linux-expected compatibility files:
- Many procfs files are synthesized from host-side runtime state.
/proc/self/*is backed by internal region, FD-table, and process data.- Synthetic proc directories support common traversal patterns used by
BusyBox
ps,uptime, andtop. - Guest cwd handling preserves a virtual
/procworking directory even though the host operates on synthetic backing directories.
Related implementation: src/runtime/procemu.c, src/syscall/path.c,
src/syscall/fs.c, src/syscall/proc_state.c.
elfuse supports dynamically linked aarch64-linux ELF binaries via
--sysroot:
elfuse --sysroot /path/to/sysroot ./my-dynamic-programHow it works:
elf_load()parsesPT_INTERPto find the interpreter path (/lib/ld-linux-aarch64.so.1for glibc,/lib/ld-musl-aarch64.so.1for musl).- The interpreter is loaded as
ET_DYNatg->interp_base(computed dynamically: 60 GiB for 36-bit IPA, 1020 GiB for 40-bit IPA). build_linux_stack()passesAT_BASE(interpreter load address) andAT_EXECFN(argv[0]) in the auxiliary vector.- The entry point becomes
interp_entry + load_base; the dynamic linker takes over from there. sys_openat()redirects guest absolute paths through the sysroot: if--sysrootis set, it tries<sysroot>/<path>first.
The sysroot is inherited by fork children via IPC state transfer.
sys_execve also loads the interpreter for dynamically linked targets, so
tools that execve dynamic children (env, nice, nohup) work
correctly. elf_resolve_interp() in src/core/elf.c is shared between
src/main.c and src/syscall/exec.c.
timeoutfails: it usesfork/cloneto create a child process with a timer, and the forked child inherits the dynamic-linker state but the fork+exec path has issues in interpreter space.
src/debug/ is split by role:
gdbstub.c— session lifecycle, stop/resume flow, packet dispatchgdbstub-rsp.c— RSP packet transport and hex helpersgdbstub-reg.c— register snapshot layout, restore flow,target.xml
The stub runs in all-stop mode. Because Hypervisor.framework register access must happen on the owning thread, the stopped vCPU snapshots its own state; the GDB-handler thread reads and updates the snapshot, and the owning thread restores the modified state on resume — the same pattern used by ptrace.
The split mirrors the architectural boundary: transport and encoding are independent of guest execution; register layout is independent of socket I/O; stop/resume sequencing remains tightly coupled to process and thread state.
elfuse uses several layers of validation:
make check— fast guest tests plus the BusyBox applet smoke suite.make test-busybox— applet coverage in isolation.make test-gdbstub— debugger integration.make test-matrix— cross-checks elfuse against QEMU on the same corpus.
The rule for contributors is simple: match the validation depth to the
subsystem you changed. Procfs, process state, dynamic linking, and debugging
typically warrant more than make check. See testing.md for
the full target list and the validation-by-change-type table.