From 026fbe079d5eb3e0d07ff812bdc82757809a0eff Mon Sep 17 00:00:00 2001 From: Simon Davies Date: Tue, 3 Feb 2026 14:19:25 +0000 Subject: [PATCH] Adds a HIP for efficiently mapping files into sandboxes This covers both the guest binary for the sandbox and arbitrary user mapped files. It proposes a capability that users minimal amounts of memory whilst retaining low overhead, that integrates with sandbox snapshot and restore and paves the way for future snapshot persistence. Signed-off-by: Simon Davies --- .../README.md | 1192 +++++++++++++++++ 1 file changed, 1192 insertions(+) create mode 100644 proposals/0002-hip-memory-efficient-file-mapping/README.md diff --git a/proposals/0002-hip-memory-efficient-file-mapping/README.md b/proposals/0002-hip-memory-efficient-file-mapping/README.md new file mode 100644 index 000000000..634c40a9c --- /dev/null +++ b/proposals/0002-hip-memory-efficient-file-mapping/README.md @@ -0,0 +1,1192 @@ +# HIP 0002 - Memory-Efficient File Mapping + + + +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories](#user-stories) + - [Story 1: High-Density Serverless Deployment](#story-1-high-density-serverless-deployment) + - [Story 2: Memory-Mapped Data Files](#story-2-memory-mapped-data-files) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Comparison with Current Implementation](#comparison-with-current-implementation) + - [Binary Caching](#binary-caching) + - [User Files](#user-files) + - [Mapping Modes](#mapping-modes) + - [Lazy Page Table Creation](#lazy-page-table-creation) + - [Guest Page Fault Handler](#guest-page-fault-handler) + - [Snapshot Integration](#snapshot-integration) +- [Test Plan](#test-plan) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) + + +## Summary + +This HIP proposes a memory-efficient file mapping infrastructure for Hyperlight that enables +sharing the guest binary across multiple sandboxes while maintaining isolation through guest-side +copy-on-write (COW). The design extends Hyperlight's existing `map_file_cow` capability (which +currently only supports host-side COW via `MAP_PRIVATE` and does not integrate with snapshots) +to provide: + +1. **Shared binary mapping** - The relocated binary is cached and mmap'd once, then shared across + all sandboxes via hypervisor memory slots +2. **Unified file mapping** - Guest binaries and user files use the same mapping infrastructure, + with configurable read-only or read-write (COW) modes +3. **Guest-side COW for user files** - User-mapped files can now be written to, with writes + triggering guest-side COW that properly integrates with snapshot/restore +4. **Lazy page table creation** - Guest page tables are populated on-demand via page faults, + avoiding the cost of creating PTEs for pages that are never accessed +5. **Bootstrap code refactoring** - The existing guest initialization code is refactored into + an eagerly-mapped stub, ensuring the page fault handler is installed before any lazy + page access occurs +6. **Snapshot integration** - The mapping infrastructure integrates with snapshot/restore, + designed to support future snapshot persistence to disk + +The design supports both Linux (KVM, MSHV) and Windows (WHP) platforms. + +## Motivation + +Hyperlight currently loads guest binaries inefficiently. When creating a sandbox: + +1. The guest ELF is loaded and relocated +2. The entire binary is copied into a `Vec` snapshot +3. Each sandbox gets its own copy of this snapshot +4. The snapshot is used for both initialization and restore operations + +This approach has several problems: + +- **Memory waste**: Read-only segments (executable code, read-only data) are identical across all sandboxes, + yet each maintains its own copy. Worse, each sandbox also keeps a `Vec` snapshot of its + memory, effectively doubling the overhead. For example, with 1000 sandboxes on a host: + - `hyperlight-js` (~1.7MB) × 1000 × 2 = **3.4GB** for identical binaries (sandbox + snapshot) + - `wasm_runtime` (~1.25MB) × 1000 × 2 = **2.5GB** for identical binaries (sandbox + snapshot) +- **Inefficient creation**: Copying the binary for each sandbox wastes CPU cycles. This is + particularly noticeable in debug builds where binaries are much larger (~43MB for `wasm_runtime`, + ~20MB for `hyperlight-js`). +- **Density limits**: Memory scales linearly with sandbox count, capping how many sandboxes + can run on a single host. +- **Inefficient snapshots**: Every snapshot contains a full `Vec` copy of the sandbox's + memory image, duplicating the binary yet again. This wastes memory and would make future + snapshot persistence larger than necessary. + +### Goals + +1. **Share read-only content**: Map the binary once, share across sandboxes via page tables +2. **Maintain isolation**: Each sandbox has isolated writable state via guest-side COW +3. **Unify file handling**: Guest binary and user files use the same infrastructure +4. **Lazy PTE creation**: Only create page table entries for pages actually accessed +5. **Snapshot integration**: Track mapped files and dirty pages for correct restore +6. **Future-proof for persistence**: Design supports eventual snapshot serialization to disk +7. **Cross-platform**: Support Linux (KVM, MSHV) and Windows (WHP) + +### Non-Goals + +1. **Snapshot persistence to disk**: This HIP focuses on in-process snapshot/restore, though + the design prepares for future disk persistence (content hashing, serializable metadata). +2. **Cross-sandbox restore**: Snapshots restore to the same sandbox instance that created them. + Cross-sandbox restore is a future enhancement. +3. **OCI image format**: Packaging and distribution is out of scope for this HIP. +4. **Live migration**: Moving running sandboxes between hosts is out of scope. + +## Proposal + +We propose extending Hyperlight's memory management to support shared, memory-mapped files with +guest-side copy-on-write. The guest binary becomes a special case of a mapped file - it's cached +after relocation and shared across all sandboxes using it. + +Hypervisor memory slots (KVM) or regions (MSHV/WHP) can map the same host memory into multiple guest address spaces. Combined with guest-side page fault handling, this enables efficient sharing with isolation. + +### User Stories + +#### Story 1: High-Density Serverless Deployment + +As a serverless platform operator, I want to run hundreds of sandboxes executing the same +WebAssembly runtime binary so that I can maximize concurrent function invocations per host +while minimizing memory overhead. + +**Current behavior**: Each sandbox copies the full binary, plus keeps a snapshot copy. +1000 sandboxes × `wasm_runtime` (1.25MB) × 2 = 2.5GB for identical binaries. + +**Proposed behavior**: One shared 1.25MB mapping + bootstrap/page tables (maybe 64KB per sandbox?) + +per-sandbox dirty pages. Memory scales with actual writes, not sandbox count. + +#### Story 2: Memory-Mapped Data Files with COW + +As an application developer, I want to map a configuration or data file into my sandbox +and be able to modify it, with changes isolated per-sandbox and properly preserved across +snapshot/restore cycles. + +**Current behavior**: `map_file_cow` maps files as read-only (no WRITE permission). The file +content is copied into snapshots, negating the memory-sharing benefit (see `guest_page()` and +`snapshot_memory.extend(contents)` in `snapshot.rs`). + +**Proposed behavior**: Unified file mapping with guest-side COW that allows writes while +maintaining sharing. Unmodified pages remain shared; only written pages are copied into snapshots. +Files are locked (shared/read lock) while mapped to prevent external modification or deletion. +Since snapshots are in-process only, the mmap'd file remains valid for the snapshot's lifetime. +Future disk persistence would copy user files into the cache at persistence time - these cached +copies would be shared across all persisted snapshots referencing the same content (validated via +content hash). + +### Risks and Mitigations + +**Risk**: Guest page fault handler bugs could cause security vulnerabilities or crashes. + +**Mitigation**: The handler is minimal and well-defined. It validates addresses against a +region table built during initialization. Invalid accesses terminate the sandbox, not the host. +Testing validates correctness. + +**Risk**: Lazy PTE creation adds latency to first access of each page. + +**Mitigation**: The page fault cost is amortized across the page lifetime. Importantly, PTEs +are preserved across snapshot/restore cycles - once a page is faulted in, it remains mapped +after restore. If a sandbox reaches steady state (all needed pages accessed) before taking a +snapshot, subsequent restores incur no page fault overhead. For most workloads, the memory +savings outweigh the fault overhead. Benchmarking will validate this assumption. If needed, +we could add an option for eager PTE creation for latency-sensitive workloads. + +**Risk**: Cache files deleted or corrupted. + +**Mitigation**: On Linux, mmap'd files remain accessible even if deleted from the filesystem +(data persists until unmapped). On Windows, open file handles prevent deletion. Files are +locked while mapped to prevent external modification. Cache integrity is operator responsibility. + +**Risk**: Cache management burden on users. + +**Mitigation**: Cache location is well-documented. Cache entries are keyed by content hash, +so stale entries are naturally replaced. A cache cleanup utility can be provided in the future. + +**Risk**: Extending guest-side page fault handler. + +**Mitigation**: The existing page fault handler already implements COW for stack expansion +(PR #1205) and snapshot region writes. File-backed COW uses the same pattern: fault → +allocate private page → copy → update PTE → resume. The extension adds region table lookup +which is O(n) with small n (typically <10 regions). + +## Design Details + +### Comparison with Current Implementation + +**Current approach:** + +``` +┌───────────────┐ ┌───────────────┐ ┌───────────────┐ +│ Sandbox 1 │ │ Sandbox 2 │ │ Sandbox 3 │ +├───────────────┤ ├───────────────┤ ├───────────────┤ +│ Binary Copy │ │ Binary Copy │ │ Binary Copy │ +│ (1.25 MB) │ │ (1.25 MB) │ │ (1.25 MB) │ +├───────────────┤ ├───────────────┤ ├───────────────┤ +│ Snapshot │ │ Snapshot │ │ Snapshot │ +│ (1.25 MB) │ │ (1.25 MB) │ │ (1.25 MB) │ +├───────────────┤ ├───────────────┤ ├───────────────┤ +│ Heap/Stack/etc│ │ Heap/Stack/etc│ │ Heap/Stack/etc│ +└───────────────┘ └───────────────┘ └───────────────┘ + +Total memory: N × (2 × binary_size) + N × heap/stack/etc size +``` + +**Proposed approach:** + +``` + ┌─────────────────────────────────┐ + │ Shared Binary (mmap'd) │ + │ (1.25 MB) │ + └───────────┬─────────────────────┘ + │ + ┌───────────────────────┼───────────────────────┐ + │ │ │ + ▼ ▼ ▼ +┌───────────────┐ ┌───────────────┐ ┌───────────────┐ +│ Sandbox 1 │ │ Sandbox 2 │ │ Sandbox 3 │ +├───────────────┤ ├───────────────┤ ├───────────────┤ +│ Bootstrap + │ │ Bootstrap + │ │ Bootstrap + │ +│ Page Tables │ │ Page Tables │ │ Page Tables │ +│ (~64 KB) │ │ (~64 KB) │ │ (~64 KB) │ +├───────────────┤ ├───────────────┤ ├───────────────┤ +│ Dirty Pages │ │ Dirty Pages │ │ Dirty Pages │ +├───────────────┤ ├───────────────┤ ├───────────────┤ +│ Heap/Stack/etc│ │ Heap/Stack/etc│ │ Heap/Stack/etc│ +└───────────────┘ └───────────────┘ └───────────────┘ + +Total memory: 1 × binary_size + N × (bootstrap + dirty_pages + heap/stack/etc size) +``` + +| Aspect | Current | Proposed | +|--------|---------|----------| +| Memory for binary (N sandboxes) | N × 2B (sandbox + snapshot) | B + (N × D) | +| Sandbox creation time | O(B) copy | O(1) slot setup | +| Page table creation | Eager (all PTEs) | Lazy (on-demand) | +| Snapshot size | Full memory image | File refs + dirty pages | +| Binary isolation | Private copy per sandbox | Shared + COW on write | + +Where B = binary size, D = dirty pages per sandbox. + +**Bootstrap code**: The "Bootstrap" shown in the diagram is a small stub injected by +the host at sandbox creation time (not linked into the guest binary). This is a +refactoring of the existing guest initialization code in `hyperlight_guest`. The +bootstrap performs minimal initialization (set stack pointer, install IDT with page +fault handler) then calls `hyperlight_main` in the guest binary. The bootstrap must +be eagerly mapped by the host because it executes before the guest can handle page +faults. It's mapped at a fixed GPA known to both host and guest. + +### Binary Caching + +Relocated binaries are cached on disk so they can be mmap'd and shared across sandboxes. +The file must exist on disk (not just in memory) because hypervisor memory slots require +a host virtual address backed by a file mapping: + +``` +~/.cache/hyperlight/binaries/ +├── -.bin # Relocated binary (still valid ELF) +└── ... +``` + +The cache key combines the source binary hash and load address, since relocation is +address-dependent. + +**Note on future ASLR**: This design does not address base address randomization. +Per-sandbox ASLR would negate the memory sharing benefits. If ASLR is required in the +future, options such as a pool of pre-relocated variants or guest-side relocation should +be considered. + +**Future extension for snapshot persistence**: When snapshot persistence to disk is +implemented, this cache will be extended to store user data files as well. At persistence +time, user-mapped files would be copied into the cache (keyed by content hash), allowing +persisted snapshots to reference stable, immutable file content rather than paths or content that +may change. Multiple snapshots referencing the same file content would share the cached +copy. + +```rust +pub struct BinaryCache { + cache_dir: PathBuf, + /// In-memory cache of path → content hash, so we only hash and relocate each file + /// once per process. Benefits: + /// - Avoids re-hashing and re-relocating the same binary + /// - Allows the original binary to be deleted after first access + /// - Subsequent sandbox creations use the cached relocated copy + /// + /// Trade-off: If the binary changes on disk after first access, the process will + /// continue using the original version. This is typically desirable for consistency + /// within a process lifetime. + path_to_hash: Mutex>, +} + +impl BinaryCache { + /// Get or create a cached relocated binary. + /// Returns an mmap'd read-only view of the binary. + /// + /// The cache key is `-.bin`. The content hash is + /// computed from the source binary on first access and cached in memory, + /// so subsequent calls don't need to read or hash the original file. + pub fn get_or_create( + &self, + source_path: &Path, + load_address: u64, + ) -> Result<(MappedBinary, ContentHash)> { + // Check in-memory path→hash cache first + let source_hash = { + let cache = self.path_to_hash.lock().unwrap(); + cache.get(source_path).cloned() + }; + + let source_hash = match source_hash { + Some(hash) => hash, + None => { + // First time seeing this path - hash the file and cache it + let hash = hash_file(source_path)?; + let mut cache = self.path_to_hash.lock().unwrap(); + cache.insert(source_path.to_owned(), hash.clone()); + hash + } + }; + + // Try to open existing cached binary + if let Ok(mapped) = self.open_by_hash(&source_hash, load_address) { + return Ok((mapped, source_hash)); + } + + // Cache miss - relocate and cache (requires original file to still exist) + let relocated = relocate_binary(source_path, load_address)?; + let cache_path = self.cache_path(&source_hash, load_address); + std::fs::write(&cache_path, &relocated)?; + + Ok((self.open_by_hash(&source_hash, load_address)?, source_hash)) + } + + /// Open a previously cached binary by its content hash. + /// Used when restoring from a snapshot that references the binary by hash, + /// or when creating additional sandboxes after the original file was deleted. + pub fn open_by_hash( + &self, + content_hash: &ContentHash, + load_address: u64, + ) -> Result { + let cache_path = self.cache_path(content_hash, load_address); + MappedBinary::open(&cache_path) + } + + fn cache_path(&self, content_hash: &ContentHash, load_address: u64) -> PathBuf { + let cache_key = format!("{}-{:x}.bin", content_hash, load_address); + self.cache_dir.join(&cache_key) + } +} + +pub struct MappedBinary { + mmap: memmap2::Mmap, +} +``` + +**Trade-offs:** + +- (+) Avoids repeated relocation work across sandbox creations +- (+) Enables sharing via mmap (kernel page cache deduplicates across processes) +- (+) Original binary can be deleted after first sandbox creation +- (+) Binary changes on disk don't affect running process (consistent behavior) +- (-) Disk space usage for cached binaries +- (-) Stale cache entry cleanup is user's responsibility +- (-) Binary updates require process restart to take effect + +**Alternatives considered:** + +1. **Content-addressable storage (CAS)**: Stores deduplicated chunks/sections, allowing + sharing across different binaries with common content. Rejected because Hyperlight + guests are monolithic images unlikely to share content across different binaries - + simple hash-based cache is sufficient. + +### User Files + +Unlike guest binaries, **user-provided files are not cached** - they are mmap'd directly +from the path specified by the user via `map_file()`. This is because: + +1. User files don't require relocation (no address-dependent processing) +2. User files may be large data files that shouldn't be copied to cache - but will need to be in future to support snapshot persistence +3. The same file may be mapped by multiple sandboxes (shared via kernel page cache) + +**File locking**: Files are locked (shared/read lock) while mapped to prevent external +modification. Multiple processes can map the same file simultaneously since all mappings +are read-only at the host level (writes go through COW). The lock is released when the +sandbox is dropped or the file is unmapped. + +**Snapshot considerations**: For in-process snapshot/restore, user files remain mmap'd - +the file handles are still valid. On restore, the content hash is validated to detect +if the file was modified (which would indicate a bug, since files should be locked). +Future disk persistence will need to cache user files with the +snapshot - this is out of scope for this HIP. + +### Mapping Modes + +Three mapping modes are supported, with the guest binary having special handling: + +```rust +pub enum FileMappingMode { + /// Read-only mapping. Writes from guest cause fault and sandbox termination. + /// Used for read-only data files. + ReadOnly, + + /// Read-write with copy-on-write. First write to each page creates private copy. + /// Entire mapping has uniform RW permissions. Used for user data files. + ReadWriteCow, + + /// Guest binary with per-segment permissions and COW for writable segments. + /// Segment metadata is read directly from the cached ELF file's PT_LOAD headers. + /// Typical layout: RX segment (code) → R-X, RW segment (data) → RW- with COW. + GuestBinary, +} + +/// Serializable metadata about a file mapping, stored in snapshots. +/// For in-process restore, the sandbox still has the mmap handles open, so +/// we only need the hash (for validation) and GPA range (for `is_file_backed_gpa`). +/// +/// For future disk persistence, this struct would need to include either: +/// - The source path (to re-open the file), or +/// - A cache key (if user files are copied into the cache at persistence time) +pub struct FileMappingInfo { + /// Content hash for validation on restore (detects unexpected file changes) + pub content_hash: Hash, + + /// Guest address where file is mapped (GVA == GPA in Hyperlight's identity-mapped layout) + pub guest_address: u64, + + /// Size of the mapping + pub size: u64, + + /// Mapping mode + pub mode: FileMappingMode, +} +``` + +For user file mappings at runtime, we also need to track the source path (for error messages +and potential future disk persistence): + +```rust +/// Runtime state for a user file mapping. The `MappedFile` (holding the mmap handle) +/// is stored in `SandboxMemoryManager`. This struct tracks the metadata needed for +/// the region table and snapshots. +pub struct UserFileMapping { + /// Metadata stored in snapshots + pub info: FileMappingInfo, + + /// Source path - kept for error messages and future disk persistence + pub source_path: PathBuf, + + /// The mmap'd file (stored here or in SandboxMemoryManager) + pub mapped_file: Arc, +} + +The guest binary loading changes in `Snapshot::from_env()` to use the cache: + +```rust +// In Snapshot::from_env() - simplified +pub(crate) fn from_env<'a, 'b>( + env: impl Into>, + cfg: SandboxConfiguration, + cache: &BinaryCache, +) -> Result { + let env = env.into(); + let mut bin = env.guest_binary; + bin.canonicalize()?; + + // Get cached, relocated, mmap'd binary + let load_addr = layout.get_guest_code_address() as u64; + let mapped_binary = cache.get_or_create(&bin, load_addr)?; + + // Record file mapping info (for snapshot metadata, not copied into memory) + let file_mapping = FileMappingInfo { + content_hash: mapped_binary.hash(), + guest_address: load_addr, + size: mapped_binary.size() as u64, + mode: FileMappingMode::GuestBinary, + }; + + // Memory buffer now contains only heap/stack/bootstrap - NOT the binary + let memory = vec![0; layout.get_non_binary_memory_size()?]; + + // ... set up page tables, etc. + + Ok(Self { + memory, + file_mappings: vec![file_mapping], // Metadata only, for restore validation + // ... + }) +} +``` + +The actual `MappedBinary` (holding the mmap file handle) is stored in `SandboxMemoryManager`, +not in the `Snapshot`. The snapshot only contains serializable metadata. + +User files (not the guest binary) can be mapped via `map_file()`, which replaces the +existing `map_file_cow()` API. The new API adds: + +- Explicit mapping modes (`ReadOnly` vs `ReadWriteCow`) +- Snapshot integration (file mappings are tracked and validated on restore) + +```rust +// User file mapping (read-only) - entire file is R-- +sandbox.map_file(&data_path, guest_addr, FileMappingMode::ReadOnly)?; + +// User file mapping (read-write with COW) - entire file is RW- with COW +sandbox.map_file(&config_path, guest_addr, FileMappingMode::ReadWriteCow)?; +``` + +**Trade-offs:** + +- (+) Unified infrastructure for binary and user files +- (+) `GuestBinary` mode captures per-segment permissions cleanly +- (+) `ReadOnly` and `ReadWriteCow` are simple for user files (no segment complexity) +- (-) No "ephemeral" mode (not tracked in snapshot) - can be added later if needed + +### Lazy Page Table Creation + +Page tables are created on-demand via page faults, not eagerly during sandbox initialization. +This avoids the cost of creating PTEs for pages that are never accessed. + +**Initialization sequence:** + +``` +Host Guest +──── ───── +1. mmap binary (PROT_READ) +2. Create hypervisor memory slot + mapping host pages to guest GPA +3. Set up minimal bootstrap page tables + (only bootstrap code is mapped) +4. VM Enter at bootstrap entry point + 5. Bootstrap initializes (stack, IDT) + 6. Bootstrap installs page fault handler + 7. Bootstrap jumps to guest entry point + 8. Guest accesses unmapped page → PAGE FAULT + 9. Handler creates PTE, resumes execution + 10. Guest continues... +``` + +This approach means the guest binary and user files are mapped into hypervisor memory +slots (making them accessible at their GPAs), but no PTEs exist initially. When the +guest accesses a page, the fault handler creates a PTE pointing to the shared file +mapping (read-only) or triggers COW for writable pages. + +**Bootstrap requirements:** + +The bootstrap code solves a chicken-and-egg problem: lazy PTE creation requires a page +fault handler, but the handler itself needs to be mapped before it can run. This is a +refactoring of the existing guest initialization code in `hyperlight_guest` - the same +initialization that currently happens is restructured so that: + +1. A small stub is eagerly mapped by the host (PTEs pre-created) +2. The stub executes immediately on VM entry and installs the page fault handler +3. Once installed, all subsequent page accesses (including the guest binary) are handled lazily + +The bootstrap stub contains: + +- Minimal startup (set stack pointer) +- IDT setup (install page fault handler) +- Call to `hyperlight_main` in the guest binary + +The stub is injected by the host at sandbox creation time, not linked into the guest binary. +It's mapped at a fixed GPA known to both host and guest. The host creates initial page tables +that map only the bootstrap region, scratch region, and page tables themselves - using the +existing `GuestPageTableBuffer` infrastructure. + +```rust +// Host-side: create initial page tables with minimal mappings +fn setup_initial_page_tables( + pt_buf: &GuestPageTableBuffer, + scratch_size: usize, +) { + // Map bootstrap region (small, eagerly mapped) + let bootstrap_mapping = Mapping { + phys_base: BOOTSTRAP_GPA, + virt_base: BOOTSTRAP_GVA, + len: BOOTSTRAP_SIZE as u64, + kind: MappingKind::BasicMapping(BasicMapping { + readable: true, + writable: false, + executable: true, + }), + }; + unsafe { vmem::map(pt_buf, bootstrap_mapping) }; + + // Map scratch region (for COW private pages) + let scratch_mapping = Mapping { + phys_base: scratch_base_gpa(scratch_size), + virt_base: scratch_base_gva(scratch_size), + len: scratch_size as u64, + kind: MappingKind::BasicMapping(BasicMapping { + readable: true, + writable: true, + executable: true, + }), + }; + unsafe { vmem::map(pt_buf, scratch_mapping) }; + + // Map page tables themselves (so guest can update them) + // ... (grows as more PTEs are added) +} +``` + +**Trade-offs:** + +- (+) No wasted PTEs for unaccessed pages +- (+) Faster sandbox creation (no PTE setup time) +- (+) Memory savings (fewer page table pages) +- (-) First access to each page incurs fault overhead +- (-) Requires bootstrap code and page fault handler in guest + +**Alternatives considered:** + +1. **Eager PTE creation by host**: Simpler but wasteful. The host would parse ELF segments + and create all PTEs before guest entry. Rejected because most pages may never be accessed + in short-lived sandboxes. + +2. **Eager code PTEs, lazy data PTEs**: Compromise that might reduce fault overhead for + hot code paths. Could be added as optimization if benchmarks show it's needed. + +### Guest Page Fault Handler + +Hyperlight already has a guest-side exception handler (`hl_exception_handler` in +`hyperlight_guest_bin`) that handles page faults for: + +1. **Stack expansion**: Faults in the stack GVA range allocate new pages on demand + (implemented in PR #1205) +2. **Snapshot region COW**: Faults in the snapshot data region (addresses below the + snapshot page table base) trigger copy-on-write - the snapshot is mapped read-only, + so writes cause faults that allocate private pages in the scratch region + +This HIP extends the existing handler to also handle file-backed pages: + +3. **File mapping access**: Faults in file-mapped regions create PTEs pointing to the + shared file mapping, with permissions based on the region table + +The extended handler logic (additions shown): + +```rust +// In hl_exception_handler - extended for file mappings +pub extern "C" fn hl_exception_handler( + stack_pointer: u64, + exception_number: u64, + page_fault_address: u64, +) { + // Existing setup: extract error_code from exception info on stack + let exn_info = (stack_pointer + size_of::() as u64) as *mut ExceptionInfo; + let error_code = unsafe { (&raw const (*exn_info).error_code).read_volatile() }; + + // Existing: Handle stack expansion + if exception_number == 14 && + page_fault_address >= MAIN_STACK_LIMIT_GVA && + page_fault_address <= MAIN_STACK_TOP_GVA { + // ... existing stack expansion logic ... + return; + } + + // Existing: Handle snapshot region COW + if exception_number == 14 && + page_fault_address <= snapshot_pt_gpa_base_gva() { + // ... existing COW logic (alloc page, copy, remap) ... + return; + } + + // NEW: Handle file-backed regions (guest binary and user files) + if exception_number == 14 { + if let Some(region) = REGION_TABLE.lookup(page_fault_address) { + let page_addr = page_fault_address & !0xfff; + let is_write = (error_code & 0x2) != 0; // CAUSED_BY_WRITE + let is_present = (error_code & 0x1) != 0; // PROTECTION_VIOLATION + + let is_cow_region = matches!( + region.region_type, + RegionType::ElfRW | RegionType::UserFileRW + ); + + if !is_present { + // First access - create PTE pointing to shared file mapping + // For COW regions, map read-only initially (COW triggers on write) + let file_page_gpa = region.file_gpa_base + + region.file_offset + + (page_addr - region.start); + paging::map_page( + file_page_gpa, + page_addr as *mut u8, + /*writable=*/ false, // Always map read-only initially + /*executable=*/ matches!(region.region_type, RegionType::ElfRX), + ); + return; + } + + if is_write && is_present && is_cow_region { + // Write to read-only mapped COW page - make private copy + let new_page = alloc_phys_pages(1); + ptr::copy( + page_addr as *const u8, + ptov(new_page).unwrap(), + PAGE_SIZE, + ); + paging::map_page(new_page, page_addr as *mut u8, /*writable=*/ true, /*executable=*/ false); + asm!("invlpg [{}]", in(reg) page_addr); + return; + } + + if is_write && !is_cow_region { + // Write to genuinely read-only region (ElfRX, ElfRO, UserFileRO) + abort!(ErrorCode::GuestError, "Write to read-only region"); + } + } + } + + // ... existing fallthrough to abort ... +} +``` + +**Region table:** + +Built by the **host** during sandbox initialization and placed in the eagerly-mapped bootstrap +region at a fixed address (e.g., `REGION_TABLE_GVA = BOOTSTRAP_GVA + BOOTSTRAP_CODE_SIZE`). +This ensures the table is accessible before the page fault handler runs. +The guest bootstrap code simply reads this pre-built table - no ELF parsing required in the guest. + +For `GuestBinary` mappings, the host parses PT_LOAD segments from the cached ELF and creates +a region entry for each (typically one RX region for code, one RW region for data). +For `ReadOnly` and `ReadWriteCow` user file mappings, a single entry covers the whole file. + +```rust +// Host-side: build region table during sandbox initialization +// This is placed in guest memory at a known address (e.g., part of bootstrap data) + +/// Type of memory region - used by page fault handler to determine behavior +#[repr(u32)] // Stable layout for host-guest ABI +enum RegionType { + // ELF loadable segments (from PT_LOAD headers) + ElfRX = 0, // Executable code segment (read-execute) + ElfRO = 1, // Read-only data segment (read-only, no execute) + ElfRW = 2, // Writable data segment (read-write, COW) + + // User-mapped files + UserFileRO = 3, // Read-only user file + UserFileRW = 4, // Read-write user file (COW) +} + +#[repr(C)] // Stable layout for host-guest ABI +struct RegionEntry { + start: u64, + end: u64, + region_type: RegionType, + // GPA of the backing file data (for calculating page GPA from GVA) + file_gpa_base: u64, + file_offset: u64, // Offset within file where this region starts +} + +#[repr(C)] +struct RegionTable { + count: u32, + entries: [RegionEntry; MAX_REGIONS], +} + +// Host builds this table: +fn build_region_table( + mapped_binary: &MappedBinary, + user_file_mappings: &[UserFileMapping], +) -> Result { + let mut table = RegionTable::default(); + + // Parse PT_LOAD headers from the cached ELF (host-side only) + let elf = elf::ElfBytes::::minimal_parse( + mapped_binary.as_slice() + )?; + + let segments = elf.segments() + .ok_or_else(|| Error::InvalidElf("no program headers"))?; + + for phdr in segments.iter().filter(|p| p.p_type == elf::abi::PT_LOAD) { + let is_write = (phdr.p_flags & 0x2) != 0; // PF_W + let is_exec = (phdr.p_flags & 0x1) != 0; // PF_X + + let region_type = match (is_write, is_exec) { + (true, _) => RegionType::ElfRW, // Writable (COW) + (false, true) => RegionType::ElfRX, // Executable code + (false, false) => RegionType::ElfRO, // Read-only data + }; + + table.push(RegionEntry { + start: phdr.p_vaddr, + end: phdr.p_vaddr + phdr.p_memsz, + region_type, + file_gpa_base: mapped_binary.gpa_base(), + file_offset: phdr.p_offset, + })?; + } + + // Add user file mappings + for mapping in user_file_mappings { + let region_type = match mapping.info.mode { + FileMappingMode::ReadOnly => RegionType::UserFileRO, + FileMappingMode::ReadWriteCow => RegionType::UserFileRW, + FileMappingMode::GuestBinary => { + return Err(Error::InvalidArgument("GuestBinary not expected here")); + } + }; + + table.push(RegionEntry { + start: mapping.info.guest_address, + end: mapping.info.guest_address + mapping.info.size, + region_type, + file_gpa_base: mapping.mapped_file.gpa_base(), + file_offset: 0, + })?; + } + + Ok(table) +} +``` + +The guest-side page fault handler simply reads from the region table at the known address: + +```rust +// Guest-side: read pre-built region table (no ELF parsing needed) +impl RegionTable { + fn lookup(&self, addr: u64) -> Option<&RegionEntry> { + self.entries[..self.count as usize] + .iter() + .find(|e| addr >= e.start && addr < e.end) + } +} + +// In page fault handler: +static REGION_TABLE: &RegionTable = unsafe { &*(REGION_TABLE_GVA as *const RegionTable) }; +``` + +**Trade-offs:** + +- (+) Region table lookup is fast (small number of regions) +- (+) Handles both PTE creation and COW in one handler +- (-) Every first access to a page incurs fault overhead + +**COW behavior by region type:** + +- `ElfRX`, `ElfRO`, `UserFileRO`: Read-only, writes abort the sandbox +- `ElfRW`, `UserFileRW`: COW - first write triggers private copy + +``` +Initial state (page not yet accessed): +┌─────────────────┐ +│ Guest PT Entry │ → Not present +└─────────────────┘ + +After first read: +┌─────────────────┐ ┌─────────────────┐ +│ Guest PT Entry │ ──► │ Shared Page │ +│ (Read-only) │ │ (in mmap'd file)│ +└─────────────────┘ └─────────────────┘ + +After first write (COW triggered): +┌─────────────────┐ ┌─────────────────┐ +│ Guest PT Entry │ ──► │ Private Page │ +│ (Read-Write) │ │ (in scratch rgn)│ +└─────────────────┘ └─────────────────┘ + (copy of original) +``` + +**Dirty page identification:** + +Hyperlight already uses GPA-range-based dirty detection: pages pointing to the scratch region +are dirty (private COW'd pages), while pages pointing to the snapshot region are clean. +The existing `filtered_mappings` function in `snapshot.rs` walks page tables and excludes +scratch region pages. This HIP extends this mechanism to also exclude file-backed pages +(which point to the shared mmap'd file GPA range). + +### Snapshot Integration + +**Current snapshot mechanism (already implemented):** + +Hyperlight already has a page-table-walking snapshot mechanism in `Snapshot::new()`: + +1. `filtered_mappings()` walks page tables using `virt_to_phys` to enumerate all mapped pages +2. Pages in scratch region or snapshot page table region are excluded +3. All other mapped pages are copied into the snapshot `Vec` +4. New page tables are built pointing to the compacted snapshot memory +5. On restore, snapshot memory is copied back and page tables rebuilt + +```rust +// Current Snapshot struct (after PR #1205) +pub struct Snapshot { + sandbox_id: u64, + layout: SandboxMemoryLayout, + memory: Vec, // All live pages, compacted + regions: Vec, // Extra host->guest mappings (replaced by file_mappings) + load_info: LoadInfo, + hash: [u8; 32], + root_pt_gpa: u64, + stack_top_gva: u64, // Added in PR #1205 +} +``` + +**Changes required for file mapping:** + +The existing snapshot mechanism needs these modifications: + +1. **Skip content copy for clean file-backed pages**: When walking page tables at snapshot time, + check if GPA points to a file-backed region. If so, skip copying content (it's shared). + If GPA points to scratch (COW'd page), copy the content as usual. + +2. **Replace `regions` with `file_mappings`**: The existing `regions: Vec` + field is replaced by `file_mappings` which tracks which files are mapped and their GPA ranges. + This is used to identify file-backed GPAs during snapshot/restore. + +The existing PTE rebuild logic on restore remains unchanged - it already walks the snapshot's +page tables and recreates all PTEs. The GPAs stored in the snapshot are already correct: +file-backed pages point to file GPAs, dirty pages point to snapshot memory GPAs. + +```rust +// Proposed changes to Snapshot struct +pub struct Snapshot { + sandbox_id: u64, + layout: SandboxMemoryLayout, + memory: Vec, // Dirty pages + page tables (no clean file content) + load_info: LoadInfo, + hash: [u8; 32], + root_pt_gpa: u64, + stack_top_gva: u64, // Added in PR #1205 + + // REMOVED: regions: Vec + + // ADDED: file_mappings tracks which files are mapped (GPA ranges + validation hash) + file_mappings: Vec, +} +``` + +**Removed functions/logic:** + +```rust +// REMOVED: guest_page() no longer checks regions +// Before: +unsafe fn guest_page(..., regions: &[MemoryRegion], ...) -> Option<&[u8]> { + // Check if GPA falls within a MemoryRegion (host memory) + for rgn in regions { + if gpa >= rgn.guest_region.start && gpa + PAGE_SIZE <= rgn.guest_region.end { + // Return slice from host memory + return Some(std::slice::from_raw_parts(...)); + } + } + // Fall back to snapshot/scratch memory + ... +} + +// After: regions parameter removed, file-backed GPAs are excluded earlier +unsafe fn guest_page(snap: ..., scratch: ..., scratch_size: usize, gpa: u64) -> Option<&[u8]> { + // Only check snapshot and scratch memory + let (mem, off) = access_gpa(snap, scratch, scratch_size, gpa)?; + ... +} + +// REMOVED: hash() no longer includes regions +// Before: +fn hash(memory: &[u8], regions: &[MemoryRegion]) -> Result<[u8; 32]> { + let mut hasher = blake3::Hasher::new(); + hasher.update(memory); + for rgn in regions { ... } // Include region metadata + Ok(hasher.finalize().into()) +} + +// After: regions parameter removed, file_mappings included instead +fn hash(memory: &[u8], file_mappings: &[FileMappingInfo]) -> Result<[u8; 32]> { + let mut hasher = blake3::Hasher::new(); + hasher.update(memory); + for fm in file_mappings { + hasher.update(&fm.content_hash); // Include file content hash + hasher.update(&fm.guest_address.to_le_bytes()); + } + Ok(hasher.finalize().into()) +} +``` + +**Modified `filtered_mappings` logic:** + +Pages are categorized by where their GPA points: + +1. **File-backed GPA (clean)**: GPA points to the mmap'd file region → skip content copy +2. **Scratch GPA (dirty file page)**: GPA points to scratch (COW'd file page) → copy content +3. **Non-file pages**: heap, stack, etc. → copy content (existing behavior) + +```rust +// In filtered_mappings - skip content copy for clean file-backed pages +fn filtered_mappings(...) -> Vec<(u64, u64, BasicMapping, Option<&[u8]>)> { + virt_to_phys(...) + .filter_map(|(gva, gpa, bm)| { + // Exclude scratch region GVAs (internal allocations) + if gva >= scratch_base_gva(scratch_size) { + return None; + } + // Exclude snapshot page tables + if gva >= SNAPSHOT_PT_GVA_MIN && gva <= SNAPSHOT_PT_GVA_MAX { + return None; + } + + // Check if GPA points to a file-backed region + if is_file_backed_gpa(gpa, file_mappings) { + // Clean file-backed page - no content to copy + return Some((gva, gpa, bm, None)); + } + + // Dirty page - copy content from scratch or snapshot memory + let contents = guest_page(snap, scratch, scratch_size, gpa)?; + Some((gva, gpa, bm, Some(contents))) + }) + .collect() +} +``` + +**How COW'd file pages are captured:** + +When a write occurs to a file-backed COW page (e.g., `.data` section): +1. Page fault handler allocates new page in scratch region via `alloc_phys_pages(1)` +2. Copies content from file page to new scratch page +3. Remaps PTE: same GVA, but GPA now points to scratch region (with RW permissions) + +At snapshot time: +- The GVA is still in the file-mapped range (e.g., `.data` section address) +- But the GPA points to scratch, not to the file +- `is_file_backed_gpa(gpa, ...)` returns false +- Content is copied from scratch region + +On restore, `rebuild_page_tables` walks the snapshot's page tables and recreates all PTEs. +The GPAs are already correct in the snapshot - file-backed pages point to file GPAs, +dirty pages point to snapshot memory GPAs: + +```rust +fn rebuild_page_tables(&mut self, snapshot: &Snapshot) -> Result<()> { + let pt_buf = GuestPageTableBuffer::new(pt_base_gpa); + + // Walk the snapshot's page tables and recreate all PTEs + for (gva, gpa, bm) in virt_to_phys(snapshot, ...) { + let mapping = Mapping { + phys_base: gpa, + virt_base: gva, + len: PAGE_SIZE as u64, + kind: MappingKind::BasicMapping(bm), + }; + unsafe { vmem::map(&pt_buf, mapping) }; + } + + // Map special regions + map_specials(&pt_buf, self.layout.get_scratch_size()); + + Ok(()) +} +``` + +**Key points:** +- **Clean file-backed pages**: RO PTEs preserved exactly (same GVA → same GPA, same permissions) +- **Dirty file-backed pages (COW'd)**: Content captured in snapshot, restored with original permissions +- No content duplication for clean file-backed pages +- Pages never accessed before snapshot remain unmapped (lazy creation on first access) + +**Restore process:** + +For in-process restore (restoring to the same sandbox instance that created the snapshot), +the file mappings are already in place - the sandbox keeps the `MappedBinary` and +`MappedFile` handles alive, and the hypervisor memory slots remain configured. The restore +process only needs to: + +1. Validate the sandbox ID matches +2. Verify file content hashes (defense-in-depth - should never fail since files are locked) +3. Copy snapshot memory (dirty pages) +4. Rebuild page tables + +```rust +fn restore_snapshot(sandbox: &mut Sandbox, snapshot: &Snapshot) -> Result<()> { + // Existing: validate sandbox ID + if sandbox.id() != snapshot.sandbox_id { + return Err(Error::SnapshotMismatch); + } + + // Verify file mappings match what the snapshot expects. + // `file_mappings` includes both the guest binary (FileMappingMode::GuestBinary) + // and any user-mapped files (ReadOnly or ReadWriteCow). + // + // For in-process restore, the sandbox already has these files mapped into + // hypervisor memory slots at the correct GPAs. We verify the content hashes + // as defense-in-depth (should never fail since files are locked while mapped). + for file_info in &snapshot.file_mappings { + // Lookup the file mapping by GPA - searches both MappedBinary and MappedFiles + let mapped_file = sandbox.get_file_mapping_at_gpa(file_info.guest_address) + .ok_or(Error::MissingFileMapping)?; + if mapped_file.content_hash() != file_info.content_hash { + return Err(Error::ContentMismatch); + } + } + + // Existing: copy snapshot memory (dirty pages only - file content is shared) + sandbox.shared_mem.restore_from_slice(&snapshot.memory)?; + + // Existing: rebuild page tables (PTEs point to file GPAs or snapshot memory GPAs) + sandbox.rebuild_page_tables(&snapshot)?; + + Ok(()) +} +``` + +For future disk persistence (restoring a snapshot in a new process), the restore would +need to first re-establish the file mappings: + +```rust +fn restore_persisted_snapshot(snapshot: &PersistedSnapshot, cache: &BinaryCache) -> Result { + // Re-establish file mappings from cache. + // `file_mappings` includes both the guest binary and user files - all are + // stored in the cache (keyed by content hash) when a snapshot is persisted. + let mut file_mappings = Vec::new(); + for file_info in &snapshot.file_mappings { + // Lookup by content hash - requires file to be in cache + let mapped = cache.open_by_hash(&file_info.content_hash, file_info.guest_address)?; + file_mappings.push(mapped); + } + + // Create sandbox with the mapped files + let sandbox = Sandbox::new_with_file_mappings(file_mappings, ...)?; + + // Then proceed with normal restore... + sandbox.shared_mem.restore_from_slice(&snapshot.memory)?; + sandbox.rebuild_page_tables(&snapshot)?; + + Ok(sandbox) +} +``` + +**Result:** Snapshots remain compact (file content is not duplicated), while restore +is fast and preserves exact page permissions. + +**Design for future persistence:** + +While this HIP focuses on in-process restore (same sandbox instance), the design supports +future disk persistence: + +- File mappings reference content by hash, not path (enables validation on restore) +- No pointers or handles in `FileMappingInfo` (serializable to disk) + +When persistence is implemented, the `Snapshot` struct can be serialized to disk and +restored in a new process by: + +1. Re-establishing file mappings via cache (lookup by hash) +2. Applying dirty pages +3. Recreating page tables +4. Restoring CPU state + +**Trade-offs:** + +- (+) Snapshots are smaller (file refs + dirty pages, not full memory) +- (+) Design supports future persistence without restructuring +- (+) Content hashing enables validation +- (-) More complex than simple Vec snapshot. The complexity comes from managing the + relationship between snapshots and their associated file caches - a snapshot is no longer + self-contained but depends on external cached files being present and unchanged. This could + be mitigated with tooling e.g.: + - `hyperlight snapshot package` - bundle a snapshot with its required cached files into a + single distributable artifact. Moreover these single redistributable artifacts could published and shared to OCI compatible registries for easy distribution. + - `hyperlight snapshot validate` - verify all referenced files exist and content hashes match + - `hyperlight snapshot info` - query snapshot metadata (file mappings, sizes, content hashes) + - (-) Restore must re-validate file content (content hash verification adds latency) + + +## Test Plan + +**Unit Tests:** + +- Binary cache: `get_or_create`, concurrent access, cache key correctness +- Region table: lookup correctness, boundary cases, empty table +- `build_region_table`: PT_LOAD parsing, user file mapping inclusion + +**Integration Tests:** + +- Single sandbox with mmap'd binary executes correctly +- Multiple sandboxes share binary (verify via `/proc/*/smaps` Pss on Linux) +- COW triggers correctly: first read creates read-only PTE, first write triggers COW +- Write to executable segment (RX) terminates sandbox (not COW-eligible) +- Snapshot restore rebuilds page tables pointing to shared file mapping +- Restore validates file content hash (fails if cached file changed) +- `map_file` API works for user files (ReadOnly and ReadWriteCow modes) + +**Memory Tests:** + +- 100 sandboxes with shared binary use less memory than 100 copies +- Dirty pages grow proportionally to writes, not total mapped size + +**Performance Tests:** + +- Sandbox creation time (expect faster due to no binary copy) +- First page access latency (measure fault overhead) +- Memory usage at scale (100, 1000 sandboxes) + +## Implementation History + +- 2025-02-03: Initial HIP draft + +## Drawbacks + +1. **Increased complexity**: Region table and extended page fault handler logic add + guest-side complexity on top of the existing exception handling infrastructure. + +2. **First-access latency**: Each page's first access incurs a page fault. For + latency-critical code paths, this could be noticeable. + +3. **Cache management**: Users must manage disk cache. Stale entries accumulate until + manually cleaned.