diff --git a/crates/composefs-oci/src/canonical_tar_spec.rs b/crates/composefs-oci/src/canonical_tar_spec.rs new file mode 100644 index 00000000..d56746b3 --- /dev/null +++ b/crates/composefs-oci/src/canonical_tar_spec.rs @@ -0,0 +1,179 @@ +//! # Canonical Tar Format +//! +//! This document defines a canonical, reproducible tar serialization for composefs filesystem trees. This is a prerequisite for pushing images after an [incremental pull](crate::incremental_pulls_spec) and complements the standardized EROFS metadata work. +//! +//! ## Motivation +//! +//! In the [incremental pull](crate::incremental_pulls_spec) model, a composefs-aware client fetches only the content objects it doesn't already have, using the EROFS metadata as a table of contents. The client does not download or store the original tar layer bytes. To push this image to another registry, or to verify the OCI `diff_id` if needed, the client must be able to regenerate a byte-identical tar stream from the EROFS metadata and local object store. +//! +//! Without a canonical tar format, the regenerated tar will almost certainly differ from the original (different header encoding, different entry ordering, different padding), producing different digests. +//! +//! ## Conceptual Model +//! +//! The canonical tar format is defined as a mapping from composefs dumpfile to tar. The dumpfile is a human-readable textual format that represents a complete filesystem tree and can be converted to/from EROFS. By defining dumpfile-to-tar, we complete a triangle of deterministic conversions: +//! +//! ``` +//! dumpfile ──→ canonical tar +//! ↑ │ +//! │ │ +//! └── EROFS (v1) ←─┘ +//! +//! ``` +//! +//! A client that has an EROFS can convert to dumpfile, then to canonical tar. A builder that has a tar can convert to dumpfile, then to EROFS. +//! +//! ## Specification +//! +//! ### Header Format: pax (POSIX.1-2001) +//! +//! The canonical format uses pax extended headers exclusively. pax supports long filenames, large file sizes, nanosecond timestamps, arbitrary xattrs, and large uid/gid values without the ambiguities of GNU extensions. +//! +//! Each entry consists of: +//! 1. *(If pax records are needed)* A pax extended header entry (type `x`) followed by its data blocks +//! 2. The ustar header entry followed by any content data blocks +//! +//! The pax extended header entry's name is `PaxHeaders.0/` where `` is the entry's filename component (truncated to 100 bytes if necessary). +//! +//! ### Global Header +//! +//! The archive begins with a single pax global extended header (typeflag `g`) containing one record: +//! +//! ``` +//! canonical-tar=1 +//! ``` +//! +//! This allows any client to detect canonical tar format by reading the first entry. No other global extended headers are permitted in the archive. +//! +//! ### Entry Ordering +//! +//! Entries appear in depth-first pre-order with children sorted by filename using byte-wise comparison. This matches the ordering produced by iterating a `BTreeMap`, which is the in-memory representation used by composefs. +//! +//! Example: +//! ``` +//! ./ +//! ./a/ +//! ./a/x +//! ./a/y +//! ./b/ +//! ./b/z +//! ./c +//! ``` +//! +//! The root directory entry comes first. Directories are emitted before their children. +//! +//! ### Path Encoding +//! +//! All paths are relative to the archive root, prefixed with `./`. Directories have a trailing `/`. For example, the dumpfile path `/usr/bin/sh` becomes `./usr/bin/sh` in the tar stream; the dumpfile path `/usr/lib/` becomes `./usr/lib/`. +//! +//! Paths that fit within 100 bytes are stored entirely in the ustar `name` field. Paths longer than 100 bytes use a pax `path` record; the ustar `name` field is filled with a truncated form and the ustar `prefix` field is left empty. The ustar prefix/name split is never used, as different implementations split at different `/` boundaries, making it a source of non-reproducibility. +//! +//! ### Ustar Header Fields +//! +//! All header fields use the ustar format (magic `ustar\0`, version `00`). +//! +//! | Field | Size | Encoding | Notes | +//! |-------|------|----------|-------| +//! | name | 100 | Bytes, null-terminated | See path encoding above | +//! | mode | 8 | Octal, zero-padded, null-terminated | Permission bits only (no file-type bits). E.g. `0000755\0` | +//! | uid | 8 | Octal, zero-padded, null-terminated | Values > 2,097,151 overflow to pax | +//! | gid | 8 | Octal, zero-padded, null-terminated | Values > 2,097,151 overflow to pax | +//! | size | 12 | Octal, zero-padded, null-terminated | File content size. 0 for directories, symlinks, devices, fifos. Values > 8 GiB overflow to pax | +//! | mtime | 12 | Octal, zero-padded, null-terminated | Seconds since epoch. Values > 8,589,934,591 overflow to pax | +//! | chksum | 8 | Octal, zero-padded, null-terminated + space | Unsigned sum of all header bytes with chksum field treated as spaces | +//! | typeflag | 1 | ASCII | See entry types below | +//! | linkname | 100 | Bytes, null-terminated | Symlink/hardlink target; longer targets use pax `linkpath` | +//! | magic | 6 | `ustar\0` | | +//! | version | 2 | `00` | | +//! | uname | 32 | Empty (null-filled) | Not stored in EROFS; omitted | +//! | gname | 32 | Empty (null-filled) | Not stored in EROFS; omitted | +//! | devmajor | 8 | Octal, zero-padded, null-terminated | For block/char devices only; 0 otherwise | +//! | devminor | 8 | Octal, zero-padded, null-terminated | For block/char devices only; 0 otherwise | +//! | prefix | 155 | Empty (null-filled) | Never used; long paths use pax `path` instead | +//! +//! Unused header bytes are zero-filled. +//! +//! ### Entry Types +//! +//! | Dumpfile entry | typeflag | Notes | +//! |----------------|----------|-------| +//! | Regular file | `0` | Content follows header | +//! | Directory | `5` | Size 0, path has trailing `/` | +//! | Symlink | `2` | Target in linkname (or pax `linkpath`) | +//! | Hardlink | `1` | Target in linkname as relative `./`-prefixed path | +//! | Block device | `4` | devmajor/devminor set | +//! | Char device | `3` | devmajor/devminor set | +//! | FIFO | `6` | | +//! +//! ### Pax Extended Headers +//! +//! Pax records are used only when a value overflows the ustar header capacity. The canonical format does not unconditionally emit pax headers for values that fit in ustar fields. +//! +//! Pax records are emitted in the following order when present: +//! +//! 1. `path` (if name exceeds ustar prefix/name capacity) +//! 2. `linkpath` (if linkname exceeds 100 bytes) +//! 3. `size` (if > 8 GiB) +//! 4. `uid` (if > 2,097,151) +//! 5. `gid` (if > 2,097,151) +//! 6. `mtime` (if > 8,589,934,591, or if sub-second precision is needed) +//! 7. `SCHILY.xattr.*` records, sorted by full key name (byte-wise) +//! +//! Each pax record is formatted as ` =\n` per POSIX.1-2001. The length field is the total byte count of the record including itself. +//! +//! #### Xattr Encoding +//! +//! Extended attributes are encoded as `SCHILY.xattr.` pax records. Values are binary-safe (the pax record length field handles arbitrary bytes). Xattr records are sorted by the full key string (`SCHILY.xattr.security.selinux` before `SCHILY.xattr.user.foo`), using byte-wise comparison. +//! +//! #### Timestamp Precision +//! +//! If the dumpfile timestamp has a non-zero nanosecond component, the `mtime` pax record is emitted as `.` (nanoseconds without trailing zeros). If the timestamp is integer seconds and fits in the ustar mtime field, no pax record is emitted. +//! +//! ### Content and Padding +//! +//! File content is the raw bytes from the object store (for external files, identified by fsverity digest) or the inline bytes (for files ≤ 64 bytes). +//! +//! Content is followed by zero-padding to the next 512-byte block boundary. The padding bytes are all zero. +//! +//! ### End of Archive +//! +//! The archive ends with two consecutive 512-byte blocks of zeros, per POSIX. +//! +//! ### Hardlink Handling +//! +//! When the dumpfile contains hardlinks (multiple paths sharing the same leaf ID), the first path encountered in depth-first sorted order is emitted as a regular entry with full content. Subsequent paths referencing the same leaf are emitted as hardlink entries (typeflag `1`) with the first path as the linkname target. +//! +//! The hardlink target path uses the same `./`-prefixed encoding as all other paths. +//! +//! ### Whiteout Representation +//! +//! For per-layer (non-merged) tars, OCI whiteouts are represented as standard whiteout entries: +//! +//! - **File deletion**: a zero-length regular file named `.wh.` in the parent directory +//! - **Opaque directory**: a zero-length regular file named `.wh..wh..opq` in the directory +//! +//! Whiteout entries appear in sorted order alongside regular entries. Their mode is `0000644`, uid/gid are 0, mtime is 0. +//! +//! For merged/flattened tars, whiteouts do not appear (they have already been processed). +//! +//! ## Compression +//! +//! This specification defines the uncompressed tar byte stream only. Compression (gzip, zstd, composefs-chunked framing) is a separate concern. The composefs-chunked format described in [`incremental_pulls_spec`](crate::incremental_pulls_spec) applies zstd frame boundaries on top of this canonical ordering without changing the entry order or content. +//! +//! ## Implementation Notes +//! +//! The [tar-core](https://github.com/composefs/tar-core) crate provides the building blocks for producing canonical tar output. It supports both pax and GNU extension modes, deterministic numeric encoding, and pax record construction. The canonical tar generator would use tar-core's `EntryBuilder` in pax mode (`ExtensionMode::Pax`), calling `build_pax_data()` to emit extended headers only when ustar fields overflow. +//! +//! tar-core does not impose entry ordering; the caller (composefs) controls the order by walking the dumpfile/EROFS tree in sorted depth-first order. +//! +//! ## Relationship to Other Specs +//! +//! The dumpfile is the canonical filesystem representation that bridges tar and EROFS. This spec defines dumpfile to tar; a future standardized EROFS metadata spec will define dumpfile to EROFS. Together they enable round-trip conversion. +//! +//! The OCI layer format (`application/vnd.oci.image.layer.v1.tar`) requires a standards-compliant tar stream. A canonical tar produced by this specification is a valid OCI layer. The `diff_id` is the SHA-256 of the uncompressed canonical tar stream. +//! +//! ## References +//! +//! - [Incremental pulls](crate::incremental_pulls_spec): the primary consumer of canonical tar +//! - [tar-core](https://github.com/composefs/tar-core): sans-IO tar library used by composefs +//! - [OCI image layer spec](https://github.com/opencontainers/image-spec/blob/main/layer.md): OCI tar layer requirements +//! - [POSIX.1-2001 pax format](https://pubs.opengroup.org/onlinepubs/9699919799/utilities/pax.html): pax extended header specification diff --git a/crates/composefs-oci/src/incremental_pulls_spec.rs b/crates/composefs-oci/src/incremental_pulls_spec.rs new file mode 100644 index 00000000..9095f2ba --- /dev/null +++ b/crates/composefs-oci/src/incremental_pulls_spec.rs @@ -0,0 +1,135 @@ +//! # Incremental Pulls +//! +//! Status: Provisional +//! +//! There's two large things missing from OCI: +//! +//! - dm-verity like integrity +//! - standard incremental fetching and deltas +//! +//! The composefs artifact model fixes the first. This proposal builds on top of the composefs artifact, giving a model for incremental fetches. +//! +//! ## Core proposal +//! +//! Existing approaches to incremental container image pulls (zstd:chunked and eStargz) embed a JSON table of contents (TOC) inside the compressed layer blob. The client reads the TOC, determines which file chunks it already has locally, and fetches missing chunks via HTTP range requests. +//! +//! The two formats handle diff_id verification differently. zstd:chunked also embeds tar-split reconstruction data in the blob, allowing the client to reassemble the exact original uncompressed tar stream and verify its SHA-256 digest against the OCI `diff_id`. eStargz does *not* include tar-split, which means it cannot verify the diff_id at all; clients must set `insecure_allow_unpredictable_image_contents` to use it. This is a significant practical limitation of eStargz. +//! +//! Composefs changes this picture fundamentally. The locally-generated EROFS metadata contains the complete filesystem tree with fsverity digests for every content object. A composefs-aware client knows which objects it already has, and can compute which ones are missing. +//! +//! All that is needed then is a mapping between the fsverity digests and the location in the tar stream. +//! +//! When the EROFS is trusted (via kernel fsverity signature or the OCI manifest signature chain covering the composefs digest), the `diff_id` verification becomes redundant: the composefs digest already cryptographically covers the complete filesystem tree. This eliminates the need for tar-split metadata entirely and simplifies the pull, verification, and push paths. +//! +//! ### Comparison with existing approaches +//! +//! | Aspect | zstd:chunked | eStargz | composefs incremental | +//! |--------|-------------|---------|----------------------| +//! | TOC format | JSON in zstd skippable frame | JSON in gzip member | Offset map (separate OCI artifact) | +//! | TOC reuse | Discarded after pull | Discarded after pull | EROFS generated locally; mounted by the kernel | +//! | Tar-split | Embedded in blob | Not available | Not needed | +//! | diff_id verification | Yes (via tar-split) | No (`insecure_allow_unpredictable_image_contents`) | Redundant (composefs digest covers the tree) | +//! | Content digests | SHA-256 | SHA-256 | fsverity (SHA-256 or SHA-512 Merkle tree) | +//! | Dedup granularity | Sub-file chunks (~64 KiB, rolling checksum) | Per-file | Whole files (by fsverity digest) | +//! | Kernel integration | None (userspace only) | None (userspace only) | EROFS + overlayfs + fsverity | +//! | Push after incremental pull | Reconstruct via tar-split | Cannot reconstruct original tar | Canonical tar generation (see below) | +//! +//! ## Design +//! +//! ### Layer Format: composefs-chunked +//! +//! A composefs-chunked layer is a valid `tar+zstd` blob that any OCI client can pull and decompress normally. The difference is in how the zstd compression is structured internally: large files are compressed as independent zstd frames, making them individually addressable via byte offset. +//! +//! Tar entries are in **canonical order**, the same deterministic ordering defined by the [canonical tar format](crate::canonical_tar_spec). This is essential: a client that does an incremental pull must be able to regenerate byte-identical tar for push, so the entry ordering cannot be compression-driven. +//! +//! The zstd frame boundaries are an overlay on top of the canonical ordering. For files above a size threshold (e.g. 4 KiB), the compressor closes and restarts the zstd frame around the file's payload, making it independently decompressible. Files below the threshold are simply compressed together with their neighbors in whatever order they naturally appear. The threshold aligns with the filesystem block size. +//! +//! Files ≤ 64 bytes are already inline in the EROFS metadata (`INLINE_CONTENT_MAX`) and are never fetched from the tar layer during an incremental pull, regardless of framing. +//! +//! Unlike zstd:chunked, there are no trailing skippable frames (no embedded JSON TOC, no tar-split data). The offset map in the composefs metadata artifact serves as the TOC; the EROFS itself is generated locally by the client. +//! +//! Unlike zstd:chunked, there is no sub-file content-defined chunking. Composefs deduplicates at the whole-file level (by fsverity digest), so rolling-checksum chunk boundaries provide no dedup benefit. This simplifies the format and the offset map. +//! +//! ### Offset Map +//! +//! The offset map tells the client where each individually-framed file lives within the compressed layer blob. It is stored as an additional layer in the composefs OCI artifact, with media type `application/vnd.composefs.v1.offset-map`. +//! +//! For each individually-compressed file, the map contains: +//! +//! ``` +//! { fsverity_digest, layer_index, byte_offset, compressed_size } +//! ``` +//! +//! - `fsverity_digest`: the fsverity digest of the file content (matches the EROFS inode's content reference) +//! - `layer_index`: position in the image manifest's `layers` array (0-indexed) +//! - `byte_offset`: byte offset of the payload zstd frame within the compressed blob +//! - `compressed_size`: size of the compressed zstd frame in bytes +//! +//! Only files above the individually-framed threshold have entries in the offset map. Files below the threshold that a client needs must be fetched by downloading the surrounding range or falling back to a full layer fetch (acceptable since these files are small by definition). +//! +//! The format should be compact. A sorted array of fixed-size records (digest + u32 layer index + u64 offset + u64 size) works well and enables binary search by digest. For a layer with 10,000 individually-framed files using SHA-512 fsverity digests, the offset map is roughly 10,000 × (64 + 4 + 8 + 8) = ~820 KiB uncompressed, which compresses well. +//! +//! ### Pull Protocol +//! +//! **Full pull (non-composefs client).** The layer is a valid tar+zstd blob. Pull, decompress, extract. Standard OCI behavior, no awareness of composefs needed. +//! +//! **Incremental pull (composefs-aware client):** +//! +//! 1. Fetch the composefs metadata artifact (offset map + optional signatures) +//! 2. Read the composefs digest annotations to learn expected fsverity digests for all non-inline content objects +//! 3. Query the local object store: which of these digests do we already have? +//! 4. For missing digests, look up byte ranges in the offset map +//! 5. Merge adjacent/nearby ranges to reduce HTTP requests (same optimization as zstd:chunked) +//! 6. Issue HTTP range requests against the layer blob(s) to fetch missing objects +//! 7. Decompress each frame independently, write to the object store, enable fsverity +//! 8. Verify each object: the computed fsverity digest must match what the EROFS references +//! +//! No tar reassembly, no diff_id verification, no tar-split. Trust is rooted in the composefs digest (signed or verified via the manifest chain), and each content object is independently verified by its fsverity digest. After fetching all missing objects, the client generates the EROFS locally and verifies its fsverity digest matches the expected value. +//! +//! ### Push After Incremental Pull +//! +//! An incrementally-pulled image does not have the original tar layer bytes stored locally. To push the image to another registry, the client must regenerate the tar layer. For the pushed image to be identical to the original (same layer digests, same manifest), this regeneration must be deterministic. +//! +//! This requires a **canonical tar format**: a well-defined, reproducible mapping from filesystem metadata (EROFS or dumpfile) + content objects to a tar byte stream. See [`canonical_tar_spec`](crate::canonical_tar_spec) for this specification. +//! +//! With a canonical tar: +//! - The original image builder produces the tar using the canonical format +//! - An incrementally-pulling client can regenerate byte-identical tar from EROFS + object store +//! - The pushed image has the same layer digests and diff_id as the original +//! - The canonical tar can also be used to lazily verify the diff_id if needed, without storing tar-split +//! +//! ### Composefs Artifact Integration +//! +//! The offset map is an additional layer in the composefs metadata artifact (`application/vnd.composefs.metadata.v1`). With incremental pull support, the artifact layers are ordered: +//! +//! 1. N offset map layers (one per image layer, `application/vnd.composefs.v1.offset-map`) +//! 2. *(Optional)* Signature layers (`application/vnd.composefs.signature.v1+pkcs7`) +//! +//! Each offset map layer carries a `composefs.layer.offset-map-index` annotation identifying which manifest layer it corresponds to (0-indexed). +//! +//! Layers that are not composefs-chunked (e.g. standard tar+gzip layers in a mixed image) simply have no offset map entry. A missing offset map for a layer means the client must fall back to a full fetch for that layer. +//! +//! ## Security Considerations +//! +//! **Trust model.** The trusted composefs digest (verified against the manifest chain or via kernel fsverity signatures) is the root of trust for the filesystem tree. Each content object fetched via range request is verified independently by computing its fsverity digest. An attacker who controls the registry cannot serve incorrect content without detection, since the fsverity digest is a Merkle tree hash that the kernel enforces on every read after `FS_IOC_ENABLE_VERITY`. +//! +//! **No tar-split, no diff_id.** By not verifying the diff_id, we are explicitly trusting the composefs digest chain rather than the OCI config's `rootfs.diff_ids`. This is a stronger verification (fsverity Merkle tree of the complete filesystem vs. flat SHA-256 of an opaque tar stream) but it does mean that a composefs-aware client and a non-composefs client may disagree if the tar and EROFS are inconsistent. Since the EROFS is generated locally from the tar layers using canonical generation, any divergence surfaces as a digest mismatch against the trusted composefs digest. +//! +//! **Offset map integrity.** The offset map is part of the composefs artifact, which is covered by the artifact's manifest digest and optionally by signatures. A tampered offset map could point to wrong byte ranges, but the client verifies each fetched object's fsverity digest, so tampered offsets result in verification failure, not incorrect data. +//! +//! ## Future Directions +//! +//! **Registry-level compression.** The [OCI distribution-spec proposal for registry-level compression](https://github.com/opencontainers/distribution-spec/issues/235) would allow registries to handle compression/decompression, serving uncompressed byte ranges from compressed blobs. This would eliminate the need for independent zstd framing entirely; the client could request raw byte ranges of uncompressed file content. The offset map would then contain offsets into the *uncompressed* tar stream, which are easier to compute (they fall out of tar generation directly). +//! +//! **Sub-file chunking.** The current design operates at whole-file granularity. For images with very large files that change incrementally between versions (e.g. RPM databases, locale archives), sub-file content-defined chunking could reduce transfer sizes. The offset map format is extensible to support multiple entries per file. This is deferred as a non-goal for the initial design. +//! +//! **Cross-layer dedup.** The composefs object store already deduplicates across layers (objects are stored by fsverity digest). The incremental pull protocol naturally benefits from this: if layer A and layer B share a file, pulling layer A populates the object store, and layer B's pull skips that file. No additional mechanism is needed. +//! +//! ## References +//! +//! - [OCI sealing specification](crate::sealing_spec): composefs metadata artifacts +//! - [Canonical tar format](crate::canonical_tar_spec): reproducible tar generation for push after incremental pull +//! - Standardized EROFS metadata (future): canonical EROFS generation (separate concern) +//! - [composefs/composefs#294](https://github.com/composefs/composefs/issues/294): original design discussion +//! - [zstd:chunked implementation](https://github.com/containers/storage/tree/main/pkg/chunked): reference for partial pull mechanics +//! - [OCI distribution-spec #235](https://github.com/opencontainers/distribution-spec/issues/235): registry-level compression proposal diff --git a/crates/composefs-oci/src/lib.rs b/crates/composefs-oci/src/lib.rs index 0aefd575..b4a935b3 100644 --- a/crates/composefs-oci/src/lib.rs +++ b/crates/composefs-oci/src/lib.rs @@ -35,8 +35,14 @@ pub mod tar; #[doc(hidden)] pub mod test_util; +#[cfg(doc)] +pub mod canonical_tar_spec; #[cfg(doc)] pub mod design; +#[cfg(doc)] +pub mod incremental_pulls_spec; +#[cfg(doc)] +pub mod sealing_spec; // Re-export the composefs crate for consumers who only need composefs-oci pub use composefs; diff --git a/crates/composefs-oci/src/sealing_spec.rs b/crates/composefs-oci/src/sealing_spec.rs new file mode 100644 index 00000000..ea97e1bd --- /dev/null +++ b/crates/composefs-oci/src/sealing_spec.rs @@ -0,0 +1,428 @@ +//! # OCI Sealing Specification for Composefs +//! +//! This document defines how composefs integrates with OCI container images to provide cryptographic verification of complete filesystem trees. The specification is based on original design discussion in [composefs/composefs#294](https://github.com/composefs/composefs/issues/294). +//! +//! ## Problem Statement +//! +//! We want to address a threat model for example where the filesystem (or block device) may have been mutated by malicious (or accidental) activity. Such changes should be detected immediately and efficiently, even while a container is running. +//! +//! To address this, container images need cryptographic verification that efficiently covers all components (manifest, config and filesystem tree). +//! +//! Current OCI signature mechanisms (cosign, GPG) can sign manifests, which then covers the compressed and uncompressed tar archive streams. But verifying the correspondence between the tar archive and the unpacked filesystem representation is very expensive. +//! +//! An obvious mechanism to address the threat model would be to store everything in memory: First verify the manifest, then the config, then unpack the tar archives into memory. But this would mean a slow and expensive "first start", and also be problematic for large container images that have unused portions. +//! +//! ## Related projects +//! +//! - **[containerd EROFS snapshotter](https://github.com/containerd/containerd/blob/main/docs/snapshotters/erofs.md)**: Converts OCI layers to EROFS blobs with optional fsverity protection. Supports `enable_fsverity = true` to enable fs-verity on layer blobs. Uses reproducible builds with erofs-utils 1.8+ (`-T0 --mkfs-time`). dm-verity integration is planned but not yet implemented. +//! +//! ## Efficient sealing with composefs +//! +//! The core primitive of composefs is fsverity, which allows incremental online verification of individual files. The complete filesystem tree metadata is itself stored as a file which can be verified in the same way. The critical design question is how to embed the composefs digest within OCI image metadata such that external signatures can efficiently cover the entire filesystem tree. +//! +//! "composefs digest" here means the fsverity digest of the EROFS metadata file. fsverity is configurable based on digest algorithm (SHA-256 or SHA-512 currently) and block size (4k or 64k). +//! +//! For standardized short form of the combination, a string of the form `fsverity-${DIGEST}-${BLOCKSIZEBITS}` is used. The `fsverity-` prefix makes clear this is an fsverity Merkle tree digest, not a simple hash: +//! +//! - `fsverity-sha256-12` (SHA-256, 4k block size, 2^12) +//! - `fsverity-sha512-12` (SHA-512, 4k block size) +//! - `fsverity-sha256-16` (SHA-256, 64k block size, 2^16) +//! - `fsverity-sha512-16` (SHA-512, 64k block size) +//! +//! Digests are encoded as lowercase hexadecimal. +//! +//! (Note at the current time, only 4k blocks are supported by the composefs-rs implementation) +//! +//! ### Key components: fsverity digest and signature +//! +//! An OCI image has 3 key components, and we want to provide integrity for all of them: +//! +//! - manifest +//! - config +//! - layers (at least when manifested as a merged filesystem tree) +//! +//! ### Possible approach: Manifest to fsverity digest verification in userspace +//! +//! There is widespread use of tools like [cosign](https://github.com/sigstore/cosign) to verify integrity of the manifest. It is possible to achieve our goal by just verifying the manifest on start (ensuring that e.g. the cosign trusted roots are first verified - a well understood problem). +//! +//! Once we verify the manifest, we can cheaply verify the config by checking its digest (it's just small JSON). +//! +//! Then if we embedded a digest for the composefs filesystem tree in the manifest (or config), we have efficiently established trust. +//! +//! This is strongly related to model effectively used by "sealed UKIs" today - the kernel command line is covered by Secure Boot, which includes the fsverity digest, and the initramfs mounting code checks that digest. +//! +//! ### Linux Kernel-based approach: Include fsverity signatures +//! +//! A different but more powerful alternative is to use a signature scheme supported by the Linux kernel to sign the fsverity digest, and include a signature for all three objects of the manifest, config and the EROFS. +//! +//! Each of these three things is a file, and when an image is unpacked, the signature can be applied to the file backing it. +//! +//! ### Composefs integrity metadata modes +//! +//! There are two modes for how trust can be established for an OCI image. +//! +//! - **composefs-meta-artifact**: An OCI artifact that only includes metadata: cryptographic checksums and signatures +//! - **composefs-meta-included**: Instead of a separate artifact, metadata is included inline in the manifest as annotations. +//! +//! #### Annotation key scheme +//! +//! Both modes use the same role-prefixed annotation keys. The role appears as the second component of the key, making each annotation self-describing regardless of where it appears. +//! +//! | Object | Digest annotation | Inline location | +//! |---|---|---| +//! | Per-layer EROFS | `composefs.layer.erofs.v1.fsverity-{alg}-{bs}` | On the layer descriptor | +//! | Merged EROFS | `composefs.merged.erofs.v1.fsverity-{alg}-{bs}` | Manifest top-level `annotations` | +//! | Merged boot variant | `composefs.merged.bootable.erofs.v1.fsverity-{alg}-{bs}` | Manifest top-level `annotations` | +//! | Config | `composefs.config.fsverity-{alg}-{bs}` | On the config descriptor | +//! | Manifest | `composefs.manifest.fsverity-{alg}-{bs}` | *(artifact mode only)* | +//! +//! Annotations live on the descriptor of the object they describe when one exists (layers, config). The merged EROFS has no descriptor in the image manifest, so its digest goes in the manifest's top-level `annotations`. +//! +//! The signature annotation key is simply the digest key with `.sig` appended (e.g. `composefs.merged.erofs.v1.fsverity-sha512-12.sig`). Signature values are base64-encoded PKCS#7 DER blobs — the exact format consumed by `FS_IOC_ENABLE_VERITY` after decoding. In artifact mode the signature travels as a raw layer blob rather than a base64 annotation, but the digest annotation keys are identical across both modes. +//! +//! The `erofs.v1` segment in EROFS annotation keys denotes version 1 of the composefs EROFS metadata format. It appears only on annotations whose digest covers a locally-generated EROFS object. Config and manifest annotations omit it because their digest is taken over the raw JSON bytes as stored in the registry — there is no composefs-specific format to version. This gives two annotation key shapes: +//! +//! - `composefs.{role}.erofs.v{N}.fsverity-{alg}-{bs}` — for EROFS objects (layer, merged, merged.bootable) +//! - `composefs.{role}.fsverity-{alg}-{bs}` — for plain JSON files (config, manifest) +//! +//! The `manifest` row applies to artifact mode only. Inline mode cannot represent a manifest digest or signature because adding the annotation would change the manifest bytes being signed — the document would be self-referential. Inline mode instead relies on out-of-band manifest trust (cosign, pinned digest, etc.). +//! +//! #### Choosing a mode +//! +//! The two modes reflect a tradeoff between logistical simplicity and capability. +//! +//! Artifact mode works with unmodified existing images: compute the composefs digests, optionally sign them, and push the result as a referrer. The original image is never touched. It also supports signing the manifest itself, providing the strongest possible chain of trust. The tradeoff is that the artifact must be copied alongside the image; tools that are unaware of the OCI referrers API will not propagate it automatically. +//! +//! Inline mode embeds everything directly in the image manifest, so a plain `skopeo copy` or any other OCI-aware tool will carry the composefs metadata along automatically. The cost is that the manifest itself cannot be signed (the annotation would change the bytes), and there is a tighter coupling between image generation and the signing step. +//! +//! | | Artifact mode | Inline mode | +//! |---|---|---| +//! | Works with unmodified images | Yes | No | +//! | Survives naive `skopeo copy` | No | Yes | +//! | Can sign manifest | Yes | No | +//! | Alters image manifest digest | No | Yes | +//! | Separate artifact required | Yes | No | +//! +//! #### OCI artifact based composefs metadata +//! +//! In this mode, additional digests (and optionally signatures) are shipped as an OCI artifact that acts as a "referrer" to the main OCI image. This is very similar to a [cosign](https://github.com/sigstore/cosign) signature. +//! +//! The OCI artifact includes: +//! +//! - At least one fsverity digest (+ optional signature) for a composefs-EROFS +//! - A fsverity digest+signature for the config +//! - A fsverity digest+signature for the manifest +//! +//! Like bootc sealed UKIs, it is required for EROFS generation to be exactly bit-for-bit reproducible across implementations. +//! +//! An `erofs.v1` composefs digest MUST be included, using either `fsverity-sha256-12` or `fsverity-sha512-12`. The `erofs.v1` in the key identifies the EROFS metadata format version, while `fsverity-{alg}-{bs}` identifies the digest algorithm and block size. The artifact MAY include alternate digests — this could mean both `sha256` and `sha512` for example. It is also possible to use `erofs.v2` or other block sizes in a future version. +//! +//! ##### Artifact Manifest +//! +//! The composefs artifact is an OCI image manifest following the [artifacts guidance](https://github.com/opencontainers/image-spec/blob/main/artifacts-guidance.md) pattern (empty config, content in layers), with `artifactType` set to `application/vnd.composefs.metadata.v1`. +//! +//! The artifact carries fsverity digests and optional signatures. Each layer has a role-prefixed annotation identifying the fsverity digest of the object it covers — using `composefs.{role}.erofs.v1.fsverity-{alg}-{bs}` for EROFS objects and `composefs.{role}.fsverity-{alg}-{bs}` for plain JSON files. The client always generates the EROFS locally using canonical generation and verifies it against the expected digest. +//! +//! ```json +//! { +//! "schemaVersion": 2, +//! "mediaType": "application/vnd.oci.image.manifest.v1+json", +//! "artifactType": "application/vnd.composefs.metadata.v1", +//! "config": { +//! "mediaType": "application/vnd.oci.empty.v1+json", +//! "digest": "sha256:44136fa355b3678a1146ad16f7e8649e94fb4fc21fe77e8310c060f61caaff8a", +//! "size": 2 +//! }, +//! "layers": [ +//! { +//! "mediaType": "application/vnd.composefs.signature.v1+pkcs7", +//! "digest": "sha256:aaa...", +//! "size": 456, +//! "annotations": { +//! "composefs.manifest.fsverity-sha512-12": "ab12...manifest-fsverity-digest..." +//! } +//! }, +//! { +//! "mediaType": "application/vnd.composefs.signature.v1+pkcs7", +//! "digest": "sha256:bbb...", +//! "size": 789, +//! "annotations": { +//! "composefs.config.fsverity-sha512-12": "cd34...config-fsverity-digest..." +//! } +//! }, +//! { +//! "mediaType": "application/vnd.composefs.signature.v1+pkcs7", +//! "digest": "sha256:ccc...", +//! "size": 1234, +//! "annotations": { +//! "composefs.merged.erofs.v1.fsverity-sha512-12": "d015f70f8bee6c...merged-composefs-digest..." +//! } +//! }, +//! { +//! "mediaType": "application/vnd.composefs.signature.v1+pkcs7", +//! "digest": "sha256:ddd...", +//! "size": 1234, +//! "annotations": { +//! "composefs.merged.bootable.erofs.v1.fsverity-sha512-12": "e826a91b3c...boot-composefs-digest..." +//! } +//! } +//! ], +//! "subject": { +//! "mediaType": "application/vnd.oci.image.manifest.v1+json", +//! "digest": "sha256:5b0bcabd1ed22e9fb1310cf6c2dec7cdef19f0ad69efa1f392e94a4333501270", +//! "size": 7682 +//! } +//! } +//! ``` +//! +//! The `merged` role refers to the complete flattened filesystem of all layers. The `merged.bootable` role refers to the boot variant — a modified EROFS that excludes `/boot` and applies other boot-specific transformations, as described in [Relationship to Booting with composefs](#relationship-to-booting-with-composefs) below. +//! +//! ##### Layer Ordering +//! +//! Each layer carries a role-prefixed annotation that identifies both the role and the fsverity digest of the covered object. This makes the artifact self-contained — a consumer can verify composefs digests using only the artifact and the image layers, without requiring composefs annotations on the original image manifest. +//! +//! The layers MUST appear in this order: +//! +//! 1. **(Optional)** One signature with `composefs.manifest.fsverity-*` annotation — signature for the sealed image manifest +//! 2. **(Optional)** One signature with `composefs.config.fsverity-*` annotation — signature for the image config +//! 3. One signature with `composefs.merged.erofs.v1.fsverity-*` annotation — signature for the merged EROFS representing the complete flattened filesystem +//! 4. **(Optional)** One signature with `composefs.merged.bootable.erofs.v1.fsverity-*` annotation — signature for the boot variant of the merged EROFS (with `/boot` excluded, etc.) +//! +//! This design enables signing existing unmodified OCI images: compute composefs digests, sign them, and push the composefs artifact as a referrer. The original image is never touched. +//! +//! ##### Signature Format +//! +//! Each signature layer blob is a raw PKCS#7 signature encoded using [DER](https://en.wikipedia.org/wiki/X.690#DER_encoding) (Distinguished Encoding Rules, ITU-T X.690) over the kernel's `fsverity_formatted_digest`: +//! +//! ```c +//! struct fsverity_formatted_digest { +//! char magic[8]; /* "FSVerity" */ +//! __le16 digest_algorithm; +//! __le16 digest_size; +//! __u8 digest[]; +//! }; +//! ``` +//! +//! Composefs algorithm identifiers map to kernel constants with no salt: +//! - `fsverity-sha512-12` → `FS_VERITY_HASH_ALG_SHA512`, 4096-byte blocks +//! - `fsverity-sha256-12` → `FS_VERITY_HASH_ALG_SHA256`, 4096-byte blocks +//! - `fsverity-sha512-16` → `FS_VERITY_HASH_ALG_SHA512`, 65536-byte blocks +//! - `fsverity-sha256-16` → `FS_VERITY_HASH_ALG_SHA256`, 65536-byte blocks +//! +//! All entries in a single composefs artifact MUST use the same algorithm, which is encoded in the annotation key (e.g. `composefs.merged.erofs.v1.fsverity-sha512-12`). +//! +//! For manifest and config signatures, the fsverity digest is computed over the exact JSON bytes as stored in the registry. These files are stored locally with fsverity enabled so that reads are kernel-verified. +//! +//! ##### Discovery and Verification +//! +//! Discovery uses the standard [OCI Distribution Spec referrers API](https://github.com/opencontainers/distribution-spec/blob/main/spec.md#listing-referrers): +//! ``` +//! GET /v2//referrers/?artifactType=application/vnd.composefs.metadata.v1 +//! ``` +//! +//! Verification: +//! +//! 1. Check `subject` matches the sealed image manifest digest +//! 2. Read the role-prefixed annotations from the artifact layers to learn the expected fsverity digests for the manifest, config, merged EROFS, and (if present) the boot variant — using `composefs.{role}.erofs.v1.fsverity-*` for EROFS objects and `composefs.{role}.fsverity-*` for plain JSON files +//! 3. Generate the EROFS locally from the tar layers using canonical generation +//! 4. Compute the fsverity digest of the locally generated EROFS and verify it matches the expected digest +//! 5. If signature layers are present, apply them via `FS_IOC_ENABLE_VERITY` to the EROFS files +//! +//! The kernel handles PKCS#7 validation when signatures are used — failed verification prevents reading the file. +//! +//! ``` +//! External CA/Keystore +//! ↓ issues certificate for .fs-verity keyring +//! PKCS#7 signatures (from artifact layers) +//! ↓ applied via FS_IOC_ENABLE_VERITY to each file +//! Manifest JSON, Config JSON, EROFS blobs +//! ↓ kernel fsverity enforcement on every read +//! Runtime file access +//! ``` +//! +//! ##### Implementation Considerations +//! +//! Kernel-level signature verification depends on Linux kernel fsverity (CONFIG_FS_VERITY, CONFIG_FS_VERITY_BUILTIN_SIGNATURES). Signature validation and file access enforcement are handled by the Linux kernel. +//! +//! When signatures are present, the manifest and config signature entries MUST also be present — there is no reason to sign the merged EROFS without also signing the manifest and config that reference it. The `merged.bootable` entry is optional and only relevant for bootable images. +//! +//! The composefs artifact carries digests and optional signatures. If an implementation uses digest-only verification (trusting the composefs digests via the manifest chain), it does not need a composefs artifact at all — the inline annotations on the image manifest (layer descriptors, config descriptor, and top-level annotations) are sufficient, and at minimum `merged` (plus `config`) must be present for that verification path. +//! +//! Clients that pull images with composefs artifacts are expected to also store the artifact locally alongside the image (it's just a small amount of metadata), and to attach the signatures to the corresponding files at the Linux kernel level. This enables offline verification and allows fsverity signatures to be applied when files are later accessed. However, local storage of the artifact is not strictly required — a client could re-fetch the artifact from the registry when needed, or operate in digest-only mode where the composefs digests themselves are trusted without kernel signature verification. +//! +//! ##### Media Types +//! +//! - `application/vnd.composefs.metadata.v1`: Artifact type for composefs metadata artifacts (digests + optional signatures) +//! - `application/vnd.composefs.signature.v1+pkcs7`: Layer media type for PKCS#7 DER signature blobs +//! +//! #### Inline composefs metadata +//! +//! In this mode, digests and optional signatures are embedded as annotations directly in the OCI image manifest. The main advantage is logistics: any standard OCI tool that copies the image will automatically carry the composefs metadata along, with no awareness of referrers or separate artifacts needed. The main disadvantage is that the manifest itself cannot be covered by a composefs digest or signature — adding the annotation would change the manifest bytes being signed, making the document self-referential. Trust in the manifest must therefore be established through other means, such as cosign signatures or referencing the image by a pinned digest that is itself verified out-of-band. +//! +//! When fsverity signatures are added in inline mode there is also a tighter coupling between signing and image generation: injecting the annotations changes the manifest digest, which is the most common identifier for an image. The underlying image can still be uniquely identified by its configuration digest, but tooling needs to be aware of this. +//! +//! ##### Digest-only example +//! +//! Each annotation lives on the descriptor of the object it describes. Per-layer EROFS digests go on the layer descriptor, the config fsverity digest goes on the config descriptor, and the merged EROFS digest goes in the manifest's top-level `annotations` (since there is no descriptor for the merged filesystem). +//! +//! ```json +//! { +//! "schemaVersion": 2, +//! "mediaType": "application/vnd.oci.image.manifest.v1+json", +//! "config": { +//! "mediaType": "application/vnd.oci.image.config.v1+json", +//! "digest": "sha256:b5b2b2c507a0944348e0303114d8d93aaaa081732b86451d9bce1f432a537bc7", +//! "size": 7023, +//! "annotations": { +//! "composefs.config.fsverity-sha512-12": "cd34f91a2b3e5678901234567890abcdef1234567890abcdef1234567890abcdef1234567890abcdef1234567890abcdef1234567890abcdef1234567890abcd" +//! } +//! }, +//! "layers": [ +//! { +//! "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip", +//! "digest": "sha256:9834876dcfb05cb167a5c24953eba58c4ac89b1adf57f28f2f9d09af107ee8f0", +//! "size": 32654, +//! "annotations": { +//! "composefs.layer.erofs.v1.fsverity-sha512-12": "3abb6677af34ac57c0ca5828fd94f9d886c26ce59a8ce60ecf6778079423dccff1d6f19cb655805d56098e6d38a1a710dee59523eed7511e5a9e4b8ccb3a4686" +//! } +//! }, +//! { +//! "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip", +//! "digest": "sha256:3c3a4604a545cdc127456d94e421cd355bca5b528f4a9c1905b15da2eb4a4c6b", +//! "size": 16724, +//! "annotations": { +//! "composefs.layer.erofs.v1.fsverity-sha512-12": "7f2b8a4e6c1d3f5a9b0e2d4c6a8f1e3b5d7c9a0b2e4f6d8a1c3e5b7d9f0a2c4e6b8d0f2a4c6e8b0d2f4a6c8e0b2d4f6a8c0e2b4d6f8a0c2e4b6d8f0a2c" +//! } +//! } +//! ], +//! "annotations": { +//! "composefs.merged.erofs.v1.fsverity-sha512-12": "d015f70f8bee6cf6453dd5b771eec18994b861c646cec18e2a9dfdec93f631fbb9030e60cfc82b552d33b9a134312a876ef4e519bffe3ef872aefbd84e6198b3" +//! } +//! } +//! ``` +//! +//! Each layer's `composefs.layer.erofs.v1.fsverity-sha512-12` covers the EROFS generated from that individual layer's tar content. The `composefs.config.fsverity-sha512-12` on the config descriptor covers the image config JSON as stored in the registry. The `composefs.merged.erofs.v1.fsverity-sha512-12` at the manifest level represents the complete flattened filesystem of all layers merged together. +//! +//! ##### Inline signatures example +//! +//! Signatures are added by appending `.sig` to the corresponding digest key. The value is a base64-encoded PKCS#7 DER blob — the same bytes that would appear raw in an artifact mode signature layer, just wrapped in base64 for transport as a JSON string. +//! +//! ```json +//! { +//! "schemaVersion": 2, +//! "mediaType": "application/vnd.oci.image.manifest.v1+json", +//! "config": { +//! "mediaType": "application/vnd.oci.image.config.v1+json", +//! "digest": "sha256:b5b2b2c507a0944348e0303114d8d93aaaa081732b86451d9bce1f432a537bc7", +//! "size": 7023, +//! "annotations": { +//! "composefs.config.fsverity-sha512-12": "cd34f91a2b3e5678901234567890abcdef1234567890abcdef1234567890abcdef1234567890abcdef1234567890abcdef1234567890abcdef1234567890abcd", +//! "composefs.config.fsverity-sha512-12.sig": "MIIEpAIBAAKCAQEA7y2W9nMmQ4rPbSTf8xHuKzJeXdCwOqVvBjPfHl2qA6uZm0tD5nSc1iEkFbGhWxLP8UoVYdNa7RjMf3pOeQ9wCqIlZvBm4tYxKnGuBcWb..." +//! } +//! }, +//! "layers": [ +//! { +//! "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip", +//! "digest": "sha256:9834876dcfb05cb167a5c24953eba58c4ac89b1adf57f28f2f9d09af107ee8f0", +//! "size": 32654, +//! "annotations": { +//! "composefs.layer.erofs.v1.fsverity-sha512-12": "3abb6677af34ac57c0ca5828fd94f9d886c26ce59a8ce60ecf6778079423dccff1d6f19cb655805d56098e6d38a1a710dee59523eed7511e5a9e4b8ccb3a4686" +//! } +//! }, +//! { +//! "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip", +//! "digest": "sha256:3c3a4604a545cdc127456d94e421cd355bca5b528f4a9c1905b15da2eb4a4c6b", +//! "size": 16724, +//! "annotations": { +//! "composefs.layer.erofs.v1.fsverity-sha512-12": "7f2b8a4e6c1d3f5a9b0e2d4c6a8f1e3b5d7c9a0b2e4f6d8a1c3e5b7d9f0a2c4e6b8d0f2a4c6e8b0d2f4a6c8e0b2d4f6a8c0e2b4d6f8a0c2e4b6d8f0a2c" +//! } +//! } +//! ], +//! "annotations": { +//! "composefs.merged.erofs.v1.fsverity-sha512-12": "d015f70f8bee6cf6453dd5b771eec18994b861c646cec18e2a9dfdec93f631fbb9030e60cfc82b552d33b9a134312a876ef4e519bffe3ef872aefbd84e6198b3", +//! "composefs.merged.erofs.v1.fsverity-sha512-12.sig": "MIIEpAIBAAKCAQEA3x7V8mLkP2nQoRfT6wYsHzJdXcBvNqWuAiOeGk1pZ5tYl9sC4mRb0hDjEaFgVwKP7TnUXcMz6QiLe2oNdR8vBpHkYuAl3sXwJmFtOcZa..." +//! } +//! } +//! ``` +//! +//! A few things worth noting about inline signatures: +//! +//! - The `.sig` values are base64-encoded PKCS#7 DER, identical in content to the raw blobs stored in artifact mode signature layers. To apply them locally, base64-decode the value and pass the result to `FS_IOC_ENABLE_VERITY`. +//! - There is no manifest signature in inline mode. Adding a `composefs.manifest.fsverity-sha512-12` annotation would alter the manifest bytes, invalidating any digest computed before the annotation was added. +//! - For large certificate chains, the base64-encoded signature annotation may be several kilobytes. If annotation size is a concern — for example, registries or tooling that imposes limits — use artifact mode instead, where signatures travel as separate layer blobs. +//! +//! #### Whiteout Handling in Merged Filesystem +//! +//! The merged EROFS represents a fully flattened filesystem and is designed to be mounted directly, not stacked with other EROFS layers via overlayfs. During the merge process, OCI whiteouts (`.wh.*` files and opaque directory markers) are fully processed: files and directories marked for deletion in upper layers are removed from the merged result. The final merged EROFS contains no whiteout entries — it is a clean, whiteout-free snapshot of the complete filesystem tree as it would appear after all layers are applied. +//! +//! ### Runtime verification +//! +//! #### Linux kernel fsverity signatures +//! +//! The primary signature mechanism is Linux kernel [fsverity built-in signature verification](https://docs.kernel.org/filesystems/fsverity.html#built-in-signature-verification). The kernel's `FS_IOC_ENABLE_VERITY` ioctl accepts a PKCS#7 signature that is verified against the `.fs-verity` keyring. This provides a clear chain of trust: the same component that controls data access (the Linux kernel) also validates the signature. The Linux kernel has subsystems that can build on top of fsverity signatures, such as [IPE](https://docs.kernel.org/admin-guide/LSM/ipe.html) (Integrity Policy Enforcement). +//! +//! #### Digest-only verification +//! +//! Verifying only the digest via userspace comparison with an expected digest (e.g. chaining from trust in the manifest to trust of the included digest) still allows efficient verification of the content, and the Linux kernel based fsverity enforcement of digests of individual objects ensures that malicious or accidental modifications are detected efficiently. +//! +//! However, because the Linux kernel did not itself establish trust in the digest, kernel based security systems such as IPE above are unaware of it. +//! +//! The userspace tooling performing this verification must itself be trusted. An operating system typically establishes this trust by running from a verified base — for example a bootc container configured as a "sealed UKI", or a root filesystem protected by dm-verity. +//! +//! A key benefit of composefs is that verification of large data is on-demand and continuous via the kernel's fsverity — the composefs digest covers the complete filesystem tree, so verifying it is cheap even though the underlying data may be large. +//! +//! #### Replacing diff_id validation +//! +//! The OCI image specification requires a `diff_id` in the [image config](https://github.com/opencontainers/image-spec/blob/main/config.md) for each layer, which is the digest of the uncompressed tar stream. This is expensive to validate after extraction and provides no path to continual kernel-enforced verification. With composefs, validating `diff_id` becomes redundant: the composefs digest already cryptographically covers the complete filesystem tree derived from the layer for the purposes of a runtime mount. +//! +//! It is however still useful for clients to verify `diff_id` when pushing a tar stream to a registry, etc. +//! +//! ## Storage model +//! +//! The composefs model is to store the manifest, config and the metadata EROFS all as files with fsverity enabled. For OCI containers, the layer tarballs are unpacked into the object store as well, with fsverity enabled on non-inline files. +//! +//! ## Relationship to Booting with composefs +//! +//! OCI sealing is independent from but complementary to the mechanism of "sealed UKIs" that embed a `composefs=` kernel command line digest. +//! +//! It is expected that boot-sealed images would *also* be OCI sealed, although this is not strictly required. +//! +//! One possible future direction for composefs/bootc UKIs would to instead load signing keys into the kernel fsverity chain from the initramfs (which may be the same or different keys used for application images), and use the composefs artifact signature scheme for mounting the root filesystem from the initramfs. However, a mechanism to determine which filesystem root to use would also be required. +//! +//! ## Future Directions +//! +//! ### Incremental Pulls +//! +//! The composefs digest for an EROFS includes fsverity digests of all content objects, so a client can determine which objects it already has locally and only fetch the missing ones from the tar layer. A minimal object-id to tar-stream offset map (shipped in the composefs metadata artifact) would serve as a table of contents for range-based fetching. +//! +//! A key advantage over existing approaches (zstd:chunked, eStargz) is that the composefs digest eliminates the need to verify the OCI `diff_id`, which in turn eliminates the need for tar-split metadata. The tar layer becomes purely a content delivery mechanism — each fetched object is verified independently by its fsverity digest. +//! +//! To push an incrementally-pulled image, the client must regenerate the tar layer deterministically. This requires a canonical tar format — see [`canonical_tar_spec`](crate::canonical_tar_spec). +//! +//! See [`incremental_pulls_spec`](crate::incremental_pulls_spec) for more design detail, including the composefs-chunked layer format, offset map structure, and pull protocol. +//! +//! ### Integration with zstd:chunked +//! +//! Both zstd:chunked and composefs add new digests to OCI images. The zstd:chunked table-of-contents (TOC) has high overlap with the composefs dumpfile format, as both are metadata about filesystem structure that identify files and their content. The TOC currently uses SHA256 while composefs requires fsverity. +//! +//! Adding fsverity to zstd:chunked TOC entries would allow using the TOC digest as a canonical composefs identifier. This would support a direct TOC → dumpfile → composefs pipeline, with a single metadata format serving both zstd:chunked and composefs use cases. +//! +//! ## References +//! +//! **Design discussion**: [composefs/composefs#294](https://github.com/composefs/composefs/issues/294) +//! +//! **Experimental implementations**: +//! - [composefs_experiments](https://github.com/allisonkarlitskaya/composefs_experiments) +//! - [composefs-oci-experimental](https://github.com/cgwalters/composefs-oci-experimental) +//! +//! **Related issues**: +//! - [containers/container-libs#108](https://github.com/containers/container-libs/issues/108) - fsverity in zstd:chunked TOC +//! - [containers/container-libs#112](https://github.com/containers/container-libs/issues/112) - per-layer vs flattened +//! - [composefs/composefs#409](https://github.com/composefs/composefs/issues/409) - non-root mounting +//! +//! **Standards**: +//! - [OCI Image Specification](https://github.com/opencontainers/image-spec) +//! +//! ## Contributors +//! +//! This specification synthesizes ideas from Colin Walters (original design proposals and iteration), Allison Karlitskaya (implementation and practical refinements), Alexander Larsson (security model and non-root mounting insights), and Giuseppe Scrivano (across the board) with assistance from Claude Sonnet 4.5 and Claude Opus 4. diff --git a/doc/plans/oci-sealing-spec.md b/doc/plans/oci-sealing-spec.md deleted file mode 100644 index 98d000bf..00000000 --- a/doc/plans/oci-sealing-spec.md +++ /dev/null @@ -1,199 +0,0 @@ -# OCI Sealing Specification for Composefs - -This document defines how composefs integrates with OCI container images to provide cryptographic verification of complete filesystem trees. The specification is based on original design discussion in [composefs/composefs#294](https://github.com/composefs/composefs/issues/294). - -## Problem Statement - -Container images need cryptographic verification that efficiently covers the entire filesystem tree without requiring re-hashing of all content. Current OCI signature mechanisms (cosign, GPG) can sign manifests, but verifying the complete filesystem tree at runtime is extremely expensive because the only known digests are those of the tar layers. - -Hence verifying the integrity of an individual file would require re-synthesizing the entire tarball (using tar-split or equivalent) and computing its digest. - -## Solution - -The core primitive of composefs is fsverity, which allows incremental online verification of individual files. The complete filesystem tree metadata is itself stored as a file which can be verified in the same way. The critical design question is how to embed the composefs digest within OCI image metadata such that external signatures can efficiently cover the entire filesystem tree. - -## Design Goals - -The OCI sealing specification aims to provide efficient verification where a signature on an OCI manifest cryptographically covers the entire filesystem tree without re-hashing content. The specification defines standardized metadata locations for composefs digests and supports future format evolution without breaking existing images. - -Incremental verification must be supported, enabling verification of individual layers or the complete flattened filesystem. The design accommodates both registry-provided sealed images and client-side sealing workflows while maintaining backward compatibility with existing OCI tooling and registries. - -## Core Design - -### Composefs Digest Storage - -The composefs fsverity digest is stored as a label in the OCI image config: - -```json -{ - "config": { - "Labels": { - "containers.composefs.fsverity": "sha256:a3b2c1d4e5f6..." - } - } -} -``` - -The config represents the container's identity rather than transport metadata. Manifests are transport artifacts that can vary across different distribution mechanisms. Adding the composefs label creates a new config and thus a new manifest, establishing the sealed image as a distinct artifact. This means sealing an image produces a new image with a different config digest, where the original unsealed image and sealed image coexist as separate artifacts that registries treat as distinct versions. - -### Digest Type - -The primary digest is the fs-verity digest of the EROFS image containing the merged, flattened filesystem. This digest provides fast verification at mount time through kernel fs-verity checks and is deterministic: the same input layers always produce the same EROFS digest. The digest covers the complete filesystem tree including all metadata such as permissions, timestamps, and extended attributes. - -### Merged Filesystem Representation - -The config label contains the digest of the merged, flattened filesystem. This represents the final filesystem state after extracting all layers in order, applying whiteouts (`.wh.` files), merging directories where the most-derived layer wins for metadata, and building the final composefs EROFS image. - -### Per-Layer Digests (Future Extension) - -Per-layer composefs digests may be added as manifest annotations: - -```json -{ - "manifests": [ - { - "layers": [ - { - "digest": "sha256:...", - "annotations": { - "containers.composefs.layer.fsverity": "sha256:..." - } - } - ] - } - ] -} -``` - -Per-layer digests enable incremental verification during pull, create caching opportunities where shared layers have known composefs digests, and enable runtime choice between flattened versus layered mounting strategies. - -### Trust Chain - -The trust chain for composefs-verified OCI images flows from external signatures through the manifest to the complete filesystem: - -``` -External signature (cosign/sigstore/GPG) - ↓ signs -OCI Manifest (includes config descriptor) - ↓ digest reference -OCI Config (includes containers.composefs.fsverity label) - ↓ fsverity digest -Composefs EROFS image - ↓ contains -Complete merged filesystem tree -``` - -## Verification Process - -Verification begins by fetching the manifest from the registry and verifying the external signature on the manifest. The config descriptor is extracted from the manifest, and the config is fetched and verified to match the descriptor digest. The `containers.composefs.fsverity` label is extracted from the config, and the composefs image is mounted with fsverity verification. The kernel verifies the EROFS matches the expected fsverity digest. - -The security property is that signature verification happens once, while filesystem verification is delegated to kernel fs-verity with lazy or eager verification depending on mount options. - -## Metadata Schema - -### Config Labels - -The image config contains the following labels: - -The `containers.composefs.fsverity` label (string) contains the fsverity digest of the merged composefs EROFS in the format `:` where algorithm is `sha256` or `sha512`. - -The `containers.composefs.version` label (string, optional) contains the seal format version such as `1.0`. - -### Descriptor Annotations - -A descriptor may have the following annotation: - -The `containers.composefs.layer.fsverity` annotation (string, optional) contains the fsverity digest of that individual layer. - -### Label versus Annotation Semantics - -Config labels store the authoritative digest because the config represents container identity while the manifest is a transport artifact. Labels are part of the container specification and create a new artifact (sealed image) rather than mutating metadata. Manifest annotations are retained for discovery purposes, allowing registries to identify sealed images without parsing configs and enabling clients to optimize pull strategies. - -## Verification Modes - -### Eager Verification - -Eager verification occurs during image pull. The composefs image is immediately created and its digest is verified against the config label. This makes the container ready to mount immediately after pull and is suitable for boot scenarios where operations should be read-only. - -### Lazy Verification - -Lazy verification defers composefs creation until first mount. The pull operation stores layers and config but doesn't build the composefs image. On mount, the composefs image is built and verified against the label. This mode is suitable for application containers where many images may be pulled but only some are actually used. - -## Security Model - -### Registry-Provided Sealed Images - -For images sealed by the registry or vendor, the seal is computed during the build process and the seal label is embedded in the published config. An external signature covers the manifest. Clients verify the chain: signature → manifest → config → composefs. Trust is placed in the image producer and the signature key. - -### Client-Sealed Images - -For images sealed locally by the client, the client pulls an image that may be unsigned and computes the seal locally. The client stores the sealed config in its local repository. On boot or mount, the client can re-fetch the manifest from the network to verify freshness. Trust is placed in the network fetch (TLS) and local verification. - -## Attack Mitigation - -### Digest Mismatch - -If a config label doesn't match the actual EROFS, the mount operation fails the fsverity check. Verification APIs can detect this condition before mounting. - -### Signature Bypass - -Any attempt to modify the config label without updating the signature fails because the signature covers the manifest, which covers the config digest. Any config change produces a new digest, breaking the signature chain. - -### Rollback Attack - -For application containers, re-fetching the manifest on boot checks for freshness. For host systems, embedding the manifest in the boot artifact prevents rollback. - -### Layer Confusion - -Per-layer fsverity annotations allow verification before merging. Implementations that maintain digest maps can link layer SHA256 digests to fsverity digests. - -## Relationship to Booting with composefs - -OCI sealing is independent from but complementary to composefs boot verification (UKI, BLS, etc.). These are separate mechanisms operating at different stages of the system lifecycle with different trust models. - -OCI sealing provides runtime verification of container images distributed through registries. The trust chain typically flows from external signatures (cosign, GPG) through OCI manifests to composefs digests. - -Boot verification is designed to be rooted in extant hardware mechanisms such as Secure Boot. The composefs digest is embedded directly in boot artifacts (UKI `.cmdline` section, BLS entry `options` field) and verified during early boot by the initramfs. - -These mechanisms work together in a complete workflow where a sealed OCI image can be pulled from a registry, verified through OCI sealing, and then used to build a boot artifact with the composefs digest embedded for boot verification. However, each mechanism operates independently with its own trust anchor and threat model. - -## Future Directions - -### Dumpfile Digest as Canonical Identifier - -The fsverity digest ties implementations to a specific EROFS format. A dumpfile digest (SHA256 of the composefs dumpfile format) would enable format evolution. This would be stored as an additional label `containers.composefs.dumpfile.sha256` alongside the fsverity digest. - -The dumpfile format is format-agnostic, meaning the same dumpfile can generate different EROFS versions. This simplifies standardization since the dumpfile format is simpler than EROFS and provides future-proofing to migrate to composefs-over-squashfs or other formats. - -The challenge is that verification becomes slower as it requires parsing a saved EROFS from disk to dumpfile format. Caching the dumpfile digest to fsverity digest mapping introduces complexity and security implications. A use case split might apply dumpfile digests to application containers (for format flexibility) while using fsverity digests for host boot (for speed with minimal skew). - -### Integration with zstd:chunked - -Both zstd:chunked and composefs add new digests to OCI images. The zstd:chunked table-of-contents (TOC) has high overlap with the composefs dumpfile format, as both are metadata about filesystem structure that identify files and their content. The TOC currently uses SHA256 while composefs requires fsverity. - -Adding fsverity to zstd:chunked TOC entries would allow using the TOC digest as a canonical composefs identifier. This would support a direct TOC → dumpfile → composefs pipeline, with a single metadata format serving both zstd:chunked and composefs use cases. - -### Three-Digest Model - -To support both flattened and layered mounting strategies, three digests could be stored per image: a base image digest, a derived layers digest, and a flattened digest. This would enable mounting a single flattened composefs for speed, mounting base and derived separately to avoid metadata amplification, or verifying the base from upstream while only rebuilding derived layers. This aligns with the existing `org.opencontainers.image.base.digest` standard. - -## References - -**Design discussion**: [composefs/composefs#294](https://github.com/composefs/composefs/issues/294) - -**Experimental implementations**: -- [composefs_experiments](https://github.com/allisonkarlitskaya/composefs_experiments) -- [composefs-oci-experimental](https://github.com/cgwalters/composefs-oci-experimental) - -**Related issues**: -- [containers/container-libs#108](https://github.com/containers/container-libs/issues/108) - fsverity in zstd:chunked TOC -- [containers/container-libs#112](https://github.com/containers/container-libs/issues/112) - per-layer vs flattened -- [composefs/composefs#409](https://github.com/composefs/composefs/issues/409) - non-root mounting - -**Standards**: -- [OCI Image Specification](https://github.com/opencontainers/image-spec) -- [Canonical JSON](https://wiki.laptop.org/go/Canonical_JSON) - -## Contributors - -This specification synthesizes ideas from Colin Walters (original design proposals and iteration), Allison Karlitskaya (implementation and practical refinements), and Alexander Larsson (security model and non-root mounting insights). Significant assistance from Claude Sonnet 4.5 was used in synthesis.