diff --git a/README.md b/README.md index 6121f789..55ed5112 100644 --- a/README.md +++ b/README.md @@ -34,76 +34,32 @@ and Linux kernel integration, but with the *flexibility* of files for content — avoiding doubled disk usage, partition table management, and similar headaches. -### Separation between metadata and data - -A key aspect of composefs is its separation of "data" (non-empty regular -files) from "metadata" (everything else: directories, symlinks, permissions, -ownership, etc.). - -composefs produces an [EROFS](https://erofs.docs.kernel.org) filesystem -image that contains only metadata. The non-empty data files live in a -separate "backing store" directory. The EROFS image includes -`trusted.overlay.redirect` extended attributes that tell the overlayfs -mount how to find the real underlying files. - -### Shared backing store - -The primary use case for composefs is versioned, immutable filesystem -trees — container images and bootable host systems — where multiple -images may share parts of their storage. - -By storing files content-addressed (named by the hash of their content), -shared files need to be stored only once on disk yet can appear in -multiple mounts. Crucially, these data files are also shared in the -[page cache](https://static.lwn.net/kerneldoc/admin-guide/mm/concepts.html#page-cache), -allowing multiple running container images to reliably share memory. - -### Filesystem integrity - -composefs supports [fs-verity](https://www.kernel.org/doc/html/latest/filesystems/fsverity.html) -validation of content files. The digest of each content file is stored -in the EROFS image via `trusted.overlay.metacopy` extended attributes, -which overlayfs validates when the file is accessed. This means backing -content cannot be changed (by mistake or by malice) without detection. - -You can also enable fs-verity on the image file itself and pass the expected -digest as a mount option. This provides full trust of both data and metadata, -solving a weakness of fs-verity alone (which can only verify file data, -not metadata like permissions, ownership, or directory structure). +composefs separates metadata (directories, permissions, xattrs) from data +(file content). An EROFS image carries only the metadata; data files live in +a content-addressed backing store, shared across images and in the Linux +[page cache](https://static.lwn.net/kerneldoc/admin-guide/mm/concepts.html#page-cache). +Optional [fs-verity](https://www.kernel.org/doc/html/latest/filesystems/fsverity.html) +provides end-to-end integrity verification of both data and metadata. +For design details, see the [crate documentation](https://docs.rs/composefs). ## Use cases ### Container images -For [OCI](https://github.com/opencontainers/image-spec/blob/main/spec.md) -container images, a common approach (used by both Docker and Podman) is -to untar each layer separately and use overlayfs to stitch them together. -composefs improves on this by storing file content in a content-addressed -fashion, allowing sharing between images even when metadata like -timestamps or ownership differs. - -Combined with approaches like -[zstd:chunked](https://github.com/containers/storage/pull/775), -this speeds up pulling container images and avoids redundantly -creating files that are already present. +composefs improves on the traditional per-layer overlayfs model for +[OCI](https://github.com/opencontainers/image-spec/blob/main/spec.md) +container images by storing file content in a content-addressed store, +enabling sharing between images and faster pulls via +[zstd:chunked](https://github.com/containers/storage/pull/775). ### Bootable host systems -Anywhere one wants versioned immutable filesystem trees ("images"), -composefs provides compelling advantages. In particular, this project -aims to be the successor to [OSTree](https://github.com/ostreedev/ostree/). - -OSTree uses a content-addressed object store, but traditionally checks out -into a regular directory (using hardlinks), which is then bind-mounted as -the rootfs. While OSTree supports enabling fs-verity on files in the store, -nothing protects the checkout directories from modification. - -composefs replaces this checkout with a directly-mountable image pointing -into the object store. We can enable fs-verity on the composefs image and -embed its digest in the kernel commandline or a Unified Kernel Image (UKI). -Since composefs generation is reproducible, we can verify the generated -image is correct by comparing its digest to one in the metadata produced -at build time. For more on this, see [this tracking issue](https://github.com/ostreedev/ostree/issues/2867). +composefs aims to succeed [OSTree](https://github.com/ostreedev/ostree/) +by replacing hardlink checkouts with directly-mountable images backed by a +shared object store. Combined with fs-verity and a digest embedded in the +kernel commandline or a UKI, this provides cryptographic verification of +the entire filesystem tree. See [this tracking issue](https://github.com/ostreedev/ostree/issues/2867) +for background. ## Components @@ -147,9 +103,7 @@ helper that supports `mount -t composefs` syntax directly. ## Documentation - - [Repository format](doc/repository.md) - - [OCI integration](doc/oci.md) - - [Splitstream format](doc/splitstream.md) + - [API and design documentation](https://docs.rs/composefs) - [Examples README](examples/README.md) ## Status diff --git a/crates/composefs-boot/src/design.rs b/crates/composefs-boot/src/design.rs new file mode 100644 index 00000000..70119a47 --- /dev/null +++ b/crates/composefs-boot/src/design.rs @@ -0,0 +1,90 @@ +//! # Booting from a composefs image +//! +//! This document describes how composefs-rs sets up the root filesystem during +//! early boot. It covers the kernel command-line interface, the expected on-disk +//! layout, kernel requirements, and the step-by-step mount sequence performed by +//! `composefs-setup-root`. +//! +//! The target audience is system integrators and OS developers who are packaging a +//! bootable system using composefs. Familiarity with Linux mount namespaces, +//! overlayfs, and fs-verity is assumed. +//! +//! ## Kernel command-line +//! +//! The initramfs code in composefs supports multiple kernel arguments; it +//! is possible to pre-compute the digest of an image using both e.g. SHA-256 and +//! SHA-512. On an installed system, the repository only supports one digest +//! by default today, and the first found will be selected. +//! +//! Additionally, it is opt-in to enable v1 EROFS, and again the first compatible +//! version will be found. +//! +//! ```text +//! composefs.digest=v1-sha256-12: # V1 EROFS image (preferred; RHEL9-era kernels) +//! composefs.digest=v1-sha512-12: # V1 EROFS image (SHA-512 variant) +//! composefs.digest=v2-sha512-12: # V2 EROFS image (explicit form) +//! composefs= # V2 EROFS image (legacy shorthand) +//! ``` +//! +//! The value format is `--:`, where +//! `` is `v1` or `v2`, `` is `sha256` or `sha512`, and +//! `` is the log2 block size (currently always `12`, i.e. 4096 +//! bytes). This mirrors how `meta.json` encodes the algorithm as +//! `fsverity-sha256-12`. +//! +//! `composefs.digest=` is checked first. Multiple entries may appear on the cmdline +//! (one per format/algorithm combination); the initramfs tries each in order and +//! mounts the first image that actually exists in the repository. +//! +//! `composefs=` is a legacy shorthand equivalent to +//! `composefs.digest=v2--12:` -- the algorithm is inferred from the +//! digest length (64 hex chars -> SHA-256, 128 -> SHA-512). It is checked only when +//! no `composefs.digest=` token matches. +//! +//! **Insecure mode.** Placing `?` immediately after `=` (e.g. +//! `composefs.digest=?v1-sha256-12:` or `composefs=?`) makes +//! fs-verity verification optional. The system will boot even when the underlying +//! filesystem does not support fs-verity or the image has no verity metadata +//! attached. This mode exists for development and testing only; it must not be used +//! in production. +//! +//! ## On-disk layout +//! +//! The composefs repository must be present at `/sysroot/composefs` with the +//! standard layout described in the `composefs::repository_format` module. +//! +//! The digest must correspond to a symlink under `images/`. +//! +//! Persistent per-deployment state lives at `/sysroot/state/deploy//`, +//! where `` matches the boot karg digest exactly. The `etc/` and `var/` +//! subdirectories within that directory serve as the upper layers for the +//! corresponding overlayfs mounts. +//! +//! ## Kernel requirements +//! +//! The following kernel features must be available: +//! +//! - **EROFS** filesystem driver (`CONFIG_EROFS_FS`) +//! - **overlayfs** with `metacopy=on` and `redirect_dir=on` +//! (`CONFIG_OVERLAY_FS`, `CONFIG_OVERLAY_FS_METACOPY`, `CONFIG_OVERLAY_FS_REDIRECT_DIR`) +//! - **fs-verity** unless insecure mode is used (`CONFIG_FS_VERITY`) +//! - The modern Linux mount API (`fsopen` / `fsconfig` / `fsmount` / `move_mount`), +//! available since kernel 5.2. Kernel >= 6.15 is required for the atomic root +//! replacement path (the default build). On kernels without `fsconfig_set_fd` +//! support (e.g. RHEL 9 / kernel < 5.15), a loopback device is created +//! automatically by `composefs::mountcompat`. +//! +//! ## Kernel argument +//! +//! The boot karg (`composefs.digest=` or `composefs=`) is the authoritative selector for which image is booted. +//! Without the `?` insecure prefix, every file access through the overlayfs is +//! verified against the object's stored digest by the kernel, combining fs-verity +//! on the data objects with overlayfs `verity=require`. +//! +//! ## Other notes +//! +//! As a workaround for a GPT auto-root issue in systemd +//! ([systemd#35017](https://github.com/systemd/systemd/issues/35017)), +//! `composefs-setup-root` attempts to create `/run/systemd/volatile-root` as a +//! symlink pointing to the real block device before performing any mounts. Failure +//! to do so is non-fatal and does not abort the boot sequence. diff --git a/crates/composefs-boot/src/lib.rs b/crates/composefs-boot/src/lib.rs index 11e5cc33..db57cc6c 100644 --- a/crates/composefs-boot/src/lib.rs +++ b/crates/composefs-boot/src/lib.rs @@ -15,6 +15,9 @@ pub mod selabel; pub mod uki; pub mod write_boot; +#[cfg(doc)] +pub mod design; + use std::ffi::OsStr; use anyhow::Result; diff --git a/crates/composefs-oci/src/design.rs b/crates/composefs-oci/src/design.rs new file mode 100644 index 00000000..30be8842 --- /dev/null +++ b/crates/composefs-oci/src/design.rs @@ -0,0 +1,127 @@ +//! # How to create a composefs from an OCI image +//! +//! This document is incomplete. It only serves to document some decisions we've +//! taken about how to resolve ambiguous situations. +//! +//! # Data precision +//! +//! We currently create a composefs image using the granularity of data as +//! typically appears in OCI tarballs: +//! - atime and ctime are not present (these are actually not physically present +//! in the erofs inode structure at all, either the compact or extended forms) +//! - mtime is set to the mtime in seconds; the sub-seconds value is simply +//! truncated (ie: we always round down). erofs has an nsec field, but it's not +//! normally present in OCI tarballs. That's down to the fact that the usual +//! tar header only has timestamps in seconds and extended headers are not +//! usually added for this purpose. +//! - we take great care to faithfully represent hardlinks: even though the +//! produced filesystem is read-only and we have data de-duplication via the +//! objects store, we make sure that hardlinks result in an actual shared inode +//! as visible via the `st_ino` and `st_nlink` fields on the mounted filesystem. +//! +//! We apply these precision restrictions also when creating images by scanning the +//! filesystem. For example: even if we get more-accurate timestamp information, +//! we'll truncate it to the nearest second. +//! +//! # Merging directories +//! +//! This is done according to the OCI spec, with an additional clarification: in +//! case a directory entry is present in multiple layers, we use the tar metadata +//! from the most-derived layer to determine the attributes (owner, permissions, +//! mtime) for the directory. +//! +//! # The root inode +//! +//! The root inode (/) is a difficult case because OCI container layer tars often +//! don't include a root directory entry, and when they do, container runtimes +//! (Podman, Docker) ignore it and use hardcoded defaults. For example, Podman's +//! [containers/storage](https://github.com/containers/storage) uses root:root +//! ownership, mode `0555`, and epoch (0) mtime when extracting layers, but +//! Docker uses `0755`. In general, the metadata for `/` is not defined. +//! +//! Because composefs requires (has a goal of providing) precise cryptographically +//! verifiable filesystem trees, we solve this for OCI by copying the metadata from `/usr` +//! to the root directory. The rationale is that `/usr` is always present in +//! standard filesystem layouts and must be defined explicitly in the OCI layers. +//! +//! This is implemented via the `copy_root_metadata_from_usr()` method and the +//! `read_container_root()` convenience function. +//! +//! When building a filesystem from OCI layers programmatically, use +//! `Stat::uninitialized()` to create the initial `FileSystem`. This placeholder +//! has mode `0` (obviously invalid) to make it clear that the root metadata should +//! be set before computing digests - typically by calling +//! `copy_root_metadata_from_usr()` after processing all layers. +//! +//! # Extended attributes (xattrs) +//! +//! When reading a container filesystem from a mounted root (as opposed to +//! processing OCI layer tars directly), host-side xattrs can leak into the +//! image. This is particularly problematic for `security.selinux` labels: +//! if SELinux is enabled at build time, files will have labels like +//! `container_t` that come from the build host, not from the target system's +//! policy. +//! +//! To ensure reproducibility, `read_container_root()` filters xattrs to only +//! include those in an allowlist. Currently this is just `security.capability`, +//! which represents actual file capabilities that should be preserved. +//! +//! SELinux labels are handled separately by `transform_for_boot()`: +//! - If the target filesystem contains a SELinux policy (in `/etc/selinux`), +//! all files are relabeled according to that policy +//! - If no SELinux policy is found, all `security.selinux` xattrs are stripped +//! +//! This ensures that: +//! - Build-time SELinux labels don't leak into non-SELinux targets +//! - SELinux-enabled targets get correct labels from their own policy +//! - Other host xattrs (overlayfs internals, etc.) don't pollute the image +//! +//! See: +//! +//! # The /run directory +//! +//! When processing OCI images via `create_filesystem()`, the `/run` directory +//! is emptied if present. This is a tmpfs at runtime and should always be +//! empty in images. Its mtime is set to match `/usr` for consistency with +//! how root directory metadata is handled. +//! +//! This makes it possible to work around podman/buildah's `RUN --mount` issue where cache +//! mounts can leave incomplete directory entries in OCI tar layers (directories +//! without explicit tar entries inherit incorrect mtimes) by pointing all +//! such mounts into `/run`, and then redirecting from their final location +//! via e.g. symlinks into `/run`. +//! +//! ## Container build cache mounts +//! +//! A practical implication of emptying `/run` is that container authors can +//! use it for cache mounts without worrying about polluting the final image. +//! +//! Instead of: +//! ```dockerfile +//! RUN --mount=type=cache,target=/var/cache/dnf dnf install -y ... +//! ``` +//! +//! Consider: +//! ```dockerfile +//! RUN rm -rf /var/cache/dnf && ln -sr /run/dnfcache /var/cache/dnf +//! RUN --mount=type=cache,target=/run/dnfcache dnf install -y ... +//! ``` +//! +//! This avoids potential mtime inconsistencies in `/var/cache` while still +//! benefiting from build caching. +//! +//! See: +//! +//! # Emptied directories for boot +//! +//! When preparing a filesystem for boot via `transform_for_boot()`, certain +//! additional directories are emptied because their contents should not be +//! part of the final verified image: +//! +//! - `/boot`: Contains the UKI which embeds the composefs digest, so including +//! it would create a circular dependency +//! - `/sysroot`: Only has content in ostree-container cases, and traversing +//! it for SELinux labeling causes problems +//! +//! These directories are emptied and their mtime is set to match `/usr` for +//! consistency with how the root directory metadata is handled. diff --git a/crates/composefs-oci/src/lib.rs b/crates/composefs-oci/src/lib.rs index 807ae966..0aefd575 100644 --- a/crates/composefs-oci/src/lib.rs +++ b/crates/composefs-oci/src/lib.rs @@ -35,6 +35,9 @@ pub mod tar; #[doc(hidden)] pub mod test_util; +#[cfg(doc)] +pub mod design; + // Re-export the composefs crate for consumers who only need composefs-oci pub use composefs; diff --git a/crates/composefs-ostree/src/design.rs b/crates/composefs-ostree/src/design.rs new file mode 100644 index 00000000..1270e410 --- /dev/null +++ b/crates/composefs-ostree/src/design.rs @@ -0,0 +1,139 @@ +//! # OSTree +//! +//! composefs-rs has support for importing images from OSTree +//! repositories, by pulling from local or remote OSTree +//! repositories. These images can then be mounted as composefs images, +//! sharing disk (deduplication) with other ostree or other types of +//! images in the composefs repository. +//! +//! Native OSTree repositories are a format similar to a composefs +//! repository, but not quite the same. This means we need some +//! conversions when handling ostree commits in a composefs repository. +//! +//! OSTree images (commits) are fundamentally made up of many small sha256 +//! content-addressed objects that reference each other. Each commit is +//! the root of a DAG that defines the total image. Some of the OSTree +//! objects are metadata like directory permissions, or list of files in a +//! directory. These don't really exist in composefs where all metadata is +//! part of the erofs image. However, some objects are large file objects, +//! and these are similar to the file objects in composefs +//! images. However, even these differ, because the checksum defining the +//! object is made up of both the file content and the file metadata. +//! +//! When an OSTree commit is stored in a composefs repo it is stored as a +//! single splitstream file, named `ostree-commit-$commit_id`, which uses +//! external object references to all the file content objects that will +//! be used when creating an erofs image for it. This means OSTree objects +//! for files that would be inlined in the erofs image will not be +//! external objects. +//! +//! OStree commit splitstream objects are created during a pull operation +//! and are used for two things, creating a composefs image by walking the +//! DAG, and serving as a source of already available OSTree object during +//! a pull operation. Such sources are found automatically during pull +//! (e.g. parent commit, or old commit for a ref being pulled) or can be +//! manually specified. +//! +//! ## File format +//! +//! This describes the format of the `ostree-commit-$commit_id` files. +//! +//! ### Splitstream header +//! +//! Since the commit file is a split stream it starts with the splitstream +//! headers. Of these we use two, the named refs and the object +//! refs: +//! +//! * When an erofs image is created for the commit, it is referenced by +//! the `composefs.image` named ref. +//! +//! * Any external file content objects are in the external_refs +//! table. The index of the references in this header table is used to +//! refer to the file in the splitstream itself. +//! +//! The splitstream content type used for commits is 0xAFE138C18C463EF1. +//! +//! ### Splitstream content +//! +//! A splitstream is normally a series of internal and external chunks, +//! but the ostree commit uses only one inline chunk. This chunk is +//! basically a serialized form of the "objects" directory of an OSTree +//! repository. I.e. it has a mapping of sha256 to ostree object data. +//! All objects except file objects are stored in the standard ostree +//! object format. +//! +//! OSTree file objects are stored in the archive-z2 format, except not +//! compressed, and optionally the file content part of it may be stored +//! as referencing the index of an external object. The z2 format is, +//! first an 8-byte header that gives the size (in bytes) of a gvariant, +//! then comes the gvariant with the file meta in +//! OSTREE_ZLIB_FILE_HEADER_GVARIANT_FORMAT format, and then the +//! file/symlink inline data. If an external object is referenced for the +//! object then it is expected that there is no inline file data. +//! +//! The high level view of the file looks like this: +//! ```text +//! +---------------+ +//! | Header | +//! +---------------| +//! | Object IDs | +//! +---------------| +//! | Object Info | +//! +---------------| +//! | Content | +//! +---------------+ +//! ``` +//! +//! The Object IDs is a sorted array of sha256 digests, and you would do +//! lookups in it using a binary search. The buckets in the header can be +//! used to quickly limit the binary search based on the first byte of a +//! digest. +//! +//! Then, at the same index as the binary searched object you can look up +//! the object info which gives you the offset/length of the object +//! content data and optionally a reference to an external object. +//! +//! The exact form of the data looks like this, packed in order from the +//! start of the splitstream content. All ints are in little endian. +//! +//! ### Header +//! ```text +//! +-----------------------------------+ +//! | u32: index of commit object | +//! | u32: flags (currently unused) | +//! | [u32; 256]: end index of bucket | +//! +-----------------------------------+ +//! ``` +//! +//! The bucket list contains the end index (in the object ids table) of +//! objects starting with that particular byte, and can be used to quickly +//! limit the search. We can also compute the total number of objects +//! (n_objects) by looking in the last bucket. +//! +//! ### Object ids +//! ```text +//! n_objects x +//! +-----------------------------------+ +//! | [u8; 32] ostree object id | +//! +-----------------------------------+ +//! ``` +//! +//! ### Object Info +//! ```text +//! n_objects x +//! +-----------------------------------+ +//! | u32: Offset to per-object data | +//! | u32: Length of per-object data | +//! | u32: Index of external object ref | +//! | or MAXUINT32 if none. | +//! +-----------------------------------+ +//! ``` +//! +//! This is an array of information for each object. Once you have found +//! the object id in the object ids table, you would look at the same +//! index in this table to find the information. Offsets to per-object +//! data are in bytes from the start of the content area, which starts at +//! the end of the Objects Info table. All data chunks references are +//! aligned to 8 bytes with respect to the start of the content area. +//! This is useful because GVariants (used by ostree) naturally want +//! 8-byte alignment. diff --git a/crates/composefs-ostree/src/lib.rs b/crates/composefs-ostree/src/lib.rs index e292188c..9c9653c9 100644 --- a/crates/composefs-ostree/src/lib.rs +++ b/crates/composefs-ostree/src/lib.rs @@ -29,6 +29,8 @@ pub struct CommitInfo { } mod commit; +#[cfg(doc)] +pub mod design; mod ostree; mod pull; mod repo; diff --git a/crates/composefs/src/erofs_format.rs b/crates/composefs/src/erofs_format.rs new file mode 100644 index 00000000..f939bd2c --- /dev/null +++ b/crates/composefs/src/erofs_format.rs @@ -0,0 +1,82 @@ +//! # composefs EROFS image format +//! +//! composefs images are EROFS filesystem images with composefs-specific extensions. They encode +//! a directory tree where regular files are stored externally in a content-addressed object store +//! and referenced by their fs-verity digest. The EROFS image itself carries only metadata: inodes, +//! directory entries, extended attributes, and chunk index entries that point to the external files. +//! +//! composefs-rs supports two EROFS format versions. V1 is byte-for-byte compatible with the C +//! `mkcomposefs` tool. V2 is the composefs-rs native format and drops several V1 constraints +//! that exist only for C compatibility. +//! +//! `cfsctl init` defaults to V2; pass `--erofs-version 1` to select V1. Higher-level tools +//! such as bootc initialize repositories with multiple formats enabled (V1 primary) so that images +//! can be booted on RHEL9-era kernels that require the `composefs.digest=` karg. +//! +//! ## Format V1 +//! +//! V1 is selected with `cfsctl init --erofs-version 1`. The `v1_erofs` ro-compat feature flag +//! is written to `meta.json` so that tools without V1 support open the repository read-only. +//! +//! **`composefs_version` field values in V1:** +//! +//! - `0` — no user-visible whiteout files (character devices with rdev=0) in the tree +//! - `1` — at least one user-visible whiteout file is present +//! +//! The constant `COMPOSEFS_VERSION_V1` is 0; the field only reaches 1 when user whiteouts are +//! found. The `--min-version` flag in `mkcomposefs` (mirrored by `mkfs_erofs_v1_min_version`) +//! forces the value to 1 even when no user whiteouts exist, for forward compatibility. +//! +//! **Inode layout:** V1 uses compact inodes (32 bytes) when the file data and inode fit within +//! the constraints of the compact format, and extended inodes (64 bytes) otherwise. +//! +//! **Inode traversal order:** V1 collects inodes in breadth-first order — all entries at one +//! directory level before descending. +//! +//! **Whiteout stub table:** V1 includes 256 synthetic inode entries at the start of the inode +//! area, one per two-hex-character prefix `00`–`ff`. Each entry is a character-device stub +//! (chr 0,0) used by the overlay filesystem to resolve whiteout paths against the object store. +//! V2 omits them entirely. +//! +//! **Whiteout escaping:** User-visible whiteout files (chr 0,0) in the tree are not stored as +//! character devices on disk. Instead they receive a `trusted.overlay.opaque=x` xattr and are +//! serialized differently. The stub entries in the whiteout table are not escaped. +//! +//! **`build_time`:** The superblock `build_time` field is set to the minimum mtime across all inodes. +//! +//! **xattr sharing:** Xattr entries are deduplicated using a sort key that is the full xattr name (prefix string concatenated with the suffix). +//! +//! ## Format V2 — Created in composefs-rs +//! +//! V2 is the default for repositories created with `cfsctl init` without `--erofs-version 1`. +//! +//! **`composefs_version` field:** Always `2` (the constant `COMPOSEFS_VERSION`). +//! +//! **Inode layout:** V2 always uses extended inodes (64 bytes). +//! +//! **Inode traversal order:** V2 collects inodes in depth-first order — all descendants of a directory before moving to the next sibling. +//! +//! **No whiteout stub table:** V2 has no synthetic stub entries; whiteout files are stored directly without escaping. +//! +//! **`build_time`:** Always 0. +//! +//! **xattr sharing:** Xattr entries are deduplicated using a sort key of (prefix, suffix, value) +//! rather than the full name string, which can produce a smaller shared xattr area. +//! +//! ## Selecting the format +//! +//! The format is fixed at repository initialization time and cannot be changed afterward. +//! +//! ```text +//! cfsctl init # V2 (default) +//! cfsctl init --erofs-version 1 # V1 (C-tool compatible) +//! ``` +//! +//! The format is recorded in `meta.json` (see [`repository_format`][crate::repository_format]) as the `v1_erofs` ro-compat feature flag: present +//! means V1, absent means V2. Tools that do not recognize this flag open the repository +//! read-only rather than writing images in the wrong format. +//! +//! For the standalone `mkcomposefs` tool, the equivalent flag is `--erofs-version`. The +//! `--min-version` flag (`mkfs_erofs_v1_min_version` in the Rust API) controls whether the +//! `composefs_version` field starts at 0 or 1 in V1 images regardless of whether user whiteouts +//! are present. diff --git a/crates/composefs/src/lib.rs b/crates/composefs/src/lib.rs index 806f3c8f..468090e3 100644 --- a/crates/composefs/src/lib.rs +++ b/crates/composefs/src/lib.rs @@ -1,8 +1,43 @@ -//! Rust bindings and utilities for working with composefs images and repositories. +//! # composefs: The reliability of disk images, the flexibility of files //! -//! Composefs is a read-only FUSE filesystem that enables efficient sharing -//! of container filesystem layers by using content-addressable storage -//! and fs-verity for integrity verification. +//! composefs combines several Linux kernel features to provide read-only +//! mountable filesystem trees that stack on top of a conventional "lower" +//! filesystem. +//! +//! ## Interfaces +//! +//! composefs offers two programmatic interfaces: +//! +//! - **Rust API** — this crate and its siblings (`composefs-oci`, +//! `composefs-boot`, etc.), usable as regular Cargo dependencies. +//! - **Varlink API** — a [varlink](https://varlink.org) RPC interface +//! exposed by `cfsctl varlink` over a Unix socket, accessible from +//! any language. See the [`varlink`] module for examples. +//! +//! Neither interface is declared stable yet. Both may change across +//! releases while the project is under active development. +//! +//! ## Key technologies +//! +//! - **[overlayfs]** — the kernel mount interface that exposes the composed tree +//! - **[EROFS]** — an in-kernel read-only filesystem for the metadata tree +//! (directories, symlinks, permissions, xattrs) with no file data +//! - **[fs-verity]** (optional) — per-file integrity verification on the +//! backing store, validated by overlayfs at access time +//! +//! [overlayfs]: https://www.kernel.org/doc/Documentation/filesystems/overlayfs.txt +//! [EROFS]: https://erofs.docs.kernel.org +//! [fs-verity]: https://www.kernel.org/doc/html/next/filesystems/fsverity.html +//! +//! ## Design +//! +//! composefs produces an EROFS image containing *only* metadata. Non-empty +//! data files live in a content-addressed backing store, with +//! `trusted.overlay.redirect` xattrs telling overlayfs where to find them. +//! Identical files across images are stored once on disk and shared in the +//! Linux page cache. +//! +//! See the [`repository_format`] module for the on-disk layout. #![forbid(unsafe_code)] // This is a library: emit diagnostics via the `log` crate (or return them), @@ -25,9 +60,16 @@ pub mod splitstream; pub mod tree; pub mod util; +#[cfg(doc)] +pub mod erofs_format; pub mod generic_tree; +#[cfg(doc)] +pub mod repository_format; +#[cfg(doc)] +pub mod splitstream_format; #[cfg(any(test, feature = "test"))] pub mod test; +pub mod varlink; /// Files with this many bytes or fewer are stored inline in the erofs image /// (and in splitstreams). Files above this threshold are written to object diff --git a/crates/composefs/src/repository_format.rs b/crates/composefs/src/repository_format.rs new file mode 100644 index 00000000..4396dede --- /dev/null +++ b/crates/composefs/src/repository_format.rs @@ -0,0 +1,317 @@ +//! # composefs repository design +//! +//! This document describes the current on-disk layout of a composefs repository. +//! +//! At this time, the composefs-rs repository format is not declared stable. +//! +//! ## Location +//! +//! A composefs repository is a directory located anywhere. The location is chosen +//! for the `cfsctl` command as follows: +//! +//! - `--repo` can specify an arbitrary directory +//! +//! - if `--user` is specified (default if the current uid is not 0), then the +//! repository defaults to `~/.var/lib/composefs`. +//! +//! - if `--system` is specified (default if the current uid is 0), then the +//! repository defaults to `/sysroot/composefs`. +//! +//! ## Layout +//! +//! A composefs repository has a layout that looks something like +//! +//! ```text +//! composefs +//! ├── meta.json +//! ├── objects +//! │   ├── 00 +//! │   │   ├── 002183fb91[...] +//! │   │   ├── [...] +//! │   │   └── ff9d7bd692[...] +//! │   ├── 4e +//! │   │   ├── 67eaccd9fd[...] +//! │   │   └── [...] +//! │   ├── 50 +//! │   │   ├── 2b126bca0c[...] +//! │   │   └── [...] +//! │   └── [...] +//! ├── images +//! │   ├── 4e67eaccd9fd[...] -> ../objects/4e/67eaccd9fd[...] +//! │   └── refs +//! │   └── some/name -> ../../images/4e67eaccd9fd[...] +//! └── streams +//! ├── 502b126bca0c[...] -> ../objects/50/2b126bca0c[...] +//! └── refs +//! └── some/name.tar -> ../../streams/502b126bca0c[...] +//! ``` +//! +//! ## `meta.json` +//! +//! Added in 0.7.0. This file records repository-level metadata. When present, it is +//! created by `cfsctl init` and contains: +//! +//! - `version` — the base repository format version (currently `1`). Tools +//! must refuse to operate on a repository whose version exceeds what they +//! understand. +//! +//! - `algorithm` — the fs-verity digest algorithm identifier, in the format +//! `fsverity--`. For example `fsverity-sha512-12` +//! means SHA-512 with 4 KiB (2^12) blocks. +//! +//! - `features` (optional) — an object with three arrays of feature-flag +//! strings, following the ext4/XFS/EROFS compatibility model: +//! - `compatible` — old tools can safely ignore these. +//! - `read-only-compatible` — old tools may read but must not write. +//! - `incompatible` — old tools must refuse the repository entirely. +//! +//! The currently defined feature flags are: +//! - `v1_erofs` (read-only-compatible) — present on repositories whose +//! EROFS image format is [V1][crate::erofs_format] (C-tool compatible: +//! compact inodes, BFS ordering, whiteout table). This is the single +//! flag that encodes the EROFS format version: present → V1, absent +//! → V2. Old +//! tools that do not recognise this flag open the repository read-only +//! rather than accidentally writing images in the wrong format. +//! +//! When `meta.json` is present, `cfsctl` auto-detects the hash algorithm and +//! errors if `--hash` is explicitly passed with a conflicting value. When +//! the file is absent (for repositories created before this feature), `--hash` +//! is honored as before and defaults to `sha512`. +//! +//! ### `cfsctl init --erofs-version` +//! +//! The `--erofs-version` flag selects the EROFS format for newly committed +//! images. It controls the `v1_erofs` feature flag in `meta.json`: +//! +//! ```text +//! cfsctl init # default: V2 EROFS (composefs-rs native) +//! cfsctl init --erofs-version 1 # V1 EROFS (C-tool compatible) +//! ``` +//! +//! **V2** (the `cfsctl` default) uses extended inodes, DFS ordering, and +//! `composefs_version=2` in the EROFS superblock. This is the composefs-rs native +//! format and is what all repositories created before V1 support was added use. +//! Higher-level tools (e.g. bootc) may configure a repository with multiple format +//! versions (V1 primary + V2 extra) so that images are usable on both RHEL9-era and +//! newer kernels. +//! +//! **V1** uses compact inodes where possible, BFS ordering, and a whiteout stub +//! table, producing output byte-for-byte identical to the C `mkcomposefs` tool. +//! The `v1_erofs` ro-compat flag is written to `meta.json` so that tools which +//! predate V1 support open the repository read-only rather than writing images +//! in the wrong format. +//! +//! Re-initializing an existing repository with a different `--erofs-version` is +//! rejected with an error; the format version is fixed at init time. +//! +//! ## `objects/` +//! +//! This is where the content-addressed data is stored. The immediate children of +//! this directory are 256 subdirectories from `00` to `ff`. Each of those +//! directories contains a number of files with 62-character hexidecimal names. +//! Taken together with the directory in which it resides, each filename represents +//! a 256bit hash value which equals the measured fs-verity digest of that file. +//! fs-verity must be enabled for every file. +//! +//! ## `images/` +//! +//! This is where composefs ([EROFS][crate::erofs_format]) images are accounted for. The images +//! themselves are fs-verity enabled and stored in the object store in the same way +//! as the file data, but the `images/` directory contains symlinks to the images +//! that we know about. Each symlink is named for the full 256bit fsverity digest. +//! +//! Images are tracked in a separate directory because of the security model of +//! filesystems in the Linux kernel. Although it would be feasible for "regular +//! users" to mount an erofs in their own mount namespace, the kernel currently +//! disallows it as a way to avoid allowing non-root users to expose the filesystem +//! code to hostile data. As such, we only mount images that we produced for +//! ourselves (with mkcomposefs), and those are the ones that are linked in this +//! directory. +//! +//! Another way to say it: we must never attempt to mount an arbitrary object: we +//! may only mount via symlinks present in this directory. +//! +//! ## `streams/` +//! +//! This is where [split streams][crate::splitstream] are stored. As for the images, +//! this is a bunch of 256bit symlinks which are symlinks to data in the object +//! storage. +//! +//! Note: the names of the hashes in this directory are the fs-verity hashes of the +//! content of the splitstream file, not the original file. More specifically: if +//! you have a tar file with a specific sha256 digest, and you import it into the +//! repository as a splitstream, the resulting filename in this directory will have +//! no relation to the original content. You can, however, store a reference for +//! it. +//! +//! ## `{images,streams}/refs/` +//! +//! This is where we record which images and streams are currently "requested" by +//! some external user. When importing a tar file, in addition to creating the +//! file in the objects database and the toplevel symlink in the `streams/` +//! directory, we also assign it a name which is chosen by the software which is +//! performing the import. +//! +//! Each ref is a symlink to the top-level entry in `images/` or `streams/`. +//! +//! There are some rough ideas for how we might namespace this. Something like +//! this model is imagined: +//! +//! ```text +//! refs +//! ├── system +//! │   └── rootfs +//! │      ├── some_id -> ../../../974d04eaff[...] +//! │      └── [...] +//! ├── 1000 # uid of a user +//! │   ├── flatpak +//! │   │   ├── some_id -> ../../../f8e2bec500[...] +//! │   │   └── [...] +//! │   └── containers +//! │      ├── some_id -> ../../../96a87f8b4b[...] +//! │      └── [...] +//! └── [...] +//! ``` +//! +//! Where the toplevel directories are `system` plus a set of uids. Each `system` +//! or uid subdirectory is namespaced by the particular piece of software that's +//! responsible for storing the given image or stream. +//! +//! The per-user directories will all be owned by root and have 0700 permissions, +//! but each user will be able to access their own uid-numbered subdirectories by +//! way of an acl. The reason that we want the directories owned by root is to +//! prevent users from corrupting the layout of the repository. The reason for the +//! acl is that read-only operations on the repository should be performed +//! directly on the repository and not via some central agent. +//! +//! ## Referring to images and streams +//! +//! Operations that are performed on images or streams (mount, cat, etc.) name the +//! stream in one of two ways: +//! +//! - via the user-chosen name such as `refs/1000/flatpak/some_id` +//! - via the fs-verity digest stored in the toplevel dir +//! +//! ie: the name must either start with the string `refs/`, or must be a +//! hexadecimal string (64 characters for sha256, 128 for sha512). +//! +//! In both cases, the name is a path relative to the `images/` or `streams/` +//! directory and this path contains a symlink (either direct or indirect) to the +//! underlying file in `objects/`. +//! +//! When specified via fs-verity digest, the digest is verified before performing +//! the operation. +//! +//! For example: +//! +//! ```sh +//! cfsctl mount refs/system/rootfs/some_id /mnt # does not check fs-verity +//! cfsctl mount 974d04eaff[...] /mnt # enforces fs-verity +//! ``` +//! +//! ## OCI image storage +//! +//! OCI container images are stored using streams exclusively. Each OCI artifact +//! (manifest, config, layer) becomes a splitstream, and OCI "tags" are refs under +//! `streams/refs/oci/`. +//! +//! ### Naming conventions +//! +//! | OCI artifact | Stream name pattern | Example | +//! |---------------|------------------------------------|------------------------------------| +//! | Manifest | `oci-manifest-{manifest_digest}` | `oci-manifest-sha256:abc123...` | +//! | Config | `oci-config-{config_digest}` | `oci-config-sha256:def456...` | +//! | Layer | `oci-layer-{diff_id}` | `oci-layer-sha256:ghi789...` | +//! | Blob | `oci-blob-{blob_digest}` | `oci-blob-sha256:jkl012...` | +//! +//! Tags are stored under `streams/refs/oci/` with percent-encoding for +//! filesystem safety (`/` → `%2F`): +//! +//! ```text +//! streams/refs/oci/myimage:latest → ../../oci-manifest-sha256:abc123... +//! ``` +//! +//! ### Splitstream reference chains +//! +//! Each splitstream contains `named_refs` (semantic labels mapping to entries +//! in the `stream_refs` array) and `object_refs` (raw objects referenced by +//! the compressed stream data). For OCI images the chain is: +//! +//! **Manifest splitstream** (`oci-manifest-sha256:...`): +//! - `object_refs`: the manifest JSON blob +//! - `named_refs`: +//! - `config:{config_digest}` → config splitstream verity +//! - `{diff_id}` → layer splitstream verity (one per layer) +//! +//! **Config splitstream** (`oci-config-sha256:...`): +//! - `object_refs`: the config JSON blob +//! - `named_refs`: +//! - `{diff_id}` → layer splitstream verity (one per layer) +//! +//! **Layer splitstream** (`oci-layer-sha256:...`): +//! - `object_refs`: file content objects extracted from the tar +//! - `named_refs`: none (leaf node) +//! +//! Both the manifest and config redundantly reference the layers. The GC +//! can reach layers from either path. +//! +//! ### Garbage collection +//! +//! The GC walks all refs under `streams/refs/` to find root splitstreams, +//! then transitively follows `named_refs` (by resolving fs-verity IDs +//! through a stream name map) and collects `object_refs`. Any object not +//! reachable from a root is deleted. +//! +//! Concretely, for a tagged container image: +//! +//! 1. Tag `streams/refs/oci/myimage:v1` resolves to `oci-manifest-sha256:abc` +//! 2. Walk the manifest: mark its JSON blob and follow `named_refs` to +//! the config and layer streams +//! 3. Walk the config: mark its JSON blob and follow `named_refs` to layers +//! (already visited, skipped) +//! 4. Walk each layer: mark all file content objects +//! +//! When a tag is removed, the manifest and everything reachable only from it +//! becomes GC-eligible. Layers shared between images survive as long as any +//! referencing manifest remains tagged. +//! +//! ### EROFS image tracking via config splitstream refs +//! +//! When an EROFS image is generated from an OCI image (via +//! `create_filesystem` + `commit_image`), its object ID (fs-verity digest) +//! is stored as a named ref on the config splitstream with the key +//! `composefs.image`. +//! +//! GC walks from tag → manifest → config, and finds the `composefs.image` +//! named ref. The EROFS object ID is added to the live set, keeping the +//! EROFS image alive. The EROFS image still needs an entry under `images/` +//! for the kernel mount security model (see above), but `images/` is not a +//! GC root — the config ref is what keeps the object alive. +//! +//! This means a single OCI tag is sufficient to keep the entire image +//! (manifest, config, layers, and the EROFS image) alive through GC. +//! +//! ### Bootable image variant +//! +//! For bootable images, a second EROFS may be generated after +//! `transform_for_boot` (stripping `/boot`, etc.). This boot EROFS is +//! stored as a second named ref on the config, `composefs.image.boot`. +//! +//! Since the config splitstream content changes (new named ref), it gets a +//! new fs-verity digest. This cascades: the manifest must also be +//! rewritten (its `config:` named ref now points to the new config verity), +//! producing a new manifest verity. The tag is re-pointed to the new +//! manifest. The old config and manifest splitstreams become unreferenced +//! and are collected by GC. +//! +//! The result: one tag still keeps everything alive — layers, raw EROFS, +//! and boot EROFS. +//! +//! ### Future: sealed images +//! +//! For sealed/signed images, the EROFS comes pre-built from the registry as +//! part of a composefs OCI artifact (referrer pattern). The artifact +//! splitstream would hold references to the pre-fetched EROFS layers. This +//! is complementary to the unsealed case — both use the same GC mechanism +//! (named refs pointing to EROFS objects). diff --git a/crates/composefs/src/splitstream_format.rs b/crates/composefs/src/splitstream_format.rs new file mode 100644 index 00000000..36a1a17f --- /dev/null +++ b/crates/composefs/src/splitstream_format.rs @@ -0,0 +1,164 @@ +//! # Splitstream +//! +//! Splitstream is a trivial way of storing file formats (like tar) with the "data +//! blocks" stored in the composefs object store with the goal that it's possible +//! to bit-for-bit recreate the entire file. It's something like the idea behind +//! [tar-split](https://github.com/vbatts/tar-split), with some important +//! differences: +//! +//! - it's a binary format +//! +//! - it's based on storing external objects content-addressed in the composefs +//! object store via their fs-verity digest +//! +//! - although it's designed with `tar` files in mind, it's not specific to `tar`, +//! or even to the idea of an archive file: any file format can be stored as a +//! splitstream, and it might make sense to do so for any file format that +//! contains large chunks of embedded data +//! +//! - in addition to the ability to split out chunks of file content (like files +//! in a `.tar`) to separate files, it is also possible to refer to external +//! file content, or even other splitstreams, without directly embedding that +//! content in the referrer, which can be useful for cross-document references +//! (such as between OCI manifests, configs, and layers) +//! +//! - the splitstream file itself is stored in the same content-addressed object +//! store by its own fs-verity hash +//! +//! Splitstream compresses inline file content before it is stored to disk using +//! zstd. The main reason for this is that, after removing the actual file data, +//! the remaining `tar` metadata contains a very large amount of padding and empty +//! space and compresses extremely well. +//! +//! Splitstream is conceptually independent from composefs: you could use the +//! format with any content-addressed storage system. +//! +//! ## File format +//! +//! What follows is a non-normative documentation of the file format. The actual +//! definition of the format is "what composefs-rs reads and writes", but this +//! document may be useful to try to understand that format. If you'd like to +//! implement the format, please get in touch. +//! +//! The format is implemented in +//! [crate::splitstream] and +//! the structs from that file are copy-pasted here. Please try to keep things +//! roughly in sync when making changes to either side. +//! +//! All integers are little-endian. In the following `struct` definitions, `U` +//! means 'unsigned little endian' (as per the `zerocopy::little_endian` crate) so +//! `U64` is an unsigned 64bit little-endian integer. +//! +//! ### File ranges ("sections") +//! +//! The file format consists of a fixed-sized header at the start of the file plus +//! a number of sections located at arbitrary locations inside of the file. All of +//! these sections are referred to by a 64-bit `[start..end)` range expressed in +//! terms of overall byte offsets within the complete file. +//! +//! ```text +//! struct FileRange { +//! start: U64, +//! end: U64, +//! } +//! ``` +//! +//! ### Header +//! +//! The file starts with a simple fixed-size header. +//! +//! ```text +//! const SPLITSTREAM_MAGIC: [u8; 11] = *b"SplitStream"; +//! +//! struct SplitstreamHeader { +//! pub magic: [u8; 11], // Contains SPLITSTREAM_MAGIC +//! pub version: u8, // must always be 0 +//! pub _flags: U16, // is currently always 0 (but ignored) +//! pub algorithm: u8, // kernel fs-verity algorithm identifier (1 = sha256, 2 = sha512) +//! pub lg_blocksize: u8, // log2 of the fs-verity block size (12 = 4k, 16 = 64k) +//! pub info: FileRange, // can be used to expand/move the info section in the future +//! } +//! ``` +//! +//! In addition to magic values and identifiers for the fs-verity algorithm in use, +//! the header is used to find the location and size of the info section. Future +//! expansions to the file format are imagined to occur by expanding the size of +//! the info section: if the section is larger than expected, the additional bytes +//! will be ignored by the implementation. +//! +//! ### Info section +//! +//! ```text +//! struct SplitstreamInfo { +//! pub stream_refs: FileRange, // location of the stream references array +//! pub object_refs: FileRange, // location of the object references array +//! pub stream: FileRange, // location of the zstd-compressed stream within the file +//! pub named_refs: FileRange, // location of the compressed named references +//! pub content_type: U64, // user can put whatever magic identifier they want there +//! pub stream_size: U64, // total uncompressed size of inline chunks and external chunks +//! } +//! ``` +//! +//! The `content_type` is just an arbitrary identifier that can be used by users of +//! the file format to prevent casual user error when opening a file by its hash +//! value (to prevent showing `.tar` data as if it were json, for example). +//! +//! The `stream_size` is the total size of the original file. +//! +//! ### Stream and object refs sections +//! +//! All referred streams and objects in the file are stored as two separate flat +//! uncompressed arrays of binary fs-verity hash values. Each of these arrays is +//! referred to from the info section (via `stream_refs` and `object_refs`). +//! +//! The number of items in the array is determined by the size of the section +//! divided by the size of the fs-verity hash value (determined by the algorithm +//! identifier in the header). +//! +//! The values are not in any particular order, but implementations should produce +//! a deterministic output. For example, the objects reference array produced by +//! the current implementation has the external objects sorted by first-appearance +//! within the stream. +//! +//! The main motivation for storing the references uncompressed, in binary, and in +//! a flat array is to make determining the references contained within a +//! splitstream as simple as possible to improve the efficiency of garbage +//! collection on large repositories. +//! +//! ### The stream +//! +//! The main content of the splitstream is stored in the `stream` section +//! referenced from the info section. The entire section is zstd compressed. +//! +//! Within the compressed stream, the splitstream is formed from a number of +//! "chunks". Each chunk starts with a single 64-bit little endian value. If that +//! number is negative, it refers to an "inline" chunk, and that (absolute) number +//! of bytes of data immediately follow it. If the number is non-negative then it +//! is an index into the object refs array for an "external" chunk. +//! +//! Zero is a non-negative value, so it's an object reference. It's not possible +//! to have a zero-byte inline chunk. This also means that the high/sign bit +//! determines which case (inline vs. external) we have and there are an equal +//! number of both cases. +//! +//! The stream is reassembled by iterating over the chunks and concatenating the +//! result. For inline chunks, the inline data is taken directly from the +//! splitstream. For external chunks, the content of the external file is used. +//! +//! The stream is over when there are no more chunks. +//! +//! ### Named references +//! +//! It's possible to have named references to other streams. These are stored in +//! the `named_refs` section referred to from the info section. +//! +//! This section is also zstd-compressed, and is a number of nul-terminated text +//! records (including a terminator after the last record). Each record has the +//! form `n:name` where `n` is a non-negative integer index into the stream refs +//! array and `name` is an arbitrary name. The entries are currently sorted by +//! name (by the writer implementation) but the order is not important to the +//! reader. Whether or not this list is "officially" sorted or not may be pinned +//! down at some future point if a need should arise. +//! +//! An example of the decompressed content of the section might be something like +//! `"0:first\01:second\0"`. diff --git a/crates/composefs/src/varlink.rs b/crates/composefs/src/varlink.rs new file mode 100644 index 00000000..af22cfd4 --- /dev/null +++ b/crates/composefs/src/varlink.rs @@ -0,0 +1,160 @@ +//! # Varlink API +//! +//! `cfsctl varlink` exposes a [varlink] RPC service over a Unix socket +//! with two interfaces: +//! +//! - **`org.composefs.Repository`** — repository lifecycle, integrity +//! checks, garbage collection, and mounting +//! - **`org.composefs.Oci`** — OCI container image operations (listing, +//! pulling, inspecting, tagging, mounting) +//! +//! This API is language-agnostic and usable from any varlink client. +//! Like the Rust crate API, it is not yet declared stable. +//! +//! [varlink]: https://varlink.org +//! +//! ## Starting the service +//! +//! ```sh +//! cfsctl varlink --address /run/composefs/composefs.sock +//! ``` +//! +//! Systemd socket activation is also supported — if `cfsctl varlink` is +//! started with an activated socket, the `--address` flag is not needed. +//! +//! ## Discovering the full API +//! +//! The complete interface definitions — every method, type, and error — +//! are available at runtime via the standard varlink introspection +//! protocol. Use [`varlinkctl`] to dump them: +//! +//! ```sh +//! # List available interfaces +//! varlinkctl list-interfaces /run/composefs/composefs.sock +//! +//! # Full IDL for the Repository interface +//! varlinkctl introspect /run/composefs/composefs.sock \ +//! org.composefs.Repository +//! +//! # Full IDL for the OCI interface +//! varlinkctl introspect /run/composefs/composefs.sock \ +//! org.composefs.Oci +//! ``` +//! +//! For `exec:`-style transports (no long-running socket), `varlinkctl` +//! can launch `cfsctl` as a subprocess: +//! +//! ```sh +//! varlinkctl introspect exec:cfsctl\ varlink org.composefs.Repository +//! ``` +//! +//! [`varlinkctl`]: https://www.freedesktop.org/software/systemd/man/latest/varlinkctl.html +//! +//! ## Session model +//! +//! Repositories are accessed through opaque `u64` handles. A client +//! calls `OpenRepository` to obtain a handle, passes it to every +//! subsequent method, and releases it with `CloseRepository`. No +//! repository is opened at startup. +//! +//! ## Examples +//! +//! The examples below use `varlinkctl call`. Any varlink client works — +//! the wire format is JSON over a Unix socket. +//! +//! ### Open and close a repository +//! +//! ```sh +//! # Open the system repository (/sysroot/composefs) +//! varlinkctl call /run/composefs/composefs.sock \ +//! org.composefs.Repository.OpenRepository '{"system": true}' +//! # → {"handle": 1} +//! +//! # Open at a specific path +//! varlinkctl call /run/composefs/composefs.sock \ +//! org.composefs.Repository.OpenRepository \ +//! '{"path": "/srv/composefs"}' +//! # → {"handle": 2} +//! +//! # Release a handle when done +//! varlinkctl call /run/composefs/composefs.sock \ +//! org.composefs.Repository.CloseRepository '{"handle": 1}' +//! ``` +//! +//! ### Check repository integrity +//! +//! ```sh +//! # Full check (verifies fs-verity on every object) +//! varlinkctl call /run/composefs/composefs.sock \ +//! org.composefs.Repository.Fsck '{"handle": 1}' +//! # → {"ok": true, "has_metadata": true, "objects_checked": 1542, ...} +//! +//! # Fast metadata-only check (skips per-object verification) +//! varlinkctl call /run/composefs/composefs.sock \ +//! org.composefs.Repository.Fsck \ +//! '{"handle": 1, "metadata_only": true}' +//! ``` +//! +//! ### List and pull OCI images +//! +//! ```sh +//! varlinkctl call /run/composefs/composefs.sock \ +//! org.composefs.Oci.ListImages '{"handle": 1}' +//! # → {"images": [{"name": "myimage:latest", +//! # "manifest_digest": "sha256:abc...", ...}, ...]} +//! +//! # Pull with streaming progress +//! varlinkctl call --more /run/composefs/composefs.sock \ +//! org.composefs.Oci.Pull '{ +//! "handle": 1, +//! "image": "quay.io/fedora/fedora:latest", +//! "local_fetch": "decompressed", +//! "bootable": false, +//! "more": true +//! }' +//! # Streams progress, then a final "completed" frame +//! ``` +//! +//! ### Inspect, tag, and untag +//! +//! ```sh +//! varlinkctl call /run/composefs/composefs.sock \ +//! org.composefs.Oci.Inspect \ +//! '{"handle": 1, "image": "myimage:latest"}' +//! # → {"manifest": "{...}", "config": "{...}", ...} +//! +//! varlinkctl call /run/composefs/composefs.sock \ +//! org.composefs.Oci.Tag '{ +//! "handle": 1, +//! "manifest_digest": "sha256:abc123...", +//! "name": "myimage:v2" +//! }' +//! +//! varlinkctl call /run/composefs/composefs.sock \ +//! org.composefs.Oci.Untag \ +//! '{"handle": 1, "name": "myimage:old"}' +//! ``` +//! +//! ### Garbage collection +//! +//! ```sh +//! # Dry run +//! varlinkctl call /run/composefs/composefs.sock \ +//! org.composefs.Repository.Gc \ +//! '{"handle": 1, "dry_run": true, "roots": []}' +//! +//! # Collect for real +//! varlinkctl call /run/composefs/composefs.sock \ +//! org.composefs.Repository.Gc \ +//! '{"handle": 1, "dry_run": false, "roots": []}' +//! ``` +//! +//! ### Mounting +//! +//! The `Mount` and `OciMount` methods return a detached mount file +//! descriptor via `SCM_RIGHTS`. The caller attaches it with +//! `move_mount(2)`. For overlay mounts, the caller passes upperdir and +//! workdir fds in the request. +//! +//! These methods require a varlink client that supports fd passing; +//! `varlinkctl` does not currently support this. diff --git a/doc/booting.md b/doc/booting.md deleted file mode 100644 index b958e6cd..00000000 --- a/doc/booting.md +++ /dev/null @@ -1,90 +0,0 @@ -# Booting from a composefs image - -This document describes how composefs-rs sets up the root filesystem during -early boot. It covers the kernel command-line interface, the expected on-disk -layout, kernel requirements, and the step-by-step mount sequence performed by -`composefs-setup-root`. - -The target audience is system integrators and OS developers who are packaging a -bootable system using composefs. Familiarity with Linux mount namespaces, -overlayfs, and fs-verity is assumed. - -## Kernel command-line - -The initramfs code in composefs supports multiple kernel arguments; it -is possible to pre-compute the digest of an image using both e.g. SHA-256 and -SHA-512. On an installed system, the repository only supports one digest -by default today, and the first found will be selected. - -Additionally, it is opt-in to enable v1 EROFS, and again the first compatible -version will be found. - -``` -composefs.digest=v1-sha256-12: # V1 EROFS image (preferred; RHEL9-era kernels) -composefs.digest=v1-sha512-12: # V1 EROFS image (SHA-512 variant) -composefs.digest=v2-sha512-12: # V2 EROFS image (explicit form) -composefs= # V2 EROFS image (legacy shorthand) -``` - -The value format is `--:`, where -`` is `v1` or `v2`, `` is `sha256` or `sha512`, and -`` is the log₂ block size (currently always `12`, i.e. 4096 -bytes). This mirrors how `meta.json` encodes the algorithm as -`fsverity-sha256-12`. - -`composefs.digest=` is checked first. Multiple entries may appear on the cmdline -(one per format/algorithm combination); the initramfs tries each in order and -mounts the first image that actually exists in the repository. - -`composefs=` is a legacy shorthand equivalent to -`composefs.digest=v2--12:` — the algorithm is inferred from the -digest length (64 hex chars → SHA-256, 128 → SHA-512). It is checked only when -no `composefs.digest=` token matches. - -**Insecure mode.** Placing `?` immediately after `=` (e.g. -`composefs.digest=?v1-sha256-12:` or `composefs=?`) makes -fs-verity verification optional. The system will boot even when the underlying -filesystem does not support fs-verity or the image has no verity metadata -attached. This mode exists for development and testing only; it must not be used -in production. - -## On-disk layout - -The composefs repository must be present at `/sysroot/composefs` with the -standard layout described in `doc/repository.md`. - -The digest must correspond to a symlink under `images/`. - -Persistent per-deployment state lives at `/sysroot/state/deploy//`, -where `` matches the boot karg digest exactly. The `etc/` and `var/` -subdirectories within that directory serve as the upper layers for the -corresponding overlayfs mounts. - -## Kernel requirements - -The following kernel features must be available: - -- **EROFS** filesystem driver (`CONFIG_EROFS_FS`) -- **overlayfs** with `metacopy=on` and `redirect_dir=on` - (`CONFIG_OVERLAY_FS`, `CONFIG_OVERLAY_FS_METACOPY`, `CONFIG_OVERLAY_FS_REDIRECT_DIR`) -- **fs-verity** unless insecure mode is used (`CONFIG_FS_VERITY`) -- The modern Linux mount API (`fsopen` / `fsconfig` / `fsmount` / `move_mount`), - available since kernel 5.2. Kernel ≥ 6.15 is required for the atomic root - replacement path (the default build). On kernels without `fsconfig_set_fd` - support (e.g. RHEL 9 / kernel < 5.15), a loopback device is created - automatically by `composefs::mountcompat`. - -## Kernel argument - -The boot karg (`composefs.digest=` or `composefs=`) is the authoritative selector for which image is booted. -Without the `?` insecure prefix, every file access through the overlayfs is -verified against the object's stored digest by the kernel, combining fs-verity -on the data objects with overlayfs `verity=require`. - -## Other notes - -As a workaround for a GPT auto-root issue in systemd -([systemd#35017](https://github.com/systemd/systemd/issues/35017)), -`composefs-setup-root` attempts to create `/run/systemd/volatile-root` as a -symlink pointing to the real block device before performing any mounts. Failure -to do so is non-fatal and does not abort the boot sequence. diff --git a/doc/erofs.md b/doc/erofs.md deleted file mode 100644 index 2a49b60e..00000000 --- a/doc/erofs.md +++ /dev/null @@ -1,82 +0,0 @@ -# composefs EROFS image format - -composefs images are EROFS filesystem images with composefs-specific extensions. They encode -a directory tree where regular files are stored externally in a content-addressed object store -and referenced by their fs-verity digest. The EROFS image itself carries only metadata: inodes, -directory entries, extended attributes, and chunk index entries that point to the external files. - -composefs-rs supports two EROFS format versions. V1 is byte-for-byte compatible with the C -`mkcomposefs` tool. V2 is the composefs-rs native format and drops several V1 constraints -that exist only for C compatibility. - -`cfsctl init` defaults to V2; pass `--erofs-version 1` to select V1. Higher-level tools -such as bootc initialize repositories with multiple formats enabled (V1 primary) so that images -can be booted on RHEL9-era kernels that require the `composefs.digest=` karg. - -## Format V1 - -V1 is selected with `cfsctl init --erofs-version 1`. The `v1_erofs` ro-compat feature flag -is written to `meta.json` so that tools without V1 support open the repository read-only. - -**`composefs_version` field values in V1:** - -- `0` — no user-visible whiteout files (character devices with rdev=0) in the tree -- `1` — at least one user-visible whiteout file is present - -The constant `COMPOSEFS_VERSION_V1` is 0; the field only reaches 1 when user whiteouts are -found. The `--min-version` flag in `mkcomposefs` (mirrored by `mkfs_erofs_v1_min_version`) -forces the value to 1 even when no user whiteouts exist, for forward compatibility. - -**Inode layout:** V1 uses compact inodes (32 bytes) when the file data and inode fit within -the constraints of the compact format, and extended inodes (64 bytes) otherwise. - -**Inode traversal order:** V1 collects inodes in breadth-first order — all entries at one -directory level before descending. - -**Whiteout stub table:** V1 includes 256 synthetic inode entries at the start of the inode -area, one per two-hex-character prefix `00`–`ff`. Each entry is a character-device stub -(chr 0,0) used by the overlay filesystem to resolve whiteout paths against the object store. -V2 omits them entirely. - -**Whiteout escaping:** User-visible whiteout files (chr 0,0) in the tree are not stored as -character devices on disk. Instead they receive a `trusted.overlay.opaque=x` xattr and are -serialized differently. The stub entries in the whiteout table are not escaped. - -**`build_time`:** The superblock `build_time` field is set to the minimum mtime across all inodes. - -**xattr sharing:** Xattr entries are deduplicated using a sort key that is the full xattr name (prefix string concatenated with the suffix). - -## Format V2 — Created in composefs-rs - -V2 is the default for repositories created with `cfsctl init` without `--erofs-version 1`. - -**`composefs_version` field:** Always `2` (the constant `COMPOSEFS_VERSION`). - -**Inode layout:** V2 always uses extended inodes (64 bytes). - -**Inode traversal order:** V2 collects inodes in depth-first order — all descendants of a directory before moving to the next sibling. - -**No whiteout stub table:** V2 has no synthetic stub entries; whiteout files are stored directly without escaping. - -**`build_time`:** Always 0. - -**xattr sharing:** Xattr entries are deduplicated using a sort key of (prefix, suffix, value) -rather than the full name string, which can produce a smaller shared xattr area. - -## Selecting the format - -The format is fixed at repository initialization time and cannot be changed afterward. - -``` -cfsctl init # V2 (default) -cfsctl init --erofs-version 1 # V1 (C-tool compatible) -``` - -The format is recorded in `meta.json` as the `v1_erofs` ro-compat feature flag: present -means V1, absent means V2. Tools that do not recognize this flag open the repository -read-only rather than writing images in the wrong format. - -For the standalone `mkcomposefs` tool, the equivalent flag is `--erofs-version`. The -`--min-version` flag (`mkfs_erofs_v1_min_version` in the Rust API) controls whether the -`composefs_version` field starts at 0 or 1 in V1 images regardless of whether user whiteouts -are present. diff --git a/doc/oci.md b/doc/oci.md deleted file mode 100644 index d1f850f4..00000000 --- a/doc/oci.md +++ /dev/null @@ -1,127 +0,0 @@ -# How to create a composefs from an OCI image - -This document is incomplete. It only serves to document some decisions we've -taken about how to resolve ambiguous situations. - -# Data precision - -We currently create a composefs image using the granularity of data as -typically appears in OCI tarballs: - - atime and ctime are not present (these are actually not physically present - in the erofs inode structure at all, either the compact or extended forms) - - mtime is set to the mtime in seconds; the sub-seconds value is simply - truncated (ie: we always round down). erofs has an nsec field, but it's not - normally present in OCI tarballs. That's down to the fact that the usual - tar header only has timestamps in seconds and extended headers are not - usually added for this purpose. - - we take great care to faithfully represent hardlinks: even though the - produced filesystem is read-only and we have data de-duplication via the - objects store, we make sure that hardlinks result in an actual shared inode - as visible via the `st_ino` and `st_nlink` fields on the mounted filesystem. - -We apply these precision restrictions also when creating images by scanning the -filesystem. For example: even if we get more-accurate timestamp information, -we'll truncate it to the nearest second. - -# Merging directories - -This is done according to the OCI spec, with an additional clarification: in -case a directory entry is present in multiple layers, we use the tar metadata -from the most-derived layer to determine the attributes (owner, permissions, -mtime) for the directory. - -# The root inode - -The root inode (/) is a difficult case because OCI container layer tars often -don't include a root directory entry, and when they do, container runtimes -(Podman, Docker) ignore it and use hardcoded defaults. For example, Podman's -[containers/storage](https://github.com/containers/storage) uses root:root -ownership, mode `0555`, and epoch (0) mtime when extracting layers, but -Docker uses `0755`. In general, the metadata for `/` is not defined. - -Because composefs requires (has a goal of providing) precise cryptographically -verifiable filesystem trees, we solve this for OCI by copying the metadata from `/usr` -to the root directory. The rationale is that `/usr` is always present in -standard filesystem layouts and must be defined explicitly in the OCI layers. - -This is implemented via the `copy_root_metadata_from_usr()` method and the -`read_container_root()` convenience function. - -When building a filesystem from OCI layers programmatically, use -`Stat::uninitialized()` to create the initial `FileSystem`. This placeholder -has mode `0` (obviously invalid) to make it clear that the root metadata should -be set before computing digests - typically by calling -`copy_root_metadata_from_usr()` after processing all layers. - -# Extended attributes (xattrs) - -When reading a container filesystem from a mounted root (as opposed to -processing OCI layer tars directly), host-side xattrs can leak into the -image. This is particularly problematic for `security.selinux` labels: -if SELinux is enabled at build time, files will have labels like -`container_t` that come from the build host, not from the target system's -policy. - -To ensure reproducibility, `read_container_root()` filters xattrs to only -include those in an allowlist. Currently this is just `security.capability`, -which represents actual file capabilities that should be preserved. - -SELinux labels are handled separately by `transform_for_boot()`: - - If the target filesystem contains a SELinux policy (in `/etc/selinux`), - all files are relabeled according to that policy - - If no SELinux policy is found, all `security.selinux` xattrs are stripped - -This ensures that: - - Build-time SELinux labels don't leak into non-SELinux targets - - SELinux-enabled targets get correct labels from their own policy - - Other host xattrs (overlayfs internals, etc.) don't pollute the image - -See: https://github.com/containers/storage/pull/1608#issuecomment-1600915185 - -# The /run directory - -When processing OCI images via `create_filesystem()`, the `/run` directory -is emptied if present. This is a tmpfs at runtime and should always be -empty in images. Its mtime is set to match `/usr` for consistency with -how root directory metadata is handled. - -This makes it possible to work around podman/buildah's `RUN --mount` issue where cache -mounts can leave incomplete directory entries in OCI tar layers (directories -without explicit tar entries inherit incorrect mtimes) by pointing all -such mounts into `/run`, and then redirecting from their final location -via e.g. symlinks into `/run`. - -## Container build cache mounts - -A practical implication of emptying `/run` is that container authors can -use it for cache mounts without worrying about polluting the final image. - -Instead of: -```dockerfile -RUN --mount=type=cache,target=/var/cache/dnf dnf install -y ... -``` - -Consider: -```dockerfile -RUN rm -rf /var/cache/dnf && ln -sr /run/dnfcache /var/cache/dnf -RUN --mount=type=cache,target=/run/dnfcache dnf install -y ... -``` - -This avoids potential mtime inconsistencies in `/var/cache` while still -benefiting from build caching. - -See: https://github.com/containers/composefs-rs/issues/132 - -# Emptied directories for boot - -When preparing a filesystem for boot via `transform_for_boot()`, certain -additional directories are emptied because their contents should not be -part of the final verified image: - -- `/boot`: Contains the UKI which embeds the composefs digest, so including - it would create a circular dependency -- `/sysroot`: Only has content in ostree-container cases, and traversing - it for SELinux labeling causes problems - -These directories are emptied and their mtime is set to match `/usr` for -consistency with how the root directory metadata is handled. diff --git a/doc/ostree.md b/doc/ostree.md deleted file mode 100644 index e96b85d7..00000000 --- a/doc/ostree.md +++ /dev/null @@ -1,139 +0,0 @@ -# OSTree - -composefs-rs has support for importing images from OSTree -repositories, by pulling from local or remote OSTree -repositories. These images can then be mounted as composefs images, -sharing disk (deduplication) with other ostree or other types of -images in the composefs repository. - -Native OSTree repositories are a format similar to a composefs -repository, but not quite the same. This means we need some -conversions when handling ostree commits in a composefs repository. - -OSTree images (commits) are fundamentally made up of many small sha256 -content-addressed objects that reference each other. Each commit is -the root of a DAG that defines the total image. Some of the OSTree -objects are metadata like directory permissions, or list of files in a -directory. These don't really exist in composefs where all metadata is -part of the erofs image. However, some objects are large file objects, -and these are similar to the file objects in composefs -images. However, even these differ, because the checksum defining the -object is made up of both the file content and the file metadata. - -When an OSTree commit is stored in a composefs repo it is stored as a -single splitstream file, named `ostree-commit-$commit_id`, which uses -external object references to all the file content objects that will -be used when creating an erofs image for it. This means OSTree objects -for files that would be inlined in the erofs image will not be -external objects. - -OStree commit splitstream objects are created during a pull operation -and are used for two things, creating a composefs image by walking the -DAG, and serving as a source of already available OSTree object during -a pull operation. Such sources are found automatically during pull -(e.g. parent commit, or old commit for a ref being pulled) or can be -manually specified. - -## File format - -This describes the format of the `ostree-commit-$commit_id` files. - -### Splitstream header - -Since the commit file is a split stream it starts with the splitstream -headers. Of these we use two, the named refs and the object -refs: - - * When an erofs image is created for the commit, it is referenced by - the `composefs.image` named ref. - - * Any external file content objects are in the external_refs - table. The index of the references in this header table is used to - refer to the file in the splitstream itself. - -The splitstream content type used for commits is 0xAFE138C18C463EF1. - -### Splitstream content - -A splitstream is normally a series of internal and external chunks, -but the ostree commit uses only one inline chunk. This chunk is -basically a serialized form of the "objects" directory of an OSTree -repository. I.e. it has a mapping of sha256 to ostree object data. -All objects except file objects are stored in the standard ostree -object format. - -OSTree file objects are stored in the archive-z2 format, except not -compressed, and optionally the file content part of it may be stored -as referencing the index of an external object. The z2 format is, -first an 8-byte header that gives the size (in bytes) of a gvariant, -then comes the gvariant with the file meta in -OSTREE_ZLIB_FILE_HEADER_GVARIANT_FORMAT format, and then the -file/symlink inline data. If an external object is referenced for the -object then it is expected that there is no inline file data. - -The high level view of the file looks like this: -``` -+---------------+ -| Header | -+---------------| -| Object IDs | -+---------------| -| Object Info | -+---------------| -| Content | -+---------------+ -``` - -The Object IDs is a sorted array of sha256 digests, and you would do -lookups in it using a binary search. The buckets in the header can be -used to quickly limit the binary search based on the first byte of a -digest. - -Then, at the same index as the binary searched object you can look up -the object info which gives you the offset/length of the object -content data and optionally a reference to an external object. - -The exact form of the data looks like this, packed in order from the -start of the splitstream content. All ints are in little endian. - -### Header -``` -+-----------------------------------+ -| u32: index of commit object | -| u32: flags (currently unused) | -| [u32; 256]: end index of bucket | -+-----------------------------------+ -``` - -The bucket list contains the end index (in the object ids table) of -objects starting with that particular byte, and can be used to quickly -limit the search. We can also compute the total number of objects -(n_objects) by looking in the last bucket. - -### Object ids -``` - n_objects x -+-----------------------------------+ -| [u8; 32] ostree object id | -+-----------------------------------+ -``` - -### Object Info -``` - n_objects x -+-----------------------------------+ -| u32: Offset to per-object data | -| u32: Length of per-object data | -| u32: Index of external object ref | -| or MAXUINT32 if none. | -+-----------------------------------+ -``` - -This is an array of information for each object. Once you have found -the object id in the object ids table, you would look at the same -index in this table to find the information. Offsets to per-object -data are in bytes from the start of the content area, which starts at -the end of the Objects Info table. All data chunks references are -aligned to 8 bytes with respect to the start of the content area. -This is useful because GVariants (used by ostree) naturally want -8-byte alignment. diff --git a/doc/repository.md b/doc/repository.md deleted file mode 100644 index 023d26cc..00000000 --- a/doc/repository.md +++ /dev/null @@ -1,316 +0,0 @@ -# composefs repository design - -This document describes the current on-disk layout of a composefs repository. - -At this time, the composefs-rs repository format is not declared stable. - -## Location - -A composefs repository is a directory located anywhere. The location is chosen -for the `cfsctl` command as follows: - - - `--repo` can specify an arbitrary directory - - - if `--user` is specified (default if the current uid is not 0), then the - repository defaults to `~/.var/lib/composefs`. - - - if `--system` is specified (default if the current uid is 0), then the - repository defaults to `/sysroot/composefs`. - -## Layout - -A composefs repository has a layout that looks something like - -``` -composefs -├── meta.json -├── objects -│   ├── 00 -│   │   ├── 002183fb91[...] -│   │   ├── [...] -│   │   └── ff9d7bd692[...] -│   ├── 4e -│   │   ├── 67eaccd9fd[...] -│   │   └── [...] -│   ├── 50 -│   │   ├── 2b126bca0c[...] -│   │   └── [...] -│   └── [...] -├── images -│   ├── 4e67eaccd9fd[...] -> ../objects/4e/67eaccd9fd[...] -│   └── refs -│   └── some/name -> ../../images/4e67eaccd9fd[...] -└── streams - ├── 502b126bca0c[...] -> ../objects/50/2b126bca0c[...] - └── refs - └── some/name.tar -> ../../streams/502b126bca0c[...] -``` - -## `meta.json` - -Added in 0.7.0. This file records repository-level metadata. When present, it is -created by `cfsctl init` and contains: - - - `version` — the base repository format version (currently `1`). Tools - must refuse to operate on a repository whose version exceeds what they - understand. - - - `algorithm` — the fs-verity digest algorithm identifier, in the format - `fsverity--`. For example `fsverity-sha512-12` - means SHA-512 with 4 KiB (2^12) blocks. - - - `features` (optional) — an object with three arrays of feature-flag - strings, following the ext4/XFS/EROFS compatibility model: - - `compatible` — old tools can safely ignore these. - - `read-only-compatible` — old tools may read but must not write. - - `incompatible` — old tools must refuse the repository entirely. - - The currently defined feature flags are: - - `v1_erofs` (read-only-compatible) — present on repositories whose - EROFS image format is V1 (C-tool compatible: compact inodes, BFS - ordering, whiteout table). This is the single flag that encodes the - EROFS format version: present → V1, absent → V2. Old - tools that do not recognise this flag open the repository read-only - rather than accidentally writing images in the wrong format. - -When `meta.json` is present, `cfsctl` auto-detects the hash algorithm and -errors if `--hash` is explicitly passed with a conflicting value. When -the file is absent (for repositories created before this feature), `--hash` -is honored as before and defaults to `sha512`. - -### `cfsctl init --erofs-version` - -The `--erofs-version` flag selects the EROFS format for newly committed -images. It controls the `v1_erofs` feature flag in `meta.json`: - -``` -cfsctl init # default: V2 EROFS (composefs-rs native) -cfsctl init --erofs-version 1 # V1 EROFS (C-tool compatible) -``` - -**V2** (the `cfsctl` default) uses extended inodes, DFS ordering, and -`composefs_version=2` in the EROFS superblock. This is the composefs-rs native -format and is what all repositories created before V1 support was added use. -Higher-level tools (e.g. bootc) may configure a repository with multiple format -versions (V1 primary + V2 extra) so that images are usable on both RHEL9-era and -newer kernels. - -**V1** uses compact inodes where possible, BFS ordering, and a whiteout stub -table, producing output byte-for-byte identical to the C `mkcomposefs` tool. -The `v1_erofs` ro-compat flag is written to `meta.json` so that tools which -predate V1 support open the repository read-only rather than writing images -in the wrong format. - -Re-initializing an existing repository with a different `--erofs-version` is -rejected with an error; the format version is fixed at init time. - -## `objects/` - -This is where the content-addressed data is stored. The immediate children of -this directory are 256 subdirectories from `00` to `ff`. Each of those -directories contains a number of files with 62-character hexidecimal names. -Taken together with the directory in which it resides, each filename represents -a 256bit hash value which equals the measured fs-verity digest of that file. -fs-verity must be enabled for every file. - -## `images/` - -This is where composefs (erofs) images are accounted for. The images -themselves are fs-verity enabled and stored in the object store in the same way -as the file data, but the `images/` directory contains symlinks to the images -that we know about. Each symlink is named for the full 256bit fsverity digest. - -Images are tracked in a separate directory because of the security model of -filesystems in the Linux kernel. Although it would be feasible for "regular -users" to mount an erofs in their own mount namespace, the kernel currently -disallows it as a way to avoid allowing non-root users to expose the filesystem -code to hostile data. As such, we only mount images that we produced for -ourselves (with mkcomposefs), and those are the ones that are linked in this -directory. - -Another way to say it: we must never attempt to mount an arbitrary object: we -may only mount via symlinks present in this directory. - -## `streams/` - -This is where [split streams](splitstream.md) are stored. As for the images, -this is a bunch of 256bit symlinks which are symlinks to data in the object -storage. - -Note: the names of the hashes in this directory are the fs-verity hashes of the -content of the splitstream file, not the original file. More specifically: if -you have a tar file with a specific sha256 digest, and you import it into the -repository as a splitstream, the resulting filename in this directory will have -no relation to the original content. You can, however, store a reference for -it. - -## `{images,streams}/refs/` - -This is where we record which images and streams are currently "requested" by -some external user. When importing a tar file, in addition to creating the -file in the objects database and the toplevel symlink in the `streams/` -directory, we also assign it a name which is chosen by the software which is -performing the import. - -Each ref is a symlink to the top-level entry in `images/` or `streams/`. - -There are some rough ideas for how we might namespace this. Something like -this model is imagined: - -``` -refs -├── system -│   └── rootfs -│      ├── some_id -> ../../../974d04eaff[...] -│      └── [...] -├── 1000 # uid of a user -│   ├── flatpak -│   │   ├── some_id -> ../../../f8e2bec500[...] -│   │   └── [...] -│   └── containers -│      ├── some_id -> ../../../96a87f8b4b[...] -│      └── [...] -└── [...] -``` - -Where the toplevel directories are `system` plus a set of uids. Each `system` -or uid subdirectory is namespaced by the particular piece of software that's -responsible for storing the given image or stream. - -The per-user directories will all be owned by root and have 0700 permissions, -but each user will be able to access their own uid-numbered subdirectories by -way of an acl. The reason that we want the directories owned by root is to -prevent users from corrupting the layout of the repository. The reason for the -acl is that read-only operations on the repository should be performed -directly on the repository and not via some central agent. - -## Referring to images and streams - -Operations that are performed on images or streams (mount, cat, etc.) name the -stream in one of two ways: - - - via the user-chosen name such as `refs/1000/flatpak/some_id` - - via the fs-verity digest stored in the toplevel dir - -ie: the name must either start with the string `refs/`, or must be a -hexadecimal string (64 characters for sha256, 128 for sha512). - -In both cases, the name is a path relative to the `images/` or `streams/` -directory and this path contains a symlink (either direct or indirect) to the -underlying file in `objects/`. - -When specified via fs-verity digest, the digest is verified before performing -the operation. - -For example: - -```sh -cfsctl mount refs/system/rootfs/some_id /mnt # does not check fs-verity -cfsctl mount 974d04eaff[...] /mnt # enforces fs-verity -``` - -## OCI image storage - -OCI container images are stored using streams exclusively. Each OCI artifact -(manifest, config, layer) becomes a splitstream, and OCI "tags" are refs under -`streams/refs/oci/`. - -### Naming conventions - -| OCI artifact | Stream name pattern | Example | -|---------------|------------------------------------|------------------------------------| -| Manifest | `oci-manifest-{manifest_digest}` | `oci-manifest-sha256:abc123...` | -| Config | `oci-config-{config_digest}` | `oci-config-sha256:def456...` | -| Layer | `oci-layer-{diff_id}` | `oci-layer-sha256:ghi789...` | -| Blob | `oci-blob-{blob_digest}` | `oci-blob-sha256:jkl012...` | - -Tags are stored under `streams/refs/oci/` with percent-encoding for -filesystem safety (`/` → `%2F`): - -``` -streams/refs/oci/myimage:latest → ../../oci-manifest-sha256:abc123... -``` - -### Splitstream reference chains - -Each splitstream contains `named_refs` (semantic labels mapping to entries -in the `stream_refs` array) and `object_refs` (raw objects referenced by -the compressed stream data). For OCI images the chain is: - -**Manifest splitstream** (`oci-manifest-sha256:...`): - - `object_refs`: the manifest JSON blob - - `named_refs`: - - `config:{config_digest}` → config splitstream verity - - `{diff_id}` → layer splitstream verity (one per layer) - -**Config splitstream** (`oci-config-sha256:...`): - - `object_refs`: the config JSON blob - - `named_refs`: - - `{diff_id}` → layer splitstream verity (one per layer) - -**Layer splitstream** (`oci-layer-sha256:...`): - - `object_refs`: file content objects extracted from the tar - - `named_refs`: none (leaf node) - -Both the manifest and config redundantly reference the layers. The GC -can reach layers from either path. - -### Garbage collection - -The GC walks all refs under `streams/refs/` to find root splitstreams, -then transitively follows `named_refs` (by resolving fs-verity IDs -through a stream name map) and collects `object_refs`. Any object not -reachable from a root is deleted. - -Concretely, for a tagged container image: - - 1. Tag `streams/refs/oci/myimage:v1` resolves to `oci-manifest-sha256:abc` - 2. Walk the manifest: mark its JSON blob and follow `named_refs` to - the config and layer streams - 3. Walk the config: mark its JSON blob and follow `named_refs` to layers - (already visited, skipped) - 4. Walk each layer: mark all file content objects - -When a tag is removed, the manifest and everything reachable only from it -becomes GC-eligible. Layers shared between images survive as long as any -referencing manifest remains tagged. - -### EROFS image tracking via config splitstream refs - -When an EROFS image is generated from an OCI image (via -`create_filesystem` + `commit_image`), its object ID (fs-verity digest) -is stored as a named ref on the config splitstream with the key -`composefs.image`. - -GC walks from tag → manifest → config, and finds the `composefs.image` -named ref. The EROFS object ID is added to the live set, keeping the -EROFS image alive. The EROFS image still needs an entry under `images/` -for the kernel mount security model (see above), but `images/` is not a -GC root — the config ref is what keeps the object alive. - -This means a single OCI tag is sufficient to keep the entire image -(manifest, config, layers, and the EROFS image) alive through GC. - -### Bootable image variant - -For bootable images, a second EROFS may be generated after -`transform_for_boot` (stripping `/boot`, etc.). This boot EROFS is -stored as a second named ref on the config, `composefs.image.boot`. - -Since the config splitstream content changes (new named ref), it gets a -new fs-verity digest. This cascades: the manifest must also be -rewritten (its `config:` named ref now points to the new config verity), -producing a new manifest verity. The tag is re-pointed to the new -manifest. The old config and manifest splitstreams become unreferenced -and are collected by GC. - -The result: one tag still keeps everything alive — layers, raw EROFS, -and boot EROFS. - -### Future: sealed images - -For sealed/signed images, the EROFS comes pre-built from the registry as -part of a composefs OCI artifact (referrer pattern). The artifact -splitstream would hold references to the pre-fetched EROFS layers. This -is complementary to the unsealed case — both use the same GC mechanism -(named refs pointing to EROFS objects). diff --git a/doc/splitstream.md b/doc/splitstream.md deleted file mode 100644 index 787d1ec9..00000000 --- a/doc/splitstream.md +++ /dev/null @@ -1,164 +0,0 @@ -# Splitstream - -Splitstream is a trivial way of storing file formats (like tar) with the "data -blocks" stored in the composefs object store with the goal that it's possible -to bit-for-bit recreate the entire file. It's something like the idea behind -[tar-split](https://github.com/vbatts/tar-split), with some important -differences: - - - it's a binary format - - - it's based on storing external objects content-addressed in the composefs - object store via their fs-verity digest - - - although it's designed with `tar` files in mind, it's not specific to `tar`, - or even to the idea of an archive file: any file format can be stored as a - splitstream, and it might make sense to do so for any file format that - contains large chunks of embedded data - - - in addition to the ability to split out chunks of file content (like files - in a `.tar`) to separate files, it is also possible to refer to external - file content, or even other splitstreams, without directly embedding that - content in the referrer, which can be useful for cross-document references - (such as between OCI manifests, configs, and layers) - - - the splitstream file itself is stored in the same content-addressed object - store by its own fs-verity hash - -Splitstream compresses inline file content before it is stored to disk using -zstd. The main reason for this is that, after removing the actual file data, -the remaining `tar` metadata contains a very large amount of padding and empty -space and compresses extremely well. - -Splitstream is conceptually independent from composefs: you could use the -format with any content-addressed storage system. - -## File format - -What follows is a non-normative documentation of the file format. The actual -definition of the format is "what composefs-rs reads and writes", but this -document may be useful to try to understand that format. If you'd like to -implement the format, please get in touch. - -The format is implemented in -[crates/composefs/src/splitstream.rs](crates/composefs/src/splitstream.rs) and -the structs from that file are copy-pasted here. Please try to keep things -roughly in sync when making changes to either side. - -All integers are little-endian. In the following `struct` definitions, `U` -means 'unsigned little endian' (as per the `zerocopy::little_endian` crate) so -`U64` is an unsigned 64bit little-endian integer. - -### File ranges ("sections") - -The file format consists of a fixed-sized header at the start of the file plus -a number of sections located at arbitrary locations inside of the file. All of -these sections are referred to by a 64-bit `[start..end)` range expressed in -terms of overall byte offsets within the complete file. - -```rust -struct FileRange { - start: U64, - end: U64, -} -``` - -### Header - -The file starts with a simple fixed-size header. - -```rust -const SPLITSTREAM_MAGIC: [u8; 11] = *b"SplitStream"; - -struct SplitstreamHeader { - pub magic: [u8; 11], // Contains SPLITSTREAM_MAGIC - pub version: u8, // must always be 0 - pub _flags: U16, // is currently always 0 (but ignored) - pub algorithm: u8, // kernel fs-verity algorithm identifier (1 = sha256, 2 = sha512) - pub lg_blocksize: u8, // log2 of the fs-verity block size (12 = 4k, 16 = 64k) - pub info: FileRange, // can be used to expand/move the info section in the future -} -``` - -In addition to magic values and identifiers for the fs-verity algorithm in use, -the header is used to find the location and size of the info section. Future -expansions to the file format are imagined to occur by expanding the size of -the info section: if the section is larger than expected, the additional bytes -will be ignored by the implementation. - -### Info section - -```rust -struct SplitstreamInfo { - pub stream_refs: FileRange, // location of the stream references array - pub object_refs: FileRange, // location of the object references array - pub stream: FileRange, // location of the zstd-compressed stream within the file - pub named_refs: FileRange, // location of the compressed named references - pub content_type: U64, // user can put whatever magic identifier they want there - pub stream_size: U64, // total uncompressed size of inline chunks and external chunks -} -``` - -The `content_type` is just an arbitrary identifier that can be used by users of -the file format to prevent casual user error when opening a file by its hash -value (to prevent showing `.tar` data as if it were json, for example). - -The `stream_size` is the total size of the original file. - -### Stream and object refs sections - -All referred streams and objects in the file are stored as two separate flat -uncompressed arrays of binary fs-verity hash values. Each of these arrays is -referred to from the info section (via `stream_refs` and `object_refs`). - -The number of items in the array is determined by the size of the section -divided by the size of the fs-verity hash value (determined by the algorithm -identifier in the header). - -The values are not in any particular order, but implementations should produce -a deterministic output. For example, the objects reference array produced by -the current implementation has the external objects sorted by first-appearance -within the stream. - -The main motivation for storing the references uncompressed, in binary, and in -a flat array is to make determining the references contained within a -splitstream as simple as possible to improve the efficiency of garbage -collection on large repositories. - -### The stream - -The main content of the splitstream is stored in the `stream` section -referenced from the info section. The entire section is zstd compressed. - -Within the compressed stream, the splitstream is formed from a number of -"chunks". Each chunk starts with a single 64-bit little endian value. If that -number is negative, it refers to an "inline" chunk, and that (absolute) number -of bytes of data immediately follow it. If the number is non-negative then it -is an index into the object refs array for an "external" chunk. - -Zero is a non-negative value, so it's an object reference. It's not possible -to have a zero-byte inline chunk. This also means that the high/sign bit -determines which case (inline vs. external) we have and there are an equal -number of both cases. - -The stream is reassembled by iterating over the chunks and concatenating the -result. For inline chunks, the inline data is taken directly from the -splitstream. For external chunks, the content of the external file is used. - -The stream is over when there are no more chunks. - -### Named references - -It's possible to have named references to other streams. These are stored in -the `named_refs` section referred to from the info section. - -This section is also zstd-compressed, and is a number of nul-terminated text -records (including a terminator after the last record). Each record has the -form `n:name` where `n` is a non-negative integer index into the stream refs -array and `name` is an arbitrary name. The entries are currently sorted by -name (by the writer implementation) but the order is not important to the -reader. Whether or not this list is "officially" sorted or not may be pinned -down at some future point if a need should arise. - -An example of the decompressed content of the section might be something like -`"0:first\01:second\0"`.