From 3e3fe9f128787cb8aeea9744f65783e08456260c Mon Sep 17 00:00:00 2001 From: Colin Walters Date: Fri, 26 Jun 2026 10:58:31 -0400 Subject: [PATCH] Make rustdoc hold most documentation I want composefs to have a website with useful docs. After some reflection, what I think is the least effort and most clear right now is just to move things into the Rust documentation. So things like the repository format, splitstream definition now live there. The pattern is that we have `#[cfg(doc)] mod foo;` for things that only need to live in docs. Add information about varlink there too. Assisted-by: OpenCode (claude-opus-4-6) Signed-off-by: Colin Walters --- README.md | 84 ++---- crates/composefs-boot/src/design.rs | 90 ++++++ crates/composefs-boot/src/lib.rs | 3 + crates/composefs-oci/src/design.rs | 127 +++++++++ crates/composefs-oci/src/lib.rs | 3 + crates/composefs-ostree/src/design.rs | 139 +++++++++ crates/composefs-ostree/src/lib.rs | 2 + crates/composefs/src/erofs_format.rs | 82 ++++++ crates/composefs/src/lib.rs | 50 +++- crates/composefs/src/repository_format.rs | 317 +++++++++++++++++++++ crates/composefs/src/splitstream_format.rs | 164 +++++++++++ crates/composefs/src/varlink.rs | 160 +++++++++++ doc/booting.md | 90 ------ doc/erofs.md | 82 ------ doc/oci.md | 127 --------- doc/ostree.md | 139 --------- doc/repository.md | 316 -------------------- doc/splitstream.md | 164 ----------- 18 files changed, 1152 insertions(+), 987 deletions(-) create mode 100644 crates/composefs-boot/src/design.rs create mode 100644 crates/composefs-oci/src/design.rs create mode 100644 crates/composefs-ostree/src/design.rs create mode 100644 crates/composefs/src/erofs_format.rs create mode 100644 crates/composefs/src/repository_format.rs create mode 100644 crates/composefs/src/splitstream_format.rs create mode 100644 crates/composefs/src/varlink.rs delete mode 100644 doc/booting.md delete mode 100644 doc/erofs.md delete mode 100644 doc/oci.md delete mode 100644 doc/ostree.md delete mode 100644 doc/repository.md delete mode 100644 doc/splitstream.md diff --git a/README.md b/README.md index 6121f789..55ed5112 100644 --- a/README.md +++ b/README.md @@ -34,76 +34,32 @@ and Linux kernel integration, but with the *flexibility* of files for content — avoiding doubled disk usage, partition table management, and similar headaches. -### Separation between metadata and data - -A key aspect of composefs is its separation of "data" (non-empty regular -files) from "metadata" (everything else: directories, symlinks, permissions, -ownership, etc.). - -composefs produces an [EROFS](https://erofs.docs.kernel.org) filesystem -image that contains only metadata. The non-empty data files live in a -separate "backing store" directory. The EROFS image includes -`trusted.overlay.redirect` extended attributes that tell the overlayfs -mount how to find the real underlying files. - -### Shared backing store - -The primary use case for composefs is versioned, immutable filesystem -trees — container images and bootable host systems — where multiple -images may share parts of their storage. - -By storing files content-addressed (named by the hash of their content), -shared files need to be stored only once on disk yet can appear in -multiple mounts. Crucially, these data files are also shared in the -[page cache](https://static.lwn.net/kerneldoc/admin-guide/mm/concepts.html#page-cache), -allowing multiple running container images to reliably share memory. - -### Filesystem integrity - -composefs supports [fs-verity](https://www.kernel.org/doc/html/latest/filesystems/fsverity.html) -validation of content files. The digest of each content file is stored -in the EROFS image via `trusted.overlay.metacopy` extended attributes, -which overlayfs validates when the file is accessed. This means backing -content cannot be changed (by mistake or by malice) without detection. - -You can also enable fs-verity on the image file itself and pass the expected -digest as a mount option. This provides full trust of both data and metadata, -solving a weakness of fs-verity alone (which can only verify file data, -not metadata like permissions, ownership, or directory structure). +composefs separates metadata (directories, permissions, xattrs) from data +(file content). An EROFS image carries only the metadata; data files live in +a content-addressed backing store, shared across images and in the Linux +[page cache](https://static.lwn.net/kerneldoc/admin-guide/mm/concepts.html#page-cache). +Optional [fs-verity](https://www.kernel.org/doc/html/latest/filesystems/fsverity.html) +provides end-to-end integrity verification of both data and metadata. +For design details, see the [crate documentation](https://docs.rs/composefs). ## Use cases ### Container images -For [OCI](https://github.com/opencontainers/image-spec/blob/main/spec.md) -container images, a common approach (used by both Docker and Podman) is -to untar each layer separately and use overlayfs to stitch them together. -composefs improves on this by storing file content in a content-addressed -fashion, allowing sharing between images even when metadata like -timestamps or ownership differs. - -Combined with approaches like -[zstd:chunked](https://github.com/containers/storage/pull/775), -this speeds up pulling container images and avoids redundantly -creating files that are already present. +composefs improves on the traditional per-layer overlayfs model for +[OCI](https://github.com/opencontainers/image-spec/blob/main/spec.md) +container images by storing file content in a content-addressed store, +enabling sharing between images and faster pulls via +[zstd:chunked](https://github.com/containers/storage/pull/775). ### Bootable host systems -Anywhere one wants versioned immutable filesystem trees ("images"), -composefs provides compelling advantages. In particular, this project -aims to be the successor to [OSTree](https://github.com/ostreedev/ostree/). - -OSTree uses a content-addressed object store, but traditionally checks out -into a regular directory (using hardlinks), which is then bind-mounted as -the rootfs. While OSTree supports enabling fs-verity on files in the store, -nothing protects the checkout directories from modification. - -composefs replaces this checkout with a directly-mountable image pointing -into the object store. We can enable fs-verity on the composefs image and -embed its digest in the kernel commandline or a Unified Kernel Image (UKI). -Since composefs generation is reproducible, we can verify the generated -image is correct by comparing its digest to one in the metadata produced -at build time. For more on this, see [this tracking issue](https://github.com/ostreedev/ostree/issues/2867). +composefs aims to succeed [OSTree](https://github.com/ostreedev/ostree/) +by replacing hardlink checkouts with directly-mountable images backed by a +shared object store. Combined with fs-verity and a digest embedded in the +kernel commandline or a UKI, this provides cryptographic verification of +the entire filesystem tree. See [this tracking issue](https://github.com/ostreedev/ostree/issues/2867) +for background. ## Components @@ -147,9 +103,7 @@ helper that supports `mount -t composefs` syntax directly. ## Documentation - - [Repository format](doc/repository.md) - - [OCI integration](doc/oci.md) - - [Splitstream format](doc/splitstream.md) + - [API and design documentation](https://docs.rs/composefs) - [Examples README](examples/README.md) ## Status diff --git a/crates/composefs-boot/src/design.rs b/crates/composefs-boot/src/design.rs new file mode 100644 index 00000000..70119a47 --- /dev/null +++ b/crates/composefs-boot/src/design.rs @@ -0,0 +1,90 @@ +//! # Booting from a composefs image +//! +//! This document describes how composefs-rs sets up the root filesystem during +//! early boot. It covers the kernel command-line interface, the expected on-disk +//! layout, kernel requirements, and the step-by-step mount sequence performed by +//! `composefs-setup-root`. +//! +//! The target audience is system integrators and OS developers who are packaging a +//! bootable system using composefs. Familiarity with Linux mount namespaces, +//! overlayfs, and fs-verity is assumed. +//! +//! ## Kernel command-line +//! +//! The initramfs code in composefs supports multiple kernel arguments; it +//! is possible to pre-compute the digest of an image using both e.g. SHA-256 and +//! SHA-512. On an installed system, the repository only supports one digest +//! by default today, and the first found will be selected. +//! +//! Additionally, it is opt-in to enable v1 EROFS, and again the first compatible +//! version will be found. +//! +//! ```text +//! composefs.digest=v1-sha256-12: # V1 EROFS image (preferred; RHEL9-era kernels) +//! composefs.digest=v1-sha512-12: # V1 EROFS image (SHA-512 variant) +//! composefs.digest=v2-sha512-12: # V2 EROFS image (explicit form) +//! composefs= # V2 EROFS image (legacy shorthand) +//! ``` +//! +//! The value format is `--:`, where +//! `` is `v1` or `v2`, `` is `sha256` or `sha512`, and +//! `` is the log2 block size (currently always `12`, i.e. 4096 +//! bytes). This mirrors how `meta.json` encodes the algorithm as +//! `fsverity-sha256-12`. +//! +//! `composefs.digest=` is checked first. Multiple entries may appear on the cmdline +//! (one per format/algorithm combination); the initramfs tries each in order and +//! mounts the first image that actually exists in the repository. +//! +//! `composefs=` is a legacy shorthand equivalent to +//! `composefs.digest=v2--12:` -- the algorithm is inferred from the +//! digest length (64 hex chars -> SHA-256, 128 -> SHA-512). It is checked only when +//! no `composefs.digest=` token matches. +//! +//! **Insecure mode.** Placing `?` immediately after `=` (e.g. +//! `composefs.digest=?v1-sha256-12:` or `composefs=?`) makes +//! fs-verity verification optional. The system will boot even when the underlying +//! filesystem does not support fs-verity or the image has no verity metadata +//! attached. This mode exists for development and testing only; it must not be used +//! in production. +//! +//! ## On-disk layout +//! +//! The composefs repository must be present at `/sysroot/composefs` with the +//! standard layout described in the `composefs::repository_format` module. +//! +//! The digest must correspond to a symlink under `images/`. +//! +//! Persistent per-deployment state lives at `/sysroot/state/deploy//`, +//! where `` matches the boot karg digest exactly. The `etc/` and `var/` +//! subdirectories within that directory serve as the upper layers for the +//! corresponding overlayfs mounts. +//! +//! ## Kernel requirements +//! +//! The following kernel features must be available: +//! +//! - **EROFS** filesystem driver (`CONFIG_EROFS_FS`) +//! - **overlayfs** with `metacopy=on` and `redirect_dir=on` +//! (`CONFIG_OVERLAY_FS`, `CONFIG_OVERLAY_FS_METACOPY`, `CONFIG_OVERLAY_FS_REDIRECT_DIR`) +//! - **fs-verity** unless insecure mode is used (`CONFIG_FS_VERITY`) +//! - The modern Linux mount API (`fsopen` / `fsconfig` / `fsmount` / `move_mount`), +//! available since kernel 5.2. Kernel >= 6.15 is required for the atomic root +//! replacement path (the default build). On kernels without `fsconfig_set_fd` +//! support (e.g. RHEL 9 / kernel < 5.15), a loopback device is created +//! automatically by `composefs::mountcompat`. +//! +//! ## Kernel argument +//! +//! The boot karg (`composefs.digest=` or `composefs=`) is the authoritative selector for which image is booted. +//! Without the `?` insecure prefix, every file access through the overlayfs is +//! verified against the object's stored digest by the kernel, combining fs-verity +//! on the data objects with overlayfs `verity=require`. +//! +//! ## Other notes +//! +//! As a workaround for a GPT auto-root issue in systemd +//! ([systemd#35017](https://github.com/systemd/systemd/issues/35017)), +//! `composefs-setup-root` attempts to create `/run/systemd/volatile-root` as a +//! symlink pointing to the real block device before performing any mounts. Failure +//! to do so is non-fatal and does not abort the boot sequence. diff --git a/crates/composefs-boot/src/lib.rs b/crates/composefs-boot/src/lib.rs index 11e5cc33..db57cc6c 100644 --- a/crates/composefs-boot/src/lib.rs +++ b/crates/composefs-boot/src/lib.rs @@ -15,6 +15,9 @@ pub mod selabel; pub mod uki; pub mod write_boot; +#[cfg(doc)] +pub mod design; + use std::ffi::OsStr; use anyhow::Result; diff --git a/crates/composefs-oci/src/design.rs b/crates/composefs-oci/src/design.rs new file mode 100644 index 00000000..30be8842 --- /dev/null +++ b/crates/composefs-oci/src/design.rs @@ -0,0 +1,127 @@ +//! # How to create a composefs from an OCI image +//! +//! This document is incomplete. It only serves to document some decisions we've +//! taken about how to resolve ambiguous situations. +//! +//! # Data precision +//! +//! We currently create a composefs image using the granularity of data as +//! typically appears in OCI tarballs: +//! - atime and ctime are not present (these are actually not physically present +//! in the erofs inode structure at all, either the compact or extended forms) +//! - mtime is set to the mtime in seconds; the sub-seconds value is simply +//! truncated (ie: we always round down). erofs has an nsec field, but it's not +//! normally present in OCI tarballs. That's down to the fact that the usual +//! tar header only has timestamps in seconds and extended headers are not +//! usually added for this purpose. +//! - we take great care to faithfully represent hardlinks: even though the +//! produced filesystem is read-only and we have data de-duplication via the +//! objects store, we make sure that hardlinks result in an actual shared inode +//! as visible via the `st_ino` and `st_nlink` fields on the mounted filesystem. +//! +//! We apply these precision restrictions also when creating images by scanning the +//! filesystem. For example: even if we get more-accurate timestamp information, +//! we'll truncate it to the nearest second. +//! +//! # Merging directories +//! +//! This is done according to the OCI spec, with an additional clarification: in +//! case a directory entry is present in multiple layers, we use the tar metadata +//! from the most-derived layer to determine the attributes (owner, permissions, +//! mtime) for the directory. +//! +//! # The root inode +//! +//! The root inode (/) is a difficult case because OCI container layer tars often +//! don't include a root directory entry, and when they do, container runtimes +//! (Podman, Docker) ignore it and use hardcoded defaults. For example, Podman's +//! [containers/storage](https://github.com/containers/storage) uses root:root +//! ownership, mode `0555`, and epoch (0) mtime when extracting layers, but +//! Docker uses `0755`. In general, the metadata for `/` is not defined. +//! +//! Because composefs requires (has a goal of providing) precise cryptographically +//! verifiable filesystem trees, we solve this for OCI by copying the metadata from `/usr` +//! to the root directory. The rationale is that `/usr` is always present in +//! standard filesystem layouts and must be defined explicitly in the OCI layers. +//! +//! This is implemented via the `copy_root_metadata_from_usr()` method and the +//! `read_container_root()` convenience function. +//! +//! When building a filesystem from OCI layers programmatically, use +//! `Stat::uninitialized()` to create the initial `FileSystem`. This placeholder +//! has mode `0` (obviously invalid) to make it clear that the root metadata should +//! be set before computing digests - typically by calling +//! `copy_root_metadata_from_usr()` after processing all layers. +//! +//! # Extended attributes (xattrs) +//! +//! When reading a container filesystem from a mounted root (as opposed to +//! processing OCI layer tars directly), host-side xattrs can leak into the +//! image. This is particularly problematic for `security.selinux` labels: +//! if SELinux is enabled at build time, files will have labels like +//! `container_t` that come from the build host, not from the target system's +//! policy. +//! +//! To ensure reproducibility, `read_container_root()` filters xattrs to only +//! include those in an allowlist. Currently this is just `security.capability`, +//! which represents actual file capabilities that should be preserved. +//! +//! SELinux labels are handled separately by `transform_for_boot()`: +//! - If the target filesystem contains a SELinux policy (in `/etc/selinux`), +//! all files are relabeled according to that policy +//! - If no SELinux policy is found, all `security.selinux` xattrs are stripped +//! +//! This ensures that: +//! - Build-time SELinux labels don't leak into non-SELinux targets +//! - SELinux-enabled targets get correct labels from their own policy +//! - Other host xattrs (overlayfs internals, etc.) don't pollute the image +//! +//! See: +//! +//! # The /run directory +//! +//! When processing OCI images via `create_filesystem()`, the `/run` directory +//! is emptied if present. This is a tmpfs at runtime and should always be +//! empty in images. Its mtime is set to match `/usr` for consistency with +//! how root directory metadata is handled. +//! +//! This makes it possible to work around podman/buildah's `RUN --mount` issue where cache +//! mounts can leave incomplete directory entries in OCI tar layers (directories +//! without explicit tar entries inherit incorrect mtimes) by pointing all +//! such mounts into `/run`, and then redirecting from their final location +//! via e.g. symlinks into `/run`. +//! +//! ## Container build cache mounts +//! +//! A practical implication of emptying `/run` is that container authors can +//! use it for cache mounts without worrying about polluting the final image. +//! +//! Instead of: +//! ```dockerfile +//! RUN --mount=type=cache,target=/var/cache/dnf dnf install -y ... +//! ``` +//! +//! Consider: +//! ```dockerfile +//! RUN rm -rf /var/cache/dnf && ln -sr /run/dnfcache /var/cache/dnf +//! RUN --mount=type=cache,target=/run/dnfcache dnf install -y ... +//! ``` +//! +//! This avoids potential mtime inconsistencies in `/var/cache` while still +//! benefiting from build caching. +//! +//! See: +//! +//! # Emptied directories for boot +//! +//! When preparing a filesystem for boot via `transform_for_boot()`, certain +//! additional directories are emptied because their contents should not be +//! part of the final verified image: +//! +//! - `/boot`: Contains the UKI which embeds the composefs digest, so including +//! it would create a circular dependency +//! - `/sysroot`: Only has content in ostree-container cases, and traversing +//! it for SELinux labeling causes problems +//! +//! These directories are emptied and their mtime is set to match `/usr` for +//! consistency with how the root directory metadata is handled. diff --git a/crates/composefs-oci/src/lib.rs b/crates/composefs-oci/src/lib.rs index 807ae966..0aefd575 100644 --- a/crates/composefs-oci/src/lib.rs +++ b/crates/composefs-oci/src/lib.rs @@ -35,6 +35,9 @@ pub mod tar; #[doc(hidden)] pub mod test_util; +#[cfg(doc)] +pub mod design; + // Re-export the composefs crate for consumers who only need composefs-oci pub use composefs; diff --git a/crates/composefs-ostree/src/design.rs b/crates/composefs-ostree/src/design.rs new file mode 100644 index 00000000..1270e410 --- /dev/null +++ b/crates/composefs-ostree/src/design.rs @@ -0,0 +1,139 @@ +//! # OSTree +//! +//! composefs-rs has support for importing images from OSTree +//! repositories, by pulling from local or remote OSTree +//! repositories. These images can then be mounted as composefs images, +//! sharing disk (deduplication) with other ostree or other types of +//! images in the composefs repository. +//! +//! Native OSTree repositories are a format similar to a composefs +//! repository, but not quite the same. This means we need some +//! conversions when handling ostree commits in a composefs repository. +//! +//! OSTree images (commits) are fundamentally made up of many small sha256 +//! content-addressed objects that reference each other. Each commit is +//! the root of a DAG that defines the total image. Some of the OSTree +//! objects are metadata like directory permissions, or list of files in a +//! directory. These don't really exist in composefs where all metadata is +//! part of the erofs image. However, some objects are large file objects, +//! and these are similar to the file objects in composefs +//! images. However, even these differ, because the checksum defining the +//! object is made up of both the file content and the file metadata. +//! +//! When an OSTree commit is stored in a composefs repo it is stored as a +//! single splitstream file, named `ostree-commit-$commit_id`, which uses +//! external object references to all the file content objects that will +//! be used when creating an erofs image for it. This means OSTree objects +//! for files that would be inlined in the erofs image will not be +//! external objects. +//! +//! OStree commit splitstream objects are created during a pull operation +//! and are used for two things, creating a composefs image by walking the +//! DAG, and serving as a source of already available OSTree object during +//! a pull operation. Such sources are found automatically during pull +//! (e.g. parent commit, or old commit for a ref being pulled) or can be +//! manually specified. +//! +//! ## File format +//! +//! This describes the format of the `ostree-commit-$commit_id` files. +//! +//! ### Splitstream header +//! +//! Since the commit file is a split stream it starts with the splitstream +//! headers. Of these we use two, the named refs and the object +//! refs: +//! +//! * When an erofs image is created for the commit, it is referenced by +//! the `composefs.image` named ref. +//! +//! * Any external file content objects are in the external_refs +//! table. The index of the references in this header table is used to +//! refer to the file in the splitstream itself. +//! +//! The splitstream content type used for commits is 0xAFE138C18C463EF1. +//! +//! ### Splitstream content +//! +//! A splitstream is normally a series of internal and external chunks, +//! but the ostree commit uses only one inline chunk. This chunk is +//! basically a serialized form of the "objects" directory of an OSTree +//! repository. I.e. it has a mapping of sha256 to ostree object data. +//! All objects except file objects are stored in the standard ostree +//! object format. +//! +//! OSTree file objects are stored in the archive-z2 format, except not +//! compressed, and optionally the file content part of it may be stored +//! as referencing the index of an external object. The z2 format is, +//! first an 8-byte header that gives the size (in bytes) of a gvariant, +//! then comes the gvariant with the file meta in +//! OSTREE_ZLIB_FILE_HEADER_GVARIANT_FORMAT format, and then the +//! file/symlink inline data. If an external object is referenced for the +//! object then it is expected that there is no inline file data. +//! +//! The high level view of the file looks like this: +//! ```text +//! +---------------+ +//! | Header | +//! +---------------| +//! | Object IDs | +//! +---------------| +//! | Object Info | +//! +---------------| +//! | Content | +//! +---------------+ +//! ``` +//! +//! The Object IDs is a sorted array of sha256 digests, and you would do +//! lookups in it using a binary search. The buckets in the header can be +//! used to quickly limit the binary search based on the first byte of a +//! digest. +//! +//! Then, at the same index as the binary searched object you can look up +//! the object info which gives you the offset/length of the object +//! content data and optionally a reference to an external object. +//! +//! The exact form of the data looks like this, packed in order from the +//! start of the splitstream content. All ints are in little endian. +//! +//! ### Header +//! ```text +//! +-----------------------------------+ +//! | u32: index of commit object | +//! | u32: flags (currently unused) | +//! | [u32; 256]: end index of bucket | +//! +-----------------------------------+ +//! ``` +//! +//! The bucket list contains the end index (in the object ids table) of +//! objects starting with that particular byte, and can be used to quickly +//! limit the search. We can also compute the total number of objects +//! (n_objects) by looking in the last bucket. +//! +//! ### Object ids +//! ```text +//! n_objects x +//! +-----------------------------------+ +//! | [u8; 32] ostree object id | +//! +-----------------------------------+ +//! ``` +//! +//! ### Object Info +//! ```text +//! n_objects x +//! +-----------------------------------+ +//! | u32: Offset to per-object data | +//! | u32: Length of per-object data | +//! | u32: Index of external object ref | +//! | or MAXUINT32 if none. | +//! +-----------------------------------+ +//! ``` +//! +//! This is an array of information for each object. Once you have found +//! the object id in the object ids table, you would look at the same +//! index in this table to find the information. Offsets to per-object +//! data are in bytes from the start of the content area, which starts at +//! the end of the Objects Info table. All data chunks references are +//! aligned to 8 bytes with respect to the start of the content area. +//! This is useful because GVariants (used by ostree) naturally want +//! 8-byte alignment. diff --git a/crates/composefs-ostree/src/lib.rs b/crates/composefs-ostree/src/lib.rs index e292188c..9c9653c9 100644 --- a/crates/composefs-ostree/src/lib.rs +++ b/crates/composefs-ostree/src/lib.rs @@ -29,6 +29,8 @@ pub struct CommitInfo { } mod commit; +#[cfg(doc)] +pub mod design; mod ostree; mod pull; mod repo; diff --git a/crates/composefs/src/erofs_format.rs b/crates/composefs/src/erofs_format.rs new file mode 100644 index 00000000..f939bd2c --- /dev/null +++ b/crates/composefs/src/erofs_format.rs @@ -0,0 +1,82 @@ +//! # composefs EROFS image format +//! +//! composefs images are EROFS filesystem images with composefs-specific extensions. They encode +//! a directory tree where regular files are stored externally in a content-addressed object store +//! and referenced by their fs-verity digest. The EROFS image itself carries only metadata: inodes, +//! directory entries, extended attributes, and chunk index entries that point to the external files. +//! +//! composefs-rs supports two EROFS format versions. V1 is byte-for-byte compatible with the C +//! `mkcomposefs` tool. V2 is the composefs-rs native format and drops several V1 constraints +//! that exist only for C compatibility. +//! +//! `cfsctl init` defaults to V2; pass `--erofs-version 1` to select V1. Higher-level tools +//! such as bootc initialize repositories with multiple formats enabled (V1 primary) so that images +//! can be booted on RHEL9-era kernels that require the `composefs.digest=` karg. +//! +//! ## Format V1 +//! +//! V1 is selected with `cfsctl init --erofs-version 1`. The `v1_erofs` ro-compat feature flag +//! is written to `meta.json` so that tools without V1 support open the repository read-only. +//! +//! **`composefs_version` field values in V1:** +//! +//! - `0` — no user-visible whiteout files (character devices with rdev=0) in the tree +//! - `1` — at least one user-visible whiteout file is present +//! +//! The constant `COMPOSEFS_VERSION_V1` is 0; the field only reaches 1 when user whiteouts are +//! found. The `--min-version` flag in `mkcomposefs` (mirrored by `mkfs_erofs_v1_min_version`) +//! forces the value to 1 even when no user whiteouts exist, for forward compatibility. +//! +//! **Inode layout:** V1 uses compact inodes (32 bytes) when the file data and inode fit within +//! the constraints of the compact format, and extended inodes (64 bytes) otherwise. +//! +//! **Inode traversal order:** V1 collects inodes in breadth-first order — all entries at one +//! directory level before descending. +//! +//! **Whiteout stub table:** V1 includes 256 synthetic inode entries at the start of the inode +//! area, one per two-hex-character prefix `00`–`ff`. Each entry is a character-device stub +//! (chr 0,0) used by the overlay filesystem to resolve whiteout paths against the object store. +//! V2 omits them entirely. +//! +//! **Whiteout escaping:** User-visible whiteout files (chr 0,0) in the tree are not stored as +//! character devices on disk. Instead they receive a `trusted.overlay.opaque=x` xattr and are +//! serialized differently. The stub entries in the whiteout table are not escaped. +//! +//! **`build_time`:** The superblock `build_time` field is set to the minimum mtime across all inodes. +//! +//! **xattr sharing:** Xattr entries are deduplicated using a sort key that is the full xattr name (prefix string concatenated with the suffix). +//! +//! ## Format V2 — Created in composefs-rs +//! +//! V2 is the default for repositories created with `cfsctl init` without `--erofs-version 1`. +//! +//! **`composefs_version` field:** Always `2` (the constant `COMPOSEFS_VERSION`). +//! +//! **Inode layout:** V2 always uses extended inodes (64 bytes). +//! +//! **Inode traversal order:** V2 collects inodes in depth-first order — all descendants of a directory before moving to the next sibling. +//! +//! **No whiteout stub table:** V2 has no synthetic stub entries; whiteout files are stored directly without escaping. +//! +//! **`build_time`:** Always 0. +//! +//! **xattr sharing:** Xattr entries are deduplicated using a sort key of (prefix, suffix, value) +//! rather than the full name string, which can produce a smaller shared xattr area. +//! +//! ## Selecting the format +//! +//! The format is fixed at repository initialization time and cannot be changed afterward. +//! +//! ```text +//! cfsctl init # V2 (default) +//! cfsctl init --erofs-version 1 # V1 (C-tool compatible) +//! ``` +//! +//! The format is recorded in `meta.json` (see [`repository_format`][crate::repository_format]) as the `v1_erofs` ro-compat feature flag: present +//! means V1, absent means V2. Tools that do not recognize this flag open the repository +//! read-only rather than writing images in the wrong format. +//! +//! For the standalone `mkcomposefs` tool, the equivalent flag is `--erofs-version`. The +//! `--min-version` flag (`mkfs_erofs_v1_min_version` in the Rust API) controls whether the +//! `composefs_version` field starts at 0 or 1 in V1 images regardless of whether user whiteouts +//! are present. diff --git a/crates/composefs/src/lib.rs b/crates/composefs/src/lib.rs index 806f3c8f..468090e3 100644 --- a/crates/composefs/src/lib.rs +++ b/crates/composefs/src/lib.rs @@ -1,8 +1,43 @@ -//! Rust bindings and utilities for working with composefs images and repositories. +//! # composefs: The reliability of disk images, the flexibility of files //! -//! Composefs is a read-only FUSE filesystem that enables efficient sharing -//! of container filesystem layers by using content-addressable storage -//! and fs-verity for integrity verification. +//! composefs combines several Linux kernel features to provide read-only +//! mountable filesystem trees that stack on top of a conventional "lower" +//! filesystem. +//! +//! ## Interfaces +//! +//! composefs offers two programmatic interfaces: +//! +//! - **Rust API** — this crate and its siblings (`composefs-oci`, +//! `composefs-boot`, etc.), usable as regular Cargo dependencies. +//! - **Varlink API** — a [varlink](https://varlink.org) RPC interface +//! exposed by `cfsctl varlink` over a Unix socket, accessible from +//! any language. See the [`varlink`] module for examples. +//! +//! Neither interface is declared stable yet. Both may change across +//! releases while the project is under active development. +//! +//! ## Key technologies +//! +//! - **[overlayfs]** — the kernel mount interface that exposes the composed tree +//! - **[EROFS]** — an in-kernel read-only filesystem for the metadata tree +//! (directories, symlinks, permissions, xattrs) with no file data +//! - **[fs-verity]** (optional) — per-file integrity verification on the +//! backing store, validated by overlayfs at access time +//! +//! [overlayfs]: https://www.kernel.org/doc/Documentation/filesystems/overlayfs.txt +//! [EROFS]: https://erofs.docs.kernel.org +//! [fs-verity]: https://www.kernel.org/doc/html/next/filesystems/fsverity.html +//! +//! ## Design +//! +//! composefs produces an EROFS image containing *only* metadata. Non-empty +//! data files live in a content-addressed backing store, with +//! `trusted.overlay.redirect` xattrs telling overlayfs where to find them. +//! Identical files across images are stored once on disk and shared in the +//! Linux page cache. +//! +//! See the [`repository_format`] module for the on-disk layout. #![forbid(unsafe_code)] // This is a library: emit diagnostics via the `log` crate (or return them), @@ -25,9 +60,16 @@ pub mod splitstream; pub mod tree; pub mod util; +#[cfg(doc)] +pub mod erofs_format; pub mod generic_tree; +#[cfg(doc)] +pub mod repository_format; +#[cfg(doc)] +pub mod splitstream_format; #[cfg(any(test, feature = "test"))] pub mod test; +pub mod varlink; /// Files with this many bytes or fewer are stored inline in the erofs image /// (and in splitstreams). Files above this threshold are written to object diff --git a/crates/composefs/src/repository_format.rs b/crates/composefs/src/repository_format.rs new file mode 100644 index 00000000..4396dede --- /dev/null +++ b/crates/composefs/src/repository_format.rs @@ -0,0 +1,317 @@ +//! # composefs repository design +//! +//! This document describes the current on-disk layout of a composefs repository. +//! +//! At this time, the composefs-rs repository format is not declared stable. +//! +//! ## Location +//! +//! A composefs repository is a directory located anywhere. The location is chosen +//! for the `cfsctl` command as follows: +//! +//! - `--repo` can specify an arbitrary directory +//! +//! - if `--user` is specified (default if the current uid is not 0), then the +//! repository defaults to `~/.var/lib/composefs`. +//! +//! - if `--system` is specified (default if the current uid is 0), then the +//! repository defaults to `/sysroot/composefs`. +//! +//! ## Layout +//! +//! A composefs repository has a layout that looks something like +//! +//! ```text +//! composefs +//! ├── meta.json +//! ├── objects +//! │   ├── 00 +//! │   │   ├── 002183fb91[...] +//! │   │   ├── [...] +//! │   │   └── ff9d7bd692[...] +//! │   ├── 4e +//! │   │   ├── 67eaccd9fd[...] +//! │   │   └── [...] +//! │   ├── 50 +//! │   │   ├── 2b126bca0c[...] +//! │   │   └── [...] +//! │   └── [...] +//! ├── images +//! │   ├── 4e67eaccd9fd[...] -> ../objects/4e/67eaccd9fd[...] +//! │   └── refs +//! │   └── some/name -> ../../images/4e67eaccd9fd[...] +//! └── streams +//! ├── 502b126bca0c[...] -> ../objects/50/2b126bca0c[...] +//! └── refs +//! └── some/name.tar -> ../../streams/502b126bca0c[...] +//! ``` +//! +//! ## `meta.json` +//! +//! Added in 0.7.0. This file records repository-level metadata. When present, it is +//! created by `cfsctl init` and contains: +//! +//! - `version` — the base repository format version (currently `1`). Tools +//! must refuse to operate on a repository whose version exceeds what they +//! understand. +//! +//! - `algorithm` — the fs-verity digest algorithm identifier, in the format +//! `fsverity--`. For example `fsverity-sha512-12` +//! means SHA-512 with 4 KiB (2^12) blocks. +//! +//! - `features` (optional) — an object with three arrays of feature-flag +//! strings, following the ext4/XFS/EROFS compatibility model: +//! - `compatible` — old tools can safely ignore these. +//! - `read-only-compatible` — old tools may read but must not write. +//! - `incompatible` — old tools must refuse the repository entirely. +//! +//! The currently defined feature flags are: +//! - `v1_erofs` (read-only-compatible) — present on repositories whose +//! EROFS image format is [V1][crate::erofs_format] (C-tool compatible: +//! compact inodes, BFS ordering, whiteout table). This is the single +//! flag that encodes the EROFS format version: present → V1, absent +//! → V2. Old +//! tools that do not recognise this flag open the repository read-only +//! rather than accidentally writing images in the wrong format. +//! +//! When `meta.json` is present, `cfsctl` auto-detects the hash algorithm and +//! errors if `--hash` is explicitly passed with a conflicting value. When +//! the file is absent (for repositories created before this feature), `--hash` +//! is honored as before and defaults to `sha512`. +//! +//! ### `cfsctl init --erofs-version` +//! +//! The `--erofs-version` flag selects the EROFS format for newly committed +//! images. It controls the `v1_erofs` feature flag in `meta.json`: +//! +//! ```text +//! cfsctl init # default: V2 EROFS (composefs-rs native) +//! cfsctl init --erofs-version 1 # V1 EROFS (C-tool compatible) +//! ``` +//! +//! **V2** (the `cfsctl` default) uses extended inodes, DFS ordering, and +//! `composefs_version=2` in the EROFS superblock. This is the composefs-rs native +//! format and is what all repositories created before V1 support was added use. +//! Higher-level tools (e.g. bootc) may configure a repository with multiple format +//! versions (V1 primary + V2 extra) so that images are usable on both RHEL9-era and +//! newer kernels. +//! +//! **V1** uses compact inodes where possible, BFS ordering, and a whiteout stub +//! table, producing output byte-for-byte identical to the C `mkcomposefs` tool. +//! The `v1_erofs` ro-compat flag is written to `meta.json` so that tools which +//! predate V1 support open the repository read-only rather than writing images +//! in the wrong format. +//! +//! Re-initializing an existing repository with a different `--erofs-version` is +//! rejected with an error; the format version is fixed at init time. +//! +//! ## `objects/` +//! +//! This is where the content-addressed data is stored. The immediate children of +//! this directory are 256 subdirectories from `00` to `ff`. Each of those +//! directories contains a number of files with 62-character hexidecimal names. +//! Taken together with the directory in which it resides, each filename represents +//! a 256bit hash value which equals the measured fs-verity digest of that file. +//! fs-verity must be enabled for every file. +//! +//! ## `images/` +//! +//! This is where composefs ([EROFS][crate::erofs_format]) images are accounted for. The images +//! themselves are fs-verity enabled and stored in the object store in the same way +//! as the file data, but the `images/` directory contains symlinks to the images +//! that we know about. Each symlink is named for the full 256bit fsverity digest. +//! +//! Images are tracked in a separate directory because of the security model of +//! filesystems in the Linux kernel. Although it would be feasible for "regular +//! users" to mount an erofs in their own mount namespace, the kernel currently +//! disallows it as a way to avoid allowing non-root users to expose the filesystem +//! code to hostile data. As such, we only mount images that we produced for +//! ourselves (with mkcomposefs), and those are the ones that are linked in this +//! directory. +//! +//! Another way to say it: we must never attempt to mount an arbitrary object: we +//! may only mount via symlinks present in this directory. +//! +//! ## `streams/` +//! +//! This is where [split streams][crate::splitstream] are stored. As for the images, +//! this is a bunch of 256bit symlinks which are symlinks to data in the object +//! storage. +//! +//! Note: the names of the hashes in this directory are the fs-verity hashes of the +//! content of the splitstream file, not the original file. More specifically: if +//! you have a tar file with a specific sha256 digest, and you import it into the +//! repository as a splitstream, the resulting filename in this directory will have +//! no relation to the original content. You can, however, store a reference for +//! it. +//! +//! ## `{images,streams}/refs/` +//! +//! This is where we record which images and streams are currently "requested" by +//! some external user. When importing a tar file, in addition to creating the +//! file in the objects database and the toplevel symlink in the `streams/` +//! directory, we also assign it a name which is chosen by the software which is +//! performing the import. +//! +//! Each ref is a symlink to the top-level entry in `images/` or `streams/`. +//! +//! There are some rough ideas for how we might namespace this. Something like +//! this model is imagined: +//! +//! ```text +//! refs +//! ├── system +//! │   └── rootfs +//! │      ├── some_id -> ../../../974d04eaff[...] +//! │      └── [...] +//! ├── 1000 # uid of a user +//! │   ├── flatpak +//! │   │   ├── some_id -> ../../../f8e2bec500[...] +//! │   │   └── [...] +//! │   └── containers +//! │      ├── some_id -> ../../../96a87f8b4b[...] +//! │      └── [...] +//! └── [...] +//! ``` +//! +//! Where the toplevel directories are `system` plus a set of uids. Each `system` +//! or uid subdirectory is namespaced by the particular piece of software that's +//! responsible for storing the given image or stream. +//! +//! The per-user directories will all be owned by root and have 0700 permissions, +//! but each user will be able to access their own uid-numbered subdirectories by +//! way of an acl. The reason that we want the directories owned by root is to +//! prevent users from corrupting the layout of the repository. The reason for the +//! acl is that read-only operations on the repository should be performed +//! directly on the repository and not via some central agent. +//! +//! ## Referring to images and streams +//! +//! Operations that are performed on images or streams (mount, cat, etc.) name the +//! stream in one of two ways: +//! +//! - via the user-chosen name such as `refs/1000/flatpak/some_id` +//! - via the fs-verity digest stored in the toplevel dir +//! +//! ie: the name must either start with the string `refs/`, or must be a +//! hexadecimal string (64 characters for sha256, 128 for sha512). +//! +//! In both cases, the name is a path relative to the `images/` or `streams/` +//! directory and this path contains a symlink (either direct or indirect) to the +//! underlying file in `objects/`. +//! +//! When specified via fs-verity digest, the digest is verified before performing +//! the operation. +//! +//! For example: +//! +//! ```sh +//! cfsctl mount refs/system/rootfs/some_id /mnt # does not check fs-verity +//! cfsctl mount 974d04eaff[...] /mnt # enforces fs-verity +//! ``` +//! +//! ## OCI image storage +//! +//! OCI container images are stored using streams exclusively. Each OCI artifact +//! (manifest, config, layer) becomes a splitstream, and OCI "tags" are refs under +//! `streams/refs/oci/`. +//! +//! ### Naming conventions +//! +//! | OCI artifact | Stream name pattern | Example | +//! |---------------|------------------------------------|------------------------------------| +//! | Manifest | `oci-manifest-{manifest_digest}` | `oci-manifest-sha256:abc123...` | +//! | Config | `oci-config-{config_digest}` | `oci-config-sha256:def456...` | +//! | Layer | `oci-layer-{diff_id}` | `oci-layer-sha256:ghi789...` | +//! | Blob | `oci-blob-{blob_digest}` | `oci-blob-sha256:jkl012...` | +//! +//! Tags are stored under `streams/refs/oci/` with percent-encoding for +//! filesystem safety (`/` → `%2F`): +//! +//! ```text +//! streams/refs/oci/myimage:latest → ../../oci-manifest-sha256:abc123... +//! ``` +//! +//! ### Splitstream reference chains +//! +//! Each splitstream contains `named_refs` (semantic labels mapping to entries +//! in the `stream_refs` array) and `object_refs` (raw objects referenced by +//! the compressed stream data). For OCI images the chain is: +//! +//! **Manifest splitstream** (`oci-manifest-sha256:...`): +//! - `object_refs`: the manifest JSON blob +//! - `named_refs`: +//! - `config:{config_digest}` → config splitstream verity +//! - `{diff_id}` → layer splitstream verity (one per layer) +//! +//! **Config splitstream** (`oci-config-sha256:...`): +//! - `object_refs`: the config JSON blob +//! - `named_refs`: +//! - `{diff_id}` → layer splitstream verity (one per layer) +//! +//! **Layer splitstream** (`oci-layer-sha256:...`): +//! - `object_refs`: file content objects extracted from the tar +//! - `named_refs`: none (leaf node) +//! +//! Both the manifest and config redundantly reference the layers. The GC +//! can reach layers from either path. +//! +//! ### Garbage collection +//! +//! The GC walks all refs under `streams/refs/` to find root splitstreams, +//! then transitively follows `named_refs` (by resolving fs-verity IDs +//! through a stream name map) and collects `object_refs`. Any object not +//! reachable from a root is deleted. +//! +//! Concretely, for a tagged container image: +//! +//! 1. Tag `streams/refs/oci/myimage:v1` resolves to `oci-manifest-sha256:abc` +//! 2. Walk the manifest: mark its JSON blob and follow `named_refs` to +//! the config and layer streams +//! 3. Walk the config: mark its JSON blob and follow `named_refs` to layers +//! (already visited, skipped) +//! 4. Walk each layer: mark all file content objects +//! +//! When a tag is removed, the manifest and everything reachable only from it +//! becomes GC-eligible. Layers shared between images survive as long as any +//! referencing manifest remains tagged. +//! +//! ### EROFS image tracking via config splitstream refs +//! +//! When an EROFS image is generated from an OCI image (via +//! `create_filesystem` + `commit_image`), its object ID (fs-verity digest) +//! is stored as a named ref on the config splitstream with the key +//! `composefs.image`. +//! +//! GC walks from tag → manifest → config, and finds the `composefs.image` +//! named ref. The EROFS object ID is added to the live set, keeping the +//! EROFS image alive. The EROFS image still needs an entry under `images/` +//! for the kernel mount security model (see above), but `images/` is not a +//! GC root — the config ref is what keeps the object alive. +//! +//! This means a single OCI tag is sufficient to keep the entire image +//! (manifest, config, layers, and the EROFS image) alive through GC. +//! +//! ### Bootable image variant +//! +//! For bootable images, a second EROFS may be generated after +//! `transform_for_boot` (stripping `/boot`, etc.). This boot EROFS is +//! stored as a second named ref on the config, `composefs.image.boot`. +//! +//! Since the config splitstream content changes (new named ref), it gets a +//! new fs-verity digest. This cascades: the manifest must also be +//! rewritten (its `config:` named ref now points to the new config verity), +//! producing a new manifest verity. The tag is re-pointed to the new +//! manifest. The old config and manifest splitstreams become unreferenced +//! and are collected by GC. +//! +//! The result: one tag still keeps everything alive — layers, raw EROFS, +//! and boot EROFS. +//! +//! ### Future: sealed images +//! +//! For sealed/signed images, the EROFS comes pre-built from the registry as +//! part of a composefs OCI artifact (referrer pattern). The artifact +//! splitstream would hold references to the pre-fetched EROFS layers. This +//! is complementary to the unsealed case — both use the same GC mechanism +//! (named refs pointing to EROFS objects). diff --git a/crates/composefs/src/splitstream_format.rs b/crates/composefs/src/splitstream_format.rs new file mode 100644 index 00000000..36a1a17f --- /dev/null +++ b/crates/composefs/src/splitstream_format.rs @@ -0,0 +1,164 @@ +//! # Splitstream +//! +//! Splitstream is a trivial way of storing file formats (like tar) with the "data +//! blocks" stored in the composefs object store with the goal that it's possible +//! to bit-for-bit recreate the entire file. It's something like the idea behind +//! [tar-split](https://github.com/vbatts/tar-split), with some important +//! differences: +//! +//! - it's a binary format +//! +//! - it's based on storing external objects content-addressed in the composefs +//! object store via their fs-verity digest +//! +//! - although it's designed with `tar` files in mind, it's not specific to `tar`, +//! or even to the idea of an archive file: any file format can be stored as a +//! splitstream, and it might make sense to do so for any file format that +//! contains large chunks of embedded data +//! +//! - in addition to the ability to split out chunks of file content (like files +//! in a `.tar`) to separate files, it is also possible to refer to external +//! file content, or even other splitstreams, without directly embedding that +//! content in the referrer, which can be useful for cross-document references +//! (such as between OCI manifests, configs, and layers) +//! +//! - the splitstream file itself is stored in the same content-addressed object +//! store by its own fs-verity hash +//! +//! Splitstream compresses inline file content before it is stored to disk using +//! zstd. The main reason for this is that, after removing the actual file data, +//! the remaining `tar` metadata contains a very large amount of padding and empty +//! space and compresses extremely well. +//! +//! Splitstream is conceptually independent from composefs: you could use the +//! format with any content-addressed storage system. +//! +//! ## File format +//! +//! What follows is a non-normative documentation of the file format. The actual +//! definition of the format is "what composefs-rs reads and writes", but this +//! document may be useful to try to understand that format. If you'd like to +//! implement the format, please get in touch. +//! +//! The format is implemented in +//! [crate::splitstream] and +//! the structs from that file are copy-pasted here. Please try to keep things +//! roughly in sync when making changes to either side. +//! +//! All integers are little-endian. In the following `struct` definitions, `U` +//! means 'unsigned little endian' (as per the `zerocopy::little_endian` crate) so +//! `U64` is an unsigned 64bit little-endian integer. +//! +//! ### File ranges ("sections") +//! +//! The file format consists of a fixed-sized header at the start of the file plus +//! a number of sections located at arbitrary locations inside of the file. All of +//! these sections are referred to by a 64-bit `[start..end)` range expressed in +//! terms of overall byte offsets within the complete file. +//! +//! ```text +//! struct FileRange { +//! start: U64, +//! end: U64, +//! } +//! ``` +//! +//! ### Header +//! +//! The file starts with a simple fixed-size header. +//! +//! ```text +//! const SPLITSTREAM_MAGIC: [u8; 11] = *b"SplitStream"; +//! +//! struct SplitstreamHeader { +//! pub magic: [u8; 11], // Contains SPLITSTREAM_MAGIC +//! pub version: u8, // must always be 0 +//! pub _flags: U16, // is currently always 0 (but ignored) +//! pub algorithm: u8, // kernel fs-verity algorithm identifier (1 = sha256, 2 = sha512) +//! pub lg_blocksize: u8, // log2 of the fs-verity block size (12 = 4k, 16 = 64k) +//! pub info: FileRange, // can be used to expand/move the info section in the future +//! } +//! ``` +//! +//! In addition to magic values and identifiers for the fs-verity algorithm in use, +//! the header is used to find the location and size of the info section. Future +//! expansions to the file format are imagined to occur by expanding the size of +//! the info section: if the section is larger than expected, the additional bytes +//! will be ignored by the implementation. +//! +//! ### Info section +//! +//! ```text +//! struct SplitstreamInfo { +//! pub stream_refs: FileRange, // location of the stream references array +//! pub object_refs: FileRange, // location of the object references array +//! pub stream: FileRange, // location of the zstd-compressed stream within the file +//! pub named_refs: FileRange, // location of the compressed named references +//! pub content_type: U64, // user can put whatever magic identifier they want there +//! pub stream_size: U64, // total uncompressed size of inline chunks and external chunks +//! } +//! ``` +//! +//! The `content_type` is just an arbitrary identifier that can be used by users of +//! the file format to prevent casual user error when opening a file by its hash +//! value (to prevent showing `.tar` data as if it were json, for example). +//! +//! The `stream_size` is the total size of the original file. +//! +//! ### Stream and object refs sections +//! +//! All referred streams and objects in the file are stored as two separate flat +//! uncompressed arrays of binary fs-verity hash values. Each of these arrays is +//! referred to from the info section (via `stream_refs` and `object_refs`). +//! +//! The number of items in the array is determined by the size of the section +//! divided by the size of the fs-verity hash value (determined by the algorithm +//! identifier in the header). +//! +//! The values are not in any particular order, but implementations should produce +//! a deterministic output. For example, the objects reference array produced by +//! the current implementation has the external objects sorted by first-appearance +//! within the stream. +//! +//! The main motivation for storing the references uncompressed, in binary, and in +//! a flat array is to make determining the references contained within a +//! splitstream as simple as possible to improve the efficiency of garbage +//! collection on large repositories. +//! +//! ### The stream +//! +//! The main content of the splitstream is stored in the `stream` section +//! referenced from the info section. The entire section is zstd compressed. +//! +//! Within the compressed stream, the splitstream is formed from a number of +//! "chunks". Each chunk starts with a single 64-bit little endian value. If that +//! number is negative, it refers to an "inline" chunk, and that (absolute) number +//! of bytes of data immediately follow it. If the number is non-negative then it +//! is an index into the object refs array for an "external" chunk. +//! +//! Zero is a non-negative value, so it's an object reference. It's not possible +//! to have a zero-byte inline chunk. This also means that the high/sign bit +//! determines which case (inline vs. external) we have and there are an equal +//! number of both cases. +//! +//! The stream is reassembled by iterating over the chunks and concatenating the +//! result. For inline chunks, the inline data is taken directly from the +//! splitstream. For external chunks, the content of the external file is used. +//! +//! The stream is over when there are no more chunks. +//! +//! ### Named references +//! +//! It's possible to have named references to other streams. These are stored in +//! the `named_refs` section referred to from the info section. +//! +//! This section is also zstd-compressed, and is a number of nul-terminated text +//! records (including a terminator after the last record). Each record has the +//! form `n:name` where `n` is a non-negative integer index into the stream refs +//! array and `name` is an arbitrary name. The entries are currently sorted by +//! name (by the writer implementation) but the order is not important to the +//! reader. Whether or not this list is "officially" sorted or not may be pinned +//! down at some future point if a need should arise. +//! +//! An example of the decompressed content of the section might be something like +//! `"0:first\01:second\0"`. diff --git a/crates/composefs/src/varlink.rs b/crates/composefs/src/varlink.rs new file mode 100644 index 00000000..af22cfd4 --- /dev/null +++ b/crates/composefs/src/varlink.rs @@ -0,0 +1,160 @@ +//! # Varlink API +//! +//! `cfsctl varlink` exposes a [varlink] RPC service over a Unix socket +//! with two interfaces: +//! +//! - **`org.composefs.Repository`** — repository lifecycle, integrity +//! checks, garbage collection, and mounting +//! - **`org.composefs.Oci`** — OCI container image operations (listing, +//! pulling, inspecting, tagging, mounting) +//! +//! This API is language-agnostic and usable from any varlink client. +//! Like the Rust crate API, it is not yet declared stable. +//! +//! [varlink]: https://varlink.org +//! +//! ## Starting the service +//! +//! ```sh +//! cfsctl varlink --address /run/composefs/composefs.sock +//! ``` +//! +//! Systemd socket activation is also supported — if `cfsctl varlink` is +//! started with an activated socket, the `--address` flag is not needed. +//! +//! ## Discovering the full API +//! +//! The complete interface definitions — every method, type, and error — +//! are available at runtime via the standard varlink introspection +//! protocol. Use [`varlinkctl`] to dump them: +//! +//! ```sh +//! # List available interfaces +//! varlinkctl list-interfaces /run/composefs/composefs.sock +//! +//! # Full IDL for the Repository interface +//! varlinkctl introspect /run/composefs/composefs.sock \ +//! org.composefs.Repository +//! +//! # Full IDL for the OCI interface +//! varlinkctl introspect /run/composefs/composefs.sock \ +//! org.composefs.Oci +//! ``` +//! +//! For `exec:`-style transports (no long-running socket), `varlinkctl` +//! can launch `cfsctl` as a subprocess: +//! +//! ```sh +//! varlinkctl introspect exec:cfsctl\ varlink org.composefs.Repository +//! ``` +//! +//! [`varlinkctl`]: https://www.freedesktop.org/software/systemd/man/latest/varlinkctl.html +//! +//! ## Session model +//! +//! Repositories are accessed through opaque `u64` handles. A client +//! calls `OpenRepository` to obtain a handle, passes it to every +//! subsequent method, and releases it with `CloseRepository`. No +//! repository is opened at startup. +//! +//! ## Examples +//! +//! The examples below use `varlinkctl call`. Any varlink client works — +//! the wire format is JSON over a Unix socket. +//! +//! ### Open and close a repository +//! +//! ```sh +//! # Open the system repository (/sysroot/composefs) +//! varlinkctl call /run/composefs/composefs.sock \ +//! org.composefs.Repository.OpenRepository '{"system": true}' +//! # → {"handle": 1} +//! +//! # Open at a specific path +//! varlinkctl call /run/composefs/composefs.sock \ +//! org.composefs.Repository.OpenRepository \ +//! '{"path": "/srv/composefs"}' +//! # → {"handle": 2} +//! +//! # Release a handle when done +//! varlinkctl call /run/composefs/composefs.sock \ +//! org.composefs.Repository.CloseRepository '{"handle": 1}' +//! ``` +//! +//! ### Check repository integrity +//! +//! ```sh +//! # Full check (verifies fs-verity on every object) +//! varlinkctl call /run/composefs/composefs.sock \ +//! org.composefs.Repository.Fsck '{"handle": 1}' +//! # → {"ok": true, "has_metadata": true, "objects_checked": 1542, ...} +//! +//! # Fast metadata-only check (skips per-object verification) +//! varlinkctl call /run/composefs/composefs.sock \ +//! org.composefs.Repository.Fsck \ +//! '{"handle": 1, "metadata_only": true}' +//! ``` +//! +//! ### List and pull OCI images +//! +//! ```sh +//! varlinkctl call /run/composefs/composefs.sock \ +//! org.composefs.Oci.ListImages '{"handle": 1}' +//! # → {"images": [{"name": "myimage:latest", +//! # "manifest_digest": "sha256:abc...", ...}, ...]} +//! +//! # Pull with streaming progress +//! varlinkctl call --more /run/composefs/composefs.sock \ +//! org.composefs.Oci.Pull '{ +//! "handle": 1, +//! "image": "quay.io/fedora/fedora:latest", +//! "local_fetch": "decompressed", +//! "bootable": false, +//! "more": true +//! }' +//! # Streams progress, then a final "completed" frame +//! ``` +//! +//! ### Inspect, tag, and untag +//! +//! ```sh +//! varlinkctl call /run/composefs/composefs.sock \ +//! org.composefs.Oci.Inspect \ +//! '{"handle": 1, "image": "myimage:latest"}' +//! # → {"manifest": "{...}", "config": "{...}", ...} +//! +//! varlinkctl call /run/composefs/composefs.sock \ +//! org.composefs.Oci.Tag '{ +//! "handle": 1, +//! "manifest_digest": "sha256:abc123...", +//! "name": "myimage:v2" +//! }' +//! +//! varlinkctl call /run/composefs/composefs.sock \ +//! org.composefs.Oci.Untag \ +//! '{"handle": 1, "name": "myimage:old"}' +//! ``` +//! +//! ### Garbage collection +//! +//! ```sh +//! # Dry run +//! varlinkctl call /run/composefs/composefs.sock \ +//! org.composefs.Repository.Gc \ +//! '{"handle": 1, "dry_run": true, "roots": []}' +//! +//! # Collect for real +//! varlinkctl call /run/composefs/composefs.sock \ +//! org.composefs.Repository.Gc \ +//! '{"handle": 1, "dry_run": false, "roots": []}' +//! ``` +//! +//! ### Mounting +//! +//! The `Mount` and `OciMount` methods return a detached mount file +//! descriptor via `SCM_RIGHTS`. The caller attaches it with +//! `move_mount(2)`. For overlay mounts, the caller passes upperdir and +//! workdir fds in the request. +//! +//! These methods require a varlink client that supports fd passing; +//! `varlinkctl` does not currently support this. diff --git a/doc/booting.md b/doc/booting.md deleted file mode 100644 index b958e6cd..00000000 --- a/doc/booting.md +++ /dev/null @@ -1,90 +0,0 @@ -# Booting from a composefs image - -This document describes how composefs-rs sets up the root filesystem during -early boot. It covers the kernel command-line interface, the expected on-disk -layout, kernel requirements, and the step-by-step mount sequence performed by -`composefs-setup-root`. - -The target audience is system integrators and OS developers who are packaging a -bootable system using composefs. Familiarity with Linux mount namespaces, -overlayfs, and fs-verity is assumed. - -## Kernel command-line - -The initramfs code in composefs supports multiple kernel arguments; it -is possible to pre-compute the digest of an image using both e.g. SHA-256 and -SHA-512. On an installed system, the repository only supports one digest -by default today, and the first found will be selected. - -Additionally, it is opt-in to enable v1 EROFS, and again the first compatible -version will be found. - -``` -composefs.digest=v1-sha256-12: # V1 EROFS image (preferred; RHEL9-era kernels) -composefs.digest=v1-sha512-12: # V1 EROFS image (SHA-512 variant) -composefs.digest=v2-sha512-12: # V2 EROFS image (explicit form) -composefs= # V2 EROFS image (legacy shorthand) -``` - -The value format is `--:`, where -`` is `v1` or `v2`, `` is `sha256` or `sha512`, and -`` is the log₂ block size (currently always `12`, i.e. 4096 -bytes). This mirrors how `meta.json` encodes the algorithm as -`fsverity-sha256-12`. - -`composefs.digest=` is checked first. Multiple entries may appear on the cmdline -(one per format/algorithm combination); the initramfs tries each in order and -mounts the first image that actually exists in the repository. - -`composefs=` is a legacy shorthand equivalent to -`composefs.digest=v2--12:` — the algorithm is inferred from the -digest length (64 hex chars → SHA-256, 128 → SHA-512). It is checked only when -no `composefs.digest=` token matches. - -**Insecure mode.** Placing `?` immediately after `=` (e.g. -`composefs.digest=?v1-sha256-12:` or `composefs=?`) makes -fs-verity verification optional. The system will boot even when the underlying -filesystem does not support fs-verity or the image has no verity metadata -attached. This mode exists for development and testing only; it must not be used -in production. - -## On-disk layout - -The composefs repository must be present at `/sysroot/composefs` with the -standard layout described in `doc/repository.md`. - -The digest must correspond to a symlink under `images/`. - -Persistent per-deployment state lives at `/sysroot/state/deploy//`, -where `` matches the boot karg digest exactly. The `etc/` and `var/` -subdirectories within that directory serve as the upper layers for the -corresponding overlayfs mounts. - -## Kernel requirements - -The following kernel features must be available: - -- **EROFS** filesystem driver (`CONFIG_EROFS_FS`) -- **overlayfs** with `metacopy=on` and `redirect_dir=on` - (`CONFIG_OVERLAY_FS`, `CONFIG_OVERLAY_FS_METACOPY`, `CONFIG_OVERLAY_FS_REDIRECT_DIR`) -- **fs-verity** unless insecure mode is used (`CONFIG_FS_VERITY`) -- The modern Linux mount API (`fsopen` / `fsconfig` / `fsmount` / `move_mount`), - available since kernel 5.2. Kernel ≥ 6.15 is required for the atomic root - replacement path (the default build). On kernels without `fsconfig_set_fd` - support (e.g. RHEL 9 / kernel < 5.15), a loopback device is created - automatically by `composefs::mountcompat`. - -## Kernel argument - -The boot karg (`composefs.digest=` or `composefs=`) is the authoritative selector for which image is booted. -Without the `?` insecure prefix, every file access through the overlayfs is -verified against the object's stored digest by the kernel, combining fs-verity -on the data objects with overlayfs `verity=require`. - -## Other notes - -As a workaround for a GPT auto-root issue in systemd -([systemd#35017](https://github.com/systemd/systemd/issues/35017)), -`composefs-setup-root` attempts to create `/run/systemd/volatile-root` as a -symlink pointing to the real block device before performing any mounts. Failure -to do so is non-fatal and does not abort the boot sequence. diff --git a/doc/erofs.md b/doc/erofs.md deleted file mode 100644 index 2a49b60e..00000000 --- a/doc/erofs.md +++ /dev/null @@ -1,82 +0,0 @@ -# composefs EROFS image format - -composefs images are EROFS filesystem images with composefs-specific extensions. They encode -a directory tree where regular files are stored externally in a content-addressed object store -and referenced by their fs-verity digest. The EROFS image itself carries only metadata: inodes, -directory entries, extended attributes, and chunk index entries that point to the external files. - -composefs-rs supports two EROFS format versions. V1 is byte-for-byte compatible with the C -`mkcomposefs` tool. V2 is the composefs-rs native format and drops several V1 constraints -that exist only for C compatibility. - -`cfsctl init` defaults to V2; pass `--erofs-version 1` to select V1. Higher-level tools -such as bootc initialize repositories with multiple formats enabled (V1 primary) so that images -can be booted on RHEL9-era kernels that require the `composefs.digest=` karg. - -## Format V1 - -V1 is selected with `cfsctl init --erofs-version 1`. The `v1_erofs` ro-compat feature flag -is written to `meta.json` so that tools without V1 support open the repository read-only. - -**`composefs_version` field values in V1:** - -- `0` — no user-visible whiteout files (character devices with rdev=0) in the tree -- `1` — at least one user-visible whiteout file is present - -The constant `COMPOSEFS_VERSION_V1` is 0; the field only reaches 1 when user whiteouts are -found. The `--min-version` flag in `mkcomposefs` (mirrored by `mkfs_erofs_v1_min_version`) -forces the value to 1 even when no user whiteouts exist, for forward compatibility. - -**Inode layout:** V1 uses compact inodes (32 bytes) when the file data and inode fit within -the constraints of the compact format, and extended inodes (64 bytes) otherwise. - -**Inode traversal order:** V1 collects inodes in breadth-first order — all entries at one -directory level before descending. - -**Whiteout stub table:** V1 includes 256 synthetic inode entries at the start of the inode -area, one per two-hex-character prefix `00`–`ff`. Each entry is a character-device stub -(chr 0,0) used by the overlay filesystem to resolve whiteout paths against the object store. -V2 omits them entirely. - -**Whiteout escaping:** User-visible whiteout files (chr 0,0) in the tree are not stored as -character devices on disk. Instead they receive a `trusted.overlay.opaque=x` xattr and are -serialized differently. The stub entries in the whiteout table are not escaped. - -**`build_time`:** The superblock `build_time` field is set to the minimum mtime across all inodes. - -**xattr sharing:** Xattr entries are deduplicated using a sort key that is the full xattr name (prefix string concatenated with the suffix). - -## Format V2 — Created in composefs-rs - -V2 is the default for repositories created with `cfsctl init` without `--erofs-version 1`. - -**`composefs_version` field:** Always `2` (the constant `COMPOSEFS_VERSION`). - -**Inode layout:** V2 always uses extended inodes (64 bytes). - -**Inode traversal order:** V2 collects inodes in depth-first order — all descendants of a directory before moving to the next sibling. - -**No whiteout stub table:** V2 has no synthetic stub entries; whiteout files are stored directly without escaping. - -**`build_time`:** Always 0. - -**xattr sharing:** Xattr entries are deduplicated using a sort key of (prefix, suffix, value) -rather than the full name string, which can produce a smaller shared xattr area. - -## Selecting the format - -The format is fixed at repository initialization time and cannot be changed afterward. - -``` -cfsctl init # V2 (default) -cfsctl init --erofs-version 1 # V1 (C-tool compatible) -``` - -The format is recorded in `meta.json` as the `v1_erofs` ro-compat feature flag: present -means V1, absent means V2. Tools that do not recognize this flag open the repository -read-only rather than writing images in the wrong format. - -For the standalone `mkcomposefs` tool, the equivalent flag is `--erofs-version`. The -`--min-version` flag (`mkfs_erofs_v1_min_version` in the Rust API) controls whether the -`composefs_version` field starts at 0 or 1 in V1 images regardless of whether user whiteouts -are present. diff --git a/doc/oci.md b/doc/oci.md deleted file mode 100644 index d1f850f4..00000000 --- a/doc/oci.md +++ /dev/null @@ -1,127 +0,0 @@ -# How to create a composefs from an OCI image - -This document is incomplete. It only serves to document some decisions we've -taken about how to resolve ambiguous situations. - -# Data precision - -We currently create a composefs image using the granularity of data as -typically appears in OCI tarballs: - - atime and ctime are not present (these are actually not physically present - in the erofs inode structure at all, either the compact or extended forms) - - mtime is set to the mtime in seconds; the sub-seconds value is simply - truncated (ie: we always round down). erofs has an nsec field, but it's not - normally present in OCI tarballs. That's down to the fact that the usual - tar header only has timestamps in seconds and extended headers are not - usually added for this purpose. - - we take great care to faithfully represent hardlinks: even though the - produced filesystem is read-only and we have data de-duplication via the - objects store, we make sure that hardlinks result in an actual shared inode - as visible via the `st_ino` and `st_nlink` fields on the mounted filesystem. - -We apply these precision restrictions also when creating images by scanning the -filesystem. For example: even if we get more-accurate timestamp information, -we'll truncate it to the nearest second. - -# Merging directories - -This is done according to the OCI spec, with an additional clarification: in -case a directory entry is present in multiple layers, we use the tar metadata -from the most-derived layer to determine the attributes (owner, permissions, -mtime) for the directory. - -# The root inode - -The root inode (/) is a difficult case because OCI container layer tars often -don't include a root directory entry, and when they do, container runtimes -(Podman, Docker) ignore it and use hardcoded defaults. For example, Podman's -[containers/storage](https://github.com/containers/storage) uses root:root -ownership, mode `0555`, and epoch (0) mtime when extracting layers, but -Docker uses `0755`. In general, the metadata for `/` is not defined. - -Because composefs requires (has a goal of providing) precise cryptographically -verifiable filesystem trees, we solve this for OCI by copying the metadata from `/usr` -to the root directory. The rationale is that `/usr` is always present in -standard filesystem layouts and must be defined explicitly in the OCI layers. - -This is implemented via the `copy_root_metadata_from_usr()` method and the -`read_container_root()` convenience function. - -When building a filesystem from OCI layers programmatically, use -`Stat::uninitialized()` to create the initial `FileSystem`. This placeholder -has mode `0` (obviously invalid) to make it clear that the root metadata should -be set before computing digests - typically by calling -`copy_root_metadata_from_usr()` after processing all layers. - -# Extended attributes (xattrs) - -When reading a container filesystem from a mounted root (as opposed to -processing OCI layer tars directly), host-side xattrs can leak into the -image. This is particularly problematic for `security.selinux` labels: -if SELinux is enabled at build time, files will have labels like -`container_t` that come from the build host, not from the target system's -policy. - -To ensure reproducibility, `read_container_root()` filters xattrs to only -include those in an allowlist. Currently this is just `security.capability`, -which represents actual file capabilities that should be preserved. - -SELinux labels are handled separately by `transform_for_boot()`: - - If the target filesystem contains a SELinux policy (in `/etc/selinux`), - all files are relabeled according to that policy - - If no SELinux policy is found, all `security.selinux` xattrs are stripped - -This ensures that: - - Build-time SELinux labels don't leak into non-SELinux targets - - SELinux-enabled targets get correct labels from their own policy - - Other host xattrs (overlayfs internals, etc.) don't pollute the image - -See: https://github.com/containers/storage/pull/1608#issuecomment-1600915185 - -# The /run directory - -When processing OCI images via `create_filesystem()`, the `/run` directory -is emptied if present. This is a tmpfs at runtime and should always be -empty in images. Its mtime is set to match `/usr` for consistency with -how root directory metadata is handled. - -This makes it possible to work around podman/buildah's `RUN --mount` issue where cache -mounts can leave incomplete directory entries in OCI tar layers (directories -without explicit tar entries inherit incorrect mtimes) by pointing all -such mounts into `/run`, and then redirecting from their final location -via e.g. symlinks into `/run`. - -## Container build cache mounts - -A practical implication of emptying `/run` is that container authors can -use it for cache mounts without worrying about polluting the final image. - -Instead of: -```dockerfile -RUN --mount=type=cache,target=/var/cache/dnf dnf install -y ... -``` - -Consider: -```dockerfile -RUN rm -rf /var/cache/dnf && ln -sr /run/dnfcache /var/cache/dnf -RUN --mount=type=cache,target=/run/dnfcache dnf install -y ... -``` - -This avoids potential mtime inconsistencies in `/var/cache` while still -benefiting from build caching. - -See: https://github.com/containers/composefs-rs/issues/132 - -# Emptied directories for boot - -When preparing a filesystem for boot via `transform_for_boot()`, certain -additional directories are emptied because their contents should not be -part of the final verified image: - -- `/boot`: Contains the UKI which embeds the composefs digest, so including - it would create a circular dependency -- `/sysroot`: Only has content in ostree-container cases, and traversing - it for SELinux labeling causes problems - -These directories are emptied and their mtime is set to match `/usr` for -consistency with how the root directory metadata is handled. diff --git a/doc/ostree.md b/doc/ostree.md deleted file mode 100644 index e96b85d7..00000000 --- a/doc/ostree.md +++ /dev/null @@ -1,139 +0,0 @@ -# OSTree - -composefs-rs has support for importing images from OSTree -repositories, by pulling from local or remote OSTree -repositories. These images can then be mounted as composefs images, -sharing disk (deduplication) with other ostree or other types of -images in the composefs repository. - -Native OSTree repositories are a format similar to a composefs -repository, but not quite the same. This means we need some -conversions when handling ostree commits in a composefs repository. - -OSTree images (commits) are fundamentally made up of many small sha256 -content-addressed objects that reference each other. Each commit is -the root of a DAG that defines the total image. Some of the OSTree -objects are metadata like directory permissions, or list of files in a -directory. These don't really exist in composefs where all metadata is -part of the erofs image. However, some objects are large file objects, -and these are similar to the file objects in composefs -images. However, even these differ, because the checksum defining the -object is made up of both the file content and the file metadata. - -When an OSTree commit is stored in a composefs repo it is stored as a -single splitstream file, named `ostree-commit-$commit_id`, which uses -external object references to all the file content objects that will -be used when creating an erofs image for it. This means OSTree objects -for files that would be inlined in the erofs image will not be -external objects. - -OStree commit splitstream objects are created during a pull operation -and are used for two things, creating a composefs image by walking the -DAG, and serving as a source of already available OSTree object during -a pull operation. Such sources are found automatically during pull -(e.g. parent commit, or old commit for a ref being pulled) or can be -manually specified. - -## File format - -This describes the format of the `ostree-commit-$commit_id` files. - -### Splitstream header - -Since the commit file is a split stream it starts with the splitstream -headers. Of these we use two, the named refs and the object -refs: - - * When an erofs image is created for the commit, it is referenced by - the `composefs.image` named ref. - - * Any external file content objects are in the external_refs - table. The index of the references in this header table is used to - refer to the file in the splitstream itself. - -The splitstream content type used for commits is 0xAFE138C18C463EF1. - -### Splitstream content - -A splitstream is normally a series of internal and external chunks, -but the ostree commit uses only one inline chunk. This chunk is -basically a serialized form of the "objects" directory of an OSTree -repository. I.e. it has a mapping of sha256 to ostree object data. -All objects except file objects are stored in the standard ostree -object format. - -OSTree file objects are stored in the archive-z2 format, except not -compressed, and optionally the file content part of it may be stored -as referencing the index of an external object. The z2 format is, -first an 8-byte header that gives the size (in bytes) of a gvariant, -then comes the gvariant with the file meta in -OSTREE_ZLIB_FILE_HEADER_GVARIANT_FORMAT format, and then the -file/symlink inline data. If an external object is referenced for the -object then it is expected that there is no inline file data. - -The high level view of the file looks like this: -``` -+---------------+ -| Header | -+---------------| -| Object IDs | -+---------------| -| Object Info | -+---------------| -| Content | -+---------------+ -``` - -The Object IDs is a sorted array of sha256 digests, and you would do -lookups in it using a binary search. The buckets in the header can be -used to quickly limit the binary search based on the first byte of a -digest. - -Then, at the same index as the binary searched object you can look up -the object info which gives you the offset/length of the object -content data and optionally a reference to an external object. - -The exact form of the data looks like this, packed in order from the -start of the splitstream content. All ints are in little endian. - -### Header -``` -+-----------------------------------+ -| u32: index of commit object | -| u32: flags (currently unused) | -| [u32; 256]: end index of bucket | -+-----------------------------------+ -``` - -The bucket list contains the end index (in the object ids table) of -objects starting with that particular byte, and can be used to quickly -limit the search. We can also compute the total number of objects -(n_objects) by looking in the last bucket. - -### Object ids -``` - n_objects x -+-----------------------------------+ -| [u8; 32] ostree object id | -+-----------------------------------+ -``` - -### Object Info -``` - n_objects x -+-----------------------------------+ -| u32: Offset to per-object data | -| u32: Length of per-object data | -| u32: Index of external object ref | -| or MAXUINT32 if none. | -+-----------------------------------+ -``` - -This is an array of information for each object. Once you have found -the object id in the object ids table, you would look at the same -index in this table to find the information. Offsets to per-object -data are in bytes from the start of the content area, which starts at -the end of the Objects Info table. All data chunks references are -aligned to 8 bytes with respect to the start of the content area. -This is useful because GVariants (used by ostree) naturally want -8-byte alignment. diff --git a/doc/repository.md b/doc/repository.md deleted file mode 100644 index 023d26cc..00000000 --- a/doc/repository.md +++ /dev/null @@ -1,316 +0,0 @@ -# composefs repository design - -This document describes the current on-disk layout of a composefs repository. - -At this time, the composefs-rs repository format is not declared stable. - -## Location - -A composefs repository is a directory located anywhere. The location is chosen -for the `cfsctl` command as follows: - - - `--repo` can specify an arbitrary directory - - - if `--user` is specified (default if the current uid is not 0), then the - repository defaults to `~/.var/lib/composefs`. - - - if `--system` is specified (default if the current uid is 0), then the - repository defaults to `/sysroot/composefs`. - -## Layout - -A composefs repository has a layout that looks something like - -``` -composefs -├── meta.json -├── objects -│   ├── 00 -│   │   ├── 002183fb91[...] -│   │   ├── [...] -│   │   └── ff9d7bd692[...] -│   ├── 4e -│   │   ├── 67eaccd9fd[...] -│   │   └── [...] -│   ├── 50 -│   │   ├── 2b126bca0c[...] -│   │   └── [...] -│   └── [...] -├── images -│   ├── 4e67eaccd9fd[...] -> ../objects/4e/67eaccd9fd[...] -│   └── refs -│   └── some/name -> ../../images/4e67eaccd9fd[...] -└── streams - ├── 502b126bca0c[...] -> ../objects/50/2b126bca0c[...] - └── refs - └── some/name.tar -> ../../streams/502b126bca0c[...] -``` - -## `meta.json` - -Added in 0.7.0. This file records repository-level metadata. When present, it is -created by `cfsctl init` and contains: - - - `version` — the base repository format version (currently `1`). Tools - must refuse to operate on a repository whose version exceeds what they - understand. - - - `algorithm` — the fs-verity digest algorithm identifier, in the format - `fsverity--`. For example `fsverity-sha512-12` - means SHA-512 with 4 KiB (2^12) blocks. - - - `features` (optional) — an object with three arrays of feature-flag - strings, following the ext4/XFS/EROFS compatibility model: - - `compatible` — old tools can safely ignore these. - - `read-only-compatible` — old tools may read but must not write. - - `incompatible` — old tools must refuse the repository entirely. - - The currently defined feature flags are: - - `v1_erofs` (read-only-compatible) — present on repositories whose - EROFS image format is V1 (C-tool compatible: compact inodes, BFS - ordering, whiteout table). This is the single flag that encodes the - EROFS format version: present → V1, absent → V2. Old - tools that do not recognise this flag open the repository read-only - rather than accidentally writing images in the wrong format. - -When `meta.json` is present, `cfsctl` auto-detects the hash algorithm and -errors if `--hash` is explicitly passed with a conflicting value. When -the file is absent (for repositories created before this feature), `--hash` -is honored as before and defaults to `sha512`. - -### `cfsctl init --erofs-version` - -The `--erofs-version` flag selects the EROFS format for newly committed -images. It controls the `v1_erofs` feature flag in `meta.json`: - -``` -cfsctl init # default: V2 EROFS (composefs-rs native) -cfsctl init --erofs-version 1 # V1 EROFS (C-tool compatible) -``` - -**V2** (the `cfsctl` default) uses extended inodes, DFS ordering, and -`composefs_version=2` in the EROFS superblock. This is the composefs-rs native -format and is what all repositories created before V1 support was added use. -Higher-level tools (e.g. bootc) may configure a repository with multiple format -versions (V1 primary + V2 extra) so that images are usable on both RHEL9-era and -newer kernels. - -**V1** uses compact inodes where possible, BFS ordering, and a whiteout stub -table, producing output byte-for-byte identical to the C `mkcomposefs` tool. -The `v1_erofs` ro-compat flag is written to `meta.json` so that tools which -predate V1 support open the repository read-only rather than writing images -in the wrong format. - -Re-initializing an existing repository with a different `--erofs-version` is -rejected with an error; the format version is fixed at init time. - -## `objects/` - -This is where the content-addressed data is stored. The immediate children of -this directory are 256 subdirectories from `00` to `ff`. Each of those -directories contains a number of files with 62-character hexidecimal names. -Taken together with the directory in which it resides, each filename represents -a 256bit hash value which equals the measured fs-verity digest of that file. -fs-verity must be enabled for every file. - -## `images/` - -This is where composefs (erofs) images are accounted for. The images -themselves are fs-verity enabled and stored in the object store in the same way -as the file data, but the `images/` directory contains symlinks to the images -that we know about. Each symlink is named for the full 256bit fsverity digest. - -Images are tracked in a separate directory because of the security model of -filesystems in the Linux kernel. Although it would be feasible for "regular -users" to mount an erofs in their own mount namespace, the kernel currently -disallows it as a way to avoid allowing non-root users to expose the filesystem -code to hostile data. As such, we only mount images that we produced for -ourselves (with mkcomposefs), and those are the ones that are linked in this -directory. - -Another way to say it: we must never attempt to mount an arbitrary object: we -may only mount via symlinks present in this directory. - -## `streams/` - -This is where [split streams](splitstream.md) are stored. As for the images, -this is a bunch of 256bit symlinks which are symlinks to data in the object -storage. - -Note: the names of the hashes in this directory are the fs-verity hashes of the -content of the splitstream file, not the original file. More specifically: if -you have a tar file with a specific sha256 digest, and you import it into the -repository as a splitstream, the resulting filename in this directory will have -no relation to the original content. You can, however, store a reference for -it. - -## `{images,streams}/refs/` - -This is where we record which images and streams are currently "requested" by -some external user. When importing a tar file, in addition to creating the -file in the objects database and the toplevel symlink in the `streams/` -directory, we also assign it a name which is chosen by the software which is -performing the import. - -Each ref is a symlink to the top-level entry in `images/` or `streams/`. - -There are some rough ideas for how we might namespace this. Something like -this model is imagined: - -``` -refs -├── system -│   └── rootfs -│      ├── some_id -> ../../../974d04eaff[...] -│      └── [...] -├── 1000 # uid of a user -│   ├── flatpak -│   │   ├── some_id -> ../../../f8e2bec500[...] -│   │   └── [...] -│   └── containers -│      ├── some_id -> ../../../96a87f8b4b[...] -│      └── [...] -└── [...] -``` - -Where the toplevel directories are `system` plus a set of uids. Each `system` -or uid subdirectory is namespaced by the particular piece of software that's -responsible for storing the given image or stream. - -The per-user directories will all be owned by root and have 0700 permissions, -but each user will be able to access their own uid-numbered subdirectories by -way of an acl. The reason that we want the directories owned by root is to -prevent users from corrupting the layout of the repository. The reason for the -acl is that read-only operations on the repository should be performed -directly on the repository and not via some central agent. - -## Referring to images and streams - -Operations that are performed on images or streams (mount, cat, etc.) name the -stream in one of two ways: - - - via the user-chosen name such as `refs/1000/flatpak/some_id` - - via the fs-verity digest stored in the toplevel dir - -ie: the name must either start with the string `refs/`, or must be a -hexadecimal string (64 characters for sha256, 128 for sha512). - -In both cases, the name is a path relative to the `images/` or `streams/` -directory and this path contains a symlink (either direct or indirect) to the -underlying file in `objects/`. - -When specified via fs-verity digest, the digest is verified before performing -the operation. - -For example: - -```sh -cfsctl mount refs/system/rootfs/some_id /mnt # does not check fs-verity -cfsctl mount 974d04eaff[...] /mnt # enforces fs-verity -``` - -## OCI image storage - -OCI container images are stored using streams exclusively. Each OCI artifact -(manifest, config, layer) becomes a splitstream, and OCI "tags" are refs under -`streams/refs/oci/`. - -### Naming conventions - -| OCI artifact | Stream name pattern | Example | -|---------------|------------------------------------|------------------------------------| -| Manifest | `oci-manifest-{manifest_digest}` | `oci-manifest-sha256:abc123...` | -| Config | `oci-config-{config_digest}` | `oci-config-sha256:def456...` | -| Layer | `oci-layer-{diff_id}` | `oci-layer-sha256:ghi789...` | -| Blob | `oci-blob-{blob_digest}` | `oci-blob-sha256:jkl012...` | - -Tags are stored under `streams/refs/oci/` with percent-encoding for -filesystem safety (`/` → `%2F`): - -``` -streams/refs/oci/myimage:latest → ../../oci-manifest-sha256:abc123... -``` - -### Splitstream reference chains - -Each splitstream contains `named_refs` (semantic labels mapping to entries -in the `stream_refs` array) and `object_refs` (raw objects referenced by -the compressed stream data). For OCI images the chain is: - -**Manifest splitstream** (`oci-manifest-sha256:...`): - - `object_refs`: the manifest JSON blob - - `named_refs`: - - `config:{config_digest}` → config splitstream verity - - `{diff_id}` → layer splitstream verity (one per layer) - -**Config splitstream** (`oci-config-sha256:...`): - - `object_refs`: the config JSON blob - - `named_refs`: - - `{diff_id}` → layer splitstream verity (one per layer) - -**Layer splitstream** (`oci-layer-sha256:...`): - - `object_refs`: file content objects extracted from the tar - - `named_refs`: none (leaf node) - -Both the manifest and config redundantly reference the layers. The GC -can reach layers from either path. - -### Garbage collection - -The GC walks all refs under `streams/refs/` to find root splitstreams, -then transitively follows `named_refs` (by resolving fs-verity IDs -through a stream name map) and collects `object_refs`. Any object not -reachable from a root is deleted. - -Concretely, for a tagged container image: - - 1. Tag `streams/refs/oci/myimage:v1` resolves to `oci-manifest-sha256:abc` - 2. Walk the manifest: mark its JSON blob and follow `named_refs` to - the config and layer streams - 3. Walk the config: mark its JSON blob and follow `named_refs` to layers - (already visited, skipped) - 4. Walk each layer: mark all file content objects - -When a tag is removed, the manifest and everything reachable only from it -becomes GC-eligible. Layers shared between images survive as long as any -referencing manifest remains tagged. - -### EROFS image tracking via config splitstream refs - -When an EROFS image is generated from an OCI image (via -`create_filesystem` + `commit_image`), its object ID (fs-verity digest) -is stored as a named ref on the config splitstream with the key -`composefs.image`. - -GC walks from tag → manifest → config, and finds the `composefs.image` -named ref. The EROFS object ID is added to the live set, keeping the -EROFS image alive. The EROFS image still needs an entry under `images/` -for the kernel mount security model (see above), but `images/` is not a -GC root — the config ref is what keeps the object alive. - -This means a single OCI tag is sufficient to keep the entire image -(manifest, config, layers, and the EROFS image) alive through GC. - -### Bootable image variant - -For bootable images, a second EROFS may be generated after -`transform_for_boot` (stripping `/boot`, etc.). This boot EROFS is -stored as a second named ref on the config, `composefs.image.boot`. - -Since the config splitstream content changes (new named ref), it gets a -new fs-verity digest. This cascades: the manifest must also be -rewritten (its `config:` named ref now points to the new config verity), -producing a new manifest verity. The tag is re-pointed to the new -manifest. The old config and manifest splitstreams become unreferenced -and are collected by GC. - -The result: one tag still keeps everything alive — layers, raw EROFS, -and boot EROFS. - -### Future: sealed images - -For sealed/signed images, the EROFS comes pre-built from the registry as -part of a composefs OCI artifact (referrer pattern). The artifact -splitstream would hold references to the pre-fetched EROFS layers. This -is complementary to the unsealed case — both use the same GC mechanism -(named refs pointing to EROFS objects). diff --git a/doc/splitstream.md b/doc/splitstream.md deleted file mode 100644 index 787d1ec9..00000000 --- a/doc/splitstream.md +++ /dev/null @@ -1,164 +0,0 @@ -# Splitstream - -Splitstream is a trivial way of storing file formats (like tar) with the "data -blocks" stored in the composefs object store with the goal that it's possible -to bit-for-bit recreate the entire file. It's something like the idea behind -[tar-split](https://github.com/vbatts/tar-split), with some important -differences: - - - it's a binary format - - - it's based on storing external objects content-addressed in the composefs - object store via their fs-verity digest - - - although it's designed with `tar` files in mind, it's not specific to `tar`, - or even to the idea of an archive file: any file format can be stored as a - splitstream, and it might make sense to do so for any file format that - contains large chunks of embedded data - - - in addition to the ability to split out chunks of file content (like files - in a `.tar`) to separate files, it is also possible to refer to external - file content, or even other splitstreams, without directly embedding that - content in the referrer, which can be useful for cross-document references - (such as between OCI manifests, configs, and layers) - - - the splitstream file itself is stored in the same content-addressed object - store by its own fs-verity hash - -Splitstream compresses inline file content before it is stored to disk using -zstd. The main reason for this is that, after removing the actual file data, -the remaining `tar` metadata contains a very large amount of padding and empty -space and compresses extremely well. - -Splitstream is conceptually independent from composefs: you could use the -format with any content-addressed storage system. - -## File format - -What follows is a non-normative documentation of the file format. The actual -definition of the format is "what composefs-rs reads and writes", but this -document may be useful to try to understand that format. If you'd like to -implement the format, please get in touch. - -The format is implemented in -[crates/composefs/src/splitstream.rs](crates/composefs/src/splitstream.rs) and -the structs from that file are copy-pasted here. Please try to keep things -roughly in sync when making changes to either side. - -All integers are little-endian. In the following `struct` definitions, `U` -means 'unsigned little endian' (as per the `zerocopy::little_endian` crate) so -`U64` is an unsigned 64bit little-endian integer. - -### File ranges ("sections") - -The file format consists of a fixed-sized header at the start of the file plus -a number of sections located at arbitrary locations inside of the file. All of -these sections are referred to by a 64-bit `[start..end)` range expressed in -terms of overall byte offsets within the complete file. - -```rust -struct FileRange { - start: U64, - end: U64, -} -``` - -### Header - -The file starts with a simple fixed-size header. - -```rust -const SPLITSTREAM_MAGIC: [u8; 11] = *b"SplitStream"; - -struct SplitstreamHeader { - pub magic: [u8; 11], // Contains SPLITSTREAM_MAGIC - pub version: u8, // must always be 0 - pub _flags: U16, // is currently always 0 (but ignored) - pub algorithm: u8, // kernel fs-verity algorithm identifier (1 = sha256, 2 = sha512) - pub lg_blocksize: u8, // log2 of the fs-verity block size (12 = 4k, 16 = 64k) - pub info: FileRange, // can be used to expand/move the info section in the future -} -``` - -In addition to magic values and identifiers for the fs-verity algorithm in use, -the header is used to find the location and size of the info section. Future -expansions to the file format are imagined to occur by expanding the size of -the info section: if the section is larger than expected, the additional bytes -will be ignored by the implementation. - -### Info section - -```rust -struct SplitstreamInfo { - pub stream_refs: FileRange, // location of the stream references array - pub object_refs: FileRange, // location of the object references array - pub stream: FileRange, // location of the zstd-compressed stream within the file - pub named_refs: FileRange, // location of the compressed named references - pub content_type: U64, // user can put whatever magic identifier they want there - pub stream_size: U64, // total uncompressed size of inline chunks and external chunks -} -``` - -The `content_type` is just an arbitrary identifier that can be used by users of -the file format to prevent casual user error when opening a file by its hash -value (to prevent showing `.tar` data as if it were json, for example). - -The `stream_size` is the total size of the original file. - -### Stream and object refs sections - -All referred streams and objects in the file are stored as two separate flat -uncompressed arrays of binary fs-verity hash values. Each of these arrays is -referred to from the info section (via `stream_refs` and `object_refs`). - -The number of items in the array is determined by the size of the section -divided by the size of the fs-verity hash value (determined by the algorithm -identifier in the header). - -The values are not in any particular order, but implementations should produce -a deterministic output. For example, the objects reference array produced by -the current implementation has the external objects sorted by first-appearance -within the stream. - -The main motivation for storing the references uncompressed, in binary, and in -a flat array is to make determining the references contained within a -splitstream as simple as possible to improve the efficiency of garbage -collection on large repositories. - -### The stream - -The main content of the splitstream is stored in the `stream` section -referenced from the info section. The entire section is zstd compressed. - -Within the compressed stream, the splitstream is formed from a number of -"chunks". Each chunk starts with a single 64-bit little endian value. If that -number is negative, it refers to an "inline" chunk, and that (absolute) number -of bytes of data immediately follow it. If the number is non-negative then it -is an index into the object refs array for an "external" chunk. - -Zero is a non-negative value, so it's an object reference. It's not possible -to have a zero-byte inline chunk. This also means that the high/sign bit -determines which case (inline vs. external) we have and there are an equal -number of both cases. - -The stream is reassembled by iterating over the chunks and concatenating the -result. For inline chunks, the inline data is taken directly from the -splitstream. For external chunks, the content of the external file is used. - -The stream is over when there are no more chunks. - -### Named references - -It's possible to have named references to other streams. These are stored in -the `named_refs` section referred to from the info section. - -This section is also zstd-compressed, and is a number of nul-terminated text -records (including a terminator after the last record). Each record has the -form `n:name` where `n` is a non-negative integer index into the stream refs -array and `name` is an arbitrary name. The entries are currently sorted by -name (by the writer implementation) but the order is not important to the -reader. Whether or not this list is "officially" sorted or not may be pinned -down at some future point if a need should arise. - -An example of the decompressed content of the section might be something like -`"0:first\01:second\0"`.