diff --git a/README.md b/README.md
index 6121f789..55ed5112 100644
--- a/README.md
+++ b/README.md
@@ -34,76 +34,32 @@ and Linux kernel integration, but with the *flexibility* of files
 for content — avoiding doubled disk usage, partition table management,
 and similar headaches.
 
-### Separation between metadata and data
-
-A key aspect of composefs is its separation of "data" (non-empty regular
-files) from "metadata" (everything else: directories, symlinks, permissions,
-ownership, etc.).
-
-composefs produces an [EROFS](https://erofs.docs.kernel.org) filesystem
-image that contains only metadata. The non-empty data files live in a
-separate "backing store" directory. The EROFS image includes
-`trusted.overlay.redirect` extended attributes that tell the overlayfs
-mount how to find the real underlying files.
-
-### Shared backing store
-
-The primary use case for composefs is versioned, immutable filesystem
-trees — container images and bootable host systems — where multiple
-images may share parts of their storage.
-
-By storing files content-addressed (named by the hash of their content),
-shared files need to be stored only once on disk yet can appear in
-multiple mounts. Crucially, these data files are also shared in the
-[page cache](https://static.lwn.net/kerneldoc/admin-guide/mm/concepts.html#page-cache),
-allowing multiple running container images to reliably share memory.
-
-### Filesystem integrity
-
-composefs supports [fs-verity](https://www.kernel.org/doc/html/latest/filesystems/fsverity.html)
-validation of content files. The digest of each content file is stored
-in the EROFS image via `trusted.overlay.metacopy` extended attributes,
-which overlayfs validates when the file is accessed. This means backing
-content cannot be changed (by mistake or by malice) without detection.
-
-You can also enable fs-verity on the image file itself and pass the expected
-digest as a mount option. This provides full trust of both data and metadata,
-solving a weakness of fs-verity alone (which can only verify file data,
-not metadata like permissions, ownership, or directory structure).
+composefs separates metadata (directories, permissions, xattrs) from data
+(file content). An EROFS image carries only the metadata; data files live in
+a content-addressed backing store, shared across images and in the Linux
+[page cache](https://static.lwn.net/kerneldoc/admin-guide/mm/concepts.html#page-cache).
+Optional [fs-verity](https://www.kernel.org/doc/html/latest/filesystems/fsverity.html)
+provides end-to-end integrity verification of both data and metadata.
+For design details, see the [crate documentation](https://docs.rs/composefs).
 
 ## Use cases
 
 ### Container images
 
-For [OCI](https://github.com/opencontainers/image-spec/blob/main/spec.md)
-container images, a common approach (used by both Docker and Podman) is
-to untar each layer separately and use overlayfs to stitch them together.
-composefs improves on this by storing file content in a content-addressed
-fashion, allowing sharing between images even when metadata like
-timestamps or ownership differs.
-
-Combined with approaches like
-[zstd:chunked](https://github.com/containers/storage/pull/775),
-this speeds up pulling container images and avoids redundantly
-creating files that are already present.
+composefs improves on the traditional per-layer overlayfs model for
+[OCI](https://github.com/opencontainers/image-spec/blob/main/spec.md)
+container images by storing file content in a content-addressed store,
+enabling sharing between images and faster pulls via
+[zstd:chunked](https://github.com/containers/storage/pull/775).
 
 ### Bootable host systems
 
-Anywhere one wants versioned immutable filesystem trees ("images"),
-composefs provides compelling advantages. In particular, this project
-aims to be the successor to [OSTree](https://github.com/ostreedev/ostree/).
-
-OSTree uses a content-addressed object store, but traditionally checks out
-into a regular directory (using hardlinks), which is then bind-mounted as
-the rootfs. While OSTree supports enabling fs-verity on files in the store,
-nothing protects the checkout directories from modification.
-
-composefs replaces this checkout with a directly-mountable image pointing
-into the object store. We can enable fs-verity on the composefs image and
-embed its digest in the kernel commandline or a Unified Kernel Image (UKI).
-Since composefs generation is reproducible, we can verify the generated
-image is correct by comparing its digest to one in the metadata produced
-at build time. For more on this, see [this tracking issue](https://github.com/ostreedev/ostree/issues/2867).
+composefs aims to succeed [OSTree](https://github.com/ostreedev/ostree/)
+by replacing hardlink checkouts with directly-mountable images backed by a
+shared object store. Combined with fs-verity and a digest embedded in the
+kernel commandline or a UKI, this provides cryptographic verification of
+the entire filesystem tree. See [this tracking issue](https://github.com/ostreedev/ostree/issues/2867)
+for background.
 
 ## Components
 
@@ -147,9 +103,7 @@ helper that supports `mount -t composefs` syntax directly.
 
 ## Documentation
 
- - [Repository format](doc/repository.md)
- - [OCI integration](doc/oci.md)
- - [Splitstream format](doc/splitstream.md)
+ - [API and design documentation](https://docs.rs/composefs)
  - [Examples README](examples/README.md)
 
 ## Status
diff --git a/crates/composefs-boot/src/design.rs b/crates/composefs-boot/src/design.rs
new file mode 100644
index 00000000..70119a47
--- /dev/null
+++ b/crates/composefs-boot/src/design.rs
@@ -0,0 +1,90 @@
+//! # Booting from a composefs image
+//!
+//! This document describes how composefs-rs sets up the root filesystem during
+//! early boot. It covers the kernel command-line interface, the expected on-disk
+//! layout, kernel requirements, and the step-by-step mount sequence performed by
+//! `composefs-setup-root`.
+//!
+//! The target audience is system integrators and OS developers who are packaging a
+//! bootable system using composefs. Familiarity with Linux mount namespaces,
+//! overlayfs, and fs-verity is assumed.
+//!
+//! ## Kernel command-line
+//!
+//! The initramfs code in composefs supports multiple kernel arguments; it
+//! is possible to pre-compute the digest of an image using both e.g. SHA-256 and
+//! SHA-512. On an installed system, the repository only supports one digest
+//! by default today, and the first found will be selected.
+//!
+//! Additionally, it is opt-in to enable v1 EROFS, and again the first compatible
+//! version will be found.
+//!
+//! ```text
+//! composefs.digest=v1-sha256-12:<digest>   # V1 EROFS image (preferred; RHEL9-era kernels)
+//! composefs.digest=v1-sha512-12:<digest>   # V1 EROFS image (SHA-512 variant)
+//! composefs.digest=v2-sha512-12:<digest>   # V2 EROFS image (explicit form)
+//! composefs=<digest>                       # V2 EROFS image (legacy shorthand)
+//! ```
+//!
+//! The value format is `<version>-<hash>-<lg_blocksize>:<hex_digest>`, where
+//! `<version>` is `v1` or `v2`, `<hash>` is `sha256` or `sha512`, and
+//! `<lg_blocksize>` is the log2 block size (currently always `12`, i.e. 4096
+//! bytes). This mirrors how `meta.json` encodes the algorithm as
+//! `fsverity-sha256-12`.
+//!
+//! `composefs.digest=` is checked first. Multiple entries may appear on the cmdline
+//! (one per format/algorithm combination); the initramfs tries each in order and
+//! mounts the first image that actually exists in the repository.
+//!
+//! `composefs=<digest>` is a legacy shorthand equivalent to
+//! `composefs.digest=v2-<hash>-12:<digest>` -- the algorithm is inferred from the
+//! digest length (64 hex chars -> SHA-256, 128 -> SHA-512). It is checked only when
+//! no `composefs.digest=` token matches.
+//!
+//! **Insecure mode.** Placing `?` immediately after `=` (e.g.
+//! `composefs.digest=?v1-sha256-12:<digest>` or `composefs=?<digest>`) makes
+//! fs-verity verification optional. The system will boot even when the underlying
+//! filesystem does not support fs-verity or the image has no verity metadata
+//! attached. This mode exists for development and testing only; it must not be used
+//! in production.
+//!
+//! ## On-disk layout
+//!
+//! The composefs repository must be present at `/sysroot/composefs` with the
+//! standard layout described in the `composefs::repository_format` module.
+//!
+//! The digest must correspond to a symlink under `images/`.
+//!
+//! Persistent per-deployment state lives at `/sysroot/state/deploy/<digest>/`,
+//! where `<digest>` matches the boot karg digest exactly. The `etc/` and `var/`
+//! subdirectories within that directory serve as the upper layers for the
+//! corresponding overlayfs mounts.
+//!
+//! ## Kernel requirements
+//!
+//! The following kernel features must be available:
+//!
+//! - **EROFS** filesystem driver (`CONFIG_EROFS_FS`)
+//! - **overlayfs** with `metacopy=on` and `redirect_dir=on`
+//!   (`CONFIG_OVERLAY_FS`, `CONFIG_OVERLAY_FS_METACOPY`, `CONFIG_OVERLAY_FS_REDIRECT_DIR`)
+//! - **fs-verity** unless insecure mode is used (`CONFIG_FS_VERITY`)
+//! - The modern Linux mount API (`fsopen` / `fsconfig` / `fsmount` / `move_mount`),
+//!   available since kernel 5.2. Kernel >= 6.15 is required for the atomic root
+//!   replacement path (the default build). On kernels without `fsconfig_set_fd`
+//!   support (e.g. RHEL 9 / kernel < 5.15), a loopback device is created
+//!   automatically by `composefs::mountcompat`.
+//!
+//! ## Kernel argument
+//!
+//! The boot karg (`composefs.digest=` or `composefs=`) is the authoritative selector for which image is booted.
+//! Without the `?` insecure prefix, every file access through the overlayfs is
+//! verified against the object's stored digest by the kernel, combining fs-verity
+//! on the data objects with overlayfs `verity=require`.
+//!
+//! ## Other notes
+//!
+//! As a workaround for a GPT auto-root issue in systemd
+//! ([systemd#35017](https://github.com/systemd/systemd/issues/35017)),
+//! `composefs-setup-root` attempts to create `/run/systemd/volatile-root` as a
+//! symlink pointing to the real block device before performing any mounts. Failure
+//! to do so is non-fatal and does not abort the boot sequence.
diff --git a/crates/composefs-boot/src/lib.rs b/crates/composefs-boot/src/lib.rs
index 11e5cc33..db57cc6c 100644
--- a/crates/composefs-boot/src/lib.rs
+++ b/crates/composefs-boot/src/lib.rs
@@ -15,6 +15,9 @@ pub mod selabel;
 pub mod uki;
 pub mod write_boot;
 
+#[cfg(doc)]
+pub mod design;
+
 use std::ffi::OsStr;
 
 use anyhow::Result;
diff --git a/crates/composefs-oci/src/design.rs b/crates/composefs-oci/src/design.rs
new file mode 100644
index 00000000..30be8842
--- /dev/null
+++ b/crates/composefs-oci/src/design.rs
@@ -0,0 +1,127 @@
+//! # How to create a composefs from an OCI image
+//!
+//! This document is incomplete.  It only serves to document some decisions we've
+//! taken about how to resolve ambiguous situations.
+//!
+//! # Data precision
+//!
+//! We currently create a composefs image using the granularity of data as
+//! typically appears in OCI tarballs:
+//!  - atime and ctime are not present (these are actually not physically present
+//!    in the erofs inode structure at all, either the compact or extended forms)
+//!  - mtime is set to the mtime in seconds; the sub-seconds value is simply
+//!    truncated (ie: we always round down).  erofs has an nsec field, but it's not
+//!    normally present in OCI tarballs.  That's down to the fact that the usual
+//!    tar header only has timestamps in seconds and extended headers are not
+//!    usually added for this purpose.
+//!  - we take great care to faithfully represent hardlinks: even though the
+//!    produced filesystem is read-only and we have data de-duplication via the
+//!    objects store, we make sure that hardlinks result in an actual shared inode
+//!    as visible via the `st_ino` and `st_nlink` fields on the mounted filesystem.
+//!
+//! We apply these precision restrictions also when creating images by scanning the
+//! filesystem.  For example: even if we get more-accurate timestamp information,
+//! we'll truncate it to the nearest second.
+//!
+//! # Merging directories
+//!
+//! This is done according to the OCI spec, with an additional clarification: in
+//! case a directory entry is present in multiple layers, we use the tar metadata
+//! from the most-derived layer to determine the attributes (owner, permissions,
+//! mtime) for the directory.
+//!
+//! # The root inode
+//!
+//! The root inode (/) is a difficult case because OCI container layer tars often
+//! don't include a root directory entry, and when they do, container runtimes
+//! (Podman, Docker) ignore it and use hardcoded defaults.  For example, Podman's
+//! [containers/storage](https://github.com/containers/storage) uses root:root
+//! ownership, mode `0555`, and epoch (0) mtime when extracting layers, but
+//! Docker uses `0755`. In general, the metadata for `/` is not defined.
+//!
+//! Because composefs requires (has a goal of providing) precise cryptographically
+//! verifiable filesystem trees, we solve this for OCI by copying the metadata from `/usr`
+//! to the root directory.  The rationale is that `/usr` is always present in
+//! standard filesystem layouts and must be defined explicitly in the OCI layers.
+//!
+//! This is implemented via the `copy_root_metadata_from_usr()` method and the
+//! `read_container_root()` convenience function.
+//!
+//! When building a filesystem from OCI layers programmatically, use
+//! `Stat::uninitialized()` to create the initial `FileSystem`.  This placeholder
+//! has mode `0` (obviously invalid) to make it clear that the root metadata should
+//! be set before computing digests - typically by calling
+//! `copy_root_metadata_from_usr()` after processing all layers.
+//!
+//! # Extended attributes (xattrs)
+//!
+//! When reading a container filesystem from a mounted root (as opposed to
+//! processing OCI layer tars directly), host-side xattrs can leak into the
+//! image.  This is particularly problematic for `security.selinux` labels:
+//! if SELinux is enabled at build time, files will have labels like
+//! `container_t` that come from the build host, not from the target system's
+//! policy.
+//!
+//! To ensure reproducibility, `read_container_root()` filters xattrs to only
+//! include those in an allowlist.  Currently this is just `security.capability`,
+//! which represents actual file capabilities that should be preserved.
+//!
+//! SELinux labels are handled separately by `transform_for_boot()`:
+//!  - If the target filesystem contains a SELinux policy (in `/etc/selinux`),
+//!    all files are relabeled according to that policy
+//!  - If no SELinux policy is found, all `security.selinux` xattrs are stripped
+//!
+//! This ensures that:
+//!  - Build-time SELinux labels don't leak into non-SELinux targets
+//!  - SELinux-enabled targets get correct labels from their own policy
+//!  - Other host xattrs (overlayfs internals, etc.) don't pollute the image
+//!
+//! See: <https://github.com/containers/storage/pull/1608#issuecomment-1600915185>
+//!
+//! # The /run directory
+//!
+//! When processing OCI images via `create_filesystem()`, the `/run` directory
+//! is emptied if present. This is a tmpfs at runtime and should always be
+//! empty in images. Its mtime is set to match `/usr` for consistency with
+//! how root directory metadata is handled.
+//!
+//! This makes it possible to work around podman/buildah's `RUN --mount` issue where cache
+//! mounts can leave incomplete directory entries in OCI tar layers (directories
+//! without explicit tar entries inherit incorrect mtimes) by pointing all
+//! such mounts into `/run`, and then redirecting from their final location
+//! via e.g. symlinks into `/run`.
+//!
+//! ## Container build cache mounts
+//!
+//! A practical implication of emptying `/run` is that container authors can
+//! use it for cache mounts without worrying about polluting the final image.
+//!
+//! Instead of:
+//! ```dockerfile
+//! RUN --mount=type=cache,target=/var/cache/dnf dnf install -y ...
+//! ```
+//!
+//! Consider:
+//! ```dockerfile
+//! RUN rm -rf /var/cache/dnf && ln -sr /run/dnfcache /var/cache/dnf
+//! RUN --mount=type=cache,target=/run/dnfcache dnf install -y ...
+//! ```
+//!
+//! This avoids potential mtime inconsistencies in `/var/cache` while still
+//! benefiting from build caching.
+//!
+//! See: <https://github.com/containers/composefs-rs/issues/132>
+//!
+//! # Emptied directories for boot
+//!
+//! When preparing a filesystem for boot via `transform_for_boot()`, certain
+//! additional directories are emptied because their contents should not be
+//! part of the final verified image:
+//!
+//! - `/boot`: Contains the UKI which embeds the composefs digest, so including
+//!   it would create a circular dependency
+//! - `/sysroot`: Only has content in ostree-container cases, and traversing
+//!   it for SELinux labeling causes problems
+//!
+//! These directories are emptied and their mtime is set to match `/usr` for
+//! consistency with how the root directory metadata is handled.
diff --git a/crates/composefs-oci/src/lib.rs b/crates/composefs-oci/src/lib.rs
index 807ae966..0aefd575 100644
--- a/crates/composefs-oci/src/lib.rs
+++ b/crates/composefs-oci/src/lib.rs
@@ -35,6 +35,9 @@ pub mod tar;
 #[doc(hidden)]
 pub mod test_util;
 
+#[cfg(doc)]
+pub mod design;
+
 // Re-export the composefs crate for consumers who only need composefs-oci
 pub use composefs;
 
diff --git a/crates/composefs-ostree/src/design.rs b/crates/composefs-ostree/src/design.rs
new file mode 100644
index 00000000..1270e410
--- /dev/null
+++ b/crates/composefs-ostree/src/design.rs
@@ -0,0 +1,139 @@
+//! # OSTree
+//!
+//! composefs-rs has support for importing images from OSTree
+//! repositories, by pulling from local or remote OSTree
+//! repositories. These images can then be mounted as composefs images,
+//! sharing disk (deduplication) with other ostree or other types of
+//! images in the composefs repository.
+//!
+//! Native OSTree repositories are a format similar to a composefs
+//! repository, but not quite the same. This means we need some
+//! conversions when handling ostree commits in a composefs repository.
+//!
+//! OSTree images (commits) are fundamentally made up of many small sha256
+//! content-addressed objects that reference each other. Each commit is
+//! the root of a DAG that defines the total image. Some of the OSTree
+//! objects are metadata like directory permissions, or list of files in a
+//! directory. These don't really exist in composefs where all metadata is
+//! part of the erofs image. However, some objects are large file objects,
+//! and these are similar to the file objects in composefs
+//! images. However, even these differ, because the checksum defining the
+//! object is made up of both the file content and the file metadata.
+//!
+//! When an OSTree commit is stored in a composefs repo it is stored as a
+//! single splitstream file, named `ostree-commit-$commit_id`, which uses
+//! external object references to all the file content objects that will
+//! be used when creating an erofs image for it. This means OSTree objects
+//! for files that would be inlined in the erofs image will not be
+//! external objects.
+//!
+//! OStree commit splitstream objects are created during a pull operation
+//! and are used for two things, creating a composefs image by walking the
+//! DAG, and serving as a source of already available OSTree object during
+//! a pull operation. Such sources are found automatically during pull
+//! (e.g. parent commit, or old commit for a ref being pulled) or can be
+//! manually specified.
+//!
+//! ## File format
+//!
+//! This describes the format of the `ostree-commit-$commit_id` files.
+//!
+//! ### Splitstream header
+//!
+//! Since the commit file is a split stream it starts with the splitstream
+//! headers. Of these we use two, the named refs and the object
+//! refs:
+//!
+//!  * When an erofs image is created for the commit, it is referenced by
+//!    the `composefs.image` named ref.
+//!
+//!  * Any external file content objects are in the external_refs
+//!    table. The index of the references in this header table is used to
+//!    refer to the file in the splitstream itself.
+//!
+//! The splitstream content type used for commits is 0xAFE138C18C463EF1.
+//!
+//! ### Splitstream content
+//!
+//! A splitstream is normally a series of internal and external chunks,
+//! but the ostree commit uses only one inline chunk. This chunk is
+//! basically a serialized form of the "objects" directory of an OSTree
+//! repository. I.e. it has a mapping of sha256 to ostree object data.
+//! All objects except file objects are stored in the standard ostree
+//! object format.
+//!
+//! OSTree file objects are stored in the archive-z2 format, except not
+//! compressed, and optionally the file content part of it may be stored
+//! as referencing the index of an external object. The z2 format is,
+//! first an 8-byte header that gives the size (in bytes) of a gvariant,
+//! then comes the gvariant with the file meta in
+//! OSTREE_ZLIB_FILE_HEADER_GVARIANT_FORMAT format, and then the
+//! file/symlink inline data. If an external object is referenced for the
+//! object then it is expected that there is no inline file data.
+//!
+//! The high level view of the file looks like this:
+//! ```text
+//! +---------------+
+//! | Header        |
+//! +---------------|
+//! | Object IDs    |
+//! +---------------|
+//! | Object Info   |
+//! +---------------|
+//! | Content       |
+//! +---------------+
+//! ```
+//!
+//! The Object IDs is a sorted array of sha256 digests, and you would do
+//! lookups in it using a binary search.  The buckets in the header can be
+//! used to quickly limit the binary search based on the first byte of a
+//! digest.
+//!
+//! Then, at the same index as the binary searched object you can look up
+//! the object info which gives you the offset/length of the object
+//! content data and optionally a reference to an external object.
+//!
+//! The exact form of the data looks like this, packed in order from the
+//! start of the splitstream content. All ints are in little endian.
+//!
+//! ### Header
+//! ```text
+//! +-----------------------------------+
+//! | u32: index of commit object       |
+//! | u32: flags (currently unused)      |
+//! | [u32; 256]: end index of bucket   |
+//! +-----------------------------------+
+//! ```
+//!
+//! The bucket list contains the end index (in the object ids table) of
+//! objects starting with that particular byte, and can be used to quickly
+//! limit the search.  We can also compute the total number of objects
+//! (n_objects) by looking in the last bucket.
+//!
+//! ### Object ids
+//! ```text
+//!  n_objects x
+//! +-----------------------------------+
+//! |  [u8; 32] ostree object id        |
+//! +-----------------------------------+
+//! ```
+//!
+//! ### Object Info
+//! ```text
+//!  n_objects x
+//! +-----------------------------------+
+//! | u32: Offset to per-object data    |
+//! | u32: Length of per-object data    |
+//! | u32: Index of external object ref |
+//! |      or MAXUINT32 if none.        |
+//! +-----------------------------------+
+//! ```
+//!
+//! This is an array of information for each object. Once you have found
+//! the object id in the object ids table, you would look at the same
+//! index in this table to find the information. Offsets to per-object
+//! data are in bytes from the start of the content area, which starts at
+//! the end of the Objects Info table. All data chunks references are
+//! aligned to 8 bytes with respect to the start of the content area.
+//! This is useful because GVariants (used by ostree) naturally want
+//! 8-byte alignment.
diff --git a/crates/composefs-ostree/src/lib.rs b/crates/composefs-ostree/src/lib.rs
index e292188c..9c9653c9 100644
--- a/crates/composefs-ostree/src/lib.rs
+++ b/crates/composefs-ostree/src/lib.rs
@@ -29,6 +29,8 @@ pub struct CommitInfo {
 }
 
 mod commit;
+#[cfg(doc)]
+pub mod design;
 mod ostree;
 mod pull;
 mod repo;
diff --git a/crates/composefs/src/erofs_format.rs b/crates/composefs/src/erofs_format.rs
new file mode 100644
index 00000000..f939bd2c
--- /dev/null
+++ b/crates/composefs/src/erofs_format.rs
@@ -0,0 +1,82 @@
+//! # composefs EROFS image format
+//!
+//! composefs images are EROFS filesystem images with composefs-specific extensions. They encode
+//! a directory tree where regular files are stored externally in a content-addressed object store
+//! and referenced by their fs-verity digest. The EROFS image itself carries only metadata: inodes,
+//! directory entries, extended attributes, and chunk index entries that point to the external files.
+//!
+//! composefs-rs supports two EROFS format versions. V1 is byte-for-byte compatible with the C
+//! `mkcomposefs` tool. V2 is the composefs-rs native format and drops several V1 constraints
+//! that exist only for C compatibility.
+//!
+//! `cfsctl init` defaults to V2; pass `--erofs-version 1` to select V1. Higher-level tools
+//! such as bootc initialize repositories with multiple formats enabled (V1 primary) so that images
+//! can be booted on RHEL9-era kernels that require the `composefs.digest=` karg.
+//!
+//! ## Format V1
+//!
+//! V1 is selected with `cfsctl init --erofs-version 1`. The `v1_erofs` ro-compat feature flag
+//! is written to `meta.json` so that tools without V1 support open the repository read-only.
+//!
+//! **`composefs_version` field values in V1:**
+//!
+//! - `0` — no user-visible whiteout files (character devices with rdev=0) in the tree
+//! - `1` — at least one user-visible whiteout file is present
+//!
+//! The constant `COMPOSEFS_VERSION_V1` is 0; the field only reaches 1 when user whiteouts are
+//! found. The `--min-version` flag in `mkcomposefs` (mirrored by `mkfs_erofs_v1_min_version`)
+//! forces the value to 1 even when no user whiteouts exist, for forward compatibility.
+//!
+//! **Inode layout:** V1 uses compact inodes (32 bytes) when the file data and inode fit within
+//! the constraints of the compact format, and extended inodes (64 bytes) otherwise.
+//!
+//! **Inode traversal order:** V1 collects inodes in breadth-first order — all entries at one
+//! directory level before descending.
+//!
+//! **Whiteout stub table:** V1 includes 256 synthetic inode entries at the start of the inode
+//! area, one per two-hex-character prefix `00`–`ff`. Each entry is a character-device stub
+//! (chr 0,0) used by the overlay filesystem to resolve whiteout paths against the object store.
+//! V2 omits them entirely.
+//!
+//! **Whiteout escaping:** User-visible whiteout files (chr 0,0) in the tree are not stored as
+//! character devices on disk. Instead they receive a `trusted.overlay.opaque=x` xattr and are
+//! serialized differently. The stub entries in the whiteout table are not escaped.
+//!
+//! **`build_time`:** The superblock `build_time` field is set to the minimum mtime across all inodes.
+//!
+//! **xattr sharing:** Xattr entries are deduplicated using a sort key that is the full xattr name (prefix string concatenated with the suffix).
+//!
+//! ## Format V2 — Created in composefs-rs
+//!
+//! V2 is the default for repositories created with `cfsctl init` without `--erofs-version 1`.
+//!
+//! **`composefs_version` field:** Always `2` (the constant `COMPOSEFS_VERSION`).
+//!
+//! **Inode layout:** V2 always uses extended inodes (64 bytes).
+//!
+//! **Inode traversal order:** V2 collects inodes in depth-first order — all descendants of a directory before moving to the next sibling.
+//!
+//! **No whiteout stub table:** V2 has no synthetic stub entries; whiteout files are stored directly without escaping.
+//!
+//! **`build_time`:** Always 0.
+//!
+//! **xattr sharing:** Xattr entries are deduplicated using a sort key of (prefix, suffix, value)
+//! rather than the full name string, which can produce a smaller shared xattr area.
+//!
+//! ## Selecting the format
+//!
+//! The format is fixed at repository initialization time and cannot be changed afterward.
+//!
+//! ```text
+//! cfsctl init                    # V2 (default)
+//! cfsctl init --erofs-version 1  # V1 (C-tool compatible)
+//! ```
+//!
+//! The format is recorded in `meta.json` (see [`repository_format`][crate::repository_format]) as the `v1_erofs` ro-compat feature flag: present
+//! means V1, absent means V2. Tools that do not recognize this flag open the repository
+//! read-only rather than writing images in the wrong format.
+//!
+//! For the standalone `mkcomposefs` tool, the equivalent flag is `--erofs-version`. The
+//! `--min-version` flag (`mkfs_erofs_v1_min_version` in the Rust API) controls whether the
+//! `composefs_version` field starts at 0 or 1 in V1 images regardless of whether user whiteouts
+//! are present.
diff --git a/crates/composefs/src/lib.rs b/crates/composefs/src/lib.rs
index 806f3c8f..468090e3 100644
--- a/crates/composefs/src/lib.rs
+++ b/crates/composefs/src/lib.rs
@@ -1,8 +1,43 @@
-//! Rust bindings and utilities for working with composefs images and repositories.
+//! # composefs: The reliability of disk images, the flexibility of files
 //!
-//! Composefs is a read-only FUSE filesystem that enables efficient sharing
-//! of container filesystem layers by using content-addressable storage
-//! and fs-verity for integrity verification.
+//! composefs combines several Linux kernel features to provide read-only
+//! mountable filesystem trees that stack on top of a conventional "lower"
+//! filesystem.
+//!
+//! ## Interfaces
+//!
+//! composefs offers two programmatic interfaces:
+//!
+//! - **Rust API** — this crate and its siblings (`composefs-oci`,
+//!   `composefs-boot`, etc.), usable as regular Cargo dependencies.
+//! - **Varlink API** — a [varlink](https://varlink.org) RPC interface
+//!   exposed by `cfsctl varlink` over a Unix socket, accessible from
+//!   any language.  See the [`varlink`] module for examples.
+//!
+//! Neither interface is declared stable yet.  Both may change across
+//! releases while the project is under active development.
+//!
+//! ## Key technologies
+//!
+//! - **[overlayfs]** — the kernel mount interface that exposes the composed tree
+//! - **[EROFS]** — an in-kernel read-only filesystem for the metadata tree
+//!   (directories, symlinks, permissions, xattrs) with no file data
+//! - **[fs-verity]** (optional) — per-file integrity verification on the
+//!   backing store, validated by overlayfs at access time
+//!
+//! [overlayfs]: https://www.kernel.org/doc/Documentation/filesystems/overlayfs.txt
+//! [EROFS]: https://erofs.docs.kernel.org
+//! [fs-verity]: https://www.kernel.org/doc/html/next/filesystems/fsverity.html
+//!
+//! ## Design
+//!
+//! composefs produces an EROFS image containing *only* metadata.  Non-empty
+//! data files live in a content-addressed backing store, with
+//! `trusted.overlay.redirect` xattrs telling overlayfs where to find them.
+//! Identical files across images are stored once on disk and shared in the
+//! Linux page cache.
+//!
+//! See the [`repository_format`] module for the on-disk layout.
 
 #![forbid(unsafe_code)]
 // This is a library: emit diagnostics via the `log` crate (or return them),
@@ -25,9 +60,16 @@ pub mod splitstream;
 pub mod tree;
 pub mod util;
 
+#[cfg(doc)]
+pub mod erofs_format;
 pub mod generic_tree;
+#[cfg(doc)]
+pub mod repository_format;
+#[cfg(doc)]
+pub mod splitstream_format;
 #[cfg(any(test, feature = "test"))]
 pub mod test;
+pub mod varlink;
 
 /// Files with this many bytes or fewer are stored inline in the erofs image
 /// (and in splitstreams).  Files above this threshold are written to object
diff --git a/crates/composefs/src/repository_format.rs b/crates/composefs/src/repository_format.rs
new file mode 100644
index 00000000..4396dede
--- /dev/null
+++ b/crates/composefs/src/repository_format.rs
@@ -0,0 +1,317 @@
+//! # composefs repository design
+//!
+//! This document describes the current on-disk layout of a composefs repository.
+//!
+//! At this time, the composefs-rs repository format is not declared stable.
+//!
+//! ## Location
+//!
+//! A composefs repository is a directory located anywhere. The location is chosen
+//! for the `cfsctl` command as follows:
+//!
+//!  - `--repo` can specify an arbitrary directory
+//!
+//!  - if `--user` is specified (default if the current uid is not 0), then the
+//!    repository defaults to `~/.var/lib/composefs`.
+//!
+//!  - if `--system` is specified (default if the current uid is 0), then the
+//!    repository defaults to `/sysroot/composefs`.
+//!
+//! ## Layout
+//!
+//! A composefs repository has a layout that looks something like
+//!
+//! ```text
+//! composefs
+//! ├── meta.json
+//! ├── objects
+//! │   ├── 00
+//! │   │   ├── 002183fb91[...]
+//! │   │   ├── [...]
+//! │   │   └── ff9d7bd692[...]
+//! │   ├── 4e
+//! │   │   ├── 67eaccd9fd[...]
+//! │   │   └── [...]
+//! │   ├── 50
+//! │   │   ├── 2b126bca0c[...]
+//! │   │   └── [...]
+//! │   └── [...]
+//! ├── images
+//! │   ├── 4e67eaccd9fd[...] -> ../objects/4e/67eaccd9fd[...]
+//! │   └── refs
+//! │       └── some/name -> ../../images/4e67eaccd9fd[...]
+//! └── streams
+//!     ├── 502b126bca0c[...] -> ../objects/50/2b126bca0c[...]
+//!     └── refs
+//!         └── some/name.tar -> ../../streams/502b126bca0c[...]
+//! ```
+//!
+//! ## `meta.json`
+//!
+//! Added in 0.7.0. This file records repository-level metadata. When present, it is
+//! created by `cfsctl init` and contains:
+//!
+//!  - `version` — the base repository format version (currently `1`).  Tools
+//!    must refuse to operate on a repository whose version exceeds what they
+//!    understand.
+//!
+//!  - `algorithm` — the fs-verity digest algorithm identifier, in the format
+//!    `fsverity-<hash>-<lg_blocksize>`.  For example `fsverity-sha512-12`
+//!    means SHA-512 with 4 KiB (2^12) blocks.
+//!
+//!  - `features` (optional) — an object with three arrays of feature-flag
+//!    strings, following the ext4/XFS/EROFS compatibility model:
+//!    - `compatible` — old tools can safely ignore these.
+//!    - `read-only-compatible` — old tools may read but must not write.
+//!    - `incompatible` — old tools must refuse the repository entirely.
+//!
+//!    The currently defined feature flags are:
+//!      - `v1_erofs` (read-only-compatible) — present on repositories whose
+//!        EROFS image format is [V1][crate::erofs_format] (C-tool compatible:
+//!        compact inodes, BFS ordering, whiteout table).  This is the single
+//!        flag that encodes the EROFS format version: present → V1, absent
+//!        → V2.  Old
+//!        tools that do not recognise this flag open the repository read-only
+//!        rather than accidentally writing images in the wrong format.
+//!
+//! When `meta.json` is present, `cfsctl` auto-detects the hash algorithm and
+//! errors if `--hash` is explicitly passed with a conflicting value.  When
+//! the file is absent (for repositories created before this feature), `--hash`
+//! is honored as before and defaults to `sha512`.
+//!
+//! ### `cfsctl init --erofs-version`
+//!
+//! The `--erofs-version` flag selects the EROFS format for newly committed
+//! images.  It controls the `v1_erofs` feature flag in `meta.json`:
+//!
+//! ```text
+//! cfsctl init                          # default: V2 EROFS (composefs-rs native)
+//! cfsctl init --erofs-version 1        # V1 EROFS (C-tool compatible)
+//! ```
+//!
+//! **V2** (the `cfsctl` default) uses extended inodes, DFS ordering, and
+//! `composefs_version=2` in the EROFS superblock.  This is the composefs-rs native
+//! format and is what all repositories created before V1 support was added use.
+//! Higher-level tools (e.g. bootc) may configure a repository with multiple format
+//! versions (V1 primary + V2 extra) so that images are usable on both RHEL9-era and
+//! newer kernels.
+//!
+//! **V1** uses compact inodes where possible, BFS ordering, and a whiteout stub
+//! table, producing output byte-for-byte identical to the C `mkcomposefs` tool.
+//! The `v1_erofs` ro-compat flag is written to `meta.json` so that tools which
+//! predate V1 support open the repository read-only rather than writing images
+//! in the wrong format.
+//!
+//! Re-initializing an existing repository with a different `--erofs-version` is
+//! rejected with an error; the format version is fixed at init time.
+//!
+//! ## `objects/`
+//!
+//! This is where the content-addressed data is stored.  The immediate children of
+//! this directory are 256 subdirectories from `00` to `ff`.  Each of those
+//! directories contains a number of files with 62-character hexidecimal names.
+//! Taken together with the directory in which it resides, each filename represents
+//! a 256bit hash value which equals the measured fs-verity digest of that file.
+//! fs-verity must be enabled for every file.
+//!
+//! ## `images/`
+//!
+//! This is where composefs ([EROFS][crate::erofs_format]) images are accounted for.  The images
+//! themselves are fs-verity enabled and stored in the object store in the same way
+//! as the file data, but the `images/` directory contains symlinks to the images
+//! that we know about.  Each symlink is named for the full 256bit fsverity digest.
+//!
+//! Images are tracked in a separate directory because of the security model of
+//! filesystems in the Linux kernel.  Although it would be feasible for "regular
+//! users" to mount an erofs in their own mount namespace, the kernel currently
+//! disallows it as a way to avoid allowing non-root users to expose the filesystem
+//! code to hostile data.  As such, we only mount images that we produced for
+//! ourselves (with mkcomposefs), and those are the ones that are linked in this
+//! directory.
+//!
+//! Another way to say it: we must never attempt to mount an arbitrary object: we
+//! may only mount via symlinks present in this directory.
+//!
+//! ## `streams/`
+//!
+//! This is where [split streams][crate::splitstream] are stored.  As for the images,
+//! this is a bunch of 256bit symlinks which are symlinks to data in the object
+//! storage.
+//!
+//! Note: the names of the hashes in this directory are the fs-verity hashes of the
+//! content of the splitstream file, not the original file.  More specifically: if
+//! you have a tar file with a specific sha256 digest, and you import it into the
+//! repository as a splitstream, the resulting filename in this directory will have
+//! no relation to the original content.  You can, however, store a reference for
+//! it.
+//!
+//! ## `{images,streams}/refs/`
+//!
+//! This is where we record which images and streams are currently "requested" by
+//! some external user.  When importing a tar file, in addition to creating the
+//! file in the objects database and the toplevel symlink in the `streams/`
+//! directory, we also assign it a name which is chosen by the software which is
+//! performing the import.
+//!
+//! Each ref is a symlink to the top-level entry in `images/` or `streams/`.
+//!
+//! There are some rough ideas for how we might namespace this.  Something like
+//! this model is imagined:
+//!
+//! ```text
+//! refs
+//! ├── system
+//! │   └── rootfs
+//! │       ├── some_id -> ../../../974d04eaff[...]
+//! │       └── [...]
+//! ├── 1000                      # uid of a user
+//! │   ├── flatpak
+//! │   │   ├── some_id -> ../../../f8e2bec500[...]
+//! │   │   └── [...]
+//! │   └── containers
+//! │       ├── some_id -> ../../../96a87f8b4b[...]
+//! │       └── [...]
+//! └── [...]
+//! ```
+//!
+//! Where the toplevel directories are `system` plus a set of uids.  Each `system`
+//! or uid subdirectory is namespaced by the particular piece of software that's
+//! responsible for storing the given image or stream.
+//!
+//! The per-user directories will all be owned by root and have 0700 permissions,
+//! but each user will be able to access their own uid-numbered subdirectories by
+//! way of an acl.  The reason that we want the directories owned by root is to
+//! prevent users from corrupting the layout of the repository.  The reason for the
+//! acl is that read-only operations on the repository should be performed
+//! directly on the repository and not via some central agent.
+//!
+//! ## Referring to images and streams
+//!
+//! Operations that are performed on images or streams (mount, cat, etc.) name the
+//! stream in one of two ways:
+//!
+//!  - via the user-chosen name such as `refs/1000/flatpak/some_id`
+//!  - via the fs-verity digest stored in the toplevel dir
+//!
+//! ie: the name must either start with the string `refs/`, or must be a
+//! hexadecimal string (64 characters for sha256, 128 for sha512).
+//!
+//! In both cases, the name is a path relative to the `images/` or `streams/`
+//! directory and this path contains a symlink (either direct or indirect) to the
+//! underlying file in `objects/`.
+//!
+//! When specified via fs-verity digest, the digest is verified before performing
+//! the operation.
+//!
+//! For example:
+//!
+//! ```sh
+//! cfsctl mount refs/system/rootfs/some_id /mnt   # does not check fs-verity
+//! cfsctl mount 974d04eaff[...] /mnt              # enforces fs-verity
+//! ```
+//!
+//! ## OCI image storage
+//!
+//! OCI container images are stored using streams exclusively.  Each OCI artifact
+//! (manifest, config, layer) becomes a splitstream, and OCI "tags" are refs under
+//! `streams/refs/oci/`.
+//!
+//! ### Naming conventions
+//!
+//! | OCI artifact  | Stream name pattern                | Example                            |
+//! |---------------|------------------------------------|------------------------------------|
+//! | Manifest      | `oci-manifest-{manifest_digest}`   | `oci-manifest-sha256:abc123...`    |
+//! | Config        | `oci-config-{config_digest}`       | `oci-config-sha256:def456...`      |
+//! | Layer         | `oci-layer-{diff_id}`              | `oci-layer-sha256:ghi789...`       |
+//! | Blob          | `oci-blob-{blob_digest}`           | `oci-blob-sha256:jkl012...`        |
+//!
+//! Tags are stored under `streams/refs/oci/` with percent-encoding for
+//! filesystem safety (`/` → `%2F`):
+//!
+//! ```text
+//! streams/refs/oci/myimage:latest → ../../oci-manifest-sha256:abc123...
+//! ```
+//!
+//! ### Splitstream reference chains
+//!
+//! Each splitstream contains `named_refs` (semantic labels mapping to entries
+//! in the `stream_refs` array) and `object_refs` (raw objects referenced by
+//! the compressed stream data).  For OCI images the chain is:
+//!
+//! **Manifest splitstream** (`oci-manifest-sha256:...`):
+//!   - `object_refs`: the manifest JSON blob
+//!   - `named_refs`:
+//!     - `config:{config_digest}` → config splitstream verity
+//!     - `{diff_id}` → layer splitstream verity (one per layer)
+//!
+//! **Config splitstream** (`oci-config-sha256:...`):
+//!   - `object_refs`: the config JSON blob
+//!   - `named_refs`:
+//!     - `{diff_id}` → layer splitstream verity (one per layer)
+//!
+//! **Layer splitstream** (`oci-layer-sha256:...`):
+//!   - `object_refs`: file content objects extracted from the tar
+//!   - `named_refs`: none (leaf node)
+//!
+//! Both the manifest and config redundantly reference the layers.  The GC
+//! can reach layers from either path.
+//!
+//! ### Garbage collection
+//!
+//! The GC walks all refs under `streams/refs/` to find root splitstreams,
+//! then transitively follows `named_refs` (by resolving fs-verity IDs
+//! through a stream name map) and collects `object_refs`.  Any object not
+//! reachable from a root is deleted.
+//!
+//! Concretely, for a tagged container image:
+//!
+//!  1. Tag `streams/refs/oci/myimage:v1` resolves to `oci-manifest-sha256:abc`
+//!  2. Walk the manifest: mark its JSON blob and follow `named_refs` to
+//!     the config and layer streams
+//!  3. Walk the config: mark its JSON blob and follow `named_refs` to layers
+//!     (already visited, skipped)
+//!  4. Walk each layer: mark all file content objects
+//!
+//! When a tag is removed, the manifest and everything reachable only from it
+//! becomes GC-eligible.  Layers shared between images survive as long as any
+//! referencing manifest remains tagged.
+//!
+//! ### EROFS image tracking via config splitstream refs
+//!
+//! When an EROFS image is generated from an OCI image (via
+//! `create_filesystem` + `commit_image`), its object ID (fs-verity digest)
+//! is stored as a named ref on the config splitstream with the key
+//! `composefs.image`.
+//!
+//! GC walks from tag → manifest → config, and finds the `composefs.image`
+//! named ref.  The EROFS object ID is added to the live set, keeping the
+//! EROFS image alive.  The EROFS image still needs an entry under `images/`
+//! for the kernel mount security model (see above), but `images/` is not a
+//! GC root — the config ref is what keeps the object alive.
+//!
+//! This means a single OCI tag is sufficient to keep the entire image
+//! (manifest, config, layers, and the EROFS image) alive through GC.
+//!
+//! ### Bootable image variant
+//!
+//! For bootable images, a second EROFS may be generated after
+//! `transform_for_boot` (stripping `/boot`, etc.).  This boot EROFS is
+//! stored as a second named ref on the config, `composefs.image.boot`.
+//!
+//! Since the config splitstream content changes (new named ref), it gets a
+//! new fs-verity digest.  This cascades: the manifest must also be
+//! rewritten (its `config:` named ref now points to the new config verity),
+//! producing a new manifest verity.  The tag is re-pointed to the new
+//! manifest.  The old config and manifest splitstreams become unreferenced
+//! and are collected by GC.
+//!
+//! The result: one tag still keeps everything alive — layers, raw EROFS,
+//! and boot EROFS.
+//!
+//! ### Future: sealed images
+//!
+//! For sealed/signed images, the EROFS comes pre-built from the registry as
+//! part of a composefs OCI artifact (referrer pattern).  The artifact
+//! splitstream would hold references to the pre-fetched EROFS layers.  This
+//! is complementary to the unsealed case — both use the same GC mechanism
+//! (named refs pointing to EROFS objects).
diff --git a/crates/composefs/src/splitstream_format.rs b/crates/composefs/src/splitstream_format.rs
new file mode 100644
index 00000000..36a1a17f
--- /dev/null
+++ b/crates/composefs/src/splitstream_format.rs
@@ -0,0 +1,164 @@
+//! # Splitstream
+//!
+//! Splitstream is a trivial way of storing file formats (like tar) with the "data
+//! blocks" stored in the composefs object store with the goal that it's possible
+//! to bit-for-bit recreate the entire file.  It's something like the idea behind
+//! [tar-split](https://github.com/vbatts/tar-split), with some important
+//! differences:
+//!
+//!  - it's a binary format
+//!
+//!  - it's based on storing external objects content-addressed in the composefs
+//!    object store via their fs-verity digest
+//!
+//!  - although it's designed with `tar` files in mind, it's not specific to `tar`,
+//!    or even to the idea of an archive file: any file format can be stored as a
+//!    splitstream, and it might make sense to do so for any file format that
+//!    contains large chunks of embedded data
+//!
+//!  - in addition to the ability to split out chunks of file content (like files
+//!    in a `.tar`) to separate files, it is also possible to refer to external
+//!    file content, or even other splitstreams, without directly embedding that
+//!    content in the referrer, which can be useful for cross-document references
+//!    (such as between OCI manifests, configs, and layers)
+//!
+//!  - the splitstream file itself is stored in the same content-addressed object
+//!    store by its own fs-verity hash
+//!
+//! Splitstream compresses inline file content before it is stored to disk using
+//! zstd.  The main reason for this is that, after removing the actual file data,
+//! the remaining `tar` metadata contains a very large amount of padding and empty
+//! space and compresses extremely well.
+//!
+//! Splitstream is conceptually independent from composefs: you could use the
+//! format with any content-addressed storage system.
+//!
+//! ## File format
+//!
+//! What follows is a non-normative documentation of the file format.  The actual
+//! definition of the format is "what composefs-rs reads and writes", but this
+//! document may be useful to try to understand that format.  If you'd like to
+//! implement the format, please get in touch.
+//!
+//! The format is implemented in
+//! [crate::splitstream] and
+//! the structs from that file are copy-pasted here.  Please try to keep things
+//! roughly in sync when making changes to either side.
+//!
+//! All integers are little-endian.  In the following `struct` definitions, `U`
+//! means 'unsigned little endian' (as per the `zerocopy::little_endian` crate) so
+//! `U64` is an unsigned 64bit little-endian integer.
+//!
+//! ### File ranges ("sections")
+//!
+//! The file format consists of a fixed-sized header at the start of the file plus
+//! a number of sections located at arbitrary locations inside of the file.  All of
+//! these sections are referred to by a 64-bit `[start..end)` range expressed in
+//! terms of overall byte offsets within the complete file.
+//!
+//! ```text
+//! struct FileRange {
+//!     start: U64,
+//!     end: U64,
+//! }
+//! ```
+//!
+//! ### Header
+//!
+//! The file starts with a simple fixed-size header.
+//!
+//! ```text
+//! const SPLITSTREAM_MAGIC: [u8; 11] = *b"SplitStream";
+//!
+//! struct SplitstreamHeader {
+//!     pub magic: [u8; 11],  // Contains SPLITSTREAM_MAGIC
+//!     pub version: u8,      // must always be 0
+//!     pub _flags: U16,      // is currently always 0 (but ignored)
+//!     pub algorithm: u8,    // kernel fs-verity algorithm identifier (1 = sha256, 2 = sha512)
+//!     pub lg_blocksize: u8, // log2 of the fs-verity block size (12 = 4k, 16 = 64k)
+//!     pub info: FileRange,  // can be used to expand/move the info section in the future
+//! }
+//! ```
+//!
+//! In addition to magic values and identifiers for the fs-verity algorithm in use,
+//! the header is used to find the location and size of the info section.  Future
+//! expansions to the file format are imagined to occur by expanding the size of
+//! the info section: if the section is larger than expected, the additional bytes
+//! will be ignored by the implementation.
+//!
+//! ### Info section
+//!
+//! ```text
+//! struct SplitstreamInfo {
+//!     pub stream_refs: FileRange, // location of the stream references array
+//!     pub object_refs: FileRange, // location of the object references array
+//!     pub stream: FileRange,      // location of the zstd-compressed stream within the file
+//!     pub named_refs: FileRange,  // location of the compressed named references
+//!     pub content_type: U64,      // user can put whatever magic identifier they want there
+//!     pub stream_size: U64,       // total uncompressed size of inline chunks and external chunks
+//! }
+//! ```
+//!
+//! The `content_type` is just an arbitrary identifier that can be used by users of
+//! the file format to prevent casual user error when opening a file by its hash
+//! value (to prevent showing `.tar` data as if it were json, for example).
+//!
+//! The `stream_size` is the total size of the original file.
+//!
+//! ### Stream and object refs sections
+//!
+//! All referred streams and objects in the file are stored as two separate flat
+//! uncompressed arrays of binary fs-verity hash values.  Each of these arrays is
+//! referred to from the info section (via `stream_refs` and `object_refs`).
+//!
+//! The number of items in the array is determined by the size of the section
+//! divided by the size of the fs-verity hash value (determined by the algorithm
+//! identifier in the header).
+//!
+//! The values are not in any particular order, but implementations should produce
+//! a deterministic output.  For example, the objects reference array produced by
+//! the current implementation has the external objects sorted by first-appearance
+//! within the stream.
+//!
+//! The main motivation for storing the references uncompressed, in binary, and in
+//! a flat array is to make determining the references contained within a
+//! splitstream as simple as possible to improve the efficiency of garbage
+//! collection on large repositories.
+//!
+//! ### The stream
+//!
+//! The main content of the splitstream is stored in the `stream` section
+//! referenced from the info section.  The entire section is zstd compressed.
+//!
+//! Within the compressed stream, the splitstream is formed from a number of
+//! "chunks".  Each chunk starts with a single 64-bit little endian value.  If that
+//! number is negative, it refers to an "inline" chunk, and that (absolute) number
+//! of bytes of data immediately follow it.  If the number is non-negative then it
+//! is an index into the object refs array for an "external" chunk.
+//!
+//! Zero is a non-negative value, so it's an object reference.  It's not possible
+//! to have a zero-byte inline chunk.  This also means that the high/sign bit
+//! determines which case (inline vs. external) we have and there are an equal
+//! number of both cases.
+//!
+//! The stream is reassembled by iterating over the chunks and concatenating the
+//! result.  For inline chunks, the inline data is taken directly from the
+//! splitstream. For external chunks, the content of the external file is used.
+//!
+//! The stream is over when there are no more chunks.
+//!
+//! ### Named references
+//!
+//! It's possible to have named references to other streams.  These are stored in
+//! the `named_refs` section referred to from the info section.
+//!
+//! This section is also zstd-compressed, and is a number of nul-terminated text
+//! records (including a terminator after the last record).  Each record has the
+//! form `n:name` where `n` is a non-negative integer index into the stream refs
+//! array and `name` is an arbitrary name.  The entries are currently sorted by
+//! name (by the writer implementation) but the order is not important to the
+//! reader.  Whether or not this list is "officially" sorted or not may be pinned
+//! down at some future point if a need should arise.
+//!
+//! An example of the decompressed content of the section might be something like
+//! `"0:first\01:second\0"`.
diff --git a/crates/composefs/src/varlink.rs b/crates/composefs/src/varlink.rs
new file mode 100644
index 00000000..af22cfd4
--- /dev/null
+++ b/crates/composefs/src/varlink.rs
@@ -0,0 +1,160 @@
+//! # Varlink API
+//!
+//! `cfsctl varlink` exposes a [varlink] RPC service over a Unix socket
+//! with two interfaces:
+//!
+//! - **`org.composefs.Repository`** — repository lifecycle, integrity
+//!   checks, garbage collection, and mounting
+//! - **`org.composefs.Oci`** — OCI container image operations (listing,
+//!   pulling, inspecting, tagging, mounting)
+//!
+//! This API is language-agnostic and usable from any varlink client.
+//! Like the Rust crate API, it is not yet declared stable.
+//!
+//! [varlink]: https://varlink.org
+//!
+//! ## Starting the service
+//!
+//! ```sh
+//! cfsctl varlink --address /run/composefs/composefs.sock
+//! ```
+//!
+//! Systemd socket activation is also supported — if `cfsctl varlink` is
+//! started with an activated socket, the `--address` flag is not needed.
+//!
+//! ## Discovering the full API
+//!
+//! The complete interface definitions — every method, type, and error —
+//! are available at runtime via the standard varlink introspection
+//! protocol.  Use [`varlinkctl`] to dump them:
+//!
+//! ```sh
+//! # List available interfaces
+//! varlinkctl list-interfaces /run/composefs/composefs.sock
+//!
+//! # Full IDL for the Repository interface
+//! varlinkctl introspect /run/composefs/composefs.sock \
+//!     org.composefs.Repository
+//!
+//! # Full IDL for the OCI interface
+//! varlinkctl introspect /run/composefs/composefs.sock \
+//!     org.composefs.Oci
+//! ```
+//!
+//! For `exec:`-style transports (no long-running socket), `varlinkctl`
+//! can launch `cfsctl` as a subprocess:
+//!
+//! ```sh
+//! varlinkctl introspect exec:cfsctl\ varlink org.composefs.Repository
+//! ```
+//!
+//! [`varlinkctl`]: https://www.freedesktop.org/software/systemd/man/latest/varlinkctl.html
+//!
+//! ## Session model
+//!
+//! Repositories are accessed through opaque `u64` handles.  A client
+//! calls `OpenRepository` to obtain a handle, passes it to every
+//! subsequent method, and releases it with `CloseRepository`.  No
+//! repository is opened at startup.
+//!
+//! ## Examples
+//!
+//! The examples below use `varlinkctl call`.  Any varlink client works —
+//! the wire format is JSON over a Unix socket.
+//!
+//! ### Open and close a repository
+//!
+//! ```sh
+//! # Open the system repository (/sysroot/composefs)
+//! varlinkctl call /run/composefs/composefs.sock \
+//!     org.composefs.Repository.OpenRepository '{"system": true}'
+//! # → {"handle": 1}
+//!
+//! # Open at a specific path
+//! varlinkctl call /run/composefs/composefs.sock \
+//!     org.composefs.Repository.OpenRepository \
+//!     '{"path": "/srv/composefs"}'
+//! # → {"handle": 2}
+//!
+//! # Release a handle when done
+//! varlinkctl call /run/composefs/composefs.sock \
+//!     org.composefs.Repository.CloseRepository '{"handle": 1}'
+//! ```
+//!
+//! ### Check repository integrity
+//!
+//! ```sh
+//! # Full check (verifies fs-verity on every object)
+//! varlinkctl call /run/composefs/composefs.sock \
+//!     org.composefs.Repository.Fsck '{"handle": 1}'
+//! # → {"ok": true, "has_metadata": true, "objects_checked": 1542, ...}
+//!
+//! # Fast metadata-only check (skips per-object verification)
+//! varlinkctl call /run/composefs/composefs.sock \
+//!     org.composefs.Repository.Fsck \
+//!     '{"handle": 1, "metadata_only": true}'
+//! ```
+//!
+//! ### List and pull OCI images
+//!
+//! ```sh
+//! varlinkctl call /run/composefs/composefs.sock \
+//!     org.composefs.Oci.ListImages '{"handle": 1}'
+//! # → {"images": [{"name": "myimage:latest",
+//! #     "manifest_digest": "sha256:abc...", ...}, ...]}
+//!
+//! # Pull with streaming progress
+//! varlinkctl call --more /run/composefs/composefs.sock \
+//!     org.composefs.Oci.Pull '{
+//!       "handle": 1,
+//!       "image": "quay.io/fedora/fedora:latest",
+//!       "local_fetch": "decompressed",
+//!       "bootable": false,
+//!       "more": true
+//!     }'
+//! # Streams progress, then a final "completed" frame
+//! ```
+//!
+//! ### Inspect, tag, and untag
+//!
+//! ```sh
+//! varlinkctl call /run/composefs/composefs.sock \
+//!     org.composefs.Oci.Inspect \
+//!     '{"handle": 1, "image": "myimage:latest"}'
+//! # → {"manifest": "{...}", "config": "{...}", ...}
+//!
+//! varlinkctl call /run/composefs/composefs.sock \
+//!     org.composefs.Oci.Tag '{
+//!       "handle": 1,
+//!       "manifest_digest": "sha256:abc123...",
+//!       "name": "myimage:v2"
+//!     }'
+//!
+//! varlinkctl call /run/composefs/composefs.sock \
+//!     org.composefs.Oci.Untag \
+//!     '{"handle": 1, "name": "myimage:old"}'
+//! ```
+//!
+//! ### Garbage collection
+//!
+//! ```sh
+//! # Dry run
+//! varlinkctl call /run/composefs/composefs.sock \
+//!     org.composefs.Repository.Gc \
+//!     '{"handle": 1, "dry_run": true, "roots": []}'
+//!
+//! # Collect for real
+//! varlinkctl call /run/composefs/composefs.sock \
+//!     org.composefs.Repository.Gc \
+//!     '{"handle": 1, "dry_run": false, "roots": []}'
+//! ```
+//!
+//! ### Mounting
+//!
+//! The `Mount` and `OciMount` methods return a detached mount file
+//! descriptor via `SCM_RIGHTS`.  The caller attaches it with
+//! `move_mount(2)`.  For overlay mounts, the caller passes upperdir and
+//! workdir fds in the request.
+//!
+//! These methods require a varlink client that supports fd passing;
+//! `varlinkctl` does not currently support this.
diff --git a/doc/booting.md b/doc/booting.md
deleted file mode 100644
index b958e6cd..00000000
--- a/doc/booting.md
+++ /dev/null
@@ -1,90 +0,0 @@
-# Booting from a composefs image
-
-This document describes how composefs-rs sets up the root filesystem during
-early boot. It covers the kernel command-line interface, the expected on-disk
-layout, kernel requirements, and the step-by-step mount sequence performed by
-`composefs-setup-root`.
-
-The target audience is system integrators and OS developers who are packaging a
-bootable system using composefs. Familiarity with Linux mount namespaces,
-overlayfs, and fs-verity is assumed.
-
-## Kernel command-line
-
-The initramfs code in composefs supports multiple kernel arguments; it
-is possible to pre-compute the digest of an image using both e.g. SHA-256 and
-SHA-512. On an installed system, the repository only supports one digest
-by default today, and the first found will be selected.
-
-Additionally, it is opt-in to enable v1 EROFS, and again the first compatible
-version will be found.
-
-```
-composefs.digest=v1-sha256-12:<digest>   # V1 EROFS image (preferred; RHEL9-era kernels)
-composefs.digest=v1-sha512-12:<digest>   # V1 EROFS image (SHA-512 variant)
-composefs.digest=v2-sha512-12:<digest>   # V2 EROFS image (explicit form)
-composefs=<digest>                       # V2 EROFS image (legacy shorthand)
-```
-
-The value format is `<version>-<hash>-<lg_blocksize>:<hex_digest>`, where
-`<version>` is `v1` or `v2`, `<hash>` is `sha256` or `sha512`, and
-`<lg_blocksize>` is the log₂ block size (currently always `12`, i.e. 4096
-bytes). This mirrors how `meta.json` encodes the algorithm as
-`fsverity-sha256-12`.
-
-`composefs.digest=` is checked first. Multiple entries may appear on the cmdline
-(one per format/algorithm combination); the initramfs tries each in order and
-mounts the first image that actually exists in the repository.
-
-`composefs=<digest>` is a legacy shorthand equivalent to
-`composefs.digest=v2-<hash>-12:<digest>` — the algorithm is inferred from the
-digest length (64 hex chars → SHA-256, 128 → SHA-512). It is checked only when
-no `composefs.digest=` token matches.
-
-**Insecure mode.** Placing `?` immediately after `=` (e.g.
-`composefs.digest=?v1-sha256-12:<digest>` or `composefs=?<digest>`) makes
-fs-verity verification optional. The system will boot even when the underlying
-filesystem does not support fs-verity or the image has no verity metadata
-attached. This mode exists for development and testing only; it must not be used
-in production.
-
-## On-disk layout
-
-The composefs repository must be present at `/sysroot/composefs` with the
-standard layout described in `doc/repository.md`.
-
-The digest must correspond to a symlink under `images/`.
-
-Persistent per-deployment state lives at `/sysroot/state/deploy/<digest>/`,
-where `<digest>` matches the boot karg digest exactly. The `etc/` and `var/`
-subdirectories within that directory serve as the upper layers for the
-corresponding overlayfs mounts.
-
-## Kernel requirements
-
-The following kernel features must be available:
-
-- **EROFS** filesystem driver (`CONFIG_EROFS_FS`)
-- **overlayfs** with `metacopy=on` and `redirect_dir=on`
-  (`CONFIG_OVERLAY_FS`, `CONFIG_OVERLAY_FS_METACOPY`, `CONFIG_OVERLAY_FS_REDIRECT_DIR`)
-- **fs-verity** unless insecure mode is used (`CONFIG_FS_VERITY`)
-- The modern Linux mount API (`fsopen` / `fsconfig` / `fsmount` / `move_mount`),
-  available since kernel 5.2. Kernel ≥ 6.15 is required for the atomic root
-  replacement path (the default build). On kernels without `fsconfig_set_fd`
-  support (e.g. RHEL 9 / kernel < 5.15), a loopback device is created
-  automatically by `composefs::mountcompat`.
-
-## Kernel argument
-
-The boot karg (`composefs.digest=` or `composefs=`) is the authoritative selector for which image is booted.
-Without the `?` insecure prefix, every file access through the overlayfs is
-verified against the object's stored digest by the kernel, combining fs-verity
-on the data objects with overlayfs `verity=require`.
-
-## Other notes
-
-As a workaround for a GPT auto-root issue in systemd
-([systemd#35017](https://github.com/systemd/systemd/issues/35017)),
-`composefs-setup-root` attempts to create `/run/systemd/volatile-root` as a
-symlink pointing to the real block device before performing any mounts. Failure
-to do so is non-fatal and does not abort the boot sequence.
diff --git a/doc/erofs.md b/doc/erofs.md
deleted file mode 100644
index 2a49b60e..00000000
--- a/doc/erofs.md
+++ /dev/null
@@ -1,82 +0,0 @@
-# composefs EROFS image format
-
-composefs images are EROFS filesystem images with composefs-specific extensions. They encode
-a directory tree where regular files are stored externally in a content-addressed object store
-and referenced by their fs-verity digest. The EROFS image itself carries only metadata: inodes,
-directory entries, extended attributes, and chunk index entries that point to the external files.
-
-composefs-rs supports two EROFS format versions. V1 is byte-for-byte compatible with the C
-`mkcomposefs` tool. V2 is the composefs-rs native format and drops several V1 constraints
-that exist only for C compatibility.
-
-`cfsctl init` defaults to V2; pass `--erofs-version 1` to select V1. Higher-level tools
-such as bootc initialize repositories with multiple formats enabled (V1 primary) so that images
-can be booted on RHEL9-era kernels that require the `composefs.digest=` karg.
-
-## Format V1
-
-V1 is selected with `cfsctl init --erofs-version 1`. The `v1_erofs` ro-compat feature flag
-is written to `meta.json` so that tools without V1 support open the repository read-only.
-
-**`composefs_version` field values in V1:**
-
-- `0` — no user-visible whiteout files (character devices with rdev=0) in the tree
-- `1` — at least one user-visible whiteout file is present
-
-The constant `COMPOSEFS_VERSION_V1` is 0; the field only reaches 1 when user whiteouts are
-found. The `--min-version` flag in `mkcomposefs` (mirrored by `mkfs_erofs_v1_min_version`)
-forces the value to 1 even when no user whiteouts exist, for forward compatibility.
-
-**Inode layout:** V1 uses compact inodes (32 bytes) when the file data and inode fit within
-the constraints of the compact format, and extended inodes (64 bytes) otherwise.
-
-**Inode traversal order:** V1 collects inodes in breadth-first order — all entries at one
-directory level before descending.
-
-**Whiteout stub table:** V1 includes 256 synthetic inode entries at the start of the inode
-area, one per two-hex-character prefix `00`–`ff`. Each entry is a character-device stub
-(chr 0,0) used by the overlay filesystem to resolve whiteout paths against the object store.
-V2 omits them entirely.
-
-**Whiteout escaping:** User-visible whiteout files (chr 0,0) in the tree are not stored as
-character devices on disk. Instead they receive a `trusted.overlay.opaque=x` xattr and are
-serialized differently. The stub entries in the whiteout table are not escaped.
-
-**`build_time`:** The superblock `build_time` field is set to the minimum mtime across all inodes.
-
-**xattr sharing:** Xattr entries are deduplicated using a sort key that is the full xattr name (prefix string concatenated with the suffix).
-
-## Format V2 — Created in composefs-rs
-
-V2 is the default for repositories created with `cfsctl init` without `--erofs-version 1`.
-
-**`composefs_version` field:** Always `2` (the constant `COMPOSEFS_VERSION`).
-
-**Inode layout:** V2 always uses extended inodes (64 bytes).
-
-**Inode traversal order:** V2 collects inodes in depth-first order — all descendants of a directory before moving to the next sibling.
-
-**No whiteout stub table:** V2 has no synthetic stub entries; whiteout files are stored directly without escaping.
-
-**`build_time`:** Always 0.
-
-**xattr sharing:** Xattr entries are deduplicated using a sort key of (prefix, suffix, value)
-rather than the full name string, which can produce a smaller shared xattr area.
-
-## Selecting the format
-
-The format is fixed at repository initialization time and cannot be changed afterward.
-
-```
-cfsctl init                    # V2 (default)
-cfsctl init --erofs-version 1  # V1 (C-tool compatible)
-```
-
-The format is recorded in `meta.json` as the `v1_erofs` ro-compat feature flag: present
-means V1, absent means V2. Tools that do not recognize this flag open the repository
-read-only rather than writing images in the wrong format.
-
-For the standalone `mkcomposefs` tool, the equivalent flag is `--erofs-version`. The
-`--min-version` flag (`mkfs_erofs_v1_min_version` in the Rust API) controls whether the
-`composefs_version` field starts at 0 or 1 in V1 images regardless of whether user whiteouts
-are present.
diff --git a/doc/oci.md b/doc/oci.md
deleted file mode 100644
index d1f850f4..00000000
--- a/doc/oci.md
+++ /dev/null
@@ -1,127 +0,0 @@
-# How to create a composefs from an OCI image
-
-This document is incomplete.  It only serves to document some decisions we've
-taken about how to resolve ambiguous situations.
-
-# Data precision
-
-We currently create a composefs image using the granularity of data as
-typically appears in OCI tarballs:
- - atime and ctime are not present (these are actually not physically present
-   in the erofs inode structure at all, either the compact or extended forms)
- - mtime is set to the mtime in seconds; the sub-seconds value is simply
-   truncated (ie: we always round down).  erofs has an nsec field, but it's not
-   normally present in OCI tarballs.  That's down to the fact that the usual
-   tar header only has timestamps in seconds and extended headers are not
-   usually added for this purpose.
- - we take great care to faithfully represent hardlinks: even though the
-   produced filesystem is read-only and we have data de-duplication via the
-   objects store, we make sure that hardlinks result in an actual shared inode
-   as visible via the `st_ino` and `st_nlink` fields on the mounted filesystem.
-
-We apply these precision restrictions also when creating images by scanning the
-filesystem.  For example: even if we get more-accurate timestamp information,
-we'll truncate it to the nearest second.
-
-# Merging directories
-
-This is done according to the OCI spec, with an additional clarification: in
-case a directory entry is present in multiple layers, we use the tar metadata
-from the most-derived layer to determine the attributes (owner, permissions,
-mtime) for the directory.
-
-# The root inode
-
-The root inode (/) is a difficult case because OCI container layer tars often
-don't include a root directory entry, and when they do, container runtimes
-(Podman, Docker) ignore it and use hardcoded defaults.  For example, Podman's
-[containers/storage](https://github.com/containers/storage) uses root:root
-ownership, mode `0555`, and epoch (0) mtime when extracting layers, but
-Docker uses `0755`. In general, the metadata for `/` is not defined.
-
-Because composefs requires (has a goal of providing) precise cryptographically
-verifiable filesystem trees, we solve this for OCI by copying the metadata from `/usr`
-to the root directory.  The rationale is that `/usr` is always present in
-standard filesystem layouts and must be defined explicitly in the OCI layers.
-
-This is implemented via the `copy_root_metadata_from_usr()` method and the
-`read_container_root()` convenience function.
-
-When building a filesystem from OCI layers programmatically, use
-`Stat::uninitialized()` to create the initial `FileSystem`.  This placeholder
-has mode `0` (obviously invalid) to make it clear that the root metadata should
-be set before computing digests - typically by calling
-`copy_root_metadata_from_usr()` after processing all layers.
-
-# Extended attributes (xattrs)
-
-When reading a container filesystem from a mounted root (as opposed to
-processing OCI layer tars directly), host-side xattrs can leak into the
-image.  This is particularly problematic for `security.selinux` labels:
-if SELinux is enabled at build time, files will have labels like
-`container_t` that come from the build host, not from the target system's
-policy.
-
-To ensure reproducibility, `read_container_root()` filters xattrs to only
-include those in an allowlist.  Currently this is just `security.capability`,
-which represents actual file capabilities that should be preserved.
-
-SELinux labels are handled separately by `transform_for_boot()`:
- - If the target filesystem contains a SELinux policy (in `/etc/selinux`),
-   all files are relabeled according to that policy
- - If no SELinux policy is found, all `security.selinux` xattrs are stripped
-
-This ensures that:
- - Build-time SELinux labels don't leak into non-SELinux targets
- - SELinux-enabled targets get correct labels from their own policy
- - Other host xattrs (overlayfs internals, etc.) don't pollute the image
-
-See: https://github.com/containers/storage/pull/1608#issuecomment-1600915185
-
-# The /run directory
-
-When processing OCI images via `create_filesystem()`, the `/run` directory
-is emptied if present. This is a tmpfs at runtime and should always be
-empty in images. Its mtime is set to match `/usr` for consistency with
-how root directory metadata is handled.
-
-This makes it possible to work around podman/buildah's `RUN --mount` issue where cache
-mounts can leave incomplete directory entries in OCI tar layers (directories
-without explicit tar entries inherit incorrect mtimes) by pointing all
-such mounts into `/run`, and then redirecting from their final location
-via e.g. symlinks into `/run`.
-
-## Container build cache mounts
-
-A practical implication of emptying `/run` is that container authors can
-use it for cache mounts without worrying about polluting the final image.
-
-Instead of:
-```dockerfile
-RUN --mount=type=cache,target=/var/cache/dnf dnf install -y ...
-```
-
-Consider:
-```dockerfile
-RUN rm -rf /var/cache/dnf && ln -sr /run/dnfcache /var/cache/dnf
-RUN --mount=type=cache,target=/run/dnfcache dnf install -y ...
-```
-
-This avoids potential mtime inconsistencies in `/var/cache` while still
-benefiting from build caching.
-
-See: https://github.com/containers/composefs-rs/issues/132
-
-# Emptied directories for boot
-
-When preparing a filesystem for boot via `transform_for_boot()`, certain
-additional directories are emptied because their contents should not be
-part of the final verified image:
-
-- `/boot`: Contains the UKI which embeds the composefs digest, so including
-  it would create a circular dependency
-- `/sysroot`: Only has content in ostree-container cases, and traversing
-  it for SELinux labeling causes problems
-
-These directories are emptied and their mtime is set to match `/usr` for
-consistency with how the root directory metadata is handled.
diff --git a/doc/ostree.md b/doc/ostree.md
deleted file mode 100644
index e96b85d7..00000000
--- a/doc/ostree.md
+++ /dev/null
@@ -1,139 +0,0 @@
-# OSTree
-
-composefs-rs has support for importing images from OSTree
-repositories, by pulling from local or remote OSTree
-repositories. These images can then be mounted as composefs images,
-sharing disk (deduplication) with other ostree or other types of
-images in the composefs repository.
-
-Native OSTree repositories are a format similar to a composefs
-repository, but not quite the same. This means we need some
-conversions when handling ostree commits in a composefs repository.
-
-OSTree images (commits) are fundamentally made up of many small sha256
-content-addressed objects that reference each other. Each commit is
-the root of a DAG that defines the total image. Some of the OSTree
-objects are metadata like directory permissions, or list of files in a
-directory. These don't really exist in composefs where all metadata is
-part of the erofs image. However, some objects are large file objects,
-and these are similar to the file objects in composefs
-images. However, even these differ, because the checksum defining the
-object is made up of both the file content and the file metadata.
-
-When an OSTree commit is stored in a composefs repo it is stored as a
-single splitstream file, named `ostree-commit-$commit_id`, which uses
-external object references to all the file content objects that will
-be used when creating an erofs image for it. This means OSTree objects
-for files that would be inlined in the erofs image will not be
-external objects.
-
-OStree commit splitstream objects are created during a pull operation
-and are used for two things, creating a composefs image by walking the
-DAG, and serving as a source of already available OSTree object during
-a pull operation. Such sources are found automatically during pull
-(e.g. parent commit, or old commit for a ref being pulled) or can be
-manually specified.
-
-## File format
-
-This describes the format of the `ostree-commit-$commit_id` files.
-
-### Splitstream header
-
-Since the commit file is a split stream it starts with the splitstream
-headers. Of these we use two, the named refs and the object
-refs:
-
- * When an erofs image is created for the commit, it is referenced by
-   the `composefs.image` named ref.
-
- * Any external file content objects are in the external_refs
-   table. The index of the references in this header table is used to
-   refer to the file in the splitstream itself.
-
-The splitstream content type used for commits is 0xAFE138C18C463EF1.
-
-### Splitstream content
-
-A splitstream is normally a series of internal and external chunks,
-but the ostree commit uses only one inline chunk. This chunk is
-basically a serialized form of the "objects" directory of an OSTree
-repository. I.e. it has a mapping of sha256 to ostree object data.
-All objects except file objects are stored in the standard ostree
-object format.
-
-OSTree file objects are stored in the archive-z2 format, except not
-compressed, and optionally the file content part of it may be stored
-as referencing the index of an external object. The z2 format is,
-first an 8-byte header that gives the size (in bytes) of a gvariant,
-then comes the gvariant with the file meta in
-OSTREE_ZLIB_FILE_HEADER_GVARIANT_FORMAT format, and then the
-file/symlink inline data. If an external object is referenced for the
-object then it is expected that there is no inline file data.
-
-The high level view of the file looks like this:
-```
-+---------------+
-| Header        |
-+---------------|
-| Object IDs    |
-+---------------|
-| Object Info   |
-+---------------|
-| Content       |
-+---------------+
-```
-
-The Object IDs is a sorted array of sha256 digests, and you would do
-lookups in it using a binary search.  The buckets in the header can be
-used to quickly limit the binary search based on the first byte of a
-digest.
-
-Then, at the same index as the binary searched object you can look up
-the object info which gives you the offset/length of the object
-content data and optionally a reference to an external object.
-
-The exact form of the data looks like this, packed in order from the
-start of the splitstream content. All ints are in little endian.
-
-### Header
-```
-+-----------------------------------+
-| u32: index of commit object       |
-| u32: flags (currently unused)      |
-| [u32; 256]: end index of bucket   |
-+-----------------------------------+
-```
-
-The bucket list contains the end index (in the object ids table) of
-objects starting with that particular byte, and can be used to quickly
-limit the search.  We can also compute the total number of objects
-(n_objects) by looking in the last bucket.
-
-### Object ids
-```
- n_objects x
-+-----------------------------------+
-|  [u8; 32] ostree object id        |
-+-----------------------------------+
-```
-
-### Object Info
-```
- n_objects x
-+-----------------------------------+
-| u32: Offset to per-object data    |
-| u32: Length of per-object data    |
-| u32: Index of external object ref |
-|      or MAXUINT32 if none.        |
-+-----------------------------------+
-```
-
-This is an array of information for each object. Once you have found
-the object id in the object ids table, you would look at the same
-index in this table to find the information. Offsets to per-object
-data are in bytes from the start of the content area, which starts at
-the end of the Objects Info table. All data chunks references are
-aligned to 8 bytes with respect to the start of the content area.
-This is useful because GVariants (used by ostree) naturally want
-8-byte alignment.
diff --git a/doc/repository.md b/doc/repository.md
deleted file mode 100644
index 023d26cc..00000000
--- a/doc/repository.md
+++ /dev/null
@@ -1,316 +0,0 @@
-# composefs repository design
-
-This document describes the current on-disk layout of a composefs repository.
-
-At this time, the composefs-rs repository format is not declared stable.
-
-## Location
-
-A composefs repository is a directory located anywhere. The location is chosen
-for the `cfsctl` command as follows:
-
- - `--repo` can specify an arbitrary directory
-
- - if `--user` is specified (default if the current uid is not 0), then the
-   repository defaults to `~/.var/lib/composefs`.
-
- - if `--system` is specified (default if the current uid is 0), then the
-   repository defaults to `/sysroot/composefs`.
-
-## Layout
-
-A composefs repository has a layout that looks something like
-
-```
-composefs
-├── meta.json
-├── objects
-│   ├── 00
-│   │   ├── 002183fb91[...]
-│   │   ├── [...]
-│   │   └── ff9d7bd692[...]
-│   ├── 4e
-│   │   ├── 67eaccd9fd[...]
-│   │   └── [...]
-│   ├── 50
-│   │   ├── 2b126bca0c[...]
-│   │   └── [...]
-│   └── [...]
-├── images
-│   ├── 4e67eaccd9fd[...] -> ../objects/4e/67eaccd9fd[...]
-│   └── refs
-│       └── some/name -> ../../images/4e67eaccd9fd[...]
-└── streams
-    ├── 502b126bca0c[...] -> ../objects/50/2b126bca0c[...]
-    └── refs
-        └── some/name.tar -> ../../streams/502b126bca0c[...]
-```
-
-## `meta.json`
-
-Added in 0.7.0. This file records repository-level metadata. When present, it is
-created by `cfsctl init` and contains:
-
- - `version` — the base repository format version (currently `1`).  Tools
-   must refuse to operate on a repository whose version exceeds what they
-   understand.
-
- - `algorithm` — the fs-verity digest algorithm identifier, in the format
-   `fsverity-<hash>-<lg_blocksize>`.  For example `fsverity-sha512-12`
-   means SHA-512 with 4 KiB (2^12) blocks.
-
- - `features` (optional) — an object with three arrays of feature-flag
-   strings, following the ext4/XFS/EROFS compatibility model:
-   - `compatible` — old tools can safely ignore these.
-   - `read-only-compatible` — old tools may read but must not write.
-   - `incompatible` — old tools must refuse the repository entirely.
-
-   The currently defined feature flags are:
-     - `v1_erofs` (read-only-compatible) — present on repositories whose
-       EROFS image format is V1 (C-tool compatible: compact inodes, BFS
-       ordering, whiteout table).  This is the single flag that encodes the
-        EROFS format version: present → V1, absent → V2.  Old
-       tools that do not recognise this flag open the repository read-only
-       rather than accidentally writing images in the wrong format.
-
-When `meta.json` is present, `cfsctl` auto-detects the hash algorithm and
-errors if `--hash` is explicitly passed with a conflicting value.  When
-the file is absent (for repositories created before this feature), `--hash`
-is honored as before and defaults to `sha512`.
-
-### `cfsctl init --erofs-version`
-
-The `--erofs-version` flag selects the EROFS format for newly committed
-images.  It controls the `v1_erofs` feature flag in `meta.json`:
-
-```
-cfsctl init                          # default: V2 EROFS (composefs-rs native)
-cfsctl init --erofs-version 1        # V1 EROFS (C-tool compatible)
-```
-
-**V2** (the `cfsctl` default) uses extended inodes, DFS ordering, and
-`composefs_version=2` in the EROFS superblock.  This is the composefs-rs native
-format and is what all repositories created before V1 support was added use.
-Higher-level tools (e.g. bootc) may configure a repository with multiple format
-versions (V1 primary + V2 extra) so that images are usable on both RHEL9-era and
-newer kernels.
-
-**V1** uses compact inodes where possible, BFS ordering, and a whiteout stub
-table, producing output byte-for-byte identical to the C `mkcomposefs` tool.
-The `v1_erofs` ro-compat flag is written to `meta.json` so that tools which
-predate V1 support open the repository read-only rather than writing images
-in the wrong format.
-
-Re-initializing an existing repository with a different `--erofs-version` is
-rejected with an error; the format version is fixed at init time.
-
-## `objects/`
-
-This is where the content-addressed data is stored.  The immediate children of
-this directory are 256 subdirectories from `00` to `ff`.  Each of those
-directories contains a number of files with 62-character hexidecimal names.
-Taken together with the directory in which it resides, each filename represents
-a 256bit hash value which equals the measured fs-verity digest of that file.
-fs-verity must be enabled for every file.
-
-## `images/`
-
-This is where composefs (erofs) images are accounted for.  The images
-themselves are fs-verity enabled and stored in the object store in the same way
-as the file data, but the `images/` directory contains symlinks to the images
-that we know about.  Each symlink is named for the full 256bit fsverity digest.
-
-Images are tracked in a separate directory because of the security model of
-filesystems in the Linux kernel.  Although it would be feasible for "regular
-users" to mount an erofs in their own mount namespace, the kernel currently
-disallows it as a way to avoid allowing non-root users to expose the filesystem
-code to hostile data.  As such, we only mount images that we produced for
-ourselves (with mkcomposefs), and those are the ones that are linked in this
-directory.
-
-Another way to say it: we must never attempt to mount an arbitrary object: we
-may only mount via symlinks present in this directory.
-
-## `streams/`
-
-This is where [split streams](splitstream.md) are stored.  As for the images,
-this is a bunch of 256bit symlinks which are symlinks to data in the object
-storage.
-
-Note: the names of the hashes in this directory are the fs-verity hashes of the
-content of the splitstream file, not the original file.  More specifically: if
-you have a tar file with a specific sha256 digest, and you import it into the
-repository as a splitstream, the resulting filename in this directory will have
-no relation to the original content.  You can, however, store a reference for
-it.
-
-## `{images,streams}/refs/`
-
-This is where we record which images and streams are currently "requested" by
-some external user.  When importing a tar file, in addition to creating the
-file in the objects database and the toplevel symlink in the `streams/`
-directory, we also assign it a name which is chosen by the software which is
-performing the import.
-
-Each ref is a symlink to the top-level entry in `images/` or `streams/`.
-
-There are some rough ideas for how we might namespace this.  Something like
-this model is imagined:
-
-```
-refs
-├── system
-│   └── rootfs
-│       ├── some_id -> ../../../974d04eaff[...]
-│       └── [...]
-├── 1000                      # uid of a user
-│   ├── flatpak
-│   │   ├── some_id -> ../../../f8e2bec500[...]
-│   │   └── [...]
-│   └── containers
-│       ├── some_id -> ../../../96a87f8b4b[...]
-│       └── [...]
-└── [...]
-```
-
-Where the toplevel directories are `system` plus a set of uids.  Each `system`
-or uid subdirectory is namespaced by the particular piece of software that's
-responsible for storing the given image or stream.
-
-The per-user directories will all be owned by root and have 0700 permissions,
-but each user will be able to access their own uid-numbered subdirectories by
-way of an acl.  The reason that we want the directories owned by root is to
-prevent users from corrupting the layout of the repository.  The reason for the
-acl is that read-only operations on the repository should be performed
-directly on the repository and not via some central agent.
-
-## Referring to images and streams
-
-Operations that are performed on images or streams (mount, cat, etc.) name the
-stream in one of two ways:
-
- - via the user-chosen name such as `refs/1000/flatpak/some_id`
- - via the fs-verity digest stored in the toplevel dir
-
-ie: the name must either start with the string `refs/`, or must be a
-hexadecimal string (64 characters for sha256, 128 for sha512).
-
-In both cases, the name is a path relative to the `images/` or `streams/`
-directory and this path contains a symlink (either direct or indirect) to the
-underlying file in `objects/`.
-
-When specified via fs-verity digest, the digest is verified before performing
-the operation.
-
-For example:
-
-```sh
-cfsctl mount refs/system/rootfs/some_id /mnt   # does not check fs-verity
-cfsctl mount 974d04eaff[...] /mnt              # enforces fs-verity
-```
-
-## OCI image storage
-
-OCI container images are stored using streams exclusively.  Each OCI artifact
-(manifest, config, layer) becomes a splitstream, and OCI "tags" are refs under
-`streams/refs/oci/`.
-
-### Naming conventions
-
-| OCI artifact  | Stream name pattern                | Example                            |
-|---------------|------------------------------------|------------------------------------|
-| Manifest      | `oci-manifest-{manifest_digest}`   | `oci-manifest-sha256:abc123...`    |
-| Config        | `oci-config-{config_digest}`       | `oci-config-sha256:def456...`      |
-| Layer         | `oci-layer-{diff_id}`              | `oci-layer-sha256:ghi789...`       |
-| Blob          | `oci-blob-{blob_digest}`           | `oci-blob-sha256:jkl012...`        |
-
-Tags are stored under `streams/refs/oci/` with percent-encoding for
-filesystem safety (`/` → `%2F`):
-
-```
-streams/refs/oci/myimage:latest → ../../oci-manifest-sha256:abc123...
-```
-
-### Splitstream reference chains
-
-Each splitstream contains `named_refs` (semantic labels mapping to entries
-in the `stream_refs` array) and `object_refs` (raw objects referenced by
-the compressed stream data).  For OCI images the chain is:
-
-**Manifest splitstream** (`oci-manifest-sha256:...`):
-  - `object_refs`: the manifest JSON blob
-  - `named_refs`:
-    - `config:{config_digest}` → config splitstream verity
-    - `{diff_id}` → layer splitstream verity (one per layer)
-
-**Config splitstream** (`oci-config-sha256:...`):
-  - `object_refs`: the config JSON blob
-  - `named_refs`:
-    - `{diff_id}` → layer splitstream verity (one per layer)
-
-**Layer splitstream** (`oci-layer-sha256:...`):
-  - `object_refs`: file content objects extracted from the tar
-  - `named_refs`: none (leaf node)
-
-Both the manifest and config redundantly reference the layers.  The GC
-can reach layers from either path.
-
-### Garbage collection
-
-The GC walks all refs under `streams/refs/` to find root splitstreams,
-then transitively follows `named_refs` (by resolving fs-verity IDs
-through a stream name map) and collects `object_refs`.  Any object not
-reachable from a root is deleted.
-
-Concretely, for a tagged container image:
-
- 1. Tag `streams/refs/oci/myimage:v1` resolves to `oci-manifest-sha256:abc`
- 2. Walk the manifest: mark its JSON blob and follow `named_refs` to
-    the config and layer streams
- 3. Walk the config: mark its JSON blob and follow `named_refs` to layers
-    (already visited, skipped)
- 4. Walk each layer: mark all file content objects
-
-When a tag is removed, the manifest and everything reachable only from it
-becomes GC-eligible.  Layers shared between images survive as long as any
-referencing manifest remains tagged.
-
-### EROFS image tracking via config splitstream refs
-
-When an EROFS image is generated from an OCI image (via
-`create_filesystem` + `commit_image`), its object ID (fs-verity digest)
-is stored as a named ref on the config splitstream with the key
-`composefs.image`.
-
-GC walks from tag → manifest → config, and finds the `composefs.image`
-named ref.  The EROFS object ID is added to the live set, keeping the
-EROFS image alive.  The EROFS image still needs an entry under `images/`
-for the kernel mount security model (see above), but `images/` is not a
-GC root — the config ref is what keeps the object alive.
-
-This means a single OCI tag is sufficient to keep the entire image
-(manifest, config, layers, and the EROFS image) alive through GC.
-
-### Bootable image variant
-
-For bootable images, a second EROFS may be generated after
-`transform_for_boot` (stripping `/boot`, etc.).  This boot EROFS is
-stored as a second named ref on the config, `composefs.image.boot`.
-
-Since the config splitstream content changes (new named ref), it gets a
-new fs-verity digest.  This cascades: the manifest must also be
-rewritten (its `config:` named ref now points to the new config verity),
-producing a new manifest verity.  The tag is re-pointed to the new
-manifest.  The old config and manifest splitstreams become unreferenced
-and are collected by GC.
-
-The result: one tag still keeps everything alive — layers, raw EROFS,
-and boot EROFS.
-
-### Future: sealed images
-
-For sealed/signed images, the EROFS comes pre-built from the registry as
-part of a composefs OCI artifact (referrer pattern).  The artifact
-splitstream would hold references to the pre-fetched EROFS layers.  This
-is complementary to the unsealed case — both use the same GC mechanism
-(named refs pointing to EROFS objects).
diff --git a/doc/splitstream.md b/doc/splitstream.md
deleted file mode 100644
index 787d1ec9..00000000
--- a/doc/splitstream.md
+++ /dev/null
@@ -1,164 +0,0 @@
-# Splitstream
-
-Splitstream is a trivial way of storing file formats (like tar) with the "data
-blocks" stored in the composefs object store with the goal that it's possible
-to bit-for-bit recreate the entire file.  It's something like the idea behind
-[tar-split](https://github.com/vbatts/tar-split), with some important
-differences:
-
- - it's a binary format
-
- - it's based on storing external objects content-addressed in the composefs
-   object store via their fs-verity digest
-
- - although it's designed with `tar` files in mind, it's not specific to `tar`,
-   or even to the idea of an archive file: any file format can be stored as a
-   splitstream, and it might make sense to do so for any file format that
-   contains large chunks of embedded data
-
- - in addition to the ability to split out chunks of file content (like files
-   in a `.tar`) to separate files, it is also possible to refer to external
-   file content, or even other splitstreams, without directly embedding that
-   content in the referrer, which can be useful for cross-document references
-   (such as between OCI manifests, configs, and layers)
-
- - the splitstream file itself is stored in the same content-addressed object
-   store by its own fs-verity hash
-
-Splitstream compresses inline file content before it is stored to disk using
-zstd.  The main reason for this is that, after removing the actual file data,
-the remaining `tar` metadata contains a very large amount of padding and empty
-space and compresses extremely well.
-
-Splitstream is conceptually independent from composefs: you could use the
-format with any content-addressed storage system.
-
-## File format
-
-What follows is a non-normative documentation of the file format.  The actual
-definition of the format is "what composefs-rs reads and writes", but this
-document may be useful to try to understand that format.  If you'd like to
-implement the format, please get in touch.
-
-The format is implemented in
-[crates/composefs/src/splitstream.rs](crates/composefs/src/splitstream.rs) and
-the structs from that file are copy-pasted here.  Please try to keep things
-roughly in sync when making changes to either side.
-
-All integers are little-endian.  In the following `struct` definitions, `U`
-means 'unsigned little endian' (as per the `zerocopy::little_endian` crate) so
-`U64` is an unsigned 64bit little-endian integer.
-
-### File ranges ("sections")
-
-The file format consists of a fixed-sized header at the start of the file plus
-a number of sections located at arbitrary locations inside of the file.  All of
-these sections are referred to by a 64-bit `[start..end)` range expressed in
-terms of overall byte offsets within the complete file.
-
-```rust
-struct FileRange {
-    start: U64,
-    end: U64,
-}
-```
-
-### Header
-
-The file starts with a simple fixed-size header.
-
-```rust
-const SPLITSTREAM_MAGIC: [u8; 11] = *b"SplitStream";
-
-struct SplitstreamHeader {
-    pub magic: [u8; 11],  // Contains SPLITSTREAM_MAGIC
-    pub version: u8,      // must always be 0
-    pub _flags: U16,      // is currently always 0 (but ignored)
-    pub algorithm: u8,    // kernel fs-verity algorithm identifier (1 = sha256, 2 = sha512)
-    pub lg_blocksize: u8, // log2 of the fs-verity block size (12 = 4k, 16 = 64k)
-    pub info: FileRange,  // can be used to expand/move the info section in the future
-}
-```
-
-In addition to magic values and identifiers for the fs-verity algorithm in use,
-the header is used to find the location and size of the info section.  Future
-expansions to the file format are imagined to occur by expanding the size of
-the info section: if the section is larger than expected, the additional bytes
-will be ignored by the implementation.
-
-### Info section
-
-```rust
-struct SplitstreamInfo {
-    pub stream_refs: FileRange, // location of the stream references array
-    pub object_refs: FileRange, // location of the object references array
-    pub stream: FileRange,      // location of the zstd-compressed stream within the file
-    pub named_refs: FileRange,  // location of the compressed named references
-    pub content_type: U64,      // user can put whatever magic identifier they want there
-    pub stream_size: U64,       // total uncompressed size of inline chunks and external chunks
-}
-```
-
-The `content_type` is just an arbitrary identifier that can be used by users of
-the file format to prevent casual user error when opening a file by its hash
-value (to prevent showing `.tar` data as if it were json, for example).
-
-The `stream_size` is the total size of the original file.
-
-### Stream and object refs sections
-
-All referred streams and objects in the file are stored as two separate flat
-uncompressed arrays of binary fs-verity hash values.  Each of these arrays is
-referred to from the info section (via `stream_refs` and `object_refs`).
-
-The number of items in the array is determined by the size of the section
-divided by the size of the fs-verity hash value (determined by the algorithm
-identifier in the header).
-
-The values are not in any particular order, but implementations should produce
-a deterministic output.  For example, the objects reference array produced by
-the current implementation has the external objects sorted by first-appearance
-within the stream.
-
-The main motivation for storing the references uncompressed, in binary, and in
-a flat array is to make determining the references contained within a
-splitstream as simple as possible to improve the efficiency of garbage
-collection on large repositories.
-
-### The stream
-
-The main content of the splitstream is stored in the `stream` section
-referenced from the info section.  The entire section is zstd compressed.
-
-Within the compressed stream, the splitstream is formed from a number of
-"chunks".  Each chunk starts with a single 64-bit little endian value.  If that
-number is negative, it refers to an "inline" chunk, and that (absolute) number
-of bytes of data immediately follow it.  If the number is non-negative then it
-is an index into the object refs array for an "external" chunk.
-
-Zero is a non-negative value, so it's an object reference.  It's not possible
-to have a zero-byte inline chunk.  This also means that the high/sign bit
-determines which case (inline vs. external) we have and there are an equal
-number of both cases.
-
-The stream is reassembled by iterating over the chunks and concatenating the
-result.  For inline chunks, the inline data is taken directly from the
-splitstream. For external chunks, the content of the external file is used.
-
-The stream is over when there are no more chunks.
-
-### Named references
-
-It's possible to have named references to other streams.  These are stored in
-the `named_refs` section referred to from the info section.
-
-This section is also zstd-compressed, and is a number of nul-terminated text
-records (including a terminator after the last record).  Each record has the
-form `n:name` where `n` is a non-negative integer index into the stream refs
-array and `name` is an arbitrary name.  The entries are currently sorted by
-name (by the writer implementation) but the order is not important to the
-reader.  Whether or not this list is "officially" sorted or not may be pinned
-down at some future point if a need should arise.
-
-An example of the decompressed content of the section might be something like
-`"0:first\01:second\0"`.