Skip to content

Support fuse with overlayfs#328

Open
alexlarsson wants to merge 4 commits into
mainfrom
fuse-with-overlayfs
Open

Support fuse with overlayfs#328
alexlarsson wants to merge 4 commits into
mainfrom
fuse-with-overlayfs

Conversation

@alexlarsson

Copy link
Copy Markdown
Contributor

This adds support for a fuse version that only serves the erofs metadata, and helpers to mount an overlayfs using this with userxattrs. This allows rootless mounting of a composefs image where file content (but not metadata) is directly accessed from the regular fs by the kernel and should perform similar to the rootful overlayfs mount.

@cgwalters

Copy link
Copy Markdown
Collaborator

Maybe some overlap with #306 ?

@alexlarsson

Copy link
Copy Markdown
Contributor Author

@cgwalters I don't think there is necessary an overlap, except perhaps that the passthrough support is intended to solve the same performance issue (but is not useful as it is root only). They are however complementary, in that the readdir+ and multi-threading will increase metadata performance.

@alexlarsson

Copy link
Copy Markdown
Contributor Author

That said, there might be code conflicts in the two, i'll have a look at that.

@cgwalters cgwalters left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, now that I understand what this is doing - pretty cool! This makes a ton of sense.

One thing I did in the other PR is add some more integration tests for the FUSE path, which would probably be quite useful here.

Comment thread crates/composefs-fuse/src/lib.rs Outdated
Comment thread crates/composefs-fuse/src/lib.rs Outdated
Comment thread crates/composefs-fuse/src/lib.rs Outdated
/// When true, the server follows overlay redirects and serves file
/// content from the repository. When false, it synthesizes
/// `user.overlay.*` xattrs for use as an overlayfs lower layer.
/// Defaults to false.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But actually...I am not sure we considered this at all but - is there any reason at all to have the direct serving? I am not sure there is a strong one...I think the FUSE implementation was kind of intended mainly for unprivileged mounts, but since nowadays overlay is allowed in user namespaces, I think we should automatically set "follow_redirects" = false if running unprivileged right?

Actually to say it a different way - don't we always want to just always use overlayfs, and only have the FUSE for unprivileged EROFS? In that case, when running privileged, we should not use user.overlay right? Or I guess we still can, but there's no reason to?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I'm not sure here. There is one way that follow_redirects mode is useful, in that it allows any non-root user to mount a composefs image in the "root" user namespace.

Like, if you're in the shell you can just mount a cfs and have some other program look into the mount. You cannot do that with an overlay mount. For that you need to be in a new user namespace (where you have cap_sysadmin) so you can do the mount. And if you create a new user+mnt namespace programs in the root namespace can't see your mount. So, non-folllow_redirect is primarily useful for container like tools, wheras follow_redirect is for traditional commandline work.

So, I do think it would make sense to support both of these modes. I wonder though if the current composefs fuse implementation is ideal for this kind of use. It starts by parsing the entire erofs into a filesystem tree, which adds quite some latency. An implementation that just reads the erofs file into memory and does metadata lookups directly from that would probably be faster at startup. And, for a metadata-only implementation (non-follow-redirects) that would probably be pretty easy to implement, as all you have to do is read inode info.

I guess that could be later work though.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to make the metadata-only serving the default.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one way that follow_redirects mode is useful, in that it allows any non-root user to mount a composefs image in the "root" user namespace.

Yeah, but I'm not sure of any kind of "production" use case for that. Debugging can equally well just use APIs to inspect things.

follow_redirects is a confusing name, how about --standalone for the non-overlayfs case?

It starts by parsing the entire erofs into a filesystem tree, which adds quite some latency. An implementation that just reads the erofs file into memory and does metadata lookups directly from that would probably be faster at startup.

Yeah, agreed, we should do it that way.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It starts by parsing the entire erofs into a filesystem tree, which adds quite some latency. An implementation that just reads the erofs file into memory and does metadata lookups directly from that would probably be faster at startup.

Yeah, agreed, we should do it that way.

can we mmap'it so multiple containers using the same erofs will share the memory?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd think so.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least if the image is fs-verity, because then we can trust that it doesn't change under our back.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep agreed

@alexlarsson alexlarsson force-pushed the fuse-with-overlayfs branch 2 times, most recently from 3c4c36b to 588a7f0 Compare June 26, 2026 14:28
@alexlarsson

Copy link
Copy Markdown
Contributor Author

Ok I did some more work on this, picking up a bunch of changes from #306 with some changes. This now has multi-threaded fuse, mount apis to support mounting direct as well as via overlayfs with and without userxattrs, as well as verity=require. It also has mount CLI options for image/oci/ostree that mostly does the right thing by default (for example, it will use a non-overlay approach if we don't have cap_sysadmin).

I didn't look at the varlink part yet. Also this is still based on serving a filesystem, rather than directly serving from an erofs image.

@alexlarsson alexlarsson force-pushed the fuse-with-overlayfs branch from 588a7f0 to 98989b4 Compare June 26, 2026 14:36

@cgwalters cgwalters left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, overall looking good

fuse: bool,
/// Force kernel mount instead of auto-detecting
#[cfg(feature = "fuse")]
#[arg(long, conflicts_with = "fuse")]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an enum is better --fuse=auto|yes|no. Could also have a #[clap(flatten)] shared struct to dedup

Comment on lines +862 to +863
std::fs::read_to_string("/proc/self/uid_map")
.map(|s| s.trim() == "0 0 4294967295")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels hacky.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the canonical way to see if you're in the init user namespace. @giuseppe is there something better?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have a similar function in containers/storage:

// hasFullUsersMappings checks whether the current user namespace has all the IDs mapped.
func hasFullUsersMappings() (bool, error) {
        content, err := os.ReadFile("/proc/self/uid_map")
        if err != nil {
                return false, err
        }
        // The kernel rejects attempts to create mappings where either starting
        // point is (u32)-1: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/user_namespace.c?id=af3e9579ecfb#n1006 .
        // So, if the uid_map contains 4294967295, the entire IDs space is available in the
        // user namespace, so it is likely the initial user namespace.
        return bytes.Contains(content, []byte("4294967295")), nil
}

it will fail if you do something like: unshare --map-users 0:0:4294967295 ... but I guess that is not a common configuration

Ok(options)
}

#[cfg(feature = "fuse")]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be nicer to have a separate mod fuse with all this stuff so we don't need a lot of #[cfg

}

#[cfg(feature = "fuse")]
fn detect_mount_mode(force_fuse: bool, no_fuse: bool, has_upper: bool) -> MountMode {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of the first two bools an enum would be way clearer

} else if no_fuse {
false
} else {
!(getuid().is_root() && in_init_user_namespace())

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm do we really need to check the userns? I think we could just always check for has_cap_sys_admin?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't that be enough in a rootless environment?

std::mem::forget(work_fd);
}

clear_cloexec(&image_fd);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +1068 to +1069
#[arg(long, value_parser = ["fuse", "fuse-overlay"])]
mode: String,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clearer as an enum right?

pub fn run_internal_fuse_serve(args: InternalFuseServeArgs) -> Result<()> {
use std::os::fd::FromRawFd;

let image_fd = unsafe { std::os::fd::OwnedFd::from_raw_fd(args.image_fd) };

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could use the socket activation protocol to add a little bit more safety?

cgwalters and others added 4 commits June 26, 2026 18:02
Rather than serving a tree we serve directly from an erofs image
passed as an fd. This should allow much less latency at startup, as we
don't have to parse the entire file. It also allows us to use memmap,
which should be safe at least for fs-verity (i.e. readonly) files.

Fuse inodes are erofs nids, except the root inode which is always 1 in fuse.
Fortunately erofs nids can't be 1, so we just map 1 <-> root nid.

fuse requires the memory to be send and 'static, which is problematic
for the self-referencing that happens if we store both Image and the
owning buffer in ComposefsFuse. For now we just leak the erofs data
chunk to make it 'static, as we expect the fuse process to keep it
around until exit anyway.

Assisted-by: Claude Code (Opus 4.6)
Signed-off-by: Alexander Larsson <alexl@redhat.com>
Add mount_fuse_overlay() which creates an overlayfs on top of a FUSE
mount, using userxattr mode and data-only lower layers for file content.
The FUSE server must already be running before calling this, since
overlayfs probes the lower layer during setup.

OverlayMountOptions controls the overlay configuration: upper/work dirs
for writable mounts, read-write mode, and fs-verity enforcement.

This is needed if using mount_fuse() that doesn't follow redirects.

Assisted-by: Claude Code (Opus 4.6)
Signed-off-by: Alexander Larsson <alexl@redhat.com>
Add --fuse/--no-fuse flags to 'cfsctl mount' and 'cfsctl oci mount'
with automatic mount mode detection based on privilege level:

- Root in the init user namespace: kernel composefs mount (default)
- Non-init namespace with CAP_SYS_ADMIN: FUSE with overlayfs
  (kernel reads data directly via data-only lower layer)
- Non-init namespace without CAP_SYS_ADMIN: plain FUSE

The --fuse flag forces FUSE, --no-fuse forces kernel mount, and
omitting both auto-detects. When --upperdir is given, overlay mode
is always used regardless of capabilities.

By default the FUSE server daemonizes by re-executing itself as
--internal-fuse-serve, passing the repo, image, and overlay fds
via inherited file descriptors. The parent waits on a pipe for mount
readiness then returns, matching the kernel mount's fire-and-forget
behaviour. Use --foreground to keep the server in the calling process
(useful for tests and debugging).

Init namespace detection reads /proc/self/uid_map for the characteristic
"0 0 4294967295" identity mapping.

The composefs-fuse crate is an optional dependency behind the 'fuse'
feature (on by default). MountOptions gains has_overlay(), read_write(),
and into_overlay() accessors. serve_tree_fuse() gains an optional
ready_fd parameter for signaling mount readiness.

Assisted-by: Claude Code (Opus 4.6)
Signed-off-by: Alexander Larsson <alexl@redhat.com>
Add privileged_fuse_dumpfile_roundtrip test that validates the FUSE
implementation by building a synthetic filesystem with diverse content
(directories, inline files, external files, symlinks, xattrs, hardlinks,
character devices, FIFOs), mounting it via `cfsctl mount --fuse
--foreground`, and comparing the dumpfile output against the expected
canonical form.

The test uses --foreground so the FUSE server runs as a child process
that the test can manage directly (kill + unmount on cleanup).

The test also reads external file content from the FUSE mount to verify
the repository object serving path works correctly.

Based-on-work-by: Colin Walters <walters@verbum.org>
Assisted-by: Claude Code (Opus 4.6)
Signed-off-by: Alexander Larsson <alexl@redhat.com>
@alexlarsson alexlarsson force-pushed the fuse-with-overlayfs branch from 98989b4 to d82e9ab Compare June 26, 2026 16:03
@alexlarsson

Copy link
Copy Markdown
Contributor Author

Ok, completely new fuse reimplementation. Much smaller and should start faster. I'd like to do some more reviewing of it next week, and handle some of the comments above too. But, it looks pretty sweet to me.

Comment thread crates/composefs-fuse/src/lib.rs

@cgwalters cgwalters left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me; you didn't address a few of my comments, was that intentional?

}

/// Check if an fd has fs-verity enabled, meaning its contents cannot change.
fn is_safe_to_mmap(fd: &impl AsFd) -> bool {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could argue to move this check into the memmap2 crate, then we wouldn't need any unsafe here.

It's the same with sealed memfds

@alexlarsson

Copy link
Copy Markdown
Contributor Author

Overall looks good to me; you didn't address a few of my comments, was that intentional?

I mean to look at those comment next week, as they were mainly in a different area than this change. I just wanted people to have a look at the new approach earlier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants