Support fuse with overlayfs#328
Conversation
|
Maybe some overlap with #306 ? |
|
@cgwalters I don't think there is necessary an overlap, except perhaps that the passthrough support is intended to solve the same performance issue (but is not useful as it is root only). They are however complementary, in that the readdir+ and multi-threading will increase metadata performance. |
|
That said, there might be code conflicts in the two, i'll have a look at that. |
| /// When true, the server follows overlay redirects and serves file | ||
| /// content from the repository. When false, it synthesizes | ||
| /// `user.overlay.*` xattrs for use as an overlayfs lower layer. | ||
| /// Defaults to false. |
There was a problem hiding this comment.
But actually...I am not sure we considered this at all but - is there any reason at all to have the direct serving? I am not sure there is a strong one...I think the FUSE implementation was kind of intended mainly for unprivileged mounts, but since nowadays overlay is allowed in user namespaces, I think we should automatically set "follow_redirects" = false if running unprivileged right?
Actually to say it a different way - don't we always want to just always use overlayfs, and only have the FUSE for unprivileged EROFS? In that case, when running privileged, we should not use user.overlay right? Or I guess we still can, but there's no reason to?
There was a problem hiding this comment.
Honestly I'm not sure here. There is one way that follow_redirects mode is useful, in that it allows any non-root user to mount a composefs image in the "root" user namespace.
Like, if you're in the shell you can just mount a cfs and have some other program look into the mount. You cannot do that with an overlay mount. For that you need to be in a new user namespace (where you have cap_sysadmin) so you can do the mount. And if you create a new user+mnt namespace programs in the root namespace can't see your mount. So, non-folllow_redirect is primarily useful for container like tools, wheras follow_redirect is for traditional commandline work.
So, I do think it would make sense to support both of these modes. I wonder though if the current composefs fuse implementation is ideal for this kind of use. It starts by parsing the entire erofs into a filesystem tree, which adds quite some latency. An implementation that just reads the erofs file into memory and does metadata lookups directly from that would probably be faster at startup. And, for a metadata-only implementation (non-follow-redirects) that would probably be pretty easy to implement, as all you have to do is read inode info.
I guess that could be later work though.
There was a problem hiding this comment.
+1 to make the metadata-only serving the default.
There was a problem hiding this comment.
There is one way that follow_redirects mode is useful, in that it allows any non-root user to mount a composefs image in the "root" user namespace.
Yeah, but I'm not sure of any kind of "production" use case for that. Debugging can equally well just use APIs to inspect things.
follow_redirects is a confusing name, how about --standalone for the non-overlayfs case?
It starts by parsing the entire erofs into a filesystem tree, which adds quite some latency. An implementation that just reads the erofs file into memory and does metadata lookups directly from that would probably be faster at startup.
Yeah, agreed, we should do it that way.
There was a problem hiding this comment.
It starts by parsing the entire erofs into a filesystem tree, which adds quite some latency. An implementation that just reads the erofs file into memory and does metadata lookups directly from that would probably be faster at startup.
Yeah, agreed, we should do it that way.
can we mmap'it so multiple containers using the same erofs will share the memory?
There was a problem hiding this comment.
At least if the image is fs-verity, because then we can trust that it doesn't change under our back.
3c4c36b to
588a7f0
Compare
|
Ok I did some more work on this, picking up a bunch of changes from #306 with some changes. This now has multi-threaded fuse, mount apis to support mounting direct as well as via overlayfs with and without userxattrs, as well as verity=require. It also has mount CLI options for image/oci/ostree that mostly does the right thing by default (for example, it will use a non-overlay approach if we don't have cap_sysadmin). I didn't look at the varlink part yet. Also this is still based on serving a filesystem, rather than directly serving from an erofs image. |
588a7f0 to
98989b4
Compare
cgwalters
left a comment
There was a problem hiding this comment.
Cool, overall looking good
| fuse: bool, | ||
| /// Force kernel mount instead of auto-detecting | ||
| #[cfg(feature = "fuse")] | ||
| #[arg(long, conflicts_with = "fuse")] |
There was a problem hiding this comment.
I think an enum is better --fuse=auto|yes|no. Could also have a #[clap(flatten)] shared struct to dedup
| std::fs::read_to_string("/proc/self/uid_map") | ||
| .map(|s| s.trim() == "0 0 4294967295") |
There was a problem hiding this comment.
I think this is the canonical way to see if you're in the init user namespace. @giuseppe is there something better?
There was a problem hiding this comment.
we have a similar function in containers/storage:
// hasFullUsersMappings checks whether the current user namespace has all the IDs mapped.
func hasFullUsersMappings() (bool, error) {
content, err := os.ReadFile("/proc/self/uid_map")
if err != nil {
return false, err
}
// The kernel rejects attempts to create mappings where either starting
// point is (u32)-1: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/user_namespace.c?id=af3e9579ecfb#n1006 .
// So, if the uid_map contains 4294967295, the entire IDs space is available in the
// user namespace, so it is likely the initial user namespace.
return bytes.Contains(content, []byte("4294967295")), nil
}
it will fail if you do something like: unshare --map-users 0:0:4294967295 ... but I guess that is not a common configuration
| Ok(options) | ||
| } | ||
|
|
||
| #[cfg(feature = "fuse")] |
There was a problem hiding this comment.
I think it'd be nicer to have a separate mod fuse with all this stuff so we don't need a lot of #[cfg
| } | ||
|
|
||
| #[cfg(feature = "fuse")] | ||
| fn detect_mount_mode(force_fuse: bool, no_fuse: bool, has_upper: bool) -> MountMode { |
There was a problem hiding this comment.
instead of the first two bools an enum would be way clearer
| } else if no_fuse { | ||
| false | ||
| } else { | ||
| !(getuid().is_root() && in_init_user_namespace()) |
There was a problem hiding this comment.
Hmm do we really need to check the userns? I think we could just always check for has_cap_sys_admin?
There was a problem hiding this comment.
wouldn't that be enough in a rootless environment?
| std::mem::forget(work_fd); | ||
| } | ||
|
|
||
| clear_cloexec(&image_fd); |
There was a problem hiding this comment.
I think https://docs.rs/cap-std-ext/latest/cap_std_ext/cmdext/struct.CmdFds.html is better than this
| #[arg(long, value_parser = ["fuse", "fuse-overlay"])] | ||
| mode: String, |
There was a problem hiding this comment.
Clearer as an enum right?
| pub fn run_internal_fuse_serve(args: InternalFuseServeArgs) -> Result<()> { | ||
| use std::os::fd::FromRawFd; | ||
|
|
||
| let image_fd = unsafe { std::os::fd::OwnedFd::from_raw_fd(args.image_fd) }; |
There was a problem hiding this comment.
Perhaps we could use the socket activation protocol to add a little bit more safety?
Rather than serving a tree we serve directly from an erofs image passed as an fd. This should allow much less latency at startup, as we don't have to parse the entire file. It also allows us to use memmap, which should be safe at least for fs-verity (i.e. readonly) files. Fuse inodes are erofs nids, except the root inode which is always 1 in fuse. Fortunately erofs nids can't be 1, so we just map 1 <-> root nid. fuse requires the memory to be send and 'static, which is problematic for the self-referencing that happens if we store both Image and the owning buffer in ComposefsFuse. For now we just leak the erofs data chunk to make it 'static, as we expect the fuse process to keep it around until exit anyway. Assisted-by: Claude Code (Opus 4.6) Signed-off-by: Alexander Larsson <alexl@redhat.com>
Add mount_fuse_overlay() which creates an overlayfs on top of a FUSE mount, using userxattr mode and data-only lower layers for file content. The FUSE server must already be running before calling this, since overlayfs probes the lower layer during setup. OverlayMountOptions controls the overlay configuration: upper/work dirs for writable mounts, read-write mode, and fs-verity enforcement. This is needed if using mount_fuse() that doesn't follow redirects. Assisted-by: Claude Code (Opus 4.6) Signed-off-by: Alexander Larsson <alexl@redhat.com>
Add --fuse/--no-fuse flags to 'cfsctl mount' and 'cfsctl oci mount' with automatic mount mode detection based on privilege level: - Root in the init user namespace: kernel composefs mount (default) - Non-init namespace with CAP_SYS_ADMIN: FUSE with overlayfs (kernel reads data directly via data-only lower layer) - Non-init namespace without CAP_SYS_ADMIN: plain FUSE The --fuse flag forces FUSE, --no-fuse forces kernel mount, and omitting both auto-detects. When --upperdir is given, overlay mode is always used regardless of capabilities. By default the FUSE server daemonizes by re-executing itself as --internal-fuse-serve, passing the repo, image, and overlay fds via inherited file descriptors. The parent waits on a pipe for mount readiness then returns, matching the kernel mount's fire-and-forget behaviour. Use --foreground to keep the server in the calling process (useful for tests and debugging). Init namespace detection reads /proc/self/uid_map for the characteristic "0 0 4294967295" identity mapping. The composefs-fuse crate is an optional dependency behind the 'fuse' feature (on by default). MountOptions gains has_overlay(), read_write(), and into_overlay() accessors. serve_tree_fuse() gains an optional ready_fd parameter for signaling mount readiness. Assisted-by: Claude Code (Opus 4.6) Signed-off-by: Alexander Larsson <alexl@redhat.com>
Add privileged_fuse_dumpfile_roundtrip test that validates the FUSE implementation by building a synthetic filesystem with diverse content (directories, inline files, external files, symlinks, xattrs, hardlinks, character devices, FIFOs), mounting it via `cfsctl mount --fuse --foreground`, and comparing the dumpfile output against the expected canonical form. The test uses --foreground so the FUSE server runs as a child process that the test can manage directly (kill + unmount on cleanup). The test also reads external file content from the FUSE mount to verify the repository object serving path works correctly. Based-on-work-by: Colin Walters <walters@verbum.org> Assisted-by: Claude Code (Opus 4.6) Signed-off-by: Alexander Larsson <alexl@redhat.com>
98989b4 to
d82e9ab
Compare
|
Ok, completely new fuse reimplementation. Much smaller and should start faster. I'd like to do some more reviewing of it next week, and handle some of the comments above too. But, it looks pretty sweet to me. |
cgwalters
left a comment
There was a problem hiding this comment.
Overall looks good to me; you didn't address a few of my comments, was that intentional?
| } | ||
|
|
||
| /// Check if an fd has fs-verity enabled, meaning its contents cannot change. | ||
| fn is_safe_to_mmap(fd: &impl AsFd) -> bool { |
There was a problem hiding this comment.
We could argue to move this check into the memmap2 crate, then we wouldn't need any unsafe here.
It's the same with sealed memfds
I mean to look at those comment next week, as they were mainly in a different area than this change. I just wanted people to have a look at the new approach earlier. |
This adds support for a fuse version that only serves the erofs metadata, and helpers to mount an overlayfs using this with userxattrs. This allows rootless mounting of a composefs image where file content (but not metadata) is directly accessed from the regular fs by the kernel and should perform similar to the rootful overlayfs mount.