Conversation
Sandbox tools (bwrap, etc.) call unshare/setns/mount/pivot_root and
write to /proc/self/{uid,gid}_map to set up a namespace; under proot
these all fail because we have no real namespace. Pretend they
succeed and use proot's binding system to make the resulting paths
actually accessible.
- syscall/enter.c: void unshare/setns/umount; turn mount into a
runtime binding (bind/proc/sysfs/tmpfs); turn pivot_root into a
root-binding swap that re-exposes the old root at put_old;
redirect open of /proc/<pid|self>/{uid_map,gid_map,setgroups}
to /dev/null; also fix prctl(PR_SET_DUMPABLE) to actually
return 0 (previously leaked -ENOSYS).
- syscall/seccomp.c: filter PR_unshare/PR_setns so the handlers
above run under seccomp mode 2.
- path/canon.c: when a symlink's guest path aliases /proc via a
binding (e.g. /oldroot/proc/self), route through readlink_proc
so "self" resolves to the tracee's pid, not proot's.
- extension/mountinfo: append synthesized lines for runtime
bindings so bwrap's parse_mountinfo finds the mounts it just
asked for.
- path/temp.c: skip chmod on symlinks during temp-dir cleanup;
bwrap leaves /dev/{stdin,fd,...} symlinks pointing at
/proc/self/fd/N inside the emulated tmpfs dirs.
bubblewrap calls clone(CLONE_NEWNS|SIGCHLD) directly (without going through unshare) and the Android kernel rejects it with EPERM when unprivileged user namespaces are disabled. Drop the namespace flags before the kernel sees them so the fork/thread itself still succeeds; PRoot keeps tracking the child via PTRACE_EVENT_CLONE. clone3 takes its flags from a struct clone_args in tracee memory, so read/write via peek_word/poke_word. Add both syscalls to the seccomp filter so the enter handler runs under seccomp mode 2.
On some aarch64 kernels (notably Android) the SYSCALL_AVOIDER trick leaks -ENOSYS through to the tracee even though set_sysnum() ran: chdir under PROOT_NO_SECCOMP and bwrap's mount(NULL, "/", ...) both returned "Function not implemented". Add an exit-stage poke so the result is always 0 for these emulated syscalls, regardless of how the kernel handled SYSCALL_AVOIDER. Requires the syscalls to be filtered with FILTER_SYSEXIT under seccomp mode so the exit handler actually runs. Also poke SYSARG_RESULT=0 at enter for getcwd/chdir/fchdir, mirroring what mount/unshare/etc. already do.
Android's parent process installs a system-wide seccomp filter that traps mount/umount/pivot_root/unshare/setns with SIGSYS. Our regular sysenter handlers in enter.c never run for those syscalls because the kernel sends SIGSYS instead of executing the call, so bwrap was getting -ENOSYS from the SIGSYS handler's default branch. Add cases in handle_seccomp_event_common that pretend the syscall succeeded (mirroring what enter.c does), and apply the mount / pivot_root binding emulation so sandbox helpers like bubblewrap see the bindings they expect. The emulation helpers in enter.c are factored out into apply_emulated_mount() / apply_emulated_pivot_root() so the SIGSYS handler and the normal enter path share the same code.
bubblewrap reads /oldroot/proc/self/fd/<N> to verify the mount it just asked for. With only a single /oldroot binding pointing at the previous rootfs host path, /oldroot/proc resolved to <rootfs>/proc on the host (empty), not the real /proc, so the readlink failed. After installing the put_old binding, walk the existing non-root bindings and add a parallel <put_old>/<guest> binding for each. The host /proc bound at /proc thus also becomes reachable at /oldroot/proc, which is what bwrap (and similar sandbox helpers) expects.
Two related fixes for the bubblewrap-on-PRoot emulation: 1. Subsequent bwrap runs in the same shell were failing with "Creating newroot failed: No such file or directory" because the bindings added by the previous bwrap leaked into the parent. bubblewrap clones with CLONE_NEWNS (which we strip); remember that on the tracee, and in new_child() deep-copy the binding tree so emulated mount(2) calls in the child don't propagate back to the parent. 2. umount of a runtime bind was a silent no-op. Add emulate_umount() that removes the matching binding when its guest path exactly equals the unmount target, and call it from both the regular sysenter handler and the SIGSYS handler.
Previously emulated mount of fstype "devtmpfs" or "devpts" got the same empty-tmpdir treatment as "tmpfs", which meant the tracee saw an empty directory instead of any real device. Bind the host /dev (for devtmpfs) and /dev/pts (for devpts) instead, so things like opening /dev/null or a pty inside the sandbox actually work.
Bubblewrap's --unshare-net path calls loopback_setup(), which:
1. if_nametoindex("lo") -> ioctl(SIOCGIFINDEX, {ifr_name="lo"})
2. socket(PF_NETLINK, SOCK_RAW, NETLINK_ROUTE)
3. bind() the netlink socket
4. sendto/recv RTM_NEWADDR + RTM_NEWLINK
On Android the underlying syscalls return EACCES because the real
caller lacks CAP_NET_ADMIN. We can't really set the loopback up but
we can make bwrap think we did.
- ioctl(SIOCGIFINDEX) for "lo" is intercepted and filled with index 1.
- socket(AF_NETLINK, ...) is silently rewritten to
socket(AF_UNIX, SOCK_DGRAM, ...). The resulting fd is tracked on
the tracee.
- bind/sendto/recvfrom on a tracked fd is voided. sendto records the
request's nlmsg_seq, recvfrom writes back a synthesised NLMSG_ERROR
reply with error=0, nlmsg_seq from the request, and nlmsg_pid set
to the tracee's pid (bwrap checks both).
- close() on a tracked fd removes it from the set so a reused fd
number doesn't keep being intercepted.
Previously socket(AF_NETLINK, ...) was unconditionally rewritten to socket(AF_UNIX, SOCK_DGRAM, 0), which broke legitimate netlink users inside the rootfs (e.g. c-ares under dnf would observe a zero-byte recvmsg and abort with "Unexpected netlink response of size 0 on descriptor N (address family 1)"). The rewrite is only needed where the kernel refuses AF_NETLINK outright (typical on Android/Termux). Probe socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE) once at first use and cache the outcome; only fall back to the AF_UNIX emulation when the probe fails. When the probe succeeds, no fd is tracked, so the dependent bind/sendto/recv intercepts stay inert.
Pulls in <unistd.h> for close(2) so the netlink probe builds without the implicit-declaration warning. While here, replace the magic -1/0/1 cache values with a named enum, fix the comment to call out the real reasons AF_NETLINK gets denied (SELinux, inherited seccomp, hardened containers) rather than implying it is always seccomp, and emit a VERBOSE note the first time the probe falls back so users can tell from -v output whether the AF_UNIX shim is active.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Targeted to make it possible to run stuff under bwrap, which is used by Glycin (gdk-pixbuf).
Currently doesn't support bwrap's- fixed--unshare-netor--unshare-all