Skip to content

Hijack namespace requests + other fixes to make bwrap working#359

Open
sylirre wants to merge 10 commits into
masterfrom
bwrap-fix
Open

Hijack namespace requests + other fixes to make bwrap working#359
sylirre wants to merge 10 commits into
masterfrom
bwrap-fix

Conversation

@sylirre
Copy link
Copy Markdown
Member

@sylirre sylirre commented May 23, 2026

Targeted to make it possible to run stuff under bwrap, which is used by Glycin (gdk-pixbuf).

Screenshot_20260523-210908_Termux

Currently doesn't support bwrap's --unshare-net or --unshare-all - fixed

sylirre added 6 commits May 23, 2026 18:01
Sandbox tools (bwrap, etc.) call unshare/setns/mount/pivot_root and
write to /proc/self/{uid,gid}_map to set up a namespace; under proot
these all fail because we have no real namespace.  Pretend they
succeed and use proot's binding system to make the resulting paths
actually accessible.

- syscall/enter.c: void unshare/setns/umount; turn mount into a
  runtime binding (bind/proc/sysfs/tmpfs); turn pivot_root into a
  root-binding swap that re-exposes the old root at put_old;
  redirect open of /proc/<pid|self>/{uid_map,gid_map,setgroups}
  to /dev/null; also fix prctl(PR_SET_DUMPABLE) to actually
  return 0 (previously leaked -ENOSYS).
- syscall/seccomp.c: filter PR_unshare/PR_setns so the handlers
  above run under seccomp mode 2.
- path/canon.c: when a symlink's guest path aliases /proc via a
  binding (e.g. /oldroot/proc/self), route through readlink_proc
  so "self" resolves to the tracee's pid, not proot's.
- extension/mountinfo: append synthesized lines for runtime
  bindings so bwrap's parse_mountinfo finds the mounts it just
  asked for.
- path/temp.c: skip chmod on symlinks during temp-dir cleanup;
  bwrap leaves /dev/{stdin,fd,...} symlinks pointing at
  /proc/self/fd/N inside the emulated tmpfs dirs.
bubblewrap calls clone(CLONE_NEWNS|SIGCHLD) directly (without going
through unshare) and the Android kernel rejects it with EPERM when
unprivileged user namespaces are disabled.  Drop the namespace
flags before the kernel sees them so the fork/thread itself still
succeeds; PRoot keeps tracking the child via PTRACE_EVENT_CLONE.

clone3 takes its flags from a struct clone_args in tracee memory,
so read/write via peek_word/poke_word.  Add both syscalls to the
seccomp filter so the enter handler runs under seccomp mode 2.
On some aarch64 kernels (notably Android) the SYSCALL_AVOIDER trick
leaks -ENOSYS through to the tracee even though set_sysnum() ran:
chdir under PROOT_NO_SECCOMP and bwrap's mount(NULL, "/", ...) both
returned "Function not implemented".  Add an exit-stage poke so the
result is always 0 for these emulated syscalls, regardless of how
the kernel handled SYSCALL_AVOIDER.

Requires the syscalls to be filtered with FILTER_SYSEXIT under
seccomp mode so the exit handler actually runs.

Also poke SYSARG_RESULT=0 at enter for getcwd/chdir/fchdir, mirroring
what mount/unshare/etc. already do.
Android's parent process installs a system-wide seccomp filter that
traps mount/umount/pivot_root/unshare/setns with SIGSYS.  Our regular
sysenter handlers in enter.c never run for those syscalls because the
kernel sends SIGSYS instead of executing the call, so bwrap was
getting -ENOSYS from the SIGSYS handler's default branch.

Add cases in handle_seccomp_event_common that pretend the syscall
succeeded (mirroring what enter.c does), and apply the mount /
pivot_root binding emulation so sandbox helpers like bubblewrap see
the bindings they expect.

The emulation helpers in enter.c are factored out into
apply_emulated_mount() / apply_emulated_pivot_root() so the SIGSYS
handler and the normal enter path share the same code.
bubblewrap reads /oldroot/proc/self/fd/<N> to verify the mount it
just asked for.  With only a single /oldroot binding pointing at the
previous rootfs host path, /oldroot/proc resolved to
<rootfs>/proc on the host (empty), not the real /proc, so the
readlink failed.

After installing the put_old binding, walk the existing non-root
bindings and add a parallel <put_old>/<guest> binding for each.  The
host /proc bound at /proc thus also becomes reachable at
/oldroot/proc, which is what bwrap (and similar sandbox helpers)
expects.
Two related fixes for the bubblewrap-on-PRoot emulation:

1. Subsequent bwrap runs in the same shell were failing with
   "Creating newroot failed: No such file or directory" because the
   bindings added by the previous bwrap leaked into the parent.
   bubblewrap clones with CLONE_NEWNS (which we strip); remember
   that on the tracee, and in new_child() deep-copy the binding
   tree so emulated mount(2) calls in the child don't propagate
   back to the parent.

2. umount of a runtime bind was a silent no-op.  Add
   emulate_umount() that removes the matching binding when its
   guest path exactly equals the unmount target, and call it from
   both the regular sysenter handler and the SIGSYS handler.
Previously emulated mount of fstype "devtmpfs" or "devpts" got the
same empty-tmpdir treatment as "tmpfs", which meant the tracee saw
an empty directory instead of any real device.  Bind the host
/dev (for devtmpfs) and /dev/pts (for devpts) instead, so things
like opening /dev/null or a pty inside the sandbox actually work.
Bubblewrap's --unshare-net path calls loopback_setup(), which:

  1. if_nametoindex("lo") -> ioctl(SIOCGIFINDEX, {ifr_name="lo"})
  2. socket(PF_NETLINK, SOCK_RAW, NETLINK_ROUTE)
  3. bind() the netlink socket
  4. sendto/recv RTM_NEWADDR + RTM_NEWLINK

On Android the underlying syscalls return EACCES because the real
caller lacks CAP_NET_ADMIN.  We can't really set the loopback up but
we can make bwrap think we did.

- ioctl(SIOCGIFINDEX) for "lo" is intercepted and filled with index 1.
- socket(AF_NETLINK, ...) is silently rewritten to
  socket(AF_UNIX, SOCK_DGRAM, ...).  The resulting fd is tracked on
  the tracee.
- bind/sendto/recvfrom on a tracked fd is voided.  sendto records the
  request's nlmsg_seq, recvfrom writes back a synthesised NLMSG_ERROR
  reply with error=0, nlmsg_seq from the request, and nlmsg_pid set
  to the tracee's pid (bwrap checks both).
- close() on a tracked fd removes it from the set so a reused fd
  number doesn't keep being intercepted.
sylirre added 2 commits May 24, 2026 23:02
Previously socket(AF_NETLINK, ...) was unconditionally rewritten to
socket(AF_UNIX, SOCK_DGRAM, 0), which broke legitimate netlink users
inside the rootfs (e.g. c-ares under dnf would observe a zero-byte
recvmsg and abort with "Unexpected netlink response of size 0 on
descriptor N (address family 1)").  The rewrite is only needed where
the kernel refuses AF_NETLINK outright (typical on Android/Termux).

Probe socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE) once at first use
and cache the outcome; only fall back to the AF_UNIX emulation when
the probe fails.  When the probe succeeds, no fd is tracked, so the
dependent bind/sendto/recv intercepts stay inert.
Pulls in <unistd.h> for close(2) so the netlink probe builds without
the implicit-declaration warning.  While here, replace the magic
-1/0/1 cache values with a named enum, fix the comment to call out
the real reasons AF_NETLINK gets denied (SELinux, inherited seccomp,
hardened containers) rather than implying it is always seccomp, and
emit a VERBOSE note the first time the probe falls back so users can
tell from -v output whether the AF_UNIX shim is active.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant