From 98284848cc39df361335c734fac728b8698551c0 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E5=9F=83=E5=8D=9A=E6=8B=89=E9=85=B1?= Date: Sat, 18 Apr 2026 02:48:27 +0800 Subject: [PATCH 1/2] Restore original sysnum in unmodifiable-sysnum workaround When push_specific_regs fails because the kernel lacks the NT_ARM_SYSTEM_CALL regset (so NT_PRSTATUS push carrying x8=-1/PR_void is rejected with EINVAL), proot falls back to poking all six syscall argument registers to -1 and re-pushes NT_PRSTATUS. Previously x8 was kept at -1 through this second push, so the kernel saw an illegal sysnum and rejected the call with ENOSYS; the real error code was then written to x0 at sysexit. On some kernels this does not work: x8=-1 on restart triggers a non-standard signal delivery path that synthesizes a SIGSEGV and kills the tracee before it executes a single user-mode instruction. Restore x8 to the original sysnum before re-pushing, and keep poking all six argument registers to -1. The kernel then actually runs the original syscall, but with all arguments -1 it fails naturally inside the kernel (EFAULT/EBADF/EINVAL) without side effects, and proot overrides the returned x0 at sysexit with the real error code. The PR_brk special-case (SYSARG_1=0) is also removed. It was a defensive path against a non-compliant kernel mishandling brk(-1); with x8 now explicitly restored and POSIX brk(addr) returning the current brk without mutation when addr is out of range, the extra poke is unnecessary. --- src/syscall/syscall.c | 31 ++++++++++++++++++++++--------- 1 file changed, 22 insertions(+), 9 deletions(-) diff --git a/src/syscall/syscall.c b/src/syscall/syscall.c index 0761a263..51056a0a 100644 --- a/src/syscall/syscall.c +++ b/src/syscall/syscall.c @@ -243,10 +243,28 @@ void translate_syscall(Tracee *tracee) tracee->restart_how = PTRACE_SYSCALL; } - /* Set syscall arguments to make it fail - * TODO: More reliable way to make invalid arguments - * For most syscalls we set all args to -1 - * Hoping there is among them invalid request/address/fd/value that will make syscall fail */ + /* Handle syscall rejection when sysnum can't be modified. + * + * Normal path: proot sets sysnum to PR_void so the kernel runs + * a harmless no-op, then overrides x0 at sysexit with the real + * error code. On some kernels the NT_ARM_SYSTEM_CALL regset is + * absent, so the NT_PRSTATUS push carrying x8=-1 (PR_void) is + * rejected with EINVAL and we land in this workaround branch. + * + * Legacy strategy was to poke all 6 syscall args to -1 and + * re-push NT_PRSTATUS, keeping x8=-1 so the kernel still saw + * an illegal sysnum and rejected the call with ENOSYS. That + * works on stock kernels, but on some kernels x8=-1 on restart + * triggers a non-standard signal delivery path that + * synthesizes a SIGSEGV and kills the tracee before it + * executes a single user-mode instruction. + * + * Correct strategy: restore x8 to the original sysnum so the + * kernel actually runs the rejected syscall, and poke all 6 + * args to -1 so the syscall fails naturally inside the kernel + * (EFAULT/EBADF/EINVAL). The real error code is written to x0 + * by proot at sysexit. */ + poke_reg(tracee, SYSARG_NUM, orig_sysnum); /* restore sysnum; x8=-1 triggers non-standard signal path on some kernels */ poke_reg(tracee, SYSARG_1, -1); poke_reg(tracee, SYSARG_2, -1); poke_reg(tracee, SYSARG_3, -1); @@ -254,11 +272,6 @@ void translate_syscall(Tracee *tracee) poke_reg(tracee, SYSARG_5, -1); poke_reg(tracee, SYSARG_6, -1); - if (get_sysnum(tracee, ORIGINAL) == PR_brk) { - /* For brk() we pass 0 as first arg; this is used to query value without changing it */ - poke_reg(tracee, SYSARG_1, 0); - } - /* Push regs again without changing syscall */ push_regs_status = push_specific_regs(tracee, false); if (push_regs_status != 0) { From 001a6cc21691006ccf1d1038bd0161feceab7bc5 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E5=9F=83=E5=8D=9A=E6=8B=89=E9=85=B1?= Date: Sat, 18 Apr 2026 14:31:22 +0800 Subject: [PATCH 2/2] Address Copilot review on PR #348 and cache sysnum-regset availability MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two follow-ups to the unmodifiable-sysnum workaround: 1) Rewrite the workaround comment in src/syscall/syscall.c to reflect the real failure point. The previous wording blamed the NT_PRSTATUS push carrying x8=-1, but push_specific_regs() bails out earlier: on arm64 it returns as soon as PTRACE_SETREGSET(NT_ARM_SYSTEM_CALL) returns EINVAL, so the general-register push is never even attempted. The comment now describes the syscall-number regset / register in architecture-agnostic terms (x8/x0 references kept only as concrete arm64 examples inside the explanation) and records why a "known-unsafe syscall" guard proposed by the reviewer is not added: (a) the legacy "keep sysnum=PR_void" path is strictly worse on affected kernels because it causes SIGSEGV and kills the tracee; (b) poisoning all six arg registers to -1 already traps the vast majority of side-effectful syscalls at the kernel's parameter-validation stage; (c) we have no empirically grounded list of syscalls that both reach this suppression path and cause harmful side effects when invoked with poisoned args, so a speculative allow/deny list would be dead code. sysexit still overrides the return-value register with the real error code. 2) Cache NT_ARM_SYSTEM_CALL availability in src/tracee/reg.c. The regset is a kernel-global capability: once PTRACE_SETREGSET rejects it with EINVAL, it will never succeed under the same running kernel. A process-local static bool short-circuits subsequent requests so the caller hits its workaround path immediately, without paying the cost of a guaranteed-failing ptrace call on every intercepted syscall. Only EINVAL is cached; ESRCH/EPERM/ EFAULT are per-tracee state errors and are not treated as kernel- capability signals. The cache is memory-only and not persisted, so a new proot run always re-probes — this keeps behaviour correct across reboots, kernel upgrades, or chroots into different kernels, at the cost of a single failing ptrace per proot start on affected kernels. --- src/syscall/syscall.c | 52 ++++++++++++++++++++++++++++++------------- src/tracee/reg.c | 20 ++++++++++++++++- 2 files changed, 55 insertions(+), 17 deletions(-) diff --git a/src/syscall/syscall.c b/src/syscall/syscall.c index 51056a0a..b94e7b3f 100644 --- a/src/syscall/syscall.c +++ b/src/syscall/syscall.c @@ -243,28 +243,48 @@ void translate_syscall(Tracee *tracee) tracee->restart_how = PTRACE_SYSCALL; } - /* Handle syscall rejection when sysnum can't be modified. + /* Handle syscall rejection when the syscall number can't be modified. * - * Normal path: proot sets sysnum to PR_void so the kernel runs - * a harmless no-op, then overrides x0 at sysexit with the real - * error code. On some kernels the NT_ARM_SYSTEM_CALL regset is - * absent, so the NT_PRSTATUS push carrying x8=-1 (PR_void) is - * rejected with EINVAL and we land in this workaround branch. + * Normal path: proot sets the syscall number to PR_void so the + * kernel runs a harmless no-op, then overrides the return-value + * register at sysexit with the real error code. On some kernels + * the dedicated syscall-number regset is absent/refused (on + * arm64 this is PTRACE_SETREGSET(NT_ARM_SYSTEM_CALL) returning + * EINVAL; see push_specific_regs() in tracee/reg.c which bails + * out before even attempting the general-register push), and + * we land in this workaround branch. * * Legacy strategy was to poke all 6 syscall args to -1 and - * re-push NT_PRSTATUS, keeping x8=-1 so the kernel still saw - * an illegal sysnum and rejected the call with ENOSYS. That - * works on stock kernels, but on some kernels x8=-1 on restart - * triggers a non-standard signal delivery path that - * synthesizes a SIGSEGV and kills the tracee before it - * executes a single user-mode instruction. + * re-push the general register state while keeping the syscall + * number set to PR_void, so the kernel still saw an illegal + * syscall number and rejected the call with ENOSYS. That works + * on stock kernels, but on some kernels restarting with the + * syscall-number register set to PR_void triggers a + * non-standard signal delivery path that synthesizes a SIGSEGV + * and kills the tracee before it executes a single user-mode + * instruction. * - * Correct strategy: restore x8 to the original sysnum so the + * Correct strategy: restore the original syscall number so the * kernel actually runs the rejected syscall, and poke all 6 * args to -1 so the syscall fails naturally inside the kernel - * (EFAULT/EBADF/EINVAL). The real error code is written to x0 - * by proot at sysexit. */ - poke_reg(tracee, SYSARG_NUM, orig_sysnum); /* restore sysnum; x8=-1 triggers non-standard signal path on some kernels */ + * (EFAULT/EBADF/EINVAL). The real error code is written to the + * return-value register by proot at sysexit. + * + * Known limitation: + * syscalls that ignore arguments (e.g. getpid/sync) or take + * fewer than 6 args will not necessarily fail inside the + * kernel, so they will actually execute with whatever state + * the tracee already has. We accept this: (a) the legacy + * "keep sysnum=PR_void" path is strictly worse on affected + * kernels — it kills the tracee with SIGSEGV; (b) -1 in every + * arg slot already traps the overwhelming majority of + * side-effectful syscalls at the kernel's parameter-validation + * stage (EBADF/EFAULT/EINVAL); (c) we have no empirically + * grounded list of syscalls that both reach this suppression + * path and cause harmful side effects when run with poisoned + * args, so a speculative allow/deny list would be dead code. + * The real return value is still overridden at sysexit. */ + poke_reg(tracee, SYSARG_NUM, orig_sysnum); /* restore original sysnum; PR_void in the syscall-number register triggers a non-standard SIGSEGV path on some kernels */ poke_reg(tracee, SYSARG_1, -1); poke_reg(tracee, SYSARG_2, -1); poke_reg(tracee, SYSARG_3, -1); diff --git a/src/tracee/reg.c b/src/tracee/reg.c index 3859f8e5..db5b4773 100644 --- a/src/tracee/reg.c +++ b/src/tracee/reg.c @@ -332,12 +332,30 @@ int push_specific_regs(Tracee *tracee, bool including_sysnum) /* Update syscall number if needed. On arm64, a new * subcommand has been added to PTRACE_{S,G}ETREGSET - * to allow write/read of current sycall number. */ + * to allow write/read of current sycall number. + * + * Kernel-capability cache: NT_ARM_SYSTEM_CALL is a + * kernel-global feature — if it is rejected once with + * EINVAL, it will never succeed under the same running + * kernel. We short-circuit subsequent requests so the + * caller (see syscall.c unmodifiable-sysnum workaround) + * hits its fallback path immediately without paying the + * cost of a guaranteed-failing ptrace on every intercepted + * syscall. Only EINVAL is cached: ESRCH/EPERM/EFAULT are + * per-tracee state issues, not kernel capability. Memory + * only (not persisted); a new proot run re-probes. */ + static bool sysnum_regset_unavailable = false; if (including_sysnum && current_sysnum != REG(tracee, ORIGINAL, SYSARG_NUM)) { + if (sysnum_regset_unavailable) { + errno = EINVAL; + return -1; + } regs.iov_base = ¤t_sysnum; regs.iov_len = sizeof(current_sysnum); status = ptrace(PTRACE_SETREGSET, tracee->pid, NT_ARM_SYSTEM_CALL, ®s); if (status < 0) { + if (errno == EINVAL) + sysnum_regset_unavailable = true; //note(tracee, WARNING, SYSTEM, "can't set the syscall number"); return status; }