seccomp: Make syscall skipping and nr changes more consistent

This fixes two issues that could cause incompatibility between
kernel versions:

 - If a tracer uses SECCOMP_RET_TRACE to select a syscall number
   higher than the largest known syscall, emulate the unknown
   vsyscall by returning -ENOSYS.  (This is unlikely to make a
   noticeable difference on x86-64 due to the way the system call
   entry works.)

 - On x86-64 with vsyscall=emulate, skipped vsyscalls were buggy.

This updates the documentation accordingly.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Acked-by: Will Drewry <wad@chromium.org>
Signed-off-by: James Morris <james.l.morris@oracle.com>
diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt
index 597c3c5..1e469ef 100644
--- a/Documentation/prctl/seccomp_filter.txt
+++ b/Documentation/prctl/seccomp_filter.txt
@@ -95,12 +95,15 @@
 
 SECCOMP_RET_TRAP:
 	Results in the kernel sending a SIGSYS signal to the triggering
-	task without executing the system call.  The kernel will
-	rollback the register state to just before the system call
-	entry such that a signal handler in the task will be able to
-	inspect the ucontext_t->uc_mcontext registers and emulate
-	system call success or failure upon return from the signal
-	handler.
+	task without executing the system call.  siginfo->si_call_addr
+	will show the address of the system call instruction, and
+	siginfo->si_syscall and siginfo->si_arch will indicate which
+	syscall was attempted.  The program counter will be as though
+	the syscall happened (i.e. it will not point to the syscall
+	instruction).  The return value register will contain an arch-
+	dependent value -- if resuming execution, set it to something
+	sensible.  (The architecture dependency is because replacing
+	it with -ENOSYS could overwrite some useful information.)
 
 	The SECCOMP_RET_DATA portion of the return value will be passed
 	as si_errno.
@@ -123,6 +126,18 @@
 	the BPF program return value will be available to the tracer
 	via PTRACE_GETEVENTMSG.
 
+	The tracer can skip the system call by changing the syscall number
+	to -1.  Alternatively, the tracer can change the system call
+	requested by changing the system call to a valid syscall number.  If
+	the tracer asks to skip the system call, then the system call will
+	appear to return the value that the tracer puts in the return value
+	register.
+
+	The seccomp check will not be run again after the tracer is
+	notified.  (This means that seccomp-based sandboxes MUST NOT
+	allow use of ptrace, even of other sandboxed processes, without
+	extreme care; ptracers can use this mechanism to escape.)
+
 SECCOMP_RET_ALLOW:
 	Results in the system call being executed.
 
@@ -161,3 +176,50 @@
 support seccomp filter with minor fixup: SIGSYS support and seccomp return
 value checking.  Then it must just add CONFIG_HAVE_ARCH_SECCOMP_FILTER
 to its arch-specific Kconfig.
+
+
+
+Caveats
+-------
+
+The vDSO can cause some system calls to run entirely in userspace,
+leading to surprises when you run programs on different machines that
+fall back to real syscalls.  To minimize these surprises on x86, make
+sure you test with
+/sys/devices/system/clocksource/clocksource0/current_clocksource set to
+something like acpi_pm.
+
+On x86-64, vsyscall emulation is enabled by default.  (vsyscalls are
+legacy variants on vDSO calls.)  Currently, emulated vsyscalls will honor seccomp, with a few oddities:
+
+- A return value of SECCOMP_RET_TRAP will set a si_call_addr pointing to
+  the vsyscall entry for the given call and not the address after the
+  'syscall' instruction.  Any code which wants to restart the call
+  should be aware that (a) a ret instruction has been emulated and (b)
+  trying to resume the syscall will again trigger the standard vsyscall
+  emulation security checks, making resuming the syscall mostly
+  pointless.
+
+- A return value of SECCOMP_RET_TRACE will signal the tracer as usual,
+  but the syscall may not be changed to another system call using the
+  orig_rax register. It may only be changed to -1 order to skip the
+  currently emulated call. Any other change MAY terminate the process.
+  The rip value seen by the tracer will be the syscall entry address;
+  this is different from normal behavior.  The tracer MUST NOT modify
+  rip or rsp.  (Do not rely on other changes terminating the process.
+  They might work.  For example, on some kernels, choosing a syscall
+  that only exists in future kernels will be correctly emulated (by
+  returning -ENOSYS).
+
+To detect this quirky behavior, check for addr & ~0x0C00 ==
+0xFFFFFFFFFF600000.  (For SECCOMP_RET_TRACE, use rip.  For
+SECCOMP_RET_TRAP, use siginfo->si_call_addr.)  Do not check any other
+condition: future kernels may improve vsyscall emulation and current
+kernels in vsyscall=native mode will behave differently, but the
+instructions at 0xF...F600{0,4,8,C}00 will not be system calls in these
+cases.
+
+Note that modern systems are unlikely to use vsyscalls at all -- they
+are a legacy feature and they are considerably slower than standard
+syscalls.  New code will use the vDSO, and vDSO-issued system calls
+are indistinguishable from normal system calls.