[minijail] handle non-namespaced multithreaded use.
Multithreaded use of pid namespaces is still broken; see the block comment in
</libminijail.c>.
BUG=None
TEST=build
Change-Id: Ibeb9434146a231fd2fd7468572e4fec28a1c1b60
Signed-off-by: Elly Jones <ellyjones@chromium.org>
Reviewed-on: https://gerrit.chromium.org/gerrit/25234
Reviewed-by: Mike Frysinger <vapier@chromium.org>
diff --git a/libminijail.c b/libminijail.c
index 9348a59..1f7ac9d 100644
--- a/libminijail.c
+++ b/libminijail.c
@@ -808,7 +808,6 @@
int API minijail_run_pid(struct minijail *j, const char *filename,
char *const argv[], pid_t *pchild_pid)
{
- unsigned int pidns = j->flags.pids ? CLONE_NEWPID : 0;
char *oldenv, *oldenv_copy = NULL;
pid_t child_pid;
int pipe_fds[2];
@@ -831,7 +830,51 @@
if (setup_pipe(pipe_fds))
return -EFAULT;
- child_pid = syscall(SYS_clone, pidns | SIGCHLD, NULL);
+ /* Use sys_clone() if and only if we're creating a pid namespace.
+ *
+ * tl;dr: WARNING: do not mix pid namespaces and multithreading.
+ *
+ * In multithreaded programs, there are a bunch of locks inside libc,
+ * some of which may be held by other threads at the time that we call
+ * minijail_run_pid(). If we call fork(), glibc does its level best to
+ * ensure that we hold all of these locks before it calls clone()
+ * internally and drop them after clone() returns, but when we call
+ * sys_clone(2) directly, all that gets bypassed and we end up with a
+ * child address space where some of libc's important locks are held by
+ * other threads (which did not get cloned, and hence will never release
+ * those locks). This is okay so long as we call exec() immediately
+ * after, but a bunch of seemingly-innocent libc functions like setenv()
+ * take locks.
+ *
+ * Hence, only call sys_clone() if we need to, in order to get at pid
+ * namespacing. If we follow this path, the child's address space might
+ * have broken locks; you may only call functions that do not acquire
+ * any locks.
+ *
+ * Unfortunately, fork() acquires every lock it can get its hands on, as
+ * previously detailed, so this function is highly likely to deadlock
+ * later on (see "deadlock here") if we're multithreaded.
+ *
+ * We might hack around this by having the clone()d child (init of the
+ * pid namespace) return directly, rather than leaving the clone()d
+ * process hanging around to be init for the new namespace (and having
+ * its fork()ed child return in turn), but that process would be crippled
+ * with its libc locks potentially broken. We might try fork()ing in the
+ * parent before we clone() to ensure that we own all the locks, but
+ * then we have to have the forked child hanging around consuming
+ * resources (and possibly having file descriptors / shared memory
+ * regions / etc attached). We'd need to keep the child around to avoid
+ * having its children get reparented to init.
+ *
+ * TODO(ellyjones): figure out if the "forked child hanging around"
+ * problem is fixable or not. It would be nice if we worked in this
+ * case.
+ */
+ if (j->flags.pids)
+ child_pid = syscall(SYS_clone, CLONE_NEWPID | SIGCHLD, NULL);
+ else
+ child_pid = fork();
+
if (child_pid < 0) {
free(oldenv_copy);
return child_pid;
@@ -865,12 +908,15 @@
/* Jail this process and its descendants... */
minijail_enter(j);
- if (pidns) {
+ if (j->flags.pids) {
/*
* pid namespace: this process will become init inside the new
* namespace, so fork off a child to actually run the program
* (we don't want all programs we might exec to have to know
* how to be init).
+ *
+ * If we're multithreaded, we'll probably deadlock here. See
+ * WARNING above.
*/
child_pid = fork();
if (child_pid < 0)