| |
| unshare system call: |
| -------------------- |
| This document describes the new system call, unshare. The document |
| provides an overview of the feature, why it is needed, how it can |
| be used, its interface specification, design, implementation and |
| how it can be tested. |
| |
| Change Log: |
| ----------- |
| version 0.1 Initial document, Janak Desai (janak@us.ibm.com), Jan 11, 2006 |
| |
| Contents: |
| --------- |
| 1) Overview |
| 2) Benefits |
| 3) Cost |
| 4) Requirements |
| 5) Functional Specification |
| 6) High Level Design |
| 7) Low Level Design |
| 8) Test Specification |
| 9) Future Work |
| |
| 1) Overview |
| ----------- |
| Most legacy operating system kernels support an abstraction of threads |
| as multiple execution contexts within a process. These kernels provide |
| special resources and mechanisms to maintain these "threads". The Linux |
| kernel, in a clever and simple manner, does not make distinction |
| between processes and "threads". The kernel allows processes to share |
| resources and thus they can achieve legacy "threads" behavior without |
| requiring additional data structures and mechanisms in the kernel. The |
| power of implementing threads in this manner comes not only from |
| its simplicity but also from allowing application programmers to work |
| outside the confinement of all-or-nothing shared resources of legacy |
| threads. On Linux, at the time of thread creation using the clone system |
| call, applications can selectively choose which resources to share |
| between threads. |
| |
| unshare system call adds a primitive to the Linux thread model that |
| allows threads to selectively 'unshare' any resources that were being |
| shared at the time of their creation. unshare was conceptualized by |
| Al Viro in the August of 2000, on the Linux-Kernel mailing list, as part |
| of the discussion on POSIX threads on Linux. unshare augments the |
| usefulness of Linux threads for applications that would like to control |
| shared resources without creating a new process. unshare is a natural |
| addition to the set of available primitives on Linux that implement |
| the concept of process/thread as a virtual machine. |
| |
| 2) Benefits |
| ----------- |
| unshare would be useful to large application frameworks such as PAM |
| where creating a new process to control sharing/unsharing of process |
| resources is not possible. Since namespaces are shared by default |
| when creating a new process using fork or clone, unshare can benefit |
| even non-threaded applications if they have a need to disassociate |
| from default shared namespace. The following lists two use-cases |
| where unshare can be used. |
| |
| 2.1 Per-security context namespaces |
| ----------------------------------- |
| unshare can be used to implement polyinstantiated directories using |
| the kernel's per-process namespace mechanism. Polyinstantiated directories, |
| such as per-user and/or per-security context instance of /tmp, /var/tmp or |
| per-security context instance of a user's home directory, isolate user |
| processes when working with these directories. Using unshare, a PAM |
| module can easily setup a private namespace for a user at login. |
| Polyinstantiated directories are required for Common Criteria certification |
| with Labeled System Protection Profile, however, with the availability |
| of shared-tree feature in the Linux kernel, even regular Linux systems |
| can benefit from setting up private namespaces at login and |
| polyinstantiating /tmp, /var/tmp and other directories deemed |
| appropriate by system administrators. |
| |
| 2.2 unsharing of virtual memory and/or open files |
| ------------------------------------------------- |
| Consider a client/server application where the server is processing |
| client requests by creating processes that share resources such as |
| virtual memory and open files. Without unshare, the server has to |
| decide what needs to be shared at the time of creating the process |
| which services the request. unshare allows the server an ability to |
| disassociate parts of the context during the servicing of the |
| request. For large and complex middleware application frameworks, this |
| ability to unshare after the process was created can be very |
| useful. |
| |
| 3) Cost |
| ------- |
| In order to not duplicate code and to handle the fact that unshare |
| works on an active task (as opposed to clone/fork working on a newly |
| allocated inactive task) unshare had to make minor reorganizational |
| changes to copy_* functions utilized by clone/fork system call. |
| There is a cost associated with altering existing, well tested and |
| stable code to implement a new feature that may not get exercised |
| extensively in the beginning. However, with proper design and code |
| review of the changes and creation of an unshare test for the LTP |
| the benefits of this new feature can exceed its cost. |
| |
| 4) Requirements |
| --------------- |
| unshare reverses sharing that was done using clone(2) system call, |
| so unshare should have a similar interface as clone(2). That is, |
| since flags in clone(int flags, void *stack) specifies what should |
| be shared, similar flags in unshare(int flags) should specify |
| what should be unshared. Unfortunately, this may appear to invert |
| the meaning of the flags from the way they are used in clone(2). |
| However, there was no easy solution that was less confusing and that |
| allowed incremental context unsharing in future without an ABI change. |
| |
| unshare interface should accommodate possible future addition of |
| new context flags without requiring a rebuild of old applications. |
| If and when new context flags are added, unshare design should allow |
| incremental unsharing of those resources on an as needed basis. |
| |
| 5) Functional Specification |
| --------------------------- |
| NAME |
| unshare - disassociate parts of the process execution context |
| |
| SYNOPSIS |
| #include <sched.h> |
| |
| int unshare(int flags); |
| |
| DESCRIPTION |
| unshare allows a process to disassociate parts of its execution |
| context that are currently being shared with other processes. Part |
| of execution context, such as the namespace, is shared by default |
| when a new process is created using fork(2), while other parts, |
| such as the virtual memory, open file descriptors, etc, may be |
| shared by explicit request to share them when creating a process |
| using clone(2). |
| |
| The main use of unshare is to allow a process to control its |
| shared execution context without creating a new process. |
| |
| The flags argument specifies one or bitwise-or'ed of several of |
| the following constants. |
| |
| CLONE_FS |
| If CLONE_FS is set, file system information of the caller |
| is disassociated from the shared file system information. |
| |
| CLONE_FILES |
| If CLONE_FILES is set, the file descriptor table of the |
| caller is disassociated from the shared file descriptor |
| table. |
| |
| CLONE_NEWNS |
| If CLONE_NEWNS is set, the namespace of the caller is |
| disassociated from the shared namespace. |
| |
| CLONE_VM |
| If CLONE_VM is set, the virtual memory of the caller is |
| disassociated from the shared virtual memory. |
| |
| RETURN VALUE |
| On success, zero returned. On failure, -1 is returned and errno is |
| |
| ERRORS |
| EPERM CLONE_NEWNS was specified by a non-root process (process |
| without CAP_SYS_ADMIN). |
| |
| ENOMEM Cannot allocate sufficient memory to copy parts of caller's |
| context that need to be unshared. |
| |
| EINVAL Invalid flag was specified as an argument. |
| |
| CONFORMING TO |
| The unshare() call is Linux-specific and should not be used |
| in programs intended to be portable. |
| |
| SEE ALSO |
| clone(2), fork(2) |
| |
| 6) High Level Design |
| -------------------- |
| Depending on the flags argument, the unshare system call allocates |
| appropriate process context structures, populates it with values from |
| the current shared version, associates newly duplicated structures |
| with the current task structure and releases corresponding shared |
| versions. Helper functions of clone (copy_*) could not be used |
| directly by unshare because of the following two reasons. |
| 1) clone operates on a newly allocated not-yet-active task |
| structure, where as unshare operates on the current active |
| task. Therefore unshare has to take appropriate task_lock() |
| before associating newly duplicated context structures |
| 2) unshare has to allocate and duplicate all context structures |
| that are being unshared, before associating them with the |
| current task and releasing older shared structures. Failure |
| do so will create race conditions and/or oops when trying |
| to backout due to an error. Consider the case of unsharing |
| both virtual memory and namespace. After successfully unsharing |
| vm, if the system call encounters an error while allocating |
| new namespace structure, the error return code will have to |
| reverse the unsharing of vm. As part of the reversal the |
| system call will have to go back to older, shared, vm |
| structure, which may not exist anymore. |
| |
| Therefore code from copy_* functions that allocated and duplicated |
| current context structure was moved into new dup_* functions. Now, |
| copy_* functions call dup_* functions to allocate and duplicate |
| appropriate context structures and then associate them with the |
| task structure that is being constructed. unshare system call on |
| the other hand performs the following: |
| 1) Check flags to force missing, but implied, flags |
| 2) For each context structure, call the corresponding unshare |
| helper function to allocate and duplicate a new context |
| structure, if the appropriate bit is set in the flags argument. |
| 3) If there is no error in allocation and duplication and there |
| are new context structures then lock the current task structure, |
| associate new context structures with the current task structure, |
| and release the lock on the current task structure. |
| 4) Appropriately release older, shared, context structures. |
| |
| 7) Low Level Design |
| ------------------- |
| Implementation of unshare can be grouped in the following 4 different |
| items: |
| a) Reorganization of existing copy_* functions |
| b) unshare system call service function |
| c) unshare helper functions for each different process context |
| d) Registration of system call number for different architectures |
| |
| 7.1) Reorganization of copy_* functions |
| Each copy function such as copy_mm, copy_namespace, copy_files, |
| etc, had roughly two components. The first component allocated |
| and duplicated the appropriate structure and the second component |
| linked it to the task structure passed in as an argument to the copy |
| function. The first component was split into its own function. |
| These dup_* functions allocated and duplicated the appropriate |
| context structure. The reorganized copy_* functions invoked |
| their corresponding dup_* functions and then linked the newly |
| duplicated structures to the task structure with which the |
| copy function was called. |
| |
| 7.2) unshare system call service function |
| * Check flags |
| Force implied flags. If CLONE_THREAD is set force CLONE_VM. |
| If CLONE_VM is set, force CLONE_SIGHAND. If CLONE_SIGHAND is |
| set and signals are also being shared, force CLONE_THREAD. If |
| CLONE_NEWNS is set, force CLONE_FS. |
| * For each context flag, invoke the corresponding unshare_* |
| helper routine with flags passed into the system call and a |
| reference to pointer pointing the new unshared structure |
| * If any new structures are created by unshare_* helper |
| functions, take the task_lock() on the current task, |
| modify appropriate context pointers, and release the |
| task lock. |
| * For all newly unshared structures, release the corresponding |
| older, shared, structures. |
| |
| 7.3) unshare_* helper functions |
| For unshare_* helpers corresponding to CLONE_SYSVSEM, CLONE_SIGHAND, |
| and CLONE_THREAD, return -EINVAL since they are not implemented yet. |
| For others, check the flag value to see if the unsharing is |
| required for that structure. If it is, invoke the corresponding |
| dup_* function to allocate and duplicate the structure and return |
| a pointer to it. |
| |
| 7.4) Appropriately modify architecture specific code to register the |
| new system call. |
| |
| 8) Test Specification |
| --------------------- |
| The test for unshare should test the following: |
| 1) Valid flags: Test to check that clone flags for signal and |
| signal handlers, for which unsharing is not implemented |
| yet, return -EINVAL. |
| 2) Missing/implied flags: Test to make sure that if unsharing |
| namespace without specifying unsharing of filesystem, correctly |
| unshares both namespace and filesystem information. |
| 3) For each of the four (namespace, filesystem, files and vm) |
| supported unsharing, verify that the system call correctly |
| unshares the appropriate structure. Verify that unsharing |
| them individually as well as in combination with each |
| other works as expected. |
| 4) Concurrent execution: Use shared memory segments and futex on |
| an address in the shm segment to synchronize execution of |
| about 10 threads. Have a couple of threads execute execve, |
| a couple _exit and the rest unshare with different combination |
| of flags. Verify that unsharing is performed as expected and |
| that there are no oops or hangs. |
| |
| 9) Future Work |
| -------------- |
| The current implementation of unshare does not allow unsharing of |
| signals and signal handlers. Signals are complex to begin with and |
| to unshare signals and/or signal handlers of a currently running |
| process is even more complex. If in the future there is a specific |
| need to allow unsharing of signals and/or signal handlers, it can |
| be incrementally added to unshare without affecting legacy |
| applications using unshare. |
| |