JANAK DESAI | 0d4c3e7 | 2006-02-07 12:58:56 -0800 | [diff] [blame] | 1 | |
| 2 | unshare system call: |
| 3 | -------------------- |
| 4 | This document describes the new system call, unshare. The document |
| 5 | provides an overview of the feature, why it is needed, how it can |
| 6 | be used, its interface specification, design, implementation and |
| 7 | how it can be tested. |
| 8 | |
| 9 | Change Log: |
| 10 | ----------- |
| 11 | version 0.1 Initial document, Janak Desai (janak@us.ibm.com), Jan 11, 2006 |
| 12 | |
| 13 | Contents: |
| 14 | --------- |
| 15 | 1) Overview |
| 16 | 2) Benefits |
| 17 | 3) Cost |
| 18 | 4) Requirements |
| 19 | 5) Functional Specification |
| 20 | 6) High Level Design |
| 21 | 7) Low Level Design |
| 22 | 8) Test Specification |
| 23 | 9) Future Work |
| 24 | |
| 25 | 1) Overview |
| 26 | ----------- |
| 27 | Most legacy operating system kernels support an abstraction of threads |
| 28 | as multiple execution contexts within a process. These kernels provide |
| 29 | special resources and mechanisms to maintain these "threads". The Linux |
| 30 | kernel, in a clever and simple manner, does not make distinction |
| 31 | between processes and "threads". The kernel allows processes to share |
| 32 | resources and thus they can achieve legacy "threads" behavior without |
| 33 | requiring additional data structures and mechanisms in the kernel. The |
| 34 | power of implementing threads in this manner comes not only from |
| 35 | its simplicity but also from allowing application programmers to work |
| 36 | outside the confinement of all-or-nothing shared resources of legacy |
| 37 | threads. On Linux, at the time of thread creation using the clone system |
| 38 | call, applications can selectively choose which resources to share |
| 39 | between threads. |
| 40 | |
| 41 | unshare system call adds a primitive to the Linux thread model that |
| 42 | allows threads to selectively 'unshare' any resources that were being |
| 43 | shared at the time of their creation. unshare was conceptualized by |
| 44 | Al Viro in the August of 2000, on the Linux-Kernel mailing list, as part |
| 45 | of the discussion on POSIX threads on Linux. unshare augments the |
| 46 | usefulness of Linux threads for applications that would like to control |
| 47 | shared resources without creating a new process. unshare is a natural |
| 48 | addition to the set of available primitives on Linux that implement |
| 49 | the concept of process/thread as a virtual machine. |
| 50 | |
| 51 | 2) Benefits |
| 52 | ----------- |
| 53 | unshare would be useful to large application frameworks such as PAM |
| 54 | where creating a new process to control sharing/unsharing of process |
| 55 | resources is not possible. Since namespaces are shared by default |
| 56 | when creating a new process using fork or clone, unshare can benefit |
| 57 | even non-threaded applications if they have a need to disassociate |
| 58 | from default shared namespace. The following lists two use-cases |
| 59 | where unshare can be used. |
| 60 | |
| 61 | 2.1 Per-security context namespaces |
| 62 | ----------------------------------- |
| 63 | unshare can be used to implement polyinstantiated directories using |
| 64 | the kernel's per-process namespace mechanism. Polyinstantiated directories, |
| 65 | such as per-user and/or per-security context instance of /tmp, /var/tmp or |
| 66 | per-security context instance of a user's home directory, isolate user |
| 67 | processes when working with these directories. Using unshare, a PAM |
| 68 | module can easily setup a private namespace for a user at login. |
| 69 | Polyinstantiated directories are required for Common Criteria certification |
| 70 | with Labeled System Protection Profile, however, with the availability |
| 71 | of shared-tree feature in the Linux kernel, even regular Linux systems |
| 72 | can benefit from setting up private namespaces at login and |
| 73 | polyinstantiating /tmp, /var/tmp and other directories deemed |
| 74 | appropriate by system administrators. |
| 75 | |
| 76 | 2.2 unsharing of virtual memory and/or open files |
| 77 | ------------------------------------------------- |
| 78 | Consider a client/server application where the server is processing |
| 79 | client requests by creating processes that share resources such as |
| 80 | virtual memory and open files. Without unshare, the server has to |
| 81 | decide what needs to be shared at the time of creating the process |
| 82 | which services the request. unshare allows the server an ability to |
| 83 | disassociate parts of the context during the servicing of the |
| 84 | request. For large and complex middleware application frameworks, this |
| 85 | ability to unshare after the process was created can be very |
| 86 | useful. |
| 87 | |
| 88 | 3) Cost |
| 89 | ------- |
| 90 | In order to not duplicate code and to handle the fact that unshare |
| 91 | works on an active task (as opposed to clone/fork working on a newly |
| 92 | allocated inactive task) unshare had to make minor reorganizational |
| 93 | changes to copy_* functions utilized by clone/fork system call. |
| 94 | There is a cost associated with altering existing, well tested and |
| 95 | stable code to implement a new feature that may not get exercised |
| 96 | extensively in the beginning. However, with proper design and code |
| 97 | review of the changes and creation of an unshare test for the LTP |
| 98 | the benefits of this new feature can exceed its cost. |
| 99 | |
| 100 | 4) Requirements |
| 101 | --------------- |
| 102 | unshare reverses sharing that was done using clone(2) system call, |
| 103 | so unshare should have a similar interface as clone(2). That is, |
| 104 | since flags in clone(int flags, void *stack) specifies what should |
| 105 | be shared, similar flags in unshare(int flags) should specify |
| 106 | what should be unshared. Unfortunately, this may appear to invert |
| 107 | the meaning of the flags from the way they are used in clone(2). |
| 108 | However, there was no easy solution that was less confusing and that |
| 109 | allowed incremental context unsharing in future without an ABI change. |
| 110 | |
| 111 | unshare interface should accommodate possible future addition of |
| 112 | new context flags without requiring a rebuild of old applications. |
| 113 | If and when new context flags are added, unshare design should allow |
| 114 | incremental unsharing of those resources on an as needed basis. |
| 115 | |
| 116 | 5) Functional Specification |
| 117 | --------------------------- |
| 118 | NAME |
| 119 | unshare - disassociate parts of the process execution context |
| 120 | |
| 121 | SYNOPSIS |
| 122 | #include <sched.h> |
| 123 | |
| 124 | int unshare(int flags); |
| 125 | |
| 126 | DESCRIPTION |
| 127 | unshare allows a process to disassociate parts of its execution |
| 128 | context that are currently being shared with other processes. Part |
| 129 | of execution context, such as the namespace, is shared by default |
| 130 | when a new process is created using fork(2), while other parts, |
| 131 | such as the virtual memory, open file descriptors, etc, may be |
| 132 | shared by explicit request to share them when creating a process |
| 133 | using clone(2). |
| 134 | |
| 135 | The main use of unshare is to allow a process to control its |
| 136 | shared execution context without creating a new process. |
| 137 | |
| 138 | The flags argument specifies one or bitwise-or'ed of several of |
| 139 | the following constants. |
| 140 | |
| 141 | CLONE_FS |
| 142 | If CLONE_FS is set, file system information of the caller |
| 143 | is disassociated from the shared file system information. |
| 144 | |
| 145 | CLONE_FILES |
| 146 | If CLONE_FILES is set, the file descriptor table of the |
| 147 | caller is disassociated from the shared file descriptor |
| 148 | table. |
| 149 | |
| 150 | CLONE_NEWNS |
| 151 | If CLONE_NEWNS is set, the namespace of the caller is |
| 152 | disassociated from the shared namespace. |
| 153 | |
| 154 | CLONE_VM |
| 155 | If CLONE_VM is set, the virtual memory of the caller is |
| 156 | disassociated from the shared virtual memory. |
| 157 | |
| 158 | RETURN VALUE |
| 159 | On success, zero returned. On failure, -1 is returned and errno is |
| 160 | |
| 161 | ERRORS |
| 162 | EPERM CLONE_NEWNS was specified by a non-root process (process |
| 163 | without CAP_SYS_ADMIN). |
| 164 | |
| 165 | ENOMEM Cannot allocate sufficient memory to copy parts of caller's |
| 166 | context that need to be unshared. |
| 167 | |
| 168 | EINVAL Invalid flag was specified as an argument. |
| 169 | |
| 170 | CONFORMING TO |
| 171 | The unshare() call is Linux-specific and should not be used |
| 172 | in programs intended to be portable. |
| 173 | |
| 174 | SEE ALSO |
| 175 | clone(2), fork(2) |
| 176 | |
| 177 | 6) High Level Design |
| 178 | -------------------- |
| 179 | Depending on the flags argument, the unshare system call allocates |
| 180 | appropriate process context structures, populates it with values from |
| 181 | the current shared version, associates newly duplicated structures |
| 182 | with the current task structure and releases corresponding shared |
| 183 | versions. Helper functions of clone (copy_*) could not be used |
| 184 | directly by unshare because of the following two reasons. |
| 185 | 1) clone operates on a newly allocated not-yet-active task |
| 186 | structure, where as unshare operates on the current active |
| 187 | task. Therefore unshare has to take appropriate task_lock() |
| 188 | before associating newly duplicated context structures |
| 189 | 2) unshare has to allocate and duplicate all context structures |
| 190 | that are being unshared, before associating them with the |
| 191 | current task and releasing older shared structures. Failure |
| 192 | do so will create race conditions and/or oops when trying |
| 193 | to backout due to an error. Consider the case of unsharing |
| 194 | both virtual memory and namespace. After successfully unsharing |
| 195 | vm, if the system call encounters an error while allocating |
| 196 | new namespace structure, the error return code will have to |
| 197 | reverse the unsharing of vm. As part of the reversal the |
| 198 | system call will have to go back to older, shared, vm |
| 199 | structure, which may not exist anymore. |
| 200 | |
| 201 | Therefore code from copy_* functions that allocated and duplicated |
| 202 | current context structure was moved into new dup_* functions. Now, |
| 203 | copy_* functions call dup_* functions to allocate and duplicate |
| 204 | appropriate context structures and then associate them with the |
| 205 | task structure that is being constructed. unshare system call on |
| 206 | the other hand performs the following: |
| 207 | 1) Check flags to force missing, but implied, flags |
| 208 | 2) For each context structure, call the corresponding unshare |
| 209 | helper function to allocate and duplicate a new context |
| 210 | structure, if the appropriate bit is set in the flags argument. |
| 211 | 3) If there is no error in allocation and duplication and there |
| 212 | are new context structures then lock the current task structure, |
| 213 | associate new context structures with the current task structure, |
| 214 | and release the lock on the current task structure. |
| 215 | 4) Appropriately release older, shared, context structures. |
| 216 | |
| 217 | 7) Low Level Design |
| 218 | ------------------- |
| 219 | Implementation of unshare can be grouped in the following 4 different |
| 220 | items: |
| 221 | a) Reorganization of existing copy_* functions |
| 222 | b) unshare system call service function |
| 223 | c) unshare helper functions for each different process context |
| 224 | d) Registration of system call number for different architectures |
| 225 | |
| 226 | 7.1) Reorganization of copy_* functions |
| 227 | Each copy function such as copy_mm, copy_namespace, copy_files, |
| 228 | etc, had roughly two components. The first component allocated |
| 229 | and duplicated the appropriate structure and the second component |
| 230 | linked it to the task structure passed in as an argument to the copy |
| 231 | function. The first component was split into its own function. |
| 232 | These dup_* functions allocated and duplicated the appropriate |
| 233 | context structure. The reorganized copy_* functions invoked |
| 234 | their corresponding dup_* functions and then linked the newly |
| 235 | duplicated structures to the task structure with which the |
| 236 | copy function was called. |
| 237 | |
| 238 | 7.2) unshare system call service function |
| 239 | * Check flags |
| 240 | Force implied flags. If CLONE_THREAD is set force CLONE_VM. |
| 241 | If CLONE_VM is set, force CLONE_SIGHAND. If CLONE_SIGHAND is |
| 242 | set and signals are also being shared, force CLONE_THREAD. If |
| 243 | CLONE_NEWNS is set, force CLONE_FS. |
| 244 | * For each context flag, invoke the corresponding unshare_* |
| 245 | helper routine with flags passed into the system call and a |
| 246 | reference to pointer pointing the new unshared structure |
| 247 | * If any new structures are created by unshare_* helper |
| 248 | functions, take the task_lock() on the current task, |
| 249 | modify appropriate context pointers, and release the |
| 250 | task lock. |
| 251 | * For all newly unshared structures, release the corresponding |
| 252 | older, shared, structures. |
| 253 | |
| 254 | 7.3) unshare_* helper functions |
| 255 | For unshare_* helpers corresponding to CLONE_SYSVSEM, CLONE_SIGHAND, |
| 256 | and CLONE_THREAD, return -EINVAL since they are not implemented yet. |
| 257 | For others, check the flag value to see if the unsharing is |
| 258 | required for that structure. If it is, invoke the corresponding |
| 259 | dup_* function to allocate and duplicate the structure and return |
| 260 | a pointer to it. |
| 261 | |
| 262 | 7.4) Appropriately modify architecture specific code to register the |
Paolo Ornati | 670e9f3 | 2006-10-03 22:57:56 +0200 | [diff] [blame] | 263 | new system call. |
JANAK DESAI | 0d4c3e7 | 2006-02-07 12:58:56 -0800 | [diff] [blame] | 264 | |
| 265 | 8) Test Specification |
| 266 | --------------------- |
| 267 | The test for unshare should test the following: |
| 268 | 1) Valid flags: Test to check that clone flags for signal and |
| 269 | signal handlers, for which unsharing is not implemented |
| 270 | yet, return -EINVAL. |
| 271 | 2) Missing/implied flags: Test to make sure that if unsharing |
| 272 | namespace without specifying unsharing of filesystem, correctly |
| 273 | unshares both namespace and filesystem information. |
| 274 | 3) For each of the four (namespace, filesystem, files and vm) |
| 275 | supported unsharing, verify that the system call correctly |
| 276 | unshares the appropriate structure. Verify that unsharing |
| 277 | them individually as well as in combination with each |
| 278 | other works as expected. |
| 279 | 4) Concurrent execution: Use shared memory segments and futex on |
| 280 | an address in the shm segment to synchronize execution of |
| 281 | about 10 threads. Have a couple of threads execute execve, |
| 282 | a couple _exit and the rest unshare with different combination |
| 283 | of flags. Verify that unsharing is performed as expected and |
| 284 | that there are no oops or hangs. |
| 285 | |
| 286 | 9) Future Work |
| 287 | -------------- |
| 288 | The current implementation of unshare does not allow unsharing of |
| 289 | signals and signal handlers. Signals are complex to begin with and |
| 290 | to unshare signals and/or signal handlers of a currently running |
| 291 | process is even more complex. If in the future there is a specific |
| 292 | need to allow unsharing of signals and/or signal handlers, it can |
| 293 | be incrementally added to unshare without affecting legacy |
| 294 | applications using unshare. |
| 295 | |