Blame - Documentation/unshare.txt - kernel/msm-4.9

blob: a8643513a5f6cb25851140c021aec4a671c8b62c [file] [log] [blame]

JANAK DESAI	0d4c3e7	2006-02-07 12:58:56 -0800	[diff] [blame]	1
				2	unshare system call:
				3	--------------------
				4	This document describes the new system call, unshare. The document
				5	provides an overview of the feature, why it is needed, how it can
				6	be used, its interface specification, design, implementation and
				7	how it can be tested.
				8
				9	Change Log:
				10	-----------
				11	version 0.1 Initial document, Janak Desai (janak@us.ibm.com), Jan 11, 2006
				12
				13	Contents:
				14	---------
				15	1) Overview
				16	2) Benefits
				17	3) Cost
				18	4) Requirements
				19	5) Functional Specification
				20	6) High Level Design
				21	7) Low Level Design
				22	8) Test Specification
				23	9) Future Work
				24
				25	1) Overview
				26	-----------
				27	Most legacy operating system kernels support an abstraction of threads
				28	as multiple execution contexts within a process. These kernels provide
				29	special resources and mechanisms to maintain these "threads". The Linux
				30	kernel, in a clever and simple manner, does not make distinction
				31	between processes and "threads". The kernel allows processes to share
				32	resources and thus they can achieve legacy "threads" behavior without
				33	requiring additional data structures and mechanisms in the kernel. The
				34	power of implementing threads in this manner comes not only from
				35	its simplicity but also from allowing application programmers to work
				36	outside the confinement of all-or-nothing shared resources of legacy
				37	threads. On Linux, at the time of thread creation using the clone system
				38	call, applications can selectively choose which resources to share
				39	between threads.
				40
				41	unshare system call adds a primitive to the Linux thread model that
				42	allows threads to selectively 'unshare' any resources that were being
				43	shared at the time of their creation. unshare was conceptualized by
				44	Al Viro in the August of 2000, on the Linux-Kernel mailing list, as part
				45	of the discussion on POSIX threads on Linux. unshare augments the
				46	usefulness of Linux threads for applications that would like to control
				47	shared resources without creating a new process. unshare is a natural
				48	addition to the set of available primitives on Linux that implement
				49	the concept of process/thread as a virtual machine.
				50
				51	2) Benefits
				52	-----------
				53	unshare would be useful to large application frameworks such as PAM
				54	where creating a new process to control sharing/unsharing of process
				55	resources is not possible. Since namespaces are shared by default
				56	when creating a new process using fork or clone, unshare can benefit
				57	even non-threaded applications if they have a need to disassociate
				58	from default shared namespace. The following lists two use-cases
				59	where unshare can be used.
				60
				61	2.1 Per-security context namespaces
				62	-----------------------------------
				63	unshare can be used to implement polyinstantiated directories using
				64	the kernel's per-process namespace mechanism. Polyinstantiated directories,
				65	such as per-user and/or per-security context instance of /tmp, /var/tmp or
				66	per-security context instance of a user's home directory, isolate user
				67	processes when working with these directories. Using unshare, a PAM
				68	module can easily setup a private namespace for a user at login.
				69	Polyinstantiated directories are required for Common Criteria certification
				70	with Labeled System Protection Profile, however, with the availability
				71	of shared-tree feature in the Linux kernel, even regular Linux systems
				72	can benefit from setting up private namespaces at login and
				73	polyinstantiating /tmp, /var/tmp and other directories deemed
				74	appropriate by system administrators.
				75
				76	2.2 unsharing of virtual memory and/or open files
				77	-------------------------------------------------
				78	Consider a client/server application where the server is processing
				79	client requests by creating processes that share resources such as
				80	virtual memory and open files. Without unshare, the server has to
				81	decide what needs to be shared at the time of creating the process
				82	which services the request. unshare allows the server an ability to
				83	disassociate parts of the context during the servicing of the
				84	request. For large and complex middleware application frameworks, this
				85	ability to unshare after the process was created can be very
				86	useful.
				87
				88	3) Cost
				89	-------
				90	In order to not duplicate code and to handle the fact that unshare
				91	works on an active task (as opposed to clone/fork working on a newly
				92	allocated inactive task) unshare had to make minor reorganizational
				93	changes to copy_* functions utilized by clone/fork system call.
				94	There is a cost associated with altering existing, well tested and
				95	stable code to implement a new feature that may not get exercised
				96	extensively in the beginning. However, with proper design and code
				97	review of the changes and creation of an unshare test for the LTP
				98	the benefits of this new feature can exceed its cost.
				99
				100	4) Requirements
				101	---------------
				102	unshare reverses sharing that was done using clone(2) system call,
				103	so unshare should have a similar interface as clone(2). That is,
				104	since flags in clone(int flags, void *stack) specifies what should
				105	be shared, similar flags in unshare(int flags) should specify
				106	what should be unshared. Unfortunately, this may appear to invert
				107	the meaning of the flags from the way they are used in clone(2).
				108	However, there was no easy solution that was less confusing and that
				109	allowed incremental context unsharing in future without an ABI change.
				110
				111	unshare interface should accommodate possible future addition of
				112	new context flags without requiring a rebuild of old applications.
				113	If and when new context flags are added, unshare design should allow
				114	incremental unsharing of those resources on an as needed basis.
				115
				116	5) Functional Specification
				117	---------------------------
				118	NAME
				119	unshare - disassociate parts of the process execution context
				120
				121	SYNOPSIS
				122	#include <sched.h>
				123
				124	int unshare(int flags);
				125
				126	DESCRIPTION
				127	unshare allows a process to disassociate parts of its execution
				128	context that are currently being shared with other processes. Part
				129	of execution context, such as the namespace, is shared by default
				130	when a new process is created using fork(2), while other parts,
				131	such as the virtual memory, open file descriptors, etc, may be
				132	shared by explicit request to share them when creating a process
				133	using clone(2).
				134
				135	The main use of unshare is to allow a process to control its
				136	shared execution context without creating a new process.
				137
				138	The flags argument specifies one or bitwise-or'ed of several of
				139	the following constants.
				140
				141	CLONE_FS
				142	If CLONE_FS is set, file system information of the caller
				143	is disassociated from the shared file system information.
				144
				145	CLONE_FILES
				146	If CLONE_FILES is set, the file descriptor table of the
				147	caller is disassociated from the shared file descriptor
				148	table.
				149
				150	CLONE_NEWNS
				151	If CLONE_NEWNS is set, the namespace of the caller is
				152	disassociated from the shared namespace.
				153
				154	CLONE_VM
				155	If CLONE_VM is set, the virtual memory of the caller is
				156	disassociated from the shared virtual memory.
				157
				158	RETURN VALUE
				159	On success, zero returned. On failure, -1 is returned and errno is
				160
				161	ERRORS
				162	EPERM CLONE_NEWNS was specified by a non-root process (process
				163	without CAP_SYS_ADMIN).
				164
				165	ENOMEM Cannot allocate sufficient memory to copy parts of caller's
				166	context that need to be unshared.
				167
				168	EINVAL Invalid flag was specified as an argument.
				169
				170	CONFORMING TO
				171	The unshare() call is Linux-specific and should not be used
				172	in programs intended to be portable.
				173
				174	SEE ALSO
				175	clone(2), fork(2)
				176
				177	6) High Level Design
				178	--------------------
				179	Depending on the flags argument, the unshare system call allocates
				180	appropriate process context structures, populates it with values from
				181	the current shared version, associates newly duplicated structures
				182	with the current task structure and releases corresponding shared
				183	versions. Helper functions of clone (copy_*) could not be used
				184	directly by unshare because of the following two reasons.
				185	1) clone operates on a newly allocated not-yet-active task
				186	structure, where as unshare operates on the current active
				187	task. Therefore unshare has to take appropriate task_lock()
				188	before associating newly duplicated context structures
				189	2) unshare has to allocate and duplicate all context structures
				190	that are being unshared, before associating them with the
				191	current task and releasing older shared structures. Failure
				192	do so will create race conditions and/or oops when trying
				193	to backout due to an error. Consider the case of unsharing
				194	both virtual memory and namespace. After successfully unsharing
				195	vm, if the system call encounters an error while allocating
				196	new namespace structure, the error return code will have to
				197	reverse the unsharing of vm. As part of the reversal the
				198	system call will have to go back to older, shared, vm
				199	structure, which may not exist anymore.
				200
				201	Therefore code from copy_* functions that allocated and duplicated
				202	current context structure was moved into new dup_* functions. Now,
				203	copy_* functions call dup_* functions to allocate and duplicate
				204	appropriate context structures and then associate them with the
				205	task structure that is being constructed. unshare system call on
				206	the other hand performs the following:
				207	1) Check flags to force missing, but implied, flags
				208	2) For each context structure, call the corresponding unshare
				209	helper function to allocate and duplicate a new context
				210	structure, if the appropriate bit is set in the flags argument.
				211	3) If there is no error in allocation and duplication and there
				212	are new context structures then lock the current task structure,
				213	associate new context structures with the current task structure,
				214	and release the lock on the current task structure.
				215	4) Appropriately release older, shared, context structures.
				216
				217	7) Low Level Design
				218	-------------------
				219	Implementation of unshare can be grouped in the following 4 different
				220	items:
				221	a) Reorganization of existing copy_* functions
				222	b) unshare system call service function
				223	c) unshare helper functions for each different process context
				224	d) Registration of system call number for different architectures
				225
				226	7.1) Reorganization of copy_* functions
				227	Each copy function such as copy_mm, copy_namespace, copy_files,
				228	etc, had roughly two components. The first component allocated
				229	and duplicated the appropriate structure and the second component
				230	linked it to the task structure passed in as an argument to the copy
				231	function. The first component was split into its own function.
				232	These dup_* functions allocated and duplicated the appropriate
				233	context structure. The reorganized copy_* functions invoked
				234	their corresponding dup_* functions and then linked the newly
				235	duplicated structures to the task structure with which the
				236	copy function was called.
				237
				238	7.2) unshare system call service function
				239	* Check flags
				240	Force implied flags. If CLONE_THREAD is set force CLONE_VM.
				241	If CLONE_VM is set, force CLONE_SIGHAND. If CLONE_SIGHAND is
				242	set and signals are also being shared, force CLONE_THREAD. If
				243	CLONE_NEWNS is set, force CLONE_FS.
				244	* For each context flag, invoke the corresponding unshare_*
				245	helper routine with flags passed into the system call and a
				246	reference to pointer pointing the new unshared structure
				247	* If any new structures are created by unshare_* helper
				248	functions, take the task_lock() on the current task,
				249	modify appropriate context pointers, and release the
				250	task lock.
				251	* For all newly unshared structures, release the corresponding
				252	older, shared, structures.
				253
				254	7.3) unshare_* helper functions
				255	For unshare_* helpers corresponding to CLONE_SYSVSEM, CLONE_SIGHAND,
				256	and CLONE_THREAD, return -EINVAL since they are not implemented yet.
				257	For others, check the flag value to see if the unsharing is
				258	required for that structure. If it is, invoke the corresponding
				259	dup_* function to allocate and duplicate the structure and return
				260	a pointer to it.
				261
				262	7.4) Appropriately modify architecture specific code to register the
Paolo Ornati	670e9f3	2006-10-03 22:57:56 +0200	[diff] [blame]	263	new system call.
JANAK DESAI	0d4c3e7	2006-02-07 12:58:56 -0800	[diff] [blame]	264
				265	8) Test Specification
				266	---------------------
				267	The test for unshare should test the following:
				268	1) Valid flags: Test to check that clone flags for signal and
				269	signal handlers, for which unsharing is not implemented
				270	yet, return -EINVAL.
				271	2) Missing/implied flags: Test to make sure that if unsharing
				272	namespace without specifying unsharing of filesystem, correctly
				273	unshares both namespace and filesystem information.
				274	3) For each of the four (namespace, filesystem, files and vm)
				275	supported unsharing, verify that the system call correctly
				276	unshares the appropriate structure. Verify that unsharing
				277	them individually as well as in combination with each
				278	other works as expected.
				279	4) Concurrent execution: Use shared memory segments and futex on
				280	an address in the shm segment to synchronize execution of
				281	about 10 threads. Have a couple of threads execute execve,
				282	a couple _exit and the rest unshare with different combination
				283	of flags. Verify that unsharing is performed as expected and
				284	that there are no oops or hangs.
				285
				286	9) Future Work
				287	--------------
				288	The current implementation of unshare does not allow unsharing of
				289	signals and signal handlers. Signals are complex to begin with and
				290	to unshare signals and/or signal handlers of a currently running
				291	process is even more complex. If in the future there is a specific
				292	need to allow unsharing of signals and/or signal handlers, it can
				293	be incrementally added to unshare without affecting legacy
				294	applications using unshare.
				295