Blame - Documentation/cgroup-v2.txt - kernel/msm-5.4

blob: 74cdeaed9f7afb4f0b514f1213b38b2eda106f56 [file] [log] [blame]

Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1	================
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	2	Control Group v2
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	3	================
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	4
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	5	:Date: October, 2015
				6	:Author: Tejun Heo <tj@kernel.org>
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	7
				8	This is the authoritative documentation on the design, interface and
				9	conventions of cgroup v2. It describes all userland-visible aspects
				10	of cgroup including core and specific controller behaviors. All
				11	future changes must be reflected in this document. Documentation for
W. Trevor King	9a2ddda	2016-01-27 13:01:52 -0800	[diff] [blame]	12	v1 is available under Documentation/cgroup-v1/.
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	13
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	14	.. CONTENTS
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	15
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	16	1. Introduction
				17	1-1. Terminology
				18	1-2. What is cgroup?
				19	2. Basic Operations
				20	2-1. Mounting
Tejun Heo	8cfd814	2017-07-21 11:14:51 -0400	[diff] [blame]	21	2-2. Organizing Processes and Threads
				22	2-2-1. Processes
				23	2-2-2. Threads
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	24	2-3. [Un]populated Notification
				25	2-4. Controlling Controllers
				26	2-4-1. Enabling and Disabling
				27	2-4-2. Top-down Constraint
				28	2-4-3. No Internal Process Constraint
				29	2-5. Delegation
				30	2-5-1. Model of Delegation
				31	2-5-2. Delegation Containment
				32	2-6. Guidelines
				33	2-6-1. Organize Once and Control
				34	2-6-2. Avoid Name Collisions
				35	3. Resource Distribution Models
				36	3-1. Weights
				37	3-2. Limits
				38	3-3. Protections
				39	3-4. Allocations
				40	4. Interface Files
				41	4-1. Format
				42	4-2. Conventions
				43	4-3. Core Interface Files
				44	5. Controllers
				45	5-1. CPU
				46	5-1-1. CPU Interface Files
				47	5-2. Memory
				48	5-2-1. Memory Interface Files
				49	5-2-2. Usage Guidelines
				50	5-2-3. Memory Ownership
				51	5-3. IO
				52	5-3-1. IO Interface Files
				53	5-3-2. Writeback
				54	5-4. PID
				55	5-4-1. PID Interface Files
Roman Gushchin	4ad5a32	2017-12-13 19:49:03 +0000	[diff] [blame]	56	5-5. Device
				57	5-6. RDMA
				58	5-6-1. RDMA Interface Files
				59	5-7. Misc
				60	5-7-1. perf_event
Maciej S. Szmigiero	c4e0842	2018-01-10 23:33:19 +0100	[diff] [blame]	61	5-N. Non-normative information
				62	5-N-1. CPU controller root cgroup process behaviour
				63	5-N-2. IO controller root cgroup process behaviour
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	64	6. Namespace
				65	6-1. Basics
				66	6-2. The Root and Views
				67	6-3. Migration and setns(2)
				68	6-4. Interaction with Other Namespaces
				69	P. Information on Kernel Programming
				70	P-1. Filesystem Support for Writeback
				71	D. Deprecated v1 Core Features
				72	R. Issues with v1 and Rationales for v2
				73	R-1. Multiple Hierarchies
				74	R-2. Thread Granularity
				75	R-3. Competition Between Inner Nodes and Threads
				76	R-4. Other Interface Issues
				77	R-5. Controller Issues and Remedies
				78	R-5-1. Memory
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	79
				80
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	81	Introduction
				82	============
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	83
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	84	Terminology
				85	-----------
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	86
				87	"cgroup" stands for "control group" and is never capitalized. The
				88	singular form is used to designate the whole feature and also as a
				89	qualifier as in "cgroup controllers". When explicitly referring to
				90	multiple individual control groups, the plural form "cgroups" is used.
				91
				92
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	93	What is cgroup?
				94	---------------
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	95
				96	cgroup is a mechanism to organize processes hierarchically and
				97	distribute system resources along the hierarchy in a controlled and
				98	configurable manner.
				99
				100	cgroup is largely composed of two parts - the core and controllers.
				101	cgroup core is primarily responsible for hierarchically organizing
				102	processes. A cgroup controller is usually responsible for
				103	distributing a specific type of system resource along the hierarchy
				104	although there are utility controllers which serve purposes other than
				105	resource distribution.
				106
				107	cgroups form a tree structure and every process in the system belongs
				108	to one and only one cgroup. All threads of a process belong to the
				109	same cgroup. On creation, all processes are put in the cgroup that
				110	the parent process belongs to at the time. A process can be migrated
				111	to another cgroup. Migration of a process doesn't affect already
				112	existing descendant processes.
				113
				114	Following certain structural constraints, controllers may be enabled or
				115	disabled selectively on a cgroup. All controller behaviors are
				116	hierarchical - if a controller is enabled on a cgroup, it affects all
				117	processes which belong to the cgroups consisting the inclusive
				118	sub-hierarchy of the cgroup. When a controller is enabled on a nested
				119	cgroup, it always restricts the resource distribution further. The
				120	restrictions set closer to the root in the hierarchy can not be
				121	overridden from further away.
				122
				123
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	124	Basic Operations
				125	================
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	126
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	127	Mounting
				128	--------
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	129
				130	Unlike v1, cgroup v2 has only single hierarchy. The cgroup v2
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	131	hierarchy can be mounted with the following mount command::
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	132
				133	# mount -t cgroup2 none $MOUNT_POINT
				134
				135	cgroup2 filesystem has the magic number 0x63677270 ("cgrp"). All
				136	controllers which support v2 and are not bound to a v1 hierarchy are
				137	automatically bound to the v2 hierarchy and show up at the root.
				138	Controllers which are not in active use in the v2 hierarchy can be
				139	bound to other hierarchies. This allows mixing v2 hierarchy with the
				140	legacy v1 multiple hierarchies in a fully backward compatible way.
				141
				142	A controller can be moved across hierarchies only after the controller
				143	is no longer referenced in its current hierarchy. Because per-cgroup
				144	controller states are destroyed asynchronously and controllers may
				145	have lingering references, a controller may not show up immediately on
				146	the v2 hierarchy after the final umount of the previous hierarchy.
				147	Similarly, a controller should be fully disabled to be moved out of
				148	the unified hierarchy and it may take some time for the disabled
				149	controller to become available for other hierarchies; furthermore, due
				150	to inter-controller dependencies, other controllers may need to be
				151	disabled too.
				152
				153	While useful for development and manual configurations, moving
				154	controllers dynamically between the v2 and other hierarchies is
				155	strongly discouraged for production use. It is recommended to decide
				156	the hierarchies and controller associations before starting using the
				157	controllers after system boot.
				158
Johannes Weiner	1619b6d	2016-02-16 13:21:14 -0500	[diff] [blame]	159	During transition to v2, system management software might still
				160	automount the v1 cgroup filesystem and so hijack all controllers
				161	during boot, before manual intervention is possible. To make testing
				162	and experimenting easier, the kernel parameter cgroup_no_v1= allows
				163	disabling controllers in v1 and make them always available in v2.
				164
Tejun Heo	5136f63	2017-06-27 14:30:28 -0400	[diff] [blame]	165	cgroup v2 currently supports the following mount options.
				166
				167	nsdelegate
				168
				169	Consider cgroup namespaces as delegation boundaries. This
				170	option is system wide and can only be set on mount or modified
				171	through remount from the init namespace. The mount option is
				172	ignored on non-init namespace mounts. Please refer to the
				173	Delegation section for details.
				174
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	175
Tejun Heo	8cfd814	2017-07-21 11:14:51 -0400	[diff] [blame]	176	Organizing Processes and Threads
				177	--------------------------------
				178
				179	Processes
				180	~~~~~~~~~
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	181
				182	Initially, only the root cgroup exists to which all processes belong.
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	183	A child cgroup can be created by creating a sub-directory::
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	184
				185	# mkdir $CGROUP_NAME
				186
				187	A given cgroup may have multiple child cgroups forming a tree
				188	structure. Each cgroup has a read-writable interface file
				189	"cgroup.procs". When read, it lists the PIDs of all processes which
				190	belong to the cgroup one-per-line. The PIDs are not ordered and the
				191	same PID may show up more than once if the process got moved to
				192	another cgroup and then back or the PID got recycled while reading.
				193
				194	A process can be migrated into a cgroup by writing its PID to the
				195	target cgroup's "cgroup.procs" file. Only one process can be migrated
				196	on a single write(2) call. If a process is composed of multiple
				197	threads, writing the PID of any thread migrates all threads of the
				198	process.
				199
				200	When a process forks a child process, the new process is born into the
				201	cgroup that the forking process belongs to at the time of the
				202	operation. After exit, a process stays associated with the cgroup
				203	that it belonged to at the time of exit until it's reaped; however, a
				204	zombie process does not appear in "cgroup.procs" and thus can't be
				205	moved to another cgroup.
				206
				207	A cgroup which doesn't have any children or live processes can be
				208	destroyed by removing the directory. Note that a cgroup which doesn't
				209	have any children and is associated only with zombie processes is
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	210	considered empty and can be removed::
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	211
				212	# rmdir $CGROUP_NAME
				213
				214	"/proc/$PID/cgroup" lists a process's cgroup membership. If legacy
				215	cgroup is in use in the system, this file may contain multiple lines,
				216	one for each hierarchy. The entry for cgroup v2 is always in the
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	217	format "0::$PATH"::
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	218
				219	# cat /proc/842/cgroup
				220	...
				221	0::/test-cgroup/test-cgroup-nested
				222
				223	If the process becomes a zombie and the cgroup it was associated with
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	224	is removed subsequently, " (deleted)" is appended to the path::
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	225
				226	# cat /proc/842/cgroup
				227	...
				228	0::/test-cgroup/test-cgroup-nested (deleted)
				229
				230
Tejun Heo	8cfd814	2017-07-21 11:14:51 -0400	[diff] [blame]	231	Threads
				232	~~~~~~~
				233
				234	cgroup v2 supports thread granularity for a subset of controllers to
				235	support use cases requiring hierarchical resource distribution across
				236	the threads of a group of processes. By default, all threads of a
				237	process belong to the same cgroup, which also serves as the resource
				238	domain to host resource consumptions which are not specific to a
				239	process or thread. The thread mode allows threads to be spread across
				240	a subtree while still maintaining the common resource domain for them.
				241
				242	Controllers which support thread mode are called threaded controllers.
				243	The ones which don't are called domain controllers.
				244
				245	Marking a cgroup threaded makes it join the resource domain of its
				246	parent as a threaded cgroup. The parent may be another threaded
				247	cgroup whose resource domain is further up in the hierarchy. The root
				248	of a threaded subtree, that is, the nearest ancestor which is not
				249	threaded, is called threaded domain or thread root interchangeably and
				250	serves as the resource domain for the entire subtree.
				251
				252	Inside a threaded subtree, threads of a process can be put in
				253	different cgroups and are not subject to the no internal process
				254	constraint - threaded controllers can be enabled on non-leaf cgroups
				255	whether they have threads in them or not.
				256
				257	As the threaded domain cgroup hosts all the domain resource
				258	consumptions of the subtree, it is considered to have internal
				259	resource consumptions whether there are processes in it or not and
				260	can't have populated child cgroups which aren't threaded. Because the
				261	root cgroup is not subject to no internal process constraint, it can
				262	serve both as a threaded domain and a parent to domain cgroups.
				263
				264	The current operation mode or type of the cgroup is shown in the
				265	"cgroup.type" file which indicates whether the cgroup is a normal
				266	domain, a domain which is serving as the domain of a threaded subtree,
				267	or a threaded cgroup.
				268
				269	On creation, a cgroup is always a domain cgroup and can be made
				270	threaded by writing "threaded" to the "cgroup.type" file. The
				271	operation is single direction::
				272
				273	# echo threaded > cgroup.type
				274
				275	Once threaded, the cgroup can't be made a domain again. To enable the
				276	thread mode, the following conditions must be met.
				277
				278	- As the cgroup will join the parent's resource domain. The parent
				279	must either be a valid (threaded) domain or a threaded cgroup.
				280
Tejun Heo	918a8c2	2017-07-23 08:18:26 -0400	[diff] [blame]	281	- When the parent is an unthreaded domain, it must not have any domain
				282	controllers enabled or populated domain children. The root is
				283	exempt from this requirement.
Tejun Heo	8cfd814	2017-07-21 11:14:51 -0400	[diff] [blame]	284
				285	Topology-wise, a cgroup can be in an invalid state. Please consider
Vladimir Rutsky	2877cbe	2018-01-02 17:27:41 +0100	[diff] [blame]	286	the following topology::
Tejun Heo	8cfd814	2017-07-21 11:14:51 -0400	[diff] [blame]	287
				288	A (threaded domain) - B (threaded) - C (domain, just created)
				289
				290	C is created as a domain but isn't connected to a parent which can
				291	host child domains. C can't be used until it is turned into a
				292	threaded cgroup. "cgroup.type" file will report "domain (invalid)" in
				293	these cases. Operations which fail due to invalid topology use
				294	EOPNOTSUPP as the errno.
				295
				296	A domain cgroup is turned into a threaded domain when one of its child
				297	cgroup becomes threaded or threaded controllers are enabled in the
				298	"cgroup.subtree_control" file while there are processes in the cgroup.
				299	A threaded domain reverts to a normal domain when the conditions
				300	clear.
				301
				302	When read, "cgroup.threads" contains the list of the thread IDs of all
				303	threads in the cgroup. Except that the operations are per-thread
				304	instead of per-process, "cgroup.threads" has the same format and
				305	behaves the same way as "cgroup.procs". While "cgroup.threads" can be
				306	written to in any cgroup, as it can only move threads inside the same
				307	threaded domain, its operations are confined inside each threaded
				308	subtree.
				309
				310	The threaded domain cgroup serves as the resource domain for the whole
				311	subtree, and, while the threads can be scattered across the subtree,
				312	all the processes are considered to be in the threaded domain cgroup.
				313	"cgroup.procs" in a threaded domain cgroup contains the PIDs of all
				314	processes in the subtree and is not readable in the subtree proper.
				315	However, "cgroup.procs" can be written to from anywhere in the subtree
				316	to migrate all threads of the matching process to the cgroup.
				317
				318	Only threaded controllers can be enabled in a threaded subtree. When
				319	a threaded controller is enabled inside a threaded subtree, it only
				320	accounts for and controls resource consumptions associated with the
				321	threads in the cgroup and its descendants. All consumptions which
				322	aren't tied to a specific thread belong to the threaded domain cgroup.
				323
				324	Because a threaded subtree is exempt from no internal process
				325	constraint, a threaded controller must be able to handle competition
				326	between threads in a non-leaf cgroup and its child cgroups. Each
				327	threaded controller defines how such competitions are handled.
				328
				329
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	330	[Un]populated Notification
				331	--------------------------
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	332
				333	Each non-root cgroup has a "cgroup.events" file which contains
				334	"populated" field indicating whether the cgroup's sub-hierarchy has
				335	live processes in it. Its value is 0 if there is no live process in
				336	the cgroup and its descendants; otherwise, 1. poll and [id]notify
				337	events are triggered when the value changes. This can be used, for
				338	example, to start a clean-up operation after all processes of a given
				339	sub-hierarchy have exited. The populated state updates and
				340	notifications are recursive. Consider the following sub-hierarchy
				341	where the numbers in the parentheses represent the numbers of processes
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	342	in each cgroup::
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	343
				344	A(4) - B(0) - C(1)
				345	\ D(0)
				346
				347	A, B and C's "populated" fields would be 1 while D's 0. After the one
				348	process in C exits, B and C's "populated" fields would flip to "0" and
				349	file modified events will be generated on the "cgroup.events" files of
				350	both cgroups.
				351
				352
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	353	Controlling Controllers
				354	-----------------------
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	355
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	356	Enabling and Disabling
				357	~~~~~~~~~~~~~~~~~~~~~~
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	358
				359	Each cgroup has a "cgroup.controllers" file which lists all
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	360	controllers available for the cgroup to enable::
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	361
				362	# cat cgroup.controllers
				363	cpu io memory
				364
				365	No controller is enabled by default. Controllers can be enabled and
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	366	disabled by writing to the "cgroup.subtree_control" file::
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	367
				368	# echo "+cpu +memory -io" > cgroup.subtree_control
				369
				370	Only controllers which are listed in "cgroup.controllers" can be
				371	enabled. When multiple operations are specified as above, either they
				372	all succeed or fail. If multiple operations on the same controller
				373	are specified, the last one is effective.
				374
				375	Enabling a controller in a cgroup indicates that the distribution of
				376	the target resource across its immediate children will be controlled.
				377	Consider the following sub-hierarchy. The enabled controllers are
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	378	listed in parentheses::
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	379
				380	A(cpu,memory) - B(memory) - C()
				381	\ D()
				382
				383	As A has "cpu" and "memory" enabled, A will control the distribution
				384	of CPU cycles and memory to its children, in this case, B. As B has
				385	"memory" enabled but not "CPU", C and D will compete freely on CPU
				386	cycles but their division of memory available to B will be controlled.
				387
				388	As a controller regulates the distribution of the target resource to
				389	the cgroup's children, enabling it creates the controller's interface
				390	files in the child cgroups. In the above example, enabling "cpu" on B
				391	would create the "cpu." prefixed controller interface files in C and
				392	D. Likewise, disabling "memory" from B would remove the "memory."
				393	prefixed controller interface files from C and D. This means that the
				394	controller interface files - anything which doesn't start with
				395	"cgroup." are owned by the parent rather than the cgroup itself.
				396
				397
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	398	Top-down Constraint
				399	~~~~~~~~~~~~~~~~~~~
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	400
				401	Resources are distributed top-down and a cgroup can further distribute
				402	a resource only if the resource has been distributed to it from the
				403	parent. This means that all non-root "cgroup.subtree_control" files
				404	can only contain controllers which are enabled in the parent's
				405	"cgroup.subtree_control" file. A controller can be enabled only if
				406	the parent has the controller enabled and a controller can't be
				407	disabled if one or more children have it enabled.
				408
				409
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	410	No Internal Process Constraint
				411	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	412
Tejun Heo	8cfd814	2017-07-21 11:14:51 -0400	[diff] [blame]	413	Non-root cgroups can distribute domain resources to their children
				414	only when they don't have any processes of their own. In other words,
				415	only domain cgroups which don't contain any processes can have domain
				416	controllers enabled in their "cgroup.subtree_control" files.
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	417
Tejun Heo	8cfd814	2017-07-21 11:14:51 -0400	[diff] [blame]	418	This guarantees that, when a domain controller is looking at the part
				419	of the hierarchy which has it enabled, processes are always only on
				420	the leaves. This rules out situations where child cgroups compete
				421	against internal processes of the parent.
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	422
				423	The root cgroup is exempt from this restriction. Root contains
				424	processes and anonymous resource consumption which can't be associated
				425	with any other cgroups and requires special treatment from most
				426	controllers. How resource consumption in the root cgroup is governed
Maciej S. Szmigiero	c4e0842	2018-01-10 23:33:19 +0100	[diff] [blame]	427	is up to each controller (for more information on this topic please
				428	refer to the Non-normative information section in the Controllers
				429	chapter).
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	430
				431	Note that the restriction doesn't get in the way if there is no
				432	enabled controller in the cgroup's "cgroup.subtree_control". This is
				433	important as otherwise it wouldn't be possible to create children of a
				434	populated cgroup. To control resource distribution of a cgroup, the
				435	cgroup must create children and transfer all its processes to the
				436	children before enabling controllers in its "cgroup.subtree_control"
				437	file.
				438
				439
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	440	Delegation
				441	----------
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	442
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	443	Model of Delegation
				444	~~~~~~~~~~~~~~~~~~~
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	445
Tejun Heo	5136f63	2017-06-27 14:30:28 -0400	[diff] [blame]	446	A cgroup can be delegated in two ways. First, to a less privileged
Tejun Heo	8cfd814	2017-07-21 11:14:51 -0400	[diff] [blame]	447	user by granting write access of the directory and its "cgroup.procs",
				448	"cgroup.threads" and "cgroup.subtree_control" files to the user.
				449	Second, if the "nsdelegate" mount option is set, automatically to a
				450	cgroup namespace on namespace creation.
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	451
Tejun Heo	5136f63	2017-06-27 14:30:28 -0400	[diff] [blame]	452	Because the resource control interface files in a given directory
				453	control the distribution of the parent's resources, the delegatee
				454	shouldn't be allowed to write to them. For the first method, this is
				455	achieved by not granting access to these files. For the second, the
				456	kernel rejects writes to all files other than "cgroup.procs" and
				457	"cgroup.subtree_control" on a namespace root from inside the
				458	namespace.
				459
				460	The end results are equivalent for both delegation types. Once
				461	delegated, the user can build sub-hierarchy under the directory,
				462	organize processes inside it as it sees fit and further distribute the
				463	resources it received from the parent. The limits and other settings
				464	of all resource controllers are hierarchical and regardless of what
				465	happens in the delegated sub-hierarchy, nothing can escape the
				466	resource restrictions imposed by the parent.
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	467
				468	Currently, cgroup doesn't impose any restrictions on the number of
				469	cgroups in or nesting depth of a delegated sub-hierarchy; however,
				470	this may be limited explicitly in the future.
				471
				472
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	473	Delegation Containment
				474	~~~~~~~~~~~~~~~~~~~~~~
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	475
				476	A delegated sub-hierarchy is contained in the sense that processes
Tejun Heo	5136f63	2017-06-27 14:30:28 -0400	[diff] [blame]	477	can't be moved into or out of the sub-hierarchy by the delegatee.
				478
				479	For delegations to a less privileged user, this is achieved by
				480	requiring the following conditions for a process with a non-root euid
				481	to migrate a target process into a cgroup by writing its PID to the
				482	"cgroup.procs" file.
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	483
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	484	- The writer must have write access to the "cgroup.procs" file.
				485
				486	- The writer must have write access to the "cgroup.procs" file of the
				487	common ancestor of the source and destination cgroups.
				488
Tejun Heo	576dd46	2017-01-20 11:29:54 -0500	[diff] [blame]	489	The above two constraints ensure that while a delegatee may migrate
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	490	processes around freely in the delegated sub-hierarchy it can't pull
				491	in from or push out to outside the sub-hierarchy.
				492
				493	For an example, let's assume cgroups C0 and C1 have been delegated to
				494	user U0 who created C00, C01 under C0 and C10 under C1 as follows and
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	495	all processes under C0 and C1 belong to U0::
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	496
				497	~~~~~~~~~~~~~ - C0 - C00
				498	~ cgroup ~ \ C01
				499	~ hierarchy ~
				500	~~~~~~~~~~~~~ - C1 - C10
				501
				502	Let's also say U0 wants to write the PID of a process which is
				503	currently in C10 into "C00/cgroup.procs". U0 has write access to the
Tejun Heo	576dd46	2017-01-20 11:29:54 -0500	[diff] [blame]	504	file; however, the common ancestor of the source cgroup C10 and the
				505	destination cgroup C00 is above the points of delegation and U0 would
				506	not have write access to its "cgroup.procs" files and thus the write
				507	will be denied with -EACCES.
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	508
Tejun Heo	5136f63	2017-06-27 14:30:28 -0400	[diff] [blame]	509	For delegations to namespaces, containment is achieved by requiring
				510	that both the source and destination cgroups are reachable from the
				511	namespace of the process which is attempting the migration. If either
				512	is not reachable, the migration is rejected with -ENOENT.
				513
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	514
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	515	Guidelines
				516	----------
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	517
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	518	Organize Once and Control
				519	~~~~~~~~~~~~~~~~~~~~~~~~~
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	520
				521	Migrating a process across cgroups is a relatively expensive operation
				522	and stateful resources such as memory are not moved together with the
				523	process. This is an explicit design decision as there often exist
				524	inherent trade-offs between migration and various hot paths in terms
				525	of synchronization cost.
				526
				527	As such, migrating processes across cgroups frequently as a means to
				528	apply different resource restrictions is discouraged. A workload
				529	should be assigned to a cgroup according to the system's logical and
				530	resource structure once on start-up. Dynamic adjustments to resource
				531	distribution can be made by changing controller configuration through
				532	the interface files.
				533
				534
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	535	Avoid Name Collisions
				536	~~~~~~~~~~~~~~~~~~~~~
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	537
				538	Interface files for a cgroup and its children cgroups occupy the same
				539	directory and it is possible to create children cgroups which collide
				540	with interface files.
				541
				542	All cgroup core interface files are prefixed with "cgroup." and each
				543	controller's interface files are prefixed with the controller name and
				544	a dot. A controller's name is composed of lower case alphabets and
				545	'_'s but never begins with an '_' so it can be used as the prefix
				546	character for collision avoidance. Also, interface file names won't
				547	start or end with terms which are often used in categorizing workloads
				548	such as job, service, slice, unit or workload.
				549
				550	cgroup doesn't do anything to prevent name collisions and it's the
				551	user's responsibility to avoid them.
				552
				553
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	554	Resource Distribution Models
				555	============================
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	556
				557	cgroup controllers implement several resource distribution schemes
				558	depending on the resource type and expected use cases. This section
				559	describes major schemes in use along with their expected behaviors.
				560
				561
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	562	Weights
				563	-------
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	564
				565	A parent's resource is distributed by adding up the weights of all
				566	active children and giving each the fraction matching the ratio of its
				567	weight against the sum. As only children which can make use of the
				568	resource at the moment participate in the distribution, this is
				569	work-conserving. Due to the dynamic nature, this model is usually
				570	used for stateless resources.
				571
				572	All weights are in the range [1, 10000] with the default at 100. This
				573	allows symmetric multiplicative biases in both directions at fine
				574	enough granularity while staying in the intuitive range.
				575
				576	As long as the weight is in range, all configuration combinations are
				577	valid and there is no reason to reject configuration changes or
				578	process migrations.
				579
				580	"cpu.weight" proportionally distributes CPU cycles to active children
				581	and is an example of this type.
				582
				583
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	584	Limits
				585	------
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	586
				587	A child can only consume upto the configured amount of the resource.
				588	Limits can be over-committed - the sum of the limits of children can
				589	exceed the amount of resource available to the parent.
				590
				591	Limits are in the range [0, max] and defaults to "max", which is noop.
				592
				593	As limits can be over-committed, all configuration combinations are
				594	valid and there is no reason to reject configuration changes or
				595	process migrations.
				596
				597	"io.max" limits the maximum BPS and/or IOPS that a cgroup can consume
				598	on an IO device and is an example of this type.
				599
				600
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	601	Protections
				602	-----------
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	603
				604	A cgroup is protected to be allocated upto the configured amount of
				605	the resource if the usages of all its ancestors are under their
				606	protected levels. Protections can be hard guarantees or best effort
				607	soft boundaries. Protections can also be over-committed in which case
				608	only upto the amount available to the parent is protected among
				609	children.
				610
				611	Protections are in the range [0, max] and defaults to 0, which is
				612	noop.
				613
				614	As protections can be over-committed, all configuration combinations
				615	are valid and there is no reason to reject configuration changes or
				616	process migrations.
				617
				618	"memory.low" implements best-effort memory protection and is an
				619	example of this type.
				620
				621
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	622	Allocations
				623	-----------
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	624
				625	A cgroup is exclusively allocated a certain amount of a finite
				626	resource. Allocations can't be over-committed - the sum of the
				627	allocations of children can not exceed the amount of resource
				628	available to the parent.
				629
				630	Allocations are in the range [0, max] and defaults to 0, which is no
				631	resource.
				632
				633	As allocations can't be over-committed, some configuration
				634	combinations are invalid and should be rejected. Also, if the
				635	resource is mandatory for execution of processes, process migrations
				636	may be rejected.
				637
				638	"cpu.rt.max" hard-allocates realtime slices and is an example of this
				639	type.
				640
				641
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	642	Interface Files
				643	===============
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	644
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	645	Format
				646	------
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	647
				648	All interface files should be in one of the following formats whenever
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	649	possible::
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	650
				651	New-line separated values
				652	(when only one value can be written at once)
				653
				654	VAL0\n
				655	VAL1\n
				656	...
				657
				658	Space separated values
				659	(when read-only or multiple values can be written at once)
				660
				661	VAL0 VAL1 ...\n
				662
				663	Flat keyed
				664
				665	KEY0 VAL0\n
				666	KEY1 VAL1\n
				667	...
				668
				669	Nested keyed
				670
				671	KEY0 SUB_KEY0=VAL00 SUB_KEY1=VAL01...
				672	KEY1 SUB_KEY0=VAL10 SUB_KEY1=VAL11...
				673	...
				674
				675	For a writable file, the format for writing should generally match
				676	reading; however, controllers may allow omitting later fields or
				677	implement restricted shortcuts for most common use cases.
				678
				679	For both flat and nested keyed files, only the values for a single key
				680	can be written at a time. For nested keyed files, the sub key pairs
				681	may be specified in any order and not all pairs have to be specified.
				682
				683
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	684	Conventions
				685	-----------
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	686
				687	- Settings for a single feature should be contained in a single file.
				688
				689	- The root cgroup should be exempt from resource control and thus
				690	shouldn't have resource control interface files. Also,
				691	informational files on the root cgroup which end up showing global
				692	information available elsewhere shouldn't exist.
				693
				694	- If a controller implements weight based resource distribution, its
				695	interface file should be named "weight" and have the range [1,
				696	10000] with 100 as the default. The values are chosen to allow
				697	enough and symmetric bias in both directions while keeping it
				698	intuitive (the default is 100%).
				699
				700	- If a controller implements an absolute resource guarantee and/or
				701	limit, the interface files should be named "min" and "max"
				702	respectively. If a controller implements best effort resource
				703	guarantee and/or limit, the interface files should be named "low"
				704	and "high" respectively.
				705
				706	In the above four control files, the special token "max" should be
				707	used to represent upward infinity for both reading and writing.
				708
				709	- If a setting has a configurable default value and keyed specific
				710	overrides, the default entry should be keyed with "default" and
				711	appear as the first entry in the file.
				712
				713	The default value can be updated by writing either "default $VAL" or
				714	"$VAL".
				715
				716	When writing to update a specific override, "default" can be used as
				717	the value to indicate removal of the override. Override entries
				718	with "default" as the value must not appear when read.
				719
				720	For example, a setting which is keyed by major:minor device numbers
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	721	with integer values may look like the following::
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	722
				723	# cat cgroup-example-interface-file
				724	default 150
				725	8:0 300
				726
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	727	The default value can be updated by::
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	728
				729	# echo 125 > cgroup-example-interface-file
				730
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	731	or::
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	732
				733	# echo "default 125" > cgroup-example-interface-file
				734
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	735	An override can be set by::
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	736
				737	# echo "8:16 170" > cgroup-example-interface-file
				738
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	739	and cleared by::
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	740
				741	# echo "8:0 default" > cgroup-example-interface-file
				742	# cat cgroup-example-interface-file
				743	default 125
				744	8:16 170
				745
				746	- For events which are not very high frequency, an interface file
				747	"events" should be created which lists event key value pairs.
				748	Whenever a notifiable event happens, file modified event should be
				749	generated on the file.
				750
				751
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	752	Core Interface Files
				753	--------------------
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	754
				755	All cgroup core files are prefixed with "cgroup."
				756
Tejun Heo	8cfd814	2017-07-21 11:14:51 -0400	[diff] [blame]	757	cgroup.type
				758
				759	A read-write single value file which exists on non-root
				760	cgroups.
				761
				762	When read, it indicates the current type of the cgroup, which
				763	can be one of the following values.
				764
				765	- "domain" : A normal valid domain cgroup.
				766
				767	- "domain threaded" : A threaded domain cgroup which is
				768	serving as the root of a threaded subtree.
				769
				770	- "domain invalid" : A cgroup which is in an invalid state.
				771	It can't be populated or have controllers enabled. It may
				772	be allowed to become a threaded cgroup.
				773
				774	- "threaded" : A threaded cgroup which is a member of a
				775	threaded subtree.
				776
				777	A cgroup can be turned into a threaded cgroup by writing
				778	"threaded" to this file.
				779
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	780	cgroup.procs
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	781	A read-write new-line separated values file which exists on
				782	all cgroups.
				783
				784	When read, it lists the PIDs of all processes which belong to
				785	the cgroup one-per-line. The PIDs are not ordered and the
				786	same PID may show up more than once if the process got moved
				787	to another cgroup and then back or the PID got recycled while
				788	reading.
				789
				790	A PID can be written to migrate the process associated with
				791	the PID to the cgroup. The writer should match all of the
				792	following conditions.
				793
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	794	- It must have write access to the "cgroup.procs" file.
				795
				796	- It must have write access to the "cgroup.procs" file of the
				797	common ancestor of the source and destination cgroups.
				798
				799	When delegating a sub-hierarchy, write access to this file
				800	should be granted along with the containing directory.
				801
Tejun Heo	8cfd814	2017-07-21 11:14:51 -0400	[diff] [blame]	802	In a threaded cgroup, reading this file fails with EOPNOTSUPP
				803	as all the processes belong to the thread root. Writing is
				804	supported and moves every thread of the process to the cgroup.
				805
				806	cgroup.threads
				807	A read-write new-line separated values file which exists on
				808	all cgroups.
				809
				810	When read, it lists the TIDs of all threads which belong to
				811	the cgroup one-per-line. The TIDs are not ordered and the
				812	same TID may show up more than once if the thread got moved to
				813	another cgroup and then back or the TID got recycled while
				814	reading.
				815
				816	A TID can be written to migrate the thread associated with the
				817	TID to the cgroup. The writer should match all of the
				818	following conditions.
				819
				820	- It must have write access to the "cgroup.threads" file.
				821
				822	- The cgroup that the thread is currently in must be in the
				823	same resource domain as the destination cgroup.
				824
				825	- It must have write access to the "cgroup.procs" file of the
				826	common ancestor of the source and destination cgroups.
				827
				828	When delegating a sub-hierarchy, write access to this file
				829	should be granted along with the containing directory.
				830
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	831	cgroup.controllers
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	832	A read-only space separated values file which exists on all
				833	cgroups.
				834
				835	It shows space separated list of all controllers available to
				836	the cgroup. The controllers are not ordered.
				837
				838	cgroup.subtree_control
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	839	A read-write space separated values file which exists on all
				840	cgroups. Starts out empty.
				841
				842	When read, it shows space separated list of the controllers
				843	which are enabled to control resource distribution from the
				844	cgroup to its children.
				845
				846	Space separated list of controllers prefixed with '+' or '-'
				847	can be written to enable or disable controllers. A controller
				848	name prefixed with '+' enables the controller and '-'
				849	disables. If a controller appears more than once on the list,
				850	the last one is effective. When multiple enable and disable
				851	operations are specified, either all succeed or all fail.
				852
				853	cgroup.events
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	854	A read-only flat-keyed file which exists on non-root cgroups.
				855	The following entries are defined. Unless specified
				856	otherwise, a value change in this file generates a file
				857	modified event.
				858
				859	populated
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	860	1 if the cgroup or its descendants contains any live
				861	processes; otherwise, 0.
				862
Roman Gushchin	1a926e0	2017-07-28 18:28:44 +0100	[diff] [blame]	863	cgroup.max.descendants
				864	A read-write single value files. The default is "max".
				865
				866	Maximum allowed number of descent cgroups.
				867	If the actual number of descendants is equal or larger,
				868	an attempt to create a new cgroup in the hierarchy will fail.
				869
				870	cgroup.max.depth
				871	A read-write single value files. The default is "max".
				872
				873	Maximum allowed descent depth below the current cgroup.
				874	If the actual descent depth is equal or larger,
				875	an attempt to create a new child cgroup will fail.
				876
Roman Gushchin	ec39225	2017-08-02 17:55:31 +0100	[diff] [blame]	877	cgroup.stat
				878	A read-only flat-keyed file with the following entries:
				879
				880	nr_descendants
				881	Total number of visible descendant cgroups.
				882
				883	nr_dying_descendants
				884	Total number of dying descendant cgroups. A cgroup becomes
				885	dying after being deleted by a user. The cgroup will remain
				886	in dying state for some time undefined time (which can depend
				887	on system load) before being completely destroyed.
				888
				889	A process can't enter a dying cgroup under any circumstances,
				890	a dying cgroup can't revive.
				891
				892	A dying cgroup can consume system resources not exceeding
				893	limits, which were active at the moment of cgroup deletion.
				894
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	895
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	896	Controllers
				897	===========
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	898
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	899	CPU
				900	---
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	901
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	902	The "cpu" controllers regulates distribution of CPU cycles. This
				903	controller implements weight and absolute bandwidth limit models for
				904	normal scheduling policy and absolute bandwidth allocation model for
				905	realtime scheduling policy.
				906
Tejun Heo	c2f31b7	2017-12-05 09:10:17 -0800	[diff] [blame]	907	WARNING: cgroup2 doesn't yet support control of realtime processes and
				908	the cpu controller can only be enabled when all RT processes are in
				909	the root cgroup. Be aware that system management software may already
				910	have placed RT processes into nonroot cgroups during the system boot
				911	process, and these processes may need to be moved to the root cgroup
				912	before the cpu controller can be enabled.
				913
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	914
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	915	CPU Interface Files
				916	~~~~~~~~~~~~~~~~~~~
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	917
				918	All time durations are in microseconds.
				919
				920	cpu.stat
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	921	A read-only flat-keyed file which exists on non-root cgroups.
Tejun Heo	d41bf8c	2017-10-23 16:18:27 -0700	[diff] [blame]	922	This file exists whether the controller is enabled or not.
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	923
Tejun Heo	d41bf8c	2017-10-23 16:18:27 -0700	[diff] [blame]	924	It always reports the following three stats:
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	925
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	926	- usage_usec
				927	- user_usec
				928	- system_usec
Tejun Heo	d41bf8c	2017-10-23 16:18:27 -0700	[diff] [blame]	929
				930	and the following three when the controller is enabled:
				931
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	932	- nr_periods
				933	- nr_throttled
				934	- throttled_usec
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	935
				936	cpu.weight
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	937	A read-write single value file which exists on non-root
				938	cgroups. The default is "100".
				939
				940	The weight in the range [1, 10000].
				941
Tejun Heo	0d59363	2017-09-25 09:00:19 -0700	[diff] [blame]	942	cpu.weight.nice
				943	A read-write single value file which exists on non-root
				944	cgroups. The default is "0".
				945
				946	The nice value is in the range [-20, 19].
				947
				948	This interface file is an alternative interface for
				949	"cpu.weight" and allows reading and setting weight using the
				950	same values used by nice(2). Because the range is smaller and
				951	granularity is coarser for the nice values, the read value is
				952	the closest approximation of the current weight.
				953
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	954	cpu.max
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	955	A read-write two value file which exists on non-root cgroups.
				956	The default is "max 100000".
				957
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	958	The maximum bandwidth limit. It's in the following format::
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	959
				960	$MAX $PERIOD
				961
				962	which indicates that the group may consume upto $MAX in each
				963	$PERIOD duration. "max" for $MAX indicates no limit. If only
				964	one number is written, $MAX is updated.
				965
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	966
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	967	Memory
				968	------
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	969
				970	The "memory" controller regulates distribution of memory. Memory is
				971	stateful and implements both limit and protection models. Due to the
				972	intertwining between memory usage and reclaim pressure and the
				973	stateful nature of memory, the distribution model is relatively
				974	complex.
				975
				976	While not completely water-tight, all major memory usages by a given
				977	cgroup are tracked so that the total memory consumption can be
				978	accounted and controlled to a reasonable extent. Currently, the
				979	following types of memory usages are tracked.
				980
				981	- Userland memory - page cache and anonymous memory.
				982
				983	- Kernel data structures such as dentries and inodes.
				984
				985	- TCP socket buffers.
				986
				987	The above list may expand in the future for better coverage.
				988
				989
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	990	Memory Interface Files
				991	~~~~~~~~~~~~~~~~~~~~~~
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	992
				993	All memory amounts are in bytes. If a value which is not aligned to
				994	PAGE_SIZE is written, the value may be rounded up to the closest
				995	PAGE_SIZE multiple when read back.
				996
				997	memory.current
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	998	A read-only single value file which exists on non-root
				999	cgroups.
				1000
				1001	The total amount of memory currently being used by the cgroup
				1002	and its descendants.
				1003
				1004	memory.low
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1005	A read-write single value file which exists on non-root
				1006	cgroups. The default is "0".
				1007
				1008	Best-effort memory protection. If the memory usages of a
				1009	cgroup and all its ancestors are below their low boundaries,
				1010	the cgroup's memory won't be reclaimed unless memory can be
				1011	reclaimed from unprotected cgroups.
				1012
				1013	Putting more memory than generally available under this
				1014	protection is discouraged.
				1015
				1016	memory.high
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1017	A read-write single value file which exists on non-root
				1018	cgroups. The default is "max".
				1019
				1020	Memory usage throttle limit. This is the main mechanism to
				1021	control memory usage of a cgroup. If a cgroup's usage goes
				1022	over the high boundary, the processes of the cgroup are
				1023	throttled and put under heavy reclaim pressure.
				1024
				1025	Going over the high limit never invokes the OOM killer and
				1026	under extreme conditions the limit may be breached.
				1027
				1028	memory.max
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1029	A read-write single value file which exists on non-root
				1030	cgroups. The default is "max".
				1031
				1032	Memory usage hard limit. This is the final protection
				1033	mechanism. If a cgroup's memory usage reaches this limit and
				1034	can't be reduced, the OOM killer is invoked in the cgroup.
				1035	Under certain circumstances, the usage may go over the limit
				1036	temporarily.
				1037
				1038	This is the ultimate protection mechanism. As long as the
				1039	high limit is used and monitored properly, this limit's
				1040	utility is limited to providing the final safety net.
				1041
				1042	memory.events
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1043	A read-only flat-keyed file which exists on non-root cgroups.
				1044	The following entries are defined. Unless specified
				1045	otherwise, a value change in this file generates a file
				1046	modified event.
				1047
				1048	low
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1049	The number of times the cgroup is reclaimed due to
				1050	high memory pressure even though its usage is under
				1051	the low boundary. This usually indicates that the low
				1052	boundary is over-committed.
				1053
				1054	high
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1055	The number of times processes of the cgroup are
				1056	throttled and routed to perform direct memory reclaim
				1057	because the high memory boundary was exceeded. For a
				1058	cgroup whose memory usage is capped by the high limit
				1059	rather than global memory pressure, this event's
				1060	occurrences are expected.
				1061
				1062	max
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1063	The number of times the cgroup's memory usage was
				1064	about to go over the max boundary. If direct reclaim
Konstantin Khlebnikov	8e675f7	2017-07-06 15:40:28 -0700	[diff] [blame]	1065	fails to bring it down, the cgroup goes to OOM state.
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1066
				1067	oom
Konstantin Khlebnikov	8e675f7	2017-07-06 15:40:28 -0700	[diff] [blame]	1068	The number of time the cgroup's memory usage was
				1069	reached the limit and allocation was about to fail.
				1070
				1071	Depending on context result could be invocation of OOM
Vladimir Rutsky	2877cbe	2018-01-02 17:27:41 +0100	[diff] [blame]	1072	killer and retrying allocation or failing allocation.
Konstantin Khlebnikov	8e675f7	2017-07-06 15:40:28 -0700	[diff] [blame]	1073
				1074	Failed allocation in its turn could be returned into
Vladimir Rutsky	2877cbe	2018-01-02 17:27:41 +0100	[diff] [blame]	1075	userspace as -ENOMEM or silently ignored in cases like
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1076	disk readahead. For now OOM in memory cgroup kills
Konstantin Khlebnikov	8e675f7	2017-07-06 15:40:28 -0700	[diff] [blame]	1077	tasks iff shortage has happened inside page fault.
				1078
				1079	oom_kill
Konstantin Khlebnikov	8e675f7	2017-07-06 15:40:28 -0700	[diff] [blame]	1080	The number of processes belonging to this cgroup
				1081	killed by any kind of OOM killer.
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1082
Johannes Weiner	587d9f7	2016-01-20 15:03:19 -0800	[diff] [blame]	1083	memory.stat
Johannes Weiner	587d9f7	2016-01-20 15:03:19 -0800	[diff] [blame]	1084	A read-only flat-keyed file which exists on non-root cgroups.
				1085
				1086	This breaks down the cgroup's memory footprint into different
				1087	types of memory, type-specific details, and other information
				1088	on the state and past events of the memory management system.
				1089
				1090	All memory amounts are in bytes.
				1091
				1092	The entries are ordered to be human readable, and new entries
				1093	can show up in the middle. Don't rely on items remaining in a
				1094	fixed position; use the keys to look up specific values!
				1095
				1096	anon
Johannes Weiner	587d9f7	2016-01-20 15:03:19 -0800	[diff] [blame]	1097	Amount of memory used in anonymous mappings such as
				1098	brk(), sbrk(), and mmap(MAP_ANONYMOUS)
				1099
				1100	file
Johannes Weiner	587d9f7	2016-01-20 15:03:19 -0800	[diff] [blame]	1101	Amount of memory used to cache filesystem data,
				1102	including tmpfs and shared memory.
				1103
Vladimir Davydov	12580e4	2016-03-17 14:17:38 -0700	[diff] [blame]	1104	kernel_stack
Vladimir Davydov	12580e4	2016-03-17 14:17:38 -0700	[diff] [blame]	1105	Amount of memory allocated to kernel stacks.
				1106
Vladimir Davydov	27ee57c	2016-03-17 14:17:35 -0700	[diff] [blame]	1107	slab
Vladimir Davydov	27ee57c	2016-03-17 14:17:35 -0700	[diff] [blame]	1108	Amount of memory used for storing in-kernel data
				1109	structures.
				1110
Johannes Weiner	4758e19	2016-02-02 16:57:41 -0800	[diff] [blame]	1111	sock
Johannes Weiner	4758e19	2016-02-02 16:57:41 -0800	[diff] [blame]	1112	Amount of memory used in network transmission buffers
				1113
Johannes Weiner	9a4caf1	2017-05-03 14:52:45 -0700	[diff] [blame]	1114	shmem
Johannes Weiner	9a4caf1	2017-05-03 14:52:45 -0700	[diff] [blame]	1115	Amount of cached filesystem data that is swap-backed,
				1116	such as tmpfs, shm segments, shared anonymous mmap()s
				1117
Johannes Weiner	587d9f7	2016-01-20 15:03:19 -0800	[diff] [blame]	1118	file_mapped
Johannes Weiner	587d9f7	2016-01-20 15:03:19 -0800	[diff] [blame]	1119	Amount of cached filesystem data mapped with mmap()
				1120
				1121	file_dirty
Johannes Weiner	587d9f7	2016-01-20 15:03:19 -0800	[diff] [blame]	1122	Amount of cached filesystem data that was modified but
				1123	not yet written back to disk
				1124
				1125	file_writeback
Johannes Weiner	587d9f7	2016-01-20 15:03:19 -0800	[diff] [blame]	1126	Amount of cached filesystem data that was modified and
				1127	is currently being written back to disk
				1128
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1129	inactive_anon, active_anon, inactive_file, active_file, unevictable
Johannes Weiner	587d9f7	2016-01-20 15:03:19 -0800	[diff] [blame]	1130	Amount of memory, swap-backed and filesystem-backed,
				1131	on the internal memory management lists used by the
				1132	page reclaim algorithm
				1133
Vladimir Davydov	27ee57c	2016-03-17 14:17:35 -0700	[diff] [blame]	1134	slab_reclaimable
Vladimir Davydov	27ee57c	2016-03-17 14:17:35 -0700	[diff] [blame]	1135	Part of "slab" that might be reclaimed, such as
				1136	dentries and inodes.
				1137
				1138	slab_unreclaimable
Vladimir Davydov	27ee57c	2016-03-17 14:17:35 -0700	[diff] [blame]	1139	Part of "slab" that cannot be reclaimed on memory
				1140	pressure.
				1141
Johannes Weiner	587d9f7	2016-01-20 15:03:19 -0800	[diff] [blame]	1142	pgfault
Johannes Weiner	587d9f7	2016-01-20 15:03:19 -0800	[diff] [blame]	1143	Total number of page faults incurred
				1144
				1145	pgmajfault
Johannes Weiner	587d9f7	2016-01-20 15:03:19 -0800	[diff] [blame]	1146	Number of major page faults incurred
				1147
Roman Gushchin	b340959	2017-05-12 15:47:09 -0700	[diff] [blame]	1148	workingset_refault
				1149
				1150	Number of refaults of previously evicted pages
				1151
				1152	workingset_activate
				1153
				1154	Number of refaulted pages that were immediately activated
				1155
				1156	workingset_nodereclaim
				1157
				1158	Number of times a shadow node has been reclaimed
				1159
Roman Gushchin	2262185	2017-07-06 15:40:25 -0700	[diff] [blame]	1160	pgrefill
				1161
				1162	Amount of scanned pages (in an active LRU list)
				1163
				1164	pgscan
				1165
				1166	Amount of scanned pages (in an inactive LRU list)
				1167
				1168	pgsteal
				1169
				1170	Amount of reclaimed pages
				1171
				1172	pgactivate
				1173
				1174	Amount of pages moved to the active LRU list
				1175
				1176	pgdeactivate
				1177
				1178	Amount of pages moved to the inactive LRU lis
				1179
				1180	pglazyfree
				1181
				1182	Amount of pages postponed to be freed under memory pressure
				1183
				1184	pglazyfreed
				1185
				1186	Amount of reclaimed lazyfree pages
				1187
Vladimir Davydov	3e24b19	2016-01-20 15:03:13 -0800	[diff] [blame]	1188	memory.swap.current
Vladimir Davydov	3e24b19	2016-01-20 15:03:13 -0800	[diff] [blame]	1189	A read-only single value file which exists on non-root
				1190	cgroups.
				1191
				1192	The total amount of swap currently being used by the cgroup
				1193	and its descendants.
				1194
				1195	memory.swap.max
Vladimir Davydov	3e24b19	2016-01-20 15:03:13 -0800	[diff] [blame]	1196	A read-write single value file which exists on non-root
				1197	cgroups. The default is "max".
				1198
				1199	Swap usage hard limit. If a cgroup's swap usage reaches this
Vladimir Rutsky	2877cbe	2018-01-02 17:27:41 +0100	[diff] [blame]	1200	limit, anonymous memory of the cgroup will not be swapped out.
Vladimir Davydov	3e24b19	2016-01-20 15:03:13 -0800	[diff] [blame]	1201
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1202
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1203	Usage Guidelines
				1204	~~~~~~~~~~~~~~~~
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1205
				1206	"memory.high" is the main mechanism to control memory usage.
				1207	Over-committing on high limit (sum of high limits > available memory)
				1208	and letting global memory pressure to distribute memory according to
				1209	usage is a viable strategy.
				1210
				1211	Because breach of the high limit doesn't trigger the OOM killer but
				1212	throttles the offending cgroup, a management agent has ample
				1213	opportunities to monitor and take appropriate actions such as granting
				1214	more memory or terminating the workload.
				1215
				1216	Determining whether a cgroup has enough memory is not trivial as
				1217	memory usage doesn't indicate whether the workload can benefit from
				1218	more memory. For example, a workload which writes data received from
				1219	network to a file can use all available memory but can also operate as
				1220	performant with a small amount of memory. A measure of memory
				1221	pressure - how much the workload is being impacted due to lack of
				1222	memory - is necessary to determine whether a workload needs more
				1223	memory; unfortunately, memory pressure monitoring mechanism isn't
				1224	implemented yet.
				1225
				1226
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1227	Memory Ownership
				1228	~~~~~~~~~~~~~~~~
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1229
				1230	A memory area is charged to the cgroup which instantiated it and stays
				1231	charged to the cgroup until the area is released. Migrating a process
				1232	to a different cgroup doesn't move the memory usages that it
				1233	instantiated while in the previous cgroup to the new cgroup.
				1234
				1235	A memory area may be used by processes belonging to different cgroups.
				1236	To which cgroup the area will be charged is in-deterministic; however,
				1237	over time, the memory area is likely to end up in a cgroup which has
				1238	enough memory allowance to avoid high reclaim pressure.
				1239
				1240	If a cgroup sweeps a considerable amount of memory which is expected
				1241	to be accessed repeatedly by other cgroups, it may make sense to use
				1242	POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
				1243	belonging to the affected files to ensure correct memory ownership.
				1244
				1245
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1246	IO
				1247	--
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1248
				1249	The "io" controller regulates the distribution of IO resources. This
				1250	controller implements both weight based and absolute bandwidth or IOPS
				1251	limit distribution; however, weight based distribution is available
				1252	only if cfq-iosched is in use and neither scheme is available for
				1253	blk-mq devices.
				1254
				1255
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1256	IO Interface Files
				1257	~~~~~~~~~~~~~~~~~~
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1258
				1259	io.stat
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1260	A read-only nested-keyed file which exists on non-root
				1261	cgroups.
				1262
				1263	Lines are keyed by $MAJ:$MIN device numbers and not ordered.
				1264	The following nested keys are defined.
				1265
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1266	====== ===================
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1267	rbytes Bytes read
				1268	wbytes Bytes written
				1269	rios Number of read IOs
				1270	wios Number of write IOs
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1271	====== ===================
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1272
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1273	An example read output follows:
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1274
				1275	8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353
				1276	8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252
				1277
				1278	io.weight
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1279	A read-write flat-keyed file which exists on non-root cgroups.
				1280	The default is "default 100".
				1281
				1282	The first line is the default weight applied to devices
				1283	without specific override. The rest are overrides keyed by
				1284	$MAJ:$MIN device numbers and not ordered. The weights are in
				1285	the range [1, 10000] and specifies the relative amount IO time
				1286	the cgroup can use in relation to its siblings.
				1287
				1288	The default weight can be updated by writing either "default
				1289	$WEIGHT" or simply "$WEIGHT". Overrides can be set by writing
				1290	"$MAJ:$MIN $WEIGHT" and unset by writing "$MAJ:$MIN default".
				1291
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1292	An example read output follows::
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1293
				1294	default 100
				1295	8:16 200
				1296	8:0 50
				1297
				1298	io.max
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1299	A read-write nested-keyed file which exists on non-root
				1300	cgroups.
				1301
				1302	BPS and IOPS based IO limit. Lines are keyed by $MAJ:$MIN
				1303	device numbers and not ordered. The following nested keys are
				1304	defined.
				1305
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1306	===== ==================================
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1307	rbps Max read bytes per second
				1308	wbps Max write bytes per second
				1309	riops Max read IO operations per second
				1310	wiops Max write IO operations per second
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1311	===== ==================================
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1312
				1313	When writing, any number of nested key-value pairs can be
				1314	specified in any order. "max" can be specified as the value
				1315	to remove a specific limit. If the same key is specified
				1316	multiple times, the outcome is undefined.
				1317
				1318	BPS and IOPS are measured in each IO direction and IOs are
				1319	delayed if limit is reached. Temporary bursts are allowed.
				1320
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1321	Setting read limit at 2M BPS and write at 120 IOPS for 8:16::
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1322
				1323	echo "8:16 rbps=2097152 wiops=120" > io.max
				1324
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1325	Reading returns the following::
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1326
				1327	8:16 rbps=2097152 wbps=max riops=max wiops=120
				1328
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1329	Write IOPS limit can be removed by writing the following::
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1330
				1331	echo "8:16 wiops=max" > io.max
				1332
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1333	Reading now returns the following::
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1334
				1335	8:16 rbps=2097152 wbps=max riops=max wiops=max
				1336
				1337
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1338	Writeback
				1339	~~~~~~~~~
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1340
				1341	Page cache is dirtied through buffered writes and shared mmaps and
				1342	written asynchronously to the backing filesystem by the writeback
				1343	mechanism. Writeback sits between the memory and IO domains and
				1344	regulates the proportion of dirty memory by balancing dirtying and
				1345	write IOs.
				1346
				1347	The io controller, in conjunction with the memory controller,
				1348	implements control of page cache writeback IOs. The memory controller
				1349	defines the memory domain that dirty memory ratio is calculated and
				1350	maintained for and the io controller defines the io domain which
				1351	writes out dirty pages for the memory domain. Both system-wide and
				1352	per-cgroup dirty memory states are examined and the more restrictive
				1353	of the two is enforced.
				1354
				1355	cgroup writeback requires explicit support from the underlying
				1356	filesystem. Currently, cgroup writeback is implemented on ext2, ext4
				1357	and btrfs. On other filesystems, all writeback IOs are attributed to
				1358	the root cgroup.
				1359
				1360	There are inherent differences in memory and writeback management
				1361	which affects how cgroup ownership is tracked. Memory is tracked per
				1362	page while writeback per inode. For the purpose of writeback, an
				1363	inode is assigned to a cgroup and all IO requests to write dirty pages
				1364	from the inode are attributed to that cgroup.
				1365
				1366	As cgroup ownership for memory is tracked per page, there can be pages
				1367	which are associated with different cgroups than the one the inode is
				1368	associated with. These are called foreign pages. The writeback
				1369	constantly keeps track of foreign pages and, if a particular foreign
				1370	cgroup becomes the majority over a certain period of time, switches
				1371	the ownership of the inode to that cgroup.
				1372
				1373	While this model is enough for most use cases where a given inode is
				1374	mostly dirtied by a single cgroup even when the main writing cgroup
				1375	changes over time, use cases where multiple cgroups write to a single
				1376	inode simultaneously are not supported well. In such circumstances, a
				1377	significant portion of IOs are likely to be attributed incorrectly.
				1378	As memory controller assigns page ownership on the first use and
				1379	doesn't update it until the page is released, even if writeback
				1380	strictly follows page ownership, multiple cgroups dirtying overlapping
				1381	areas wouldn't work as expected. It's recommended to avoid such usage
				1382	patterns.
				1383
				1384	The sysctl knobs which affect writeback behavior are applied to cgroup
				1385	writeback as follows.
				1386
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1387	vm.dirty_background_ratio, vm.dirty_ratio
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1388	These ratios apply the same to cgroup writeback with the
				1389	amount of available memory capped by limits imposed by the
				1390	memory controller and system-wide clean memory.
				1391
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1392	vm.dirty_background_bytes, vm.dirty_bytes
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1393	For cgroup writeback, this is calculated into ratio against
				1394	total available memory and applied the same way as
				1395	vm.dirty[_background]_ratio.
				1396
				1397
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1398	PID
				1399	---
Hans Ragas	20c56e5	2017-01-10 17:42:34 +0000	[diff] [blame]	1400
				1401	The process number controller is used to allow a cgroup to stop any
				1402	new tasks from being fork()'d or clone()'d after a specified limit is
				1403	reached.
				1404
				1405	The number of tasks in a cgroup can be exhausted in ways which other
				1406	controllers cannot prevent, thus warranting its own controller. For
				1407	example, a fork bomb is likely to exhaust the number of tasks before
				1408	hitting memory restrictions.
				1409
				1410	Note that PIDs used in this controller refer to TIDs, process IDs as
				1411	used by the kernel.
				1412
				1413
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1414	PID Interface Files
				1415	~~~~~~~~~~~~~~~~~~~
Hans Ragas	20c56e5	2017-01-10 17:42:34 +0000	[diff] [blame]	1416
				1417	pids.max
Tobias Klauser	312eb71	2017-02-17 18:44:11 +0100	[diff] [blame]	1418	A read-write single value file which exists on non-root
				1419	cgroups. The default is "max".
Hans Ragas	20c56e5	2017-01-10 17:42:34 +0000	[diff] [blame]	1420
Tobias Klauser	312eb71	2017-02-17 18:44:11 +0100	[diff] [blame]	1421	Hard limit of number of processes.
Hans Ragas	20c56e5	2017-01-10 17:42:34 +0000	[diff] [blame]	1422
				1423	pids.current
Tobias Klauser	312eb71	2017-02-17 18:44:11 +0100	[diff] [blame]	1424	A read-only single value file which exists on all cgroups.
Hans Ragas	20c56e5	2017-01-10 17:42:34 +0000	[diff] [blame]	1425
Tobias Klauser	312eb71	2017-02-17 18:44:11 +0100	[diff] [blame]	1426	The number of processes currently in the cgroup and its
				1427	descendants.
Hans Ragas	20c56e5	2017-01-10 17:42:34 +0000	[diff] [blame]	1428
				1429	Organisational operations are not blocked by cgroup policies, so it is
				1430	possible to have pids.current > pids.max. This can be done by either
				1431	setting the limit to be smaller than pids.current, or attaching enough
				1432	processes to the cgroup such that pids.current is larger than
				1433	pids.max. However, it is not possible to violate a cgroup PID policy
				1434	through fork() or clone(). These will return -EAGAIN if the creation
				1435	of a new process would cause a cgroup policy to be violated.
				1436
				1437
Roman Gushchin	4ad5a32	2017-12-13 19:49:03 +0000	[diff] [blame]	1438	Device controller
				1439	-----------------
				1440
				1441	Device controller manages access to device files. It includes both
				1442	creation of new device files (using mknod), and access to the
				1443	existing device files.
				1444
				1445	Cgroup v2 device controller has no interface files and is implemented
				1446	on top of cgroup BPF. To control access to device files, a user may
				1447	create bpf programs of the BPF_CGROUP_DEVICE type and attach them
				1448	to cgroups. On an attempt to access a device file, corresponding
				1449	BPF programs will be executed, and depending on the return value
				1450	the attempt will succeed or fail with -EPERM.
				1451
				1452	A BPF_CGROUP_DEVICE program takes a pointer to the bpf_cgroup_dev_ctx
				1453	structure, which describes the device access attempt: access type
				1454	(mknod/read/write) and device (type, major and minor numbers).
				1455	If the program returns 0, the attempt fails with -EPERM, otherwise
				1456	it succeeds.
				1457
				1458	An example of BPF_CGROUP_DEVICE program may be found in the kernel
				1459	source tree in the tools/testing/selftests/bpf/dev_cgroup.c file.
				1460
				1461
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1462	RDMA
				1463	----
Tejun Heo	968ebff	2017-01-29 14:35:20 -0500	[diff] [blame]	1464
Parav Pandit	9c1e67f	2017-01-10 00:02:15 +0000	[diff] [blame]	1465	The "rdma" controller regulates the distribution and accounting of
				1466	of RDMA resources.
				1467
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1468	RDMA Interface Files
				1469	~~~~~~~~~~~~~~~~~~~~
Parav Pandit	9c1e67f	2017-01-10 00:02:15 +0000	[diff] [blame]	1470
				1471	rdma.max
				1472	A readwrite nested-keyed file that exists for all the cgroups
				1473	except root that describes current configured resource limit
				1474	for a RDMA/IB device.
				1475
				1476	Lines are keyed by device name and are not ordered.
				1477	Each line contains space separated resource name and its configured
				1478	limit that can be distributed.
				1479
				1480	The following nested keys are defined.
				1481
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1482	========== =============================
Parav Pandit	9c1e67f	2017-01-10 00:02:15 +0000	[diff] [blame]	1483	hca_handle Maximum number of HCA Handles
				1484	hca_object Maximum number of HCA Objects
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1485	========== =============================
Parav Pandit	9c1e67f	2017-01-10 00:02:15 +0000	[diff] [blame]	1486
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1487	An example for mlx4 and ocrdma device follows::
Parav Pandit	9c1e67f	2017-01-10 00:02:15 +0000	[diff] [blame]	1488
				1489	mlx4_0 hca_handle=2 hca_object=2000
				1490	ocrdma1 hca_handle=3 hca_object=max
				1491
				1492	rdma.current
				1493	A read-only file that describes current resource usage.
				1494	It exists for all the cgroup except root.
				1495
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1496	An example for mlx4 and ocrdma device follows::
Parav Pandit	9c1e67f	2017-01-10 00:02:15 +0000	[diff] [blame]	1497
				1498	mlx4_0 hca_handle=1 hca_object=20
				1499	ocrdma1 hca_handle=1 hca_object=23
				1500
				1501
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1502	Misc
				1503	----
Tejun Heo	63f1ca5	2017-02-02 13:50:35 -0500	[diff] [blame]	1504
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1505	perf_event
				1506	~~~~~~~~~~
Tejun Heo	968ebff	2017-01-29 14:35:20 -0500	[diff] [blame]	1507
				1508	perf_event controller, if not mounted on a legacy hierarchy, is
				1509	automatically enabled on the v2 hierarchy so that perf events can
				1510	always be filtered by cgroup v2 path. The controller can still be
				1511	moved to a legacy hierarchy after v2 hierarchy is populated.
				1512
				1513
Maciej S. Szmigiero	c4e0842	2018-01-10 23:33:19 +0100	[diff] [blame]	1514	Non-normative information
				1515	-------------------------
				1516
				1517	This section contains information that isn't considered to be a part of
				1518	the stable kernel API and so is subject to change.
				1519
				1520
				1521	CPU controller root cgroup process behaviour
				1522	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
				1523
				1524	When distributing CPU cycles in the root cgroup each thread in this
				1525	cgroup is treated as if it was hosted in a separate child cgroup of the
				1526	root cgroup. This child cgroup weight is dependent on its thread nice
				1527	level.
				1528
				1529	For details of this mapping see sched_prio_to_weight array in
				1530	kernel/sched/core.c file (values from this array should be scaled
				1531	appropriately so the neutral - nice 0 - value is 100 instead of 1024).
				1532
				1533
				1534	IO controller root cgroup process behaviour
				1535	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
				1536
				1537	Root cgroup processes are hosted in an implicit leaf child node.
				1538	When distributing IO resources this implicit child node is taken into
				1539	account as if it was a normal child cgroup of the root cgroup with a
				1540	weight value of 200.
				1541
				1542
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1543	Namespace
				1544	=========
Serge Hallyn	d4021f6	2016-01-29 02:54:10 -0600	[diff] [blame]	1545
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1546	Basics
				1547	------
Serge Hallyn	d4021f6	2016-01-29 02:54:10 -0600	[diff] [blame]	1548
				1549	cgroup namespace provides a mechanism to virtualize the view of the
				1550	"/proc/$PID/cgroup" file and cgroup mounts. The CLONE_NEWCGROUP clone
				1551	flag can be used with clone(2) and unshare(2) to create a new cgroup
				1552	namespace. The process running inside the cgroup namespace will have
				1553	its "/proc/$PID/cgroup" output restricted to cgroupns root. The
				1554	cgroupns root is the cgroup of the process at the time of creation of
				1555	the cgroup namespace.
				1556
				1557	Without cgroup namespace, the "/proc/$PID/cgroup" file shows the
				1558	complete path of the cgroup of a process. In a container setup where
				1559	a set of cgroups and namespaces are intended to isolate processes the
				1560	"/proc/$PID/cgroup" file may leak potential system level information
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1561	to the isolated processes. For Example::
Serge Hallyn	d4021f6	2016-01-29 02:54:10 -0600	[diff] [blame]	1562
				1563	# cat /proc/self/cgroup
				1564	0::/batchjobs/container_id1
				1565
				1566	The path '/batchjobs/container_id1' can be considered as system-data
				1567	and undesirable to expose to the isolated processes. cgroup namespace
				1568	can be used to restrict visibility of this path. For example, before
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1569	creating a cgroup namespace, one would see::
Serge Hallyn	d4021f6	2016-01-29 02:54:10 -0600	[diff] [blame]	1570
				1571	# ls -l /proc/self/ns/cgroup
				1572	lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
				1573	# cat /proc/self/cgroup
				1574	0::/batchjobs/container_id1
				1575
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1576	After unsharing a new namespace, the view changes::
Serge Hallyn	d4021f6	2016-01-29 02:54:10 -0600	[diff] [blame]	1577
				1578	# ls -l /proc/self/ns/cgroup
				1579	lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
				1580	# cat /proc/self/cgroup
				1581	0::/
				1582
				1583	When some thread from a multi-threaded process unshares its cgroup
				1584	namespace, the new cgroupns gets applied to the entire process (all
				1585	the threads). This is natural for the v2 hierarchy; however, for the
				1586	legacy hierarchies, this may be unexpected.
				1587
				1588	A cgroup namespace is alive as long as there are processes inside or
				1589	mounts pinning it. When the last usage goes away, the cgroup
				1590	namespace is destroyed. The cgroupns root and the actual cgroups
				1591	remain.
				1592
				1593
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1594	The Root and Views
				1595	------------------
Serge Hallyn	d4021f6	2016-01-29 02:54:10 -0600	[diff] [blame]	1596
				1597	The 'cgroupns root' for a cgroup namespace is the cgroup in which the
				1598	process calling unshare(2) is running. For example, if a process in
				1599	/batchjobs/container_id1 cgroup calls unshare, cgroup
				1600	/batchjobs/container_id1 becomes the cgroupns root. For the
				1601	init_cgroup_ns, this is the real root ('/') cgroup.
				1602
				1603	The cgroupns root cgroup does not change even if the namespace creator
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1604	process later moves to a different cgroup::
Serge Hallyn	d4021f6	2016-01-29 02:54:10 -0600	[diff] [blame]	1605
				1606	# ~/unshare -c # unshare cgroupns in some cgroup
				1607	# cat /proc/self/cgroup
				1608	0::/
				1609	# mkdir sub_cgrp_1
				1610	# echo 0 > sub_cgrp_1/cgroup.procs
				1611	# cat /proc/self/cgroup
				1612	0::/sub_cgrp_1
				1613
				1614	Each process gets its namespace-specific view of "/proc/$PID/cgroup"
				1615
				1616	Processes running inside the cgroup namespace will be able to see
				1617	cgroup paths (in /proc/self/cgroup) only inside their root cgroup.
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1618	From within an unshared cgroupns::
Serge Hallyn	d4021f6	2016-01-29 02:54:10 -0600	[diff] [blame]	1619
				1620	# sleep 100000 &
				1621	[1] 7353
				1622	# echo 7353 > sub_cgrp_1/cgroup.procs
				1623	# cat /proc/7353/cgroup
				1624	0::/sub_cgrp_1
				1625
				1626	From the initial cgroup namespace, the real cgroup path will be
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1627	visible::
Serge Hallyn	d4021f6	2016-01-29 02:54:10 -0600	[diff] [blame]	1628
				1629	$ cat /proc/7353/cgroup
				1630	0::/batchjobs/container_id1/sub_cgrp_1
				1631
				1632	From a sibling cgroup namespace (that is, a namespace rooted at a
				1633	different cgroup), the cgroup path relative to its own cgroup
				1634	namespace root will be shown. For instance, if PID 7353's cgroup
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1635	namespace root is at '/batchjobs/container_id2', then it will see::
Serge Hallyn	d4021f6	2016-01-29 02:54:10 -0600	[diff] [blame]	1636
				1637	# cat /proc/7353/cgroup
				1638	0::/../container_id2/sub_cgrp_1
				1639
				1640	Note that the relative path always starts with '/' to indicate that
				1641	its relative to the cgroup namespace root of the caller.
				1642
				1643
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1644	Migration and setns(2)
				1645	----------------------
Serge Hallyn	d4021f6	2016-01-29 02:54:10 -0600	[diff] [blame]	1646
				1647	Processes inside a cgroup namespace can move into and out of the
				1648	namespace root if they have proper access to external cgroups. For
				1649	example, from inside a namespace with cgroupns root at
				1650	/batchjobs/container_id1, and assuming that the global hierarchy is
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1651	still accessible inside cgroupns::
Serge Hallyn	d4021f6	2016-01-29 02:54:10 -0600	[diff] [blame]	1652
				1653	# cat /proc/7353/cgroup
				1654	0::/sub_cgrp_1
				1655	# echo 7353 > batchjobs/container_id2/cgroup.procs
				1656	# cat /proc/7353/cgroup
				1657	0::/../container_id2
				1658
				1659	Note that this kind of setup is not encouraged. A task inside cgroup
				1660	namespace should only be exposed to its own cgroupns hierarchy.
				1661
				1662	setns(2) to another cgroup namespace is allowed when:
				1663
				1664	(a) the process has CAP_SYS_ADMIN against its current user namespace
				1665	(b) the process has CAP_SYS_ADMIN against the target cgroup
				1666	namespace's userns
				1667
				1668	No implicit cgroup changes happen with attaching to another cgroup
				1669	namespace. It is expected that the someone moves the attaching
				1670	process under the target cgroup namespace root.
				1671
				1672
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1673	Interaction with Other Namespaces
				1674	---------------------------------
Serge Hallyn	d4021f6	2016-01-29 02:54:10 -0600	[diff] [blame]	1675
				1676	Namespace specific cgroup hierarchy can be mounted by a process
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1677	running inside a non-init cgroup namespace::
Serge Hallyn	d4021f6	2016-01-29 02:54:10 -0600	[diff] [blame]	1678
				1679	# mount -t cgroup2 none $MOUNT_POINT
				1680
				1681	This will mount the unified cgroup hierarchy with cgroupns root as the
				1682	filesystem root. The process needs CAP_SYS_ADMIN against its user and
				1683	mount namespaces.
				1684
				1685	The virtualization of /proc/self/cgroup file combined with restricting
				1686	the view of cgroup hierarchy by namespace-private cgroupfs mount
				1687	provides a properly isolated cgroup view inside the container.
				1688
				1689
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1690	Information on Kernel Programming
				1691	=================================
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1692
				1693	This section contains kernel programming information in the areas
				1694	where interacting with cgroup is necessary. cgroup core and
				1695	controllers are not covered.
				1696
				1697
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1698	Filesystem Support for Writeback
				1699	--------------------------------
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1700
				1701	A filesystem can support cgroup writeback by updating
				1702	address_space_operations->writepage[s]() to annotate bio's using the
				1703	following two functions.
				1704
				1705	wbc_init_bio(@wbc, @bio)
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1706	Should be called for each bio carrying writeback data and
				1707	associates the bio with the inode's owner cgroup. Can be
				1708	called anytime between bio allocation and submission.
				1709
				1710	wbc_account_io(@wbc, @page, @bytes)
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1711	Should be called for each data segment being written out.
				1712	While this function doesn't care exactly when it's called
				1713	during the writeback session, it's the easiest and most
				1714	natural to call it as data segments are added to a bio.
				1715
				1716	With writeback bio's annotated, cgroup support can be enabled per
				1717	super_block by setting SB_I_CGROUPWB in ->s_iflags. This allows for
				1718	selective disabling of cgroup writeback support which is helpful when
				1719	certain filesystem features, e.g. journaled data mode, are
				1720	incompatible.
				1721
				1722	wbc_init_bio() binds the specified bio to its cgroup. Depending on
				1723	the configuration, the bio may be executed at a lower priority and if
				1724	the writeback session is holding shared resources, e.g. a journal
				1725	entry, may lead to priority inversion. There is no one easy solution
				1726	for the problem. Filesystems can try to work around specific problem
				1727	cases by skipping wbc_init_bio() or using bio_associate_blkcg()
				1728	directly.
				1729
				1730
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1731	Deprecated v1 Core Features
				1732	===========================
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1733
				1734	- Multiple hierarchies including named ones are not supported.
				1735
Tejun Heo	5136f63	2017-06-27 14:30:28 -0400	[diff] [blame]	1736	- All v1 mount options are not supported.
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1737
				1738	- The "tasks" file is removed and "cgroup.procs" is not sorted.
				1739
				1740	- "cgroup.clone_children" is removed.
				1741
				1742	- /proc/cgroups is meaningless for v2. Use "cgroup.controllers" file
				1743	at the root instead.
				1744
				1745
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1746	Issues with v1 and Rationales for v2
				1747	====================================
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1748
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1749	Multiple Hierarchies
				1750	--------------------
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1751
				1752	cgroup v1 allowed an arbitrary number of hierarchies and each
				1753	hierarchy could host any number of controllers. While this seemed to
				1754	provide a high level of flexibility, it wasn't useful in practice.
				1755
				1756	For example, as there is only one instance of each controller, utility
				1757	type controllers such as freezer which can be useful in all
				1758	hierarchies could only be used in one. The issue is exacerbated by
				1759	the fact that controllers couldn't be moved to another hierarchy once
				1760	hierarchies were populated. Another issue was that all controllers
				1761	bound to a hierarchy were forced to have exactly the same view of the
				1762	hierarchy. It wasn't possible to vary the granularity depending on
				1763	the specific controller.
				1764
				1765	In practice, these issues heavily limited which controllers could be
				1766	put on the same hierarchy and most configurations resorted to putting
				1767	each controller on its own hierarchy. Only closely related ones, such
				1768	as the cpu and cpuacct controllers, made sense to be put on the same
				1769	hierarchy. This often meant that userland ended up managing multiple
				1770	similar hierarchies repeating the same steps on each hierarchy
				1771	whenever a hierarchy management operation was necessary.
				1772
				1773	Furthermore, support for multiple hierarchies came at a steep cost.
				1774	It greatly complicated cgroup core implementation but more importantly
				1775	the support for multiple hierarchies restricted how cgroup could be
				1776	used in general and what controllers was able to do.
				1777
				1778	There was no limit on how many hierarchies there might be, which meant
				1779	that a thread's cgroup membership couldn't be described in finite
				1780	length. The key might contain any number of entries and was unlimited
				1781	in length, which made it highly awkward to manipulate and led to
				1782	addition of controllers which existed only to identify membership,
				1783	which in turn exacerbated the original problem of proliferating number
				1784	of hierarchies.
				1785
				1786	Also, as a controller couldn't have any expectation regarding the
				1787	topologies of hierarchies other controllers might be on, each
				1788	controller had to assume that all other controllers were attached to
				1789	completely orthogonal hierarchies. This made it impossible, or at
				1790	least very cumbersome, for controllers to cooperate with each other.
				1791
				1792	In most use cases, putting controllers on hierarchies which are
				1793	completely orthogonal to each other isn't necessary. What usually is
				1794	called for is the ability to have differing levels of granularity
				1795	depending on the specific controller. In other words, hierarchy may
				1796	be collapsed from leaf towards root when viewed from specific
				1797	controllers. For example, a given configuration might not care about
				1798	how memory is distributed beyond a certain level while still wanting
				1799	to control how CPU cycles are distributed.
				1800
				1801
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1802	Thread Granularity
				1803	------------------
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1804
				1805	cgroup v1 allowed threads of a process to belong to different cgroups.
				1806	This didn't make sense for some controllers and those controllers
				1807	ended up implementing different ways to ignore such situations but
				1808	much more importantly it blurred the line between API exposed to
				1809	individual applications and system management interface.
				1810
				1811	Generally, in-process knowledge is available only to the process
				1812	itself; thus, unlike service-level organization of processes,
				1813	categorizing threads of a process requires active participation from
				1814	the application which owns the target process.
				1815
				1816	cgroup v1 had an ambiguously defined delegation model which got abused
				1817	in combination with thread granularity. cgroups were delegated to
				1818	individual applications so that they can create and manage their own
				1819	sub-hierarchies and control resource distributions along them. This
				1820	effectively raised cgroup to the status of a syscall-like API exposed
				1821	to lay programs.
				1822
				1823	First of all, cgroup has a fundamentally inadequate interface to be
				1824	exposed this way. For a process to access its own knobs, it has to
				1825	extract the path on the target hierarchy from /proc/self/cgroup,
				1826	construct the path by appending the name of the knob to the path, open
				1827	and then read and/or write to it. This is not only extremely clunky
				1828	and unusual but also inherently racy. There is no conventional way to
				1829	define transaction across the required steps and nothing can guarantee
				1830	that the process would actually be operating on its own sub-hierarchy.
				1831
				1832	cgroup controllers implemented a number of knobs which would never be
				1833	accepted as public APIs because they were just adding control knobs to
				1834	system-management pseudo filesystem. cgroup ended up with interface
				1835	knobs which were not properly abstracted or refined and directly
				1836	revealed kernel internal details. These knobs got exposed to
				1837	individual applications through the ill-defined delegation mechanism
				1838	effectively abusing cgroup as a shortcut to implementing public APIs
				1839	without going through the required scrutiny.
				1840
				1841	This was painful for both userland and kernel. Userland ended up with
				1842	misbehaving and poorly abstracted interfaces and kernel exposing and
				1843	locked into constructs inadvertently.
				1844
				1845
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1846	Competition Between Inner Nodes and Threads
				1847	-------------------------------------------
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1848
				1849	cgroup v1 allowed threads to be in any cgroups which created an
				1850	interesting problem where threads belonging to a parent cgroup and its
				1851	children cgroups competed for resources. This was nasty as two
				1852	different types of entities competed and there was no obvious way to
				1853	settle it. Different controllers did different things.
				1854
				1855	The cpu controller considered threads and cgroups as equivalents and
				1856	mapped nice levels to cgroup weights. This worked for some cases but
				1857	fell flat when children wanted to be allocated specific ratios of CPU
				1858	cycles and the number of internal threads fluctuated - the ratios
				1859	constantly changed as the number of competing entities fluctuated.
				1860	There also were other issues. The mapping from nice level to weight
				1861	wasn't obvious or universal, and there were various other knobs which
				1862	simply weren't available for threads.
				1863
				1864	The io controller implicitly created a hidden leaf node for each
				1865	cgroup to host the threads. The hidden leaf had its own copies of all
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1866	the knobs with ``leaf_`` prefixed. While this allowed equivalent
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1867	control over internal threads, it was with serious drawbacks. It
				1868	always added an extra layer of nesting which wouldn't be necessary
				1869	otherwise, made the interface messy and significantly complicated the
				1870	implementation.
				1871
				1872	The memory controller didn't have a way to control what happened
				1873	between internal tasks and child cgroups and the behavior was not
				1874	clearly defined. There were attempts to add ad-hoc behaviors and
				1875	knobs to tailor the behavior to specific workloads which would have
				1876	led to problems extremely difficult to resolve in the long term.
				1877
				1878	Multiple controllers struggled with internal tasks and came up with
				1879	different ways to deal with it; unfortunately, all the approaches were
				1880	severely flawed and, furthermore, the widely different behaviors
				1881	made cgroup as a whole highly inconsistent.
				1882
				1883	This clearly is a problem which needs to be addressed from cgroup core
				1884	in a uniform way.
				1885
				1886
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1887	Other Interface Issues
				1888	----------------------
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1889
				1890	cgroup v1 grew without oversight and developed a large number of
				1891	idiosyncrasies and inconsistencies. One issue on the cgroup core side
				1892	was how an empty cgroup was notified - a userland helper binary was
				1893	forked and executed for each event. The event delivery wasn't
				1894	recursive or delegatable. The limitations of the mechanism also led
				1895	to in-kernel event delivery filtering mechanism further complicating
				1896	the interface.
				1897
				1898	Controller interfaces were problematic too. An extreme example is
				1899	controllers completely ignoring hierarchical organization and treating
				1900	all cgroups as if they were all located directly under the root
				1901	cgroup. Some controllers exposed a large amount of inconsistent
				1902	implementation details to userland.
				1903
				1904	There also was no consistency across controllers. When a new cgroup
				1905	was created, some controllers defaulted to not imposing extra
				1906	restrictions while others disallowed any resource usage until
				1907	explicitly configured. Configuration knobs for the same type of
				1908	control used widely differing naming schemes and formats. Statistics
				1909	and information knobs were named arbitrarily and used different
				1910	formats and units even in the same controller.
				1911
				1912	cgroup v2 establishes common conventions where appropriate and updates
				1913	controllers so that they expose minimal and consistent interfaces.
				1914
				1915
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1916	Controller Issues and Remedies
				1917	------------------------------
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1918
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1919	Memory
				1920	~~~~~~
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1921
				1922	The original lower boundary, the soft limit, is defined as a limit
				1923	that is per default unset. As a result, the set of cgroups that
				1924	global reclaim prefers is opt-in, rather than opt-out. The costs for
				1925	optimizing these mostly negative lookups are so high that the
				1926	implementation, despite its enormous size, does not even provide the
				1927	basic desirable behavior. First off, the soft limit has no
				1928	hierarchical meaning. All configured groups are organized in a global
				1929	rbtree and treated like equal peers, regardless where they are located
				1930	in the hierarchy. This makes subtree delegation impossible. Second,
				1931	the soft limit reclaim pass is so aggressive that it not just
				1932	introduces high allocation latencies into the system, but also impacts
				1933	system performance due to overreclaim, to the point where the feature
				1934	becomes self-defeating.
				1935
				1936	The memory.low boundary on the other hand is a top-down allocated
				1937	reserve. A cgroup enjoys reclaim protection when it and all its
				1938	ancestors are below their low boundaries, which makes delegation of
				1939	subtrees possible. Secondly, new cgroups have no reserve per default
				1940	and in the common case most cgroups are eligible for the preferred
				1941	reclaim pass. This allows the new low boundary to be efficiently
				1942	implemented with just a minor addition to the generic reclaim code,
				1943	without the need for out-of-band data structures and reclaim passes.
				1944	Because the generic reclaim code considers all cgroups except for the
				1945	ones running low in the preferred first reclaim pass, overreclaim of
				1946	individual groups is eliminated as well, resulting in much better
				1947	overall workload performance.
				1948
				1949	The original high boundary, the hard limit, is defined as a strict
				1950	limit that can not budge, even if the OOM killer has to be called.
				1951	But this generally goes against the goal of making the most out of the
				1952	available memory. The memory consumption of workloads varies during
				1953	runtime, and that requires users to overcommit. But doing that with a
				1954	strict upper limit requires either a fairly accurate prediction of the
				1955	working set size or adding slack to the limit. Since working set size
				1956	estimation is hard and error prone, and getting it wrong results in
				1957	OOM kills, most users tend to err on the side of a looser limit and
				1958	end up wasting precious resources.
				1959
				1960	The memory.high boundary on the other hand can be set much more
				1961	conservatively. When hit, it throttles allocations by forcing them
				1962	into direct reclaim to work off the excess, but it never invokes the
				1963	OOM killer. As a result, a high boundary that is chosen too
				1964	aggressively will not terminate the processes, but instead it will
				1965	lead to gradual performance degradation. The user can monitor this
				1966	and make corrections until the minimal memory footprint that still
				1967	gives acceptable performance is found.
				1968
				1969	In extreme cases, with many concurrent allocations and a complete
				1970	breakdown of reclaim progress within the group, the high boundary can
				1971	be exceeded. But even then it's mostly better to satisfy the
				1972	allocation from the slack available in other groups or the rest of the
				1973	system than killing the group. Otherwise, memory.max is there to
				1974	limit this type of spillover and ultimately contain buggy or even
				1975	malicious applications.
Vladimir Davydov	3e24b19	2016-01-20 15:03:13 -0800	[diff] [blame]	1976
Johannes Weiner	b6e6edc	2016-03-17 14:20:28 -0700	[diff] [blame]	1977	Setting the original memory.limit_in_bytes below the current usage was
				1978	subject to a race condition, where concurrent charges could cause the
				1979	limit setting to fail. memory.max on the other hand will first set the
				1980	limit to prevent new charges, and then reclaim and OOM kill until the
				1981	new limit is met - or the task writing to memory.max is killed.
				1982
Vladimir Davydov	3e24b19	2016-01-20 15:03:13 -0800	[diff] [blame]	1983	The combined memory+swap accounting and limiting is replaced by real
				1984	control over swap space.
				1985
				1986	The main argument for a combined memory+swap facility in the original
				1987	cgroup design was that global or parental pressure would always be
				1988	able to swap all anonymous memory of a child group, regardless of the
				1989	child's own (possibly untrusted) configuration. However, untrusted
				1990	groups can sabotage swapping by other means - such as referencing its
				1991	anonymous memory in a tight loop - and an admin can not assume full
				1992	swappability when overcommitting untrusted jobs.
				1993
				1994	For trusted jobs, on the other hand, a combined counter is not an
				1995	intuitive userspace interface, and it flies in the face of the idea
				1996	that cgroup controllers should account and limit specific physical
				1997	resources. Swap space is a resource like all others in the system,
				1998	and that's why unified hierarchy allows distributing it separately.