Blame - Documentation/admin-guide/cgroup-v2.rst - kernel/msm-5.4

blob: 8a2c52d5c53b7aaa9c2fcc5b684c0ac3dbcd53dc [file] [log] [blame]

Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1	================
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	2	Control Group v2
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	3	================
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	4
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	5	:Date: October, 2015
				6	:Author: Tejun Heo <tj@kernel.org>
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	7
				8	This is the authoritative documentation on the design, interface and
				9	conventions of cgroup v2. It describes all userland-visible aspects
				10	of cgroup including core and specific controller behaviors. All
				11	future changes must be reflected in this document. Documentation for
W. Trevor King	9a2ddda	2016-01-27 13:01:52 -0800	[diff] [blame]	12	v1 is available under Documentation/cgroup-v1/.
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	13
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	14	.. CONTENTS
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	15
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	16	1. Introduction
				17	1-1. Terminology
				18	1-2. What is cgroup?
				19	2. Basic Operations
				20	2-1. Mounting
Tejun Heo	8cfd814	2017-07-21 11:14:51 -0400	[diff] [blame]	21	2-2. Organizing Processes and Threads
				22	2-2-1. Processes
				23	2-2-2. Threads
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	24	2-3. [Un]populated Notification
				25	2-4. Controlling Controllers
				26	2-4-1. Enabling and Disabling
				27	2-4-2. Top-down Constraint
				28	2-4-3. No Internal Process Constraint
				29	2-5. Delegation
				30	2-5-1. Model of Delegation
				31	2-5-2. Delegation Containment
				32	2-6. Guidelines
				33	2-6-1. Organize Once and Control
				34	2-6-2. Avoid Name Collisions
				35	3. Resource Distribution Models
				36	3-1. Weights
				37	3-2. Limits
				38	3-3. Protections
				39	3-4. Allocations
				40	4. Interface Files
				41	4-1. Format
				42	4-2. Conventions
				43	4-3. Core Interface Files
				44	5. Controllers
				45	5-1. CPU
				46	5-1-1. CPU Interface Files
				47	5-2. Memory
				48	5-2-1. Memory Interface Files
				49	5-2-2. Usage Guidelines
				50	5-2-3. Memory Ownership
				51	5-3. IO
				52	5-3-1. IO Interface Files
				53	5-3-2. Writeback
				54	5-4. PID
				55	5-4-1. PID Interface Files
Roman Gushchin	4ad5a32	2017-12-13 19:49:03 +0000	[diff] [blame]	56	5-5. Device
				57	5-6. RDMA
				58	5-6-1. RDMA Interface Files
				59	5-7. Misc
				60	5-7-1. perf_event
Maciej S. Szmigiero	c4e0842	2018-01-10 23:33:19 +0100	[diff] [blame]	61	5-N. Non-normative information
				62	5-N-1. CPU controller root cgroup process behaviour
				63	5-N-2. IO controller root cgroup process behaviour
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	64	6. Namespace
				65	6-1. Basics
				66	6-2. The Root and Views
				67	6-3. Migration and setns(2)
				68	6-4. Interaction with Other Namespaces
				69	P. Information on Kernel Programming
				70	P-1. Filesystem Support for Writeback
				71	D. Deprecated v1 Core Features
				72	R. Issues with v1 and Rationales for v2
				73	R-1. Multiple Hierarchies
				74	R-2. Thread Granularity
				75	R-3. Competition Between Inner Nodes and Threads
				76	R-4. Other Interface Issues
				77	R-5. Controller Issues and Remedies
				78	R-5-1. Memory
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	79
				80
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	81	Introduction
				82	============
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	83
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	84	Terminology
				85	-----------
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	86
				87	"cgroup" stands for "control group" and is never capitalized. The
				88	singular form is used to designate the whole feature and also as a
				89	qualifier as in "cgroup controllers". When explicitly referring to
				90	multiple individual control groups, the plural form "cgroups" is used.
				91
				92
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	93	What is cgroup?
				94	---------------
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	95
				96	cgroup is a mechanism to organize processes hierarchically and
				97	distribute system resources along the hierarchy in a controlled and
				98	configurable manner.
				99
				100	cgroup is largely composed of two parts - the core and controllers.
				101	cgroup core is primarily responsible for hierarchically organizing
				102	processes. A cgroup controller is usually responsible for
				103	distributing a specific type of system resource along the hierarchy
				104	although there are utility controllers which serve purposes other than
				105	resource distribution.
				106
				107	cgroups form a tree structure and every process in the system belongs
				108	to one and only one cgroup. All threads of a process belong to the
				109	same cgroup. On creation, all processes are put in the cgroup that
				110	the parent process belongs to at the time. A process can be migrated
				111	to another cgroup. Migration of a process doesn't affect already
				112	existing descendant processes.
				113
				114	Following certain structural constraints, controllers may be enabled or
				115	disabled selectively on a cgroup. All controller behaviors are
				116	hierarchical - if a controller is enabled on a cgroup, it affects all
				117	processes which belong to the cgroups consisting the inclusive
				118	sub-hierarchy of the cgroup. When a controller is enabled on a nested
				119	cgroup, it always restricts the resource distribution further. The
				120	restrictions set closer to the root in the hierarchy can not be
				121	overridden from further away.
				122
				123
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	124	Basic Operations
				125	================
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	126
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	127	Mounting
				128	--------
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	129
				130	Unlike v1, cgroup v2 has only single hierarchy. The cgroup v2
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	131	hierarchy can be mounted with the following mount command::
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	132
				133	# mount -t cgroup2 none $MOUNT_POINT
				134
				135	cgroup2 filesystem has the magic number 0x63677270 ("cgrp"). All
				136	controllers which support v2 and are not bound to a v1 hierarchy are
				137	automatically bound to the v2 hierarchy and show up at the root.
				138	Controllers which are not in active use in the v2 hierarchy can be
				139	bound to other hierarchies. This allows mixing v2 hierarchy with the
				140	legacy v1 multiple hierarchies in a fully backward compatible way.
				141
				142	A controller can be moved across hierarchies only after the controller
				143	is no longer referenced in its current hierarchy. Because per-cgroup
				144	controller states are destroyed asynchronously and controllers may
				145	have lingering references, a controller may not show up immediately on
				146	the v2 hierarchy after the final umount of the previous hierarchy.
				147	Similarly, a controller should be fully disabled to be moved out of
				148	the unified hierarchy and it may take some time for the disabled
				149	controller to become available for other hierarchies; furthermore, due
				150	to inter-controller dependencies, other controllers may need to be
				151	disabled too.
				152
				153	While useful for development and manual configurations, moving
				154	controllers dynamically between the v2 and other hierarchies is
				155	strongly discouraged for production use. It is recommended to decide
				156	the hierarchies and controller associations before starting using the
				157	controllers after system boot.
				158
Johannes Weiner	1619b6d	2016-02-16 13:21:14 -0500	[diff] [blame]	159	During transition to v2, system management software might still
				160	automount the v1 cgroup filesystem and so hijack all controllers
				161	during boot, before manual intervention is possible. To make testing
				162	and experimenting easier, the kernel parameter cgroup_no_v1= allows
				163	disabling controllers in v1 and make them always available in v2.
				164
Tejun Heo	5136f63	2017-06-27 14:30:28 -0400	[diff] [blame]	165	cgroup v2 currently supports the following mount options.
				166
				167	nsdelegate
				168
				169	Consider cgroup namespaces as delegation boundaries. This
				170	option is system wide and can only be set on mount or modified
				171	through remount from the init namespace. The mount option is
				172	ignored on non-init namespace mounts. Please refer to the
				173	Delegation section for details.
				174
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	175
Tejun Heo	8cfd814	2017-07-21 11:14:51 -0400	[diff] [blame]	176	Organizing Processes and Threads
				177	--------------------------------
				178
				179	Processes
				180	~~~~~~~~~
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	181
				182	Initially, only the root cgroup exists to which all processes belong.
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	183	A child cgroup can be created by creating a sub-directory::
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	184
				185	# mkdir $CGROUP_NAME
				186
				187	A given cgroup may have multiple child cgroups forming a tree
				188	structure. Each cgroup has a read-writable interface file
				189	"cgroup.procs". When read, it lists the PIDs of all processes which
				190	belong to the cgroup one-per-line. The PIDs are not ordered and the
				191	same PID may show up more than once if the process got moved to
				192	another cgroup and then back or the PID got recycled while reading.
				193
				194	A process can be migrated into a cgroup by writing its PID to the
				195	target cgroup's "cgroup.procs" file. Only one process can be migrated
				196	on a single write(2) call. If a process is composed of multiple
				197	threads, writing the PID of any thread migrates all threads of the
				198	process.
				199
				200	When a process forks a child process, the new process is born into the
				201	cgroup that the forking process belongs to at the time of the
				202	operation. After exit, a process stays associated with the cgroup
				203	that it belonged to at the time of exit until it's reaped; however, a
				204	zombie process does not appear in "cgroup.procs" and thus can't be
				205	moved to another cgroup.
				206
				207	A cgroup which doesn't have any children or live processes can be
				208	destroyed by removing the directory. Note that a cgroup which doesn't
				209	have any children and is associated only with zombie processes is
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	210	considered empty and can be removed::
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	211
				212	# rmdir $CGROUP_NAME
				213
				214	"/proc/$PID/cgroup" lists a process's cgroup membership. If legacy
				215	cgroup is in use in the system, this file may contain multiple lines,
				216	one for each hierarchy. The entry for cgroup v2 is always in the
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	217	format "0::$PATH"::
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	218
				219	# cat /proc/842/cgroup
				220	...
				221	0::/test-cgroup/test-cgroup-nested
				222
				223	If the process becomes a zombie and the cgroup it was associated with
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	224	is removed subsequently, " (deleted)" is appended to the path::
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	225
				226	# cat /proc/842/cgroup
				227	...
				228	0::/test-cgroup/test-cgroup-nested (deleted)
				229
				230
Tejun Heo	8cfd814	2017-07-21 11:14:51 -0400	[diff] [blame]	231	Threads
				232	~~~~~~~
				233
				234	cgroup v2 supports thread granularity for a subset of controllers to
				235	support use cases requiring hierarchical resource distribution across
				236	the threads of a group of processes. By default, all threads of a
				237	process belong to the same cgroup, which also serves as the resource
				238	domain to host resource consumptions which are not specific to a
				239	process or thread. The thread mode allows threads to be spread across
				240	a subtree while still maintaining the common resource domain for them.
				241
				242	Controllers which support thread mode are called threaded controllers.
				243	The ones which don't are called domain controllers.
				244
				245	Marking a cgroup threaded makes it join the resource domain of its
				246	parent as a threaded cgroup. The parent may be another threaded
				247	cgroup whose resource domain is further up in the hierarchy. The root
				248	of a threaded subtree, that is, the nearest ancestor which is not
				249	threaded, is called threaded domain or thread root interchangeably and
				250	serves as the resource domain for the entire subtree.
				251
				252	Inside a threaded subtree, threads of a process can be put in
				253	different cgroups and are not subject to the no internal process
				254	constraint - threaded controllers can be enabled on non-leaf cgroups
				255	whether they have threads in them or not.
				256
				257	As the threaded domain cgroup hosts all the domain resource
				258	consumptions of the subtree, it is considered to have internal
				259	resource consumptions whether there are processes in it or not and
				260	can't have populated child cgroups which aren't threaded. Because the
				261	root cgroup is not subject to no internal process constraint, it can
				262	serve both as a threaded domain and a parent to domain cgroups.
				263
				264	The current operation mode or type of the cgroup is shown in the
				265	"cgroup.type" file which indicates whether the cgroup is a normal
				266	domain, a domain which is serving as the domain of a threaded subtree,
				267	or a threaded cgroup.
				268
				269	On creation, a cgroup is always a domain cgroup and can be made
				270	threaded by writing "threaded" to the "cgroup.type" file. The
				271	operation is single direction::
				272
				273	# echo threaded > cgroup.type
				274
				275	Once threaded, the cgroup can't be made a domain again. To enable the
				276	thread mode, the following conditions must be met.
				277
				278	- As the cgroup will join the parent's resource domain. The parent
				279	must either be a valid (threaded) domain or a threaded cgroup.
				280
Tejun Heo	918a8c2	2017-07-23 08:18:26 -0400	[diff] [blame]	281	- When the parent is an unthreaded domain, it must not have any domain
				282	controllers enabled or populated domain children. The root is
				283	exempt from this requirement.
Tejun Heo	8cfd814	2017-07-21 11:14:51 -0400	[diff] [blame]	284
				285	Topology-wise, a cgroup can be in an invalid state. Please consider
Vladimir Rutsky	2877cbe	2018-01-02 17:27:41 +0100	[diff] [blame]	286	the following topology::
Tejun Heo	8cfd814	2017-07-21 11:14:51 -0400	[diff] [blame]	287
				288	A (threaded domain) - B (threaded) - C (domain, just created)
				289
				290	C is created as a domain but isn't connected to a parent which can
				291	host child domains. C can't be used until it is turned into a
				292	threaded cgroup. "cgroup.type" file will report "domain (invalid)" in
				293	these cases. Operations which fail due to invalid topology use
				294	EOPNOTSUPP as the errno.
				295
				296	A domain cgroup is turned into a threaded domain when one of its child
				297	cgroup becomes threaded or threaded controllers are enabled in the
				298	"cgroup.subtree_control" file while there are processes in the cgroup.
				299	A threaded domain reverts to a normal domain when the conditions
				300	clear.
				301
				302	When read, "cgroup.threads" contains the list of the thread IDs of all
				303	threads in the cgroup. Except that the operations are per-thread
				304	instead of per-process, "cgroup.threads" has the same format and
				305	behaves the same way as "cgroup.procs". While "cgroup.threads" can be
				306	written to in any cgroup, as it can only move threads inside the same
				307	threaded domain, its operations are confined inside each threaded
				308	subtree.
				309
				310	The threaded domain cgroup serves as the resource domain for the whole
				311	subtree, and, while the threads can be scattered across the subtree,
				312	all the processes are considered to be in the threaded domain cgroup.
				313	"cgroup.procs" in a threaded domain cgroup contains the PIDs of all
				314	processes in the subtree and is not readable in the subtree proper.
				315	However, "cgroup.procs" can be written to from anywhere in the subtree
				316	to migrate all threads of the matching process to the cgroup.
				317
				318	Only threaded controllers can be enabled in a threaded subtree. When
				319	a threaded controller is enabled inside a threaded subtree, it only
				320	accounts for and controls resource consumptions associated with the
				321	threads in the cgroup and its descendants. All consumptions which
				322	aren't tied to a specific thread belong to the threaded domain cgroup.
				323
				324	Because a threaded subtree is exempt from no internal process
				325	constraint, a threaded controller must be able to handle competition
				326	between threads in a non-leaf cgroup and its child cgroups. Each
				327	threaded controller defines how such competitions are handled.
				328
				329
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	330	[Un]populated Notification
				331	--------------------------
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	332
				333	Each non-root cgroup has a "cgroup.events" file which contains
				334	"populated" field indicating whether the cgroup's sub-hierarchy has
				335	live processes in it. Its value is 0 if there is no live process in
				336	the cgroup and its descendants; otherwise, 1. poll and [id]notify
				337	events are triggered when the value changes. This can be used, for
				338	example, to start a clean-up operation after all processes of a given
				339	sub-hierarchy have exited. The populated state updates and
				340	notifications are recursive. Consider the following sub-hierarchy
				341	where the numbers in the parentheses represent the numbers of processes
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	342	in each cgroup::
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	343
				344	A(4) - B(0) - C(1)
				345	\ D(0)
				346
				347	A, B and C's "populated" fields would be 1 while D's 0. After the one
				348	process in C exits, B and C's "populated" fields would flip to "0" and
				349	file modified events will be generated on the "cgroup.events" files of
				350	both cgroups.
				351
				352
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	353	Controlling Controllers
				354	-----------------------
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	355
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	356	Enabling and Disabling
				357	~~~~~~~~~~~~~~~~~~~~~~
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	358
				359	Each cgroup has a "cgroup.controllers" file which lists all
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	360	controllers available for the cgroup to enable::
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	361
				362	# cat cgroup.controllers
				363	cpu io memory
				364
				365	No controller is enabled by default. Controllers can be enabled and
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	366	disabled by writing to the "cgroup.subtree_control" file::
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	367
				368	# echo "+cpu +memory -io" > cgroup.subtree_control
				369
				370	Only controllers which are listed in "cgroup.controllers" can be
				371	enabled. When multiple operations are specified as above, either they
				372	all succeed or fail. If multiple operations on the same controller
				373	are specified, the last one is effective.
				374
				375	Enabling a controller in a cgroup indicates that the distribution of
				376	the target resource across its immediate children will be controlled.
				377	Consider the following sub-hierarchy. The enabled controllers are
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	378	listed in parentheses::
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	379
				380	A(cpu,memory) - B(memory) - C()
				381	\ D()
				382
				383	As A has "cpu" and "memory" enabled, A will control the distribution
				384	of CPU cycles and memory to its children, in this case, B. As B has
				385	"memory" enabled but not "CPU", C and D will compete freely on CPU
				386	cycles but their division of memory available to B will be controlled.
				387
				388	As a controller regulates the distribution of the target resource to
				389	the cgroup's children, enabling it creates the controller's interface
				390	files in the child cgroups. In the above example, enabling "cpu" on B
				391	would create the "cpu." prefixed controller interface files in C and
				392	D. Likewise, disabling "memory" from B would remove the "memory."
				393	prefixed controller interface files from C and D. This means that the
				394	controller interface files - anything which doesn't start with
				395	"cgroup." are owned by the parent rather than the cgroup itself.
				396
				397
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	398	Top-down Constraint
				399	~~~~~~~~~~~~~~~~~~~
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	400
				401	Resources are distributed top-down and a cgroup can further distribute
				402	a resource only if the resource has been distributed to it from the
				403	parent. This means that all non-root "cgroup.subtree_control" files
				404	can only contain controllers which are enabled in the parent's
				405	"cgroup.subtree_control" file. A controller can be enabled only if
				406	the parent has the controller enabled and a controller can't be
				407	disabled if one or more children have it enabled.
				408
				409
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	410	No Internal Process Constraint
				411	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	412
Tejun Heo	8cfd814	2017-07-21 11:14:51 -0400	[diff] [blame]	413	Non-root cgroups can distribute domain resources to their children
				414	only when they don't have any processes of their own. In other words,
				415	only domain cgroups which don't contain any processes can have domain
				416	controllers enabled in their "cgroup.subtree_control" files.
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	417
Tejun Heo	8cfd814	2017-07-21 11:14:51 -0400	[diff] [blame]	418	This guarantees that, when a domain controller is looking at the part
				419	of the hierarchy which has it enabled, processes are always only on
				420	the leaves. This rules out situations where child cgroups compete
				421	against internal processes of the parent.
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	422
				423	The root cgroup is exempt from this restriction. Root contains
				424	processes and anonymous resource consumption which can't be associated
				425	with any other cgroups and requires special treatment from most
				426	controllers. How resource consumption in the root cgroup is governed
Maciej S. Szmigiero	c4e0842	2018-01-10 23:33:19 +0100	[diff] [blame]	427	is up to each controller (for more information on this topic please
				428	refer to the Non-normative information section in the Controllers
				429	chapter).
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	430
				431	Note that the restriction doesn't get in the way if there is no
				432	enabled controller in the cgroup's "cgroup.subtree_control". This is
				433	important as otherwise it wouldn't be possible to create children of a
				434	populated cgroup. To control resource distribution of a cgroup, the
				435	cgroup must create children and transfer all its processes to the
				436	children before enabling controllers in its "cgroup.subtree_control"
				437	file.
				438
				439
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	440	Delegation
				441	----------
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	442
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	443	Model of Delegation
				444	~~~~~~~~~~~~~~~~~~~
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	445
Tejun Heo	5136f63	2017-06-27 14:30:28 -0400	[diff] [blame]	446	A cgroup can be delegated in two ways. First, to a less privileged
Tejun Heo	8cfd814	2017-07-21 11:14:51 -0400	[diff] [blame]	447	user by granting write access of the directory and its "cgroup.procs",
				448	"cgroup.threads" and "cgroup.subtree_control" files to the user.
				449	Second, if the "nsdelegate" mount option is set, automatically to a
				450	cgroup namespace on namespace creation.
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	451
Tejun Heo	5136f63	2017-06-27 14:30:28 -0400	[diff] [blame]	452	Because the resource control interface files in a given directory
				453	control the distribution of the parent's resources, the delegatee
				454	shouldn't be allowed to write to them. For the first method, this is
				455	achieved by not granting access to these files. For the second, the
				456	kernel rejects writes to all files other than "cgroup.procs" and
				457	"cgroup.subtree_control" on a namespace root from inside the
				458	namespace.
				459
				460	The end results are equivalent for both delegation types. Once
				461	delegated, the user can build sub-hierarchy under the directory,
				462	organize processes inside it as it sees fit and further distribute the
				463	resources it received from the parent. The limits and other settings
				464	of all resource controllers are hierarchical and regardless of what
				465	happens in the delegated sub-hierarchy, nothing can escape the
				466	resource restrictions imposed by the parent.
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	467
				468	Currently, cgroup doesn't impose any restrictions on the number of
				469	cgroups in or nesting depth of a delegated sub-hierarchy; however,
				470	this may be limited explicitly in the future.
				471
				472
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	473	Delegation Containment
				474	~~~~~~~~~~~~~~~~~~~~~~
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	475
				476	A delegated sub-hierarchy is contained in the sense that processes
Tejun Heo	5136f63	2017-06-27 14:30:28 -0400	[diff] [blame]	477	can't be moved into or out of the sub-hierarchy by the delegatee.
				478
				479	For delegations to a less privileged user, this is achieved by
				480	requiring the following conditions for a process with a non-root euid
				481	to migrate a target process into a cgroup by writing its PID to the
				482	"cgroup.procs" file.
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	483
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	484	- The writer must have write access to the "cgroup.procs" file.
				485
				486	- The writer must have write access to the "cgroup.procs" file of the
				487	common ancestor of the source and destination cgroups.
				488
Tejun Heo	576dd46	2017-01-20 11:29:54 -0500	[diff] [blame]	489	The above two constraints ensure that while a delegatee may migrate
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	490	processes around freely in the delegated sub-hierarchy it can't pull
				491	in from or push out to outside the sub-hierarchy.
				492
				493	For an example, let's assume cgroups C0 and C1 have been delegated to
				494	user U0 who created C00, C01 under C0 and C10 under C1 as follows and
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	495	all processes under C0 and C1 belong to U0::
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	496
				497	~~~~~~~~~~~~~ - C0 - C00
				498	~ cgroup ~ \ C01
				499	~ hierarchy ~
				500	~~~~~~~~~~~~~ - C1 - C10
				501
				502	Let's also say U0 wants to write the PID of a process which is
				503	currently in C10 into "C00/cgroup.procs". U0 has write access to the
Tejun Heo	576dd46	2017-01-20 11:29:54 -0500	[diff] [blame]	504	file; however, the common ancestor of the source cgroup C10 and the
				505	destination cgroup C00 is above the points of delegation and U0 would
				506	not have write access to its "cgroup.procs" files and thus the write
				507	will be denied with -EACCES.
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	508
Tejun Heo	5136f63	2017-06-27 14:30:28 -0400	[diff] [blame]	509	For delegations to namespaces, containment is achieved by requiring
				510	that both the source and destination cgroups are reachable from the
				511	namespace of the process which is attempting the migration. If either
				512	is not reachable, the migration is rejected with -ENOENT.
				513
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	514
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	515	Guidelines
				516	----------
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	517
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	518	Organize Once and Control
				519	~~~~~~~~~~~~~~~~~~~~~~~~~
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	520
				521	Migrating a process across cgroups is a relatively expensive operation
				522	and stateful resources such as memory are not moved together with the
				523	process. This is an explicit design decision as there often exist
				524	inherent trade-offs between migration and various hot paths in terms
				525	of synchronization cost.
				526
				527	As such, migrating processes across cgroups frequently as a means to
				528	apply different resource restrictions is discouraged. A workload
				529	should be assigned to a cgroup according to the system's logical and
				530	resource structure once on start-up. Dynamic adjustments to resource
				531	distribution can be made by changing controller configuration through
				532	the interface files.
				533
				534
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	535	Avoid Name Collisions
				536	~~~~~~~~~~~~~~~~~~~~~
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	537
				538	Interface files for a cgroup and its children cgroups occupy the same
				539	directory and it is possible to create children cgroups which collide
				540	with interface files.
				541
				542	All cgroup core interface files are prefixed with "cgroup." and each
				543	controller's interface files are prefixed with the controller name and
				544	a dot. A controller's name is composed of lower case alphabets and
				545	'_'s but never begins with an '_' so it can be used as the prefix
				546	character for collision avoidance. Also, interface file names won't
				547	start or end with terms which are often used in categorizing workloads
				548	such as job, service, slice, unit or workload.
				549
				550	cgroup doesn't do anything to prevent name collisions and it's the
				551	user's responsibility to avoid them.
				552
				553
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	554	Resource Distribution Models
				555	============================
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	556
				557	cgroup controllers implement several resource distribution schemes
				558	depending on the resource type and expected use cases. This section
				559	describes major schemes in use along with their expected behaviors.
				560
				561
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	562	Weights
				563	-------
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	564
				565	A parent's resource is distributed by adding up the weights of all
				566	active children and giving each the fraction matching the ratio of its
				567	weight against the sum. As only children which can make use of the
				568	resource at the moment participate in the distribution, this is
				569	work-conserving. Due to the dynamic nature, this model is usually
				570	used for stateless resources.
				571
				572	All weights are in the range [1, 10000] with the default at 100. This
				573	allows symmetric multiplicative biases in both directions at fine
				574	enough granularity while staying in the intuitive range.
				575
				576	As long as the weight is in range, all configuration combinations are
				577	valid and there is no reason to reject configuration changes or
				578	process migrations.
				579
				580	"cpu.weight" proportionally distributes CPU cycles to active children
				581	and is an example of this type.
				582
				583
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	584	Limits
				585	------
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	586
				587	A child can only consume upto the configured amount of the resource.
				588	Limits can be over-committed - the sum of the limits of children can
				589	exceed the amount of resource available to the parent.
				590
				591	Limits are in the range [0, max] and defaults to "max", which is noop.
				592
				593	As limits can be over-committed, all configuration combinations are
				594	valid and there is no reason to reject configuration changes or
				595	process migrations.
				596
				597	"io.max" limits the maximum BPS and/or IOPS that a cgroup can consume
				598	on an IO device and is an example of this type.
				599
				600
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	601	Protections
				602	-----------
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	603
				604	A cgroup is protected to be allocated upto the configured amount of
				605	the resource if the usages of all its ancestors are under their
				606	protected levels. Protections can be hard guarantees or best effort
				607	soft boundaries. Protections can also be over-committed in which case
				608	only upto the amount available to the parent is protected among
				609	children.
				610
				611	Protections are in the range [0, max] and defaults to 0, which is
				612	noop.
				613
				614	As protections can be over-committed, all configuration combinations
				615	are valid and there is no reason to reject configuration changes or
				616	process migrations.
				617
				618	"memory.low" implements best-effort memory protection and is an
				619	example of this type.
				620
				621
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	622	Allocations
				623	-----------
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	624
				625	A cgroup is exclusively allocated a certain amount of a finite
				626	resource. Allocations can't be over-committed - the sum of the
				627	allocations of children can not exceed the amount of resource
				628	available to the parent.
				629
				630	Allocations are in the range [0, max] and defaults to 0, which is no
				631	resource.
				632
				633	As allocations can't be over-committed, some configuration
				634	combinations are invalid and should be rejected. Also, if the
				635	resource is mandatory for execution of processes, process migrations
				636	may be rejected.
				637
				638	"cpu.rt.max" hard-allocates realtime slices and is an example of this
				639	type.
				640
				641
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	642	Interface Files
				643	===============
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	644
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	645	Format
				646	------
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	647
				648	All interface files should be in one of the following formats whenever
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	649	possible::
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	650
				651	New-line separated values
				652	(when only one value can be written at once)
				653
				654	VAL0\n
				655	VAL1\n
				656	...
				657
				658	Space separated values
				659	(when read-only or multiple values can be written at once)
				660
				661	VAL0 VAL1 ...\n
				662
				663	Flat keyed
				664
				665	KEY0 VAL0\n
				666	KEY1 VAL1\n
				667	...
				668
				669	Nested keyed
				670
				671	KEY0 SUB_KEY0=VAL00 SUB_KEY1=VAL01...
				672	KEY1 SUB_KEY0=VAL10 SUB_KEY1=VAL11...
				673	...
				674
				675	For a writable file, the format for writing should generally match
				676	reading; however, controllers may allow omitting later fields or
				677	implement restricted shortcuts for most common use cases.
				678
				679	For both flat and nested keyed files, only the values for a single key
				680	can be written at a time. For nested keyed files, the sub key pairs
				681	may be specified in any order and not all pairs have to be specified.
				682
				683
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	684	Conventions
				685	-----------
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	686
				687	- Settings for a single feature should be contained in a single file.
				688
				689	- The root cgroup should be exempt from resource control and thus
				690	shouldn't have resource control interface files. Also,
				691	informational files on the root cgroup which end up showing global
				692	information available elsewhere shouldn't exist.
				693
				694	- If a controller implements weight based resource distribution, its
				695	interface file should be named "weight" and have the range [1,
				696	10000] with 100 as the default. The values are chosen to allow
				697	enough and symmetric bias in both directions while keeping it
				698	intuitive (the default is 100%).
				699
				700	- If a controller implements an absolute resource guarantee and/or
				701	limit, the interface files should be named "min" and "max"
				702	respectively. If a controller implements best effort resource
				703	guarantee and/or limit, the interface files should be named "low"
				704	and "high" respectively.
				705
				706	In the above four control files, the special token "max" should be
				707	used to represent upward infinity for both reading and writing.
				708
				709	- If a setting has a configurable default value and keyed specific
				710	overrides, the default entry should be keyed with "default" and
				711	appear as the first entry in the file.
				712
				713	The default value can be updated by writing either "default $VAL" or
				714	"$VAL".
				715
				716	When writing to update a specific override, "default" can be used as
				717	the value to indicate removal of the override. Override entries
				718	with "default" as the value must not appear when read.
				719
				720	For example, a setting which is keyed by major:minor device numbers
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	721	with integer values may look like the following::
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	722
				723	# cat cgroup-example-interface-file
				724	default 150
				725	8:0 300
				726
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	727	The default value can be updated by::
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	728
				729	# echo 125 > cgroup-example-interface-file
				730
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	731	or::
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	732
				733	# echo "default 125" > cgroup-example-interface-file
				734
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	735	An override can be set by::
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	736
				737	# echo "8:16 170" > cgroup-example-interface-file
				738
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	739	and cleared by::
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	740
				741	# echo "8:0 default" > cgroup-example-interface-file
				742	# cat cgroup-example-interface-file
				743	default 125
				744	8:16 170
				745
				746	- For events which are not very high frequency, an interface file
				747	"events" should be created which lists event key value pairs.
				748	Whenever a notifiable event happens, file modified event should be
				749	generated on the file.
				750
				751
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	752	Core Interface Files
				753	--------------------
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	754
				755	All cgroup core files are prefixed with "cgroup."
				756
Tejun Heo	8cfd814	2017-07-21 11:14:51 -0400	[diff] [blame]	757	cgroup.type
				758
				759	A read-write single value file which exists on non-root
				760	cgroups.
				761
				762	When read, it indicates the current type of the cgroup, which
				763	can be one of the following values.
				764
				765	- "domain" : A normal valid domain cgroup.
				766
				767	- "domain threaded" : A threaded domain cgroup which is
				768	serving as the root of a threaded subtree.
				769
				770	- "domain invalid" : A cgroup which is in an invalid state.
				771	It can't be populated or have controllers enabled. It may
				772	be allowed to become a threaded cgroup.
				773
				774	- "threaded" : A threaded cgroup which is a member of a
				775	threaded subtree.
				776
				777	A cgroup can be turned into a threaded cgroup by writing
				778	"threaded" to this file.
				779
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	780	cgroup.procs
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	781	A read-write new-line separated values file which exists on
				782	all cgroups.
				783
				784	When read, it lists the PIDs of all processes which belong to
				785	the cgroup one-per-line. The PIDs are not ordered and the
				786	same PID may show up more than once if the process got moved
				787	to another cgroup and then back or the PID got recycled while
				788	reading.
				789
				790	A PID can be written to migrate the process associated with
				791	the PID to the cgroup. The writer should match all of the
				792	following conditions.
				793
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	794	- It must have write access to the "cgroup.procs" file.
				795
				796	- It must have write access to the "cgroup.procs" file of the
				797	common ancestor of the source and destination cgroups.
				798
				799	When delegating a sub-hierarchy, write access to this file
				800	should be granted along with the containing directory.
				801
Tejun Heo	8cfd814	2017-07-21 11:14:51 -0400	[diff] [blame]	802	In a threaded cgroup, reading this file fails with EOPNOTSUPP
				803	as all the processes belong to the thread root. Writing is
				804	supported and moves every thread of the process to the cgroup.
				805
				806	cgroup.threads
				807	A read-write new-line separated values file which exists on
				808	all cgroups.
				809
				810	When read, it lists the TIDs of all threads which belong to
				811	the cgroup one-per-line. The TIDs are not ordered and the
				812	same TID may show up more than once if the thread got moved to
				813	another cgroup and then back or the TID got recycled while
				814	reading.
				815
				816	A TID can be written to migrate the thread associated with the
				817	TID to the cgroup. The writer should match all of the
				818	following conditions.
				819
				820	- It must have write access to the "cgroup.threads" file.
				821
				822	- The cgroup that the thread is currently in must be in the
				823	same resource domain as the destination cgroup.
				824
				825	- It must have write access to the "cgroup.procs" file of the
				826	common ancestor of the source and destination cgroups.
				827
				828	When delegating a sub-hierarchy, write access to this file
				829	should be granted along with the containing directory.
				830
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	831	cgroup.controllers
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	832	A read-only space separated values file which exists on all
				833	cgroups.
				834
				835	It shows space separated list of all controllers available to
				836	the cgroup. The controllers are not ordered.
				837
				838	cgroup.subtree_control
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	839	A read-write space separated values file which exists on all
				840	cgroups. Starts out empty.
				841
				842	When read, it shows space separated list of the controllers
				843	which are enabled to control resource distribution from the
				844	cgroup to its children.
				845
				846	Space separated list of controllers prefixed with '+' or '-'
				847	can be written to enable or disable controllers. A controller
				848	name prefixed with '+' enables the controller and '-'
				849	disables. If a controller appears more than once on the list,
				850	the last one is effective. When multiple enable and disable
				851	operations are specified, either all succeed or all fail.
				852
				853	cgroup.events
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	854	A read-only flat-keyed file which exists on non-root cgroups.
				855	The following entries are defined. Unless specified
				856	otherwise, a value change in this file generates a file
				857	modified event.
				858
				859	populated
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	860	1 if the cgroup or its descendants contains any live
				861	processes; otherwise, 0.
				862
Roman Gushchin	1a926e0	2017-07-28 18:28:44 +0100	[diff] [blame]	863	cgroup.max.descendants
				864	A read-write single value files. The default is "max".
				865
				866	Maximum allowed number of descent cgroups.
				867	If the actual number of descendants is equal or larger,
				868	an attempt to create a new cgroup in the hierarchy will fail.
				869
				870	cgroup.max.depth
				871	A read-write single value files. The default is "max".
				872
				873	Maximum allowed descent depth below the current cgroup.
				874	If the actual descent depth is equal or larger,
				875	an attempt to create a new child cgroup will fail.
				876
Roman Gushchin	ec39225	2017-08-02 17:55:31 +0100	[diff] [blame]	877	cgroup.stat
				878	A read-only flat-keyed file with the following entries:
				879
				880	nr_descendants
				881	Total number of visible descendant cgroups.
				882
				883	nr_dying_descendants
				884	Total number of dying descendant cgroups. A cgroup becomes
				885	dying after being deleted by a user. The cgroup will remain
				886	in dying state for some time undefined time (which can depend
				887	on system load) before being completely destroyed.
				888
				889	A process can't enter a dying cgroup under any circumstances,
				890	a dying cgroup can't revive.
				891
				892	A dying cgroup can consume system resources not exceeding
				893	limits, which were active at the moment of cgroup deletion.
				894
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	895
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	896	Controllers
				897	===========
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	898
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	899	CPU
				900	---
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	901
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	902	The "cpu" controllers regulates distribution of CPU cycles. This
				903	controller implements weight and absolute bandwidth limit models for
				904	normal scheduling policy and absolute bandwidth allocation model for
				905	realtime scheduling policy.
				906
Tejun Heo	c2f31b7	2017-12-05 09:10:17 -0800	[diff] [blame]	907	WARNING: cgroup2 doesn't yet support control of realtime processes and
				908	the cpu controller can only be enabled when all RT processes are in
				909	the root cgroup. Be aware that system management software may already
				910	have placed RT processes into nonroot cgroups during the system boot
				911	process, and these processes may need to be moved to the root cgroup
				912	before the cpu controller can be enabled.
				913
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	914
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	915	CPU Interface Files
				916	~~~~~~~~~~~~~~~~~~~
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	917
				918	All time durations are in microseconds.
				919
				920	cpu.stat
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	921	A read-only flat-keyed file which exists on non-root cgroups.
Tejun Heo	d41bf8c	2017-10-23 16:18:27 -0700	[diff] [blame]	922	This file exists whether the controller is enabled or not.
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	923
Tejun Heo	d41bf8c	2017-10-23 16:18:27 -0700	[diff] [blame]	924	It always reports the following three stats:
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	925
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	926	- usage_usec
				927	- user_usec
				928	- system_usec
Tejun Heo	d41bf8c	2017-10-23 16:18:27 -0700	[diff] [blame]	929
				930	and the following three when the controller is enabled:
				931
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	932	- nr_periods
				933	- nr_throttled
				934	- throttled_usec
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	935
				936	cpu.weight
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	937	A read-write single value file which exists on non-root
				938	cgroups. The default is "100".
				939
				940	The weight in the range [1, 10000].
				941
Tejun Heo	0d59363	2017-09-25 09:00:19 -0700	[diff] [blame]	942	cpu.weight.nice
				943	A read-write single value file which exists on non-root
				944	cgroups. The default is "0".
				945
				946	The nice value is in the range [-20, 19].
				947
				948	This interface file is an alternative interface for
				949	"cpu.weight" and allows reading and setting weight using the
				950	same values used by nice(2). Because the range is smaller and
				951	granularity is coarser for the nice values, the read value is
				952	the closest approximation of the current weight.
				953
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	954	cpu.max
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	955	A read-write two value file which exists on non-root cgroups.
				956	The default is "max 100000".
				957
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	958	The maximum bandwidth limit. It's in the following format::
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	959
				960	$MAX $PERIOD
				961
				962	which indicates that the group may consume upto $MAX in each
				963	$PERIOD duration. "max" for $MAX indicates no limit. If only
				964	one number is written, $MAX is updated.
				965
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	966
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	967	Memory
				968	------
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	969
				970	The "memory" controller regulates distribution of memory. Memory is
				971	stateful and implements both limit and protection models. Due to the
				972	intertwining between memory usage and reclaim pressure and the
				973	stateful nature of memory, the distribution model is relatively
				974	complex.
				975
				976	While not completely water-tight, all major memory usages by a given
				977	cgroup are tracked so that the total memory consumption can be
				978	accounted and controlled to a reasonable extent. Currently, the
				979	following types of memory usages are tracked.
				980
				981	- Userland memory - page cache and anonymous memory.
				982
				983	- Kernel data structures such as dentries and inodes.
				984
				985	- TCP socket buffers.
				986
				987	The above list may expand in the future for better coverage.
				988
				989
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	990	Memory Interface Files
				991	~~~~~~~~~~~~~~~~~~~~~~
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	992
				993	All memory amounts are in bytes. If a value which is not aligned to
				994	PAGE_SIZE is written, the value may be rounded up to the closest
				995	PAGE_SIZE multiple when read back.
				996
				997	memory.current
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	998	A read-only single value file which exists on non-root
				999	cgroups.
				1000
				1001	The total amount of memory currently being used by the cgroup
				1002	and its descendants.
				1003
Roman Gushchin	bf8d5d5	2018-06-07 17:07:46 -0700	[diff] [blame]	1004	memory.min
				1005	A read-write single value file which exists on non-root
				1006	cgroups. The default is "0".
				1007
				1008	Hard memory protection. If the memory usage of a cgroup
				1009	is within its effective min boundary, the cgroup's memory
				1010	won't be reclaimed under any conditions. If there is no
				1011	unprotected reclaimable memory available, OOM killer
				1012	is invoked.
				1013
				1014	Effective min boundary is limited by memory.min values of
				1015	all ancestor cgroups. If there is memory.min overcommitment
				1016	(child cgroup or cgroups are requiring more protected memory
				1017	than parent will allow), then each child cgroup will get
				1018	the part of parent's protection proportional to its
				1019	actual memory usage below memory.min.
				1020
				1021	Putting more memory than generally available under this
				1022	protection is discouraged and may lead to constant OOMs.
				1023
				1024	If a memory cgroup is not populated with processes,
				1025	its memory.min is ignored.
				1026
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1027	memory.low
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1028	A read-write single value file which exists on non-root
				1029	cgroups. The default is "0".
				1030
Roman Gushchin	7854207	2018-06-07 17:06:29 -0700	[diff] [blame]	1031	Best-effort memory protection. If the memory usage of a
				1032	cgroup is within its effective low boundary, the cgroup's
				1033	memory won't be reclaimed unless memory can be reclaimed
				1034	from unprotected cgroups.
				1035
				1036	Effective low boundary is limited by memory.low values of
				1037	all ancestor cgroups. If there is memory.low overcommitment
Roman Gushchin	bf8d5d5	2018-06-07 17:07:46 -0700	[diff] [blame]	1038	(child cgroup or cgroups are requiring more protected memory
Roman Gushchin	7854207	2018-06-07 17:06:29 -0700	[diff] [blame]	1039	than parent will allow), then each child cgroup will get
Roman Gushchin	bf8d5d5	2018-06-07 17:07:46 -0700	[diff] [blame]	1040	the part of parent's protection proportional to its
Roman Gushchin	7854207	2018-06-07 17:06:29 -0700	[diff] [blame]	1041	actual memory usage below memory.low.
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1042
				1043	Putting more memory than generally available under this
				1044	protection is discouraged.
				1045
				1046	memory.high
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1047	A read-write single value file which exists on non-root
				1048	cgroups. The default is "max".
				1049
				1050	Memory usage throttle limit. This is the main mechanism to
				1051	control memory usage of a cgroup. If a cgroup's usage goes
				1052	over the high boundary, the processes of the cgroup are
				1053	throttled and put under heavy reclaim pressure.
				1054
				1055	Going over the high limit never invokes the OOM killer and
				1056	under extreme conditions the limit may be breached.
				1057
				1058	memory.max
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1059	A read-write single value file which exists on non-root
				1060	cgroups. The default is "max".
				1061
				1062	Memory usage hard limit. This is the final protection
				1063	mechanism. If a cgroup's memory usage reaches this limit and
				1064	can't be reduced, the OOM killer is invoked in the cgroup.
				1065	Under certain circumstances, the usage may go over the limit
				1066	temporarily.
				1067
				1068	This is the ultimate protection mechanism. As long as the
				1069	high limit is used and monitored properly, this limit's
				1070	utility is limited to providing the final safety net.
				1071
				1072	memory.events
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1073	A read-only flat-keyed file which exists on non-root cgroups.
				1074	The following entries are defined. Unless specified
				1075	otherwise, a value change in this file generates a file
				1076	modified event.
				1077
				1078	low
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1079	The number of times the cgroup is reclaimed due to
				1080	high memory pressure even though its usage is under
				1081	the low boundary. This usually indicates that the low
				1082	boundary is over-committed.
				1083
				1084	high
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1085	The number of times processes of the cgroup are
				1086	throttled and routed to perform direct memory reclaim
				1087	because the high memory boundary was exceeded. For a
				1088	cgroup whose memory usage is capped by the high limit
				1089	rather than global memory pressure, this event's
				1090	occurrences are expected.
				1091
				1092	max
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1093	The number of times the cgroup's memory usage was
				1094	about to go over the max boundary. If direct reclaim
Konstantin Khlebnikov	8e675f7	2017-07-06 15:40:28 -0700	[diff] [blame]	1095	fails to bring it down, the cgroup goes to OOM state.
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1096
				1097	oom
Konstantin Khlebnikov	8e675f7	2017-07-06 15:40:28 -0700	[diff] [blame]	1098	The number of time the cgroup's memory usage was
				1099	reached the limit and allocation was about to fail.
				1100
				1101	Depending on context result could be invocation of OOM
Vladimir Rutsky	2877cbe	2018-01-02 17:27:41 +0100	[diff] [blame]	1102	killer and retrying allocation or failing allocation.
Konstantin Khlebnikov	8e675f7	2017-07-06 15:40:28 -0700	[diff] [blame]	1103
				1104	Failed allocation in its turn could be returned into
Vladimir Rutsky	2877cbe	2018-01-02 17:27:41 +0100	[diff] [blame]	1105	userspace as -ENOMEM or silently ignored in cases like
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1106	disk readahead. For now OOM in memory cgroup kills
Konstantin Khlebnikov	8e675f7	2017-07-06 15:40:28 -0700	[diff] [blame]	1107	tasks iff shortage has happened inside page fault.
				1108
				1109	oom_kill
Konstantin Khlebnikov	8e675f7	2017-07-06 15:40:28 -0700	[diff] [blame]	1110	The number of processes belonging to this cgroup
				1111	killed by any kind of OOM killer.
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1112
Johannes Weiner	587d9f7	2016-01-20 15:03:19 -0800	[diff] [blame]	1113	memory.stat
Johannes Weiner	587d9f7	2016-01-20 15:03:19 -0800	[diff] [blame]	1114	A read-only flat-keyed file which exists on non-root cgroups.
				1115
				1116	This breaks down the cgroup's memory footprint into different
				1117	types of memory, type-specific details, and other information
				1118	on the state and past events of the memory management system.
				1119
				1120	All memory amounts are in bytes.
				1121
				1122	The entries are ordered to be human readable, and new entries
				1123	can show up in the middle. Don't rely on items remaining in a
				1124	fixed position; use the keys to look up specific values!
				1125
				1126	anon
Johannes Weiner	587d9f7	2016-01-20 15:03:19 -0800	[diff] [blame]	1127	Amount of memory used in anonymous mappings such as
				1128	brk(), sbrk(), and mmap(MAP_ANONYMOUS)
				1129
				1130	file
Johannes Weiner	587d9f7	2016-01-20 15:03:19 -0800	[diff] [blame]	1131	Amount of memory used to cache filesystem data,
				1132	including tmpfs and shared memory.
				1133
Vladimir Davydov	12580e4	2016-03-17 14:17:38 -0700	[diff] [blame]	1134	kernel_stack
Vladimir Davydov	12580e4	2016-03-17 14:17:38 -0700	[diff] [blame]	1135	Amount of memory allocated to kernel stacks.
				1136
Vladimir Davydov	27ee57c	2016-03-17 14:17:35 -0700	[diff] [blame]	1137	slab
Vladimir Davydov	27ee57c	2016-03-17 14:17:35 -0700	[diff] [blame]	1138	Amount of memory used for storing in-kernel data
				1139	structures.
				1140
Johannes Weiner	4758e19	2016-02-02 16:57:41 -0800	[diff] [blame]	1141	sock
Johannes Weiner	4758e19	2016-02-02 16:57:41 -0800	[diff] [blame]	1142	Amount of memory used in network transmission buffers
				1143
Johannes Weiner	9a4caf1	2017-05-03 14:52:45 -0700	[diff] [blame]	1144	shmem
Johannes Weiner	9a4caf1	2017-05-03 14:52:45 -0700	[diff] [blame]	1145	Amount of cached filesystem data that is swap-backed,
				1146	such as tmpfs, shm segments, shared anonymous mmap()s
				1147
Johannes Weiner	587d9f7	2016-01-20 15:03:19 -0800	[diff] [blame]	1148	file_mapped
Johannes Weiner	587d9f7	2016-01-20 15:03:19 -0800	[diff] [blame]	1149	Amount of cached filesystem data mapped with mmap()
				1150
				1151	file_dirty
Johannes Weiner	587d9f7	2016-01-20 15:03:19 -0800	[diff] [blame]	1152	Amount of cached filesystem data that was modified but
				1153	not yet written back to disk
				1154
				1155	file_writeback
Johannes Weiner	587d9f7	2016-01-20 15:03:19 -0800	[diff] [blame]	1156	Amount of cached filesystem data that was modified and
				1157	is currently being written back to disk
				1158
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1159	inactive_anon, active_anon, inactive_file, active_file, unevictable
Johannes Weiner	587d9f7	2016-01-20 15:03:19 -0800	[diff] [blame]	1160	Amount of memory, swap-backed and filesystem-backed,
				1161	on the internal memory management lists used by the
				1162	page reclaim algorithm
				1163
Vladimir Davydov	27ee57c	2016-03-17 14:17:35 -0700	[diff] [blame]	1164	slab_reclaimable
Vladimir Davydov	27ee57c	2016-03-17 14:17:35 -0700	[diff] [blame]	1165	Part of "slab" that might be reclaimed, such as
				1166	dentries and inodes.
				1167
				1168	slab_unreclaimable
Vladimir Davydov	27ee57c	2016-03-17 14:17:35 -0700	[diff] [blame]	1169	Part of "slab" that cannot be reclaimed on memory
				1170	pressure.
				1171
Johannes Weiner	587d9f7	2016-01-20 15:03:19 -0800	[diff] [blame]	1172	pgfault
Johannes Weiner	587d9f7	2016-01-20 15:03:19 -0800	[diff] [blame]	1173	Total number of page faults incurred
				1174
				1175	pgmajfault
Johannes Weiner	587d9f7	2016-01-20 15:03:19 -0800	[diff] [blame]	1176	Number of major page faults incurred
				1177
Roman Gushchin	b340959	2017-05-12 15:47:09 -0700	[diff] [blame]	1178	workingset_refault
				1179
				1180	Number of refaults of previously evicted pages
				1181
				1182	workingset_activate
				1183
				1184	Number of refaulted pages that were immediately activated
				1185
				1186	workingset_nodereclaim
				1187
				1188	Number of times a shadow node has been reclaimed
				1189
Roman Gushchin	2262185	2017-07-06 15:40:25 -0700	[diff] [blame]	1190	pgrefill
				1191
				1192	Amount of scanned pages (in an active LRU list)
				1193
				1194	pgscan
				1195
				1196	Amount of scanned pages (in an inactive LRU list)
				1197
				1198	pgsteal
				1199
				1200	Amount of reclaimed pages
				1201
				1202	pgactivate
				1203
				1204	Amount of pages moved to the active LRU list
				1205
				1206	pgdeactivate
				1207
				1208	Amount of pages moved to the inactive LRU lis
				1209
				1210	pglazyfree
				1211
				1212	Amount of pages postponed to be freed under memory pressure
				1213
				1214	pglazyfreed
				1215
				1216	Amount of reclaimed lazyfree pages
				1217
Vladimir Davydov	3e24b19	2016-01-20 15:03:13 -0800	[diff] [blame]	1218	memory.swap.current
Vladimir Davydov	3e24b19	2016-01-20 15:03:13 -0800	[diff] [blame]	1219	A read-only single value file which exists on non-root
				1220	cgroups.
				1221
				1222	The total amount of swap currently being used by the cgroup
				1223	and its descendants.
				1224
				1225	memory.swap.max
Vladimir Davydov	3e24b19	2016-01-20 15:03:13 -0800	[diff] [blame]	1226	A read-write single value file which exists on non-root
				1227	cgroups. The default is "max".
				1228
				1229	Swap usage hard limit. If a cgroup's swap usage reaches this
Vladimir Rutsky	2877cbe	2018-01-02 17:27:41 +0100	[diff] [blame]	1230	limit, anonymous memory of the cgroup will not be swapped out.
Vladimir Davydov	3e24b19	2016-01-20 15:03:13 -0800	[diff] [blame]	1231
Tejun Heo	f3a53a3	2018-06-07 17:05:35 -0700	[diff] [blame]	1232	memory.swap.events
				1233	A read-only flat-keyed file which exists on non-root cgroups.
				1234	The following entries are defined. Unless specified
				1235	otherwise, a value change in this file generates a file
				1236	modified event.
				1237
				1238	max
				1239	The number of times the cgroup's swap usage was about
				1240	to go over the max boundary and swap allocation
				1241	failed.
				1242
				1243	fail
				1244	The number of times swap allocation failed either
				1245	because of running out of swap system-wide or max
				1246	limit.
				1247
Tejun Heo	be09102	2018-06-07 17:09:21 -0700	[diff] [blame]	1248	When reduced under the current usage, the existing swap
				1249	entries are reclaimed gradually and the swap usage may stay
				1250	higher than the limit for an extended period of time. This
				1251	reduces the impact on the workload and memory management.
				1252
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1253
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1254	Usage Guidelines
				1255	~~~~~~~~~~~~~~~~
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1256
				1257	"memory.high" is the main mechanism to control memory usage.
				1258	Over-committing on high limit (sum of high limits > available memory)
				1259	and letting global memory pressure to distribute memory according to
				1260	usage is a viable strategy.
				1261
				1262	Because breach of the high limit doesn't trigger the OOM killer but
				1263	throttles the offending cgroup, a management agent has ample
				1264	opportunities to monitor and take appropriate actions such as granting
				1265	more memory or terminating the workload.
				1266
				1267	Determining whether a cgroup has enough memory is not trivial as
				1268	memory usage doesn't indicate whether the workload can benefit from
				1269	more memory. For example, a workload which writes data received from
				1270	network to a file can use all available memory but can also operate as
				1271	performant with a small amount of memory. A measure of memory
				1272	pressure - how much the workload is being impacted due to lack of
				1273	memory - is necessary to determine whether a workload needs more
				1274	memory; unfortunately, memory pressure monitoring mechanism isn't
				1275	implemented yet.
				1276
				1277
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1278	Memory Ownership
				1279	~~~~~~~~~~~~~~~~
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1280
				1281	A memory area is charged to the cgroup which instantiated it and stays
				1282	charged to the cgroup until the area is released. Migrating a process
				1283	to a different cgroup doesn't move the memory usages that it
				1284	instantiated while in the previous cgroup to the new cgroup.
				1285
				1286	A memory area may be used by processes belonging to different cgroups.
				1287	To which cgroup the area will be charged is in-deterministic; however,
				1288	over time, the memory area is likely to end up in a cgroup which has
				1289	enough memory allowance to avoid high reclaim pressure.
				1290
				1291	If a cgroup sweeps a considerable amount of memory which is expected
				1292	to be accessed repeatedly by other cgroups, it may make sense to use
				1293	POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
				1294	belonging to the affected files to ensure correct memory ownership.
				1295
				1296
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1297	IO
				1298	--
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1299
				1300	The "io" controller regulates the distribution of IO resources. This
				1301	controller implements both weight based and absolute bandwidth or IOPS
				1302	limit distribution; however, weight based distribution is available
				1303	only if cfq-iosched is in use and neither scheme is available for
				1304	blk-mq devices.
				1305
				1306
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1307	IO Interface Files
				1308	~~~~~~~~~~~~~~~~~~
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1309
				1310	io.stat
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1311	A read-only nested-keyed file which exists on non-root
				1312	cgroups.
				1313
				1314	Lines are keyed by $MAJ:$MIN device numbers and not ordered.
				1315	The following nested keys are defined.
				1316
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1317	====== ===================
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1318	rbytes Bytes read
				1319	wbytes Bytes written
				1320	rios Number of read IOs
				1321	wios Number of write IOs
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1322	====== ===================
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1323
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1324	An example read output follows:
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1325
				1326	8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353
				1327	8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252
				1328
				1329	io.weight
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1330	A read-write flat-keyed file which exists on non-root cgroups.
				1331	The default is "default 100".
				1332
				1333	The first line is the default weight applied to devices
				1334	without specific override. The rest are overrides keyed by
				1335	$MAJ:$MIN device numbers and not ordered. The weights are in
				1336	the range [1, 10000] and specifies the relative amount IO time
				1337	the cgroup can use in relation to its siblings.
				1338
				1339	The default weight can be updated by writing either "default
				1340	$WEIGHT" or simply "$WEIGHT". Overrides can be set by writing
				1341	"$MAJ:$MIN $WEIGHT" and unset by writing "$MAJ:$MIN default".
				1342
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1343	An example read output follows::
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1344
				1345	default 100
				1346	8:16 200
				1347	8:0 50
				1348
				1349	io.max
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1350	A read-write nested-keyed file which exists on non-root
				1351	cgroups.
				1352
				1353	BPS and IOPS based IO limit. Lines are keyed by $MAJ:$MIN
				1354	device numbers and not ordered. The following nested keys are
				1355	defined.
				1356
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1357	===== ==================================
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1358	rbps Max read bytes per second
				1359	wbps Max write bytes per second
				1360	riops Max read IO operations per second
				1361	wiops Max write IO operations per second
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1362	===== ==================================
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1363
				1364	When writing, any number of nested key-value pairs can be
				1365	specified in any order. "max" can be specified as the value
				1366	to remove a specific limit. If the same key is specified
				1367	multiple times, the outcome is undefined.
				1368
				1369	BPS and IOPS are measured in each IO direction and IOs are
				1370	delayed if limit is reached. Temporary bursts are allowed.
				1371
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1372	Setting read limit at 2M BPS and write at 120 IOPS for 8:16::
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1373
				1374	echo "8:16 rbps=2097152 wiops=120" > io.max
				1375
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1376	Reading returns the following::
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1377
				1378	8:16 rbps=2097152 wbps=max riops=max wiops=120
				1379
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1380	Write IOPS limit can be removed by writing the following::
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1381
				1382	echo "8:16 wiops=max" > io.max
				1383
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1384	Reading now returns the following::
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1385
				1386	8:16 rbps=2097152 wbps=max riops=max wiops=max
				1387
				1388
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1389	Writeback
				1390	~~~~~~~~~
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1391
				1392	Page cache is dirtied through buffered writes and shared mmaps and
				1393	written asynchronously to the backing filesystem by the writeback
				1394	mechanism. Writeback sits between the memory and IO domains and
				1395	regulates the proportion of dirty memory by balancing dirtying and
				1396	write IOs.
				1397
				1398	The io controller, in conjunction with the memory controller,
				1399	implements control of page cache writeback IOs. The memory controller
				1400	defines the memory domain that dirty memory ratio is calculated and
				1401	maintained for and the io controller defines the io domain which
				1402	writes out dirty pages for the memory domain. Both system-wide and
				1403	per-cgroup dirty memory states are examined and the more restrictive
				1404	of the two is enforced.
				1405
				1406	cgroup writeback requires explicit support from the underlying
				1407	filesystem. Currently, cgroup writeback is implemented on ext2, ext4
				1408	and btrfs. On other filesystems, all writeback IOs are attributed to
				1409	the root cgroup.
				1410
				1411	There are inherent differences in memory and writeback management
				1412	which affects how cgroup ownership is tracked. Memory is tracked per
				1413	page while writeback per inode. For the purpose of writeback, an
				1414	inode is assigned to a cgroup and all IO requests to write dirty pages
				1415	from the inode are attributed to that cgroup.
				1416
				1417	As cgroup ownership for memory is tracked per page, there can be pages
				1418	which are associated with different cgroups than the one the inode is
				1419	associated with. These are called foreign pages. The writeback
				1420	constantly keeps track of foreign pages and, if a particular foreign
				1421	cgroup becomes the majority over a certain period of time, switches
				1422	the ownership of the inode to that cgroup.
				1423
				1424	While this model is enough for most use cases where a given inode is
				1425	mostly dirtied by a single cgroup even when the main writing cgroup
				1426	changes over time, use cases where multiple cgroups write to a single
				1427	inode simultaneously are not supported well. In such circumstances, a
				1428	significant portion of IOs are likely to be attributed incorrectly.
				1429	As memory controller assigns page ownership on the first use and
				1430	doesn't update it until the page is released, even if writeback
				1431	strictly follows page ownership, multiple cgroups dirtying overlapping
				1432	areas wouldn't work as expected. It's recommended to avoid such usage
				1433	patterns.
				1434
				1435	The sysctl knobs which affect writeback behavior are applied to cgroup
				1436	writeback as follows.
				1437
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1438	vm.dirty_background_ratio, vm.dirty_ratio
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1439	These ratios apply the same to cgroup writeback with the
				1440	amount of available memory capped by limits imposed by the
				1441	memory controller and system-wide clean memory.
				1442
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1443	vm.dirty_background_bytes, vm.dirty_bytes
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1444	For cgroup writeback, this is calculated into ratio against
				1445	total available memory and applied the same way as
				1446	vm.dirty[_background]_ratio.
				1447
				1448
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1449	PID
				1450	---
Hans Ragas	20c56e5	2017-01-10 17:42:34 +0000	[diff] [blame]	1451
				1452	The process number controller is used to allow a cgroup to stop any
				1453	new tasks from being fork()'d or clone()'d after a specified limit is
				1454	reached.
				1455
				1456	The number of tasks in a cgroup can be exhausted in ways which other
				1457	controllers cannot prevent, thus warranting its own controller. For
				1458	example, a fork bomb is likely to exhaust the number of tasks before
				1459	hitting memory restrictions.
				1460
				1461	Note that PIDs used in this controller refer to TIDs, process IDs as
				1462	used by the kernel.
				1463
				1464
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1465	PID Interface Files
				1466	~~~~~~~~~~~~~~~~~~~
Hans Ragas	20c56e5	2017-01-10 17:42:34 +0000	[diff] [blame]	1467
				1468	pids.max
Tobias Klauser	312eb71	2017-02-17 18:44:11 +0100	[diff] [blame]	1469	A read-write single value file which exists on non-root
				1470	cgroups. The default is "max".
Hans Ragas	20c56e5	2017-01-10 17:42:34 +0000	[diff] [blame]	1471
Tobias Klauser	312eb71	2017-02-17 18:44:11 +0100	[diff] [blame]	1472	Hard limit of number of processes.
Hans Ragas	20c56e5	2017-01-10 17:42:34 +0000	[diff] [blame]	1473
				1474	pids.current
Tobias Klauser	312eb71	2017-02-17 18:44:11 +0100	[diff] [blame]	1475	A read-only single value file which exists on all cgroups.
Hans Ragas	20c56e5	2017-01-10 17:42:34 +0000	[diff] [blame]	1476
Tobias Klauser	312eb71	2017-02-17 18:44:11 +0100	[diff] [blame]	1477	The number of processes currently in the cgroup and its
				1478	descendants.
Hans Ragas	20c56e5	2017-01-10 17:42:34 +0000	[diff] [blame]	1479
				1480	Organisational operations are not blocked by cgroup policies, so it is
				1481	possible to have pids.current > pids.max. This can be done by either
				1482	setting the limit to be smaller than pids.current, or attaching enough
				1483	processes to the cgroup such that pids.current is larger than
				1484	pids.max. However, it is not possible to violate a cgroup PID policy
				1485	through fork() or clone(). These will return -EAGAIN if the creation
				1486	of a new process would cause a cgroup policy to be violated.
				1487
				1488
Roman Gushchin	4ad5a32	2017-12-13 19:49:03 +0000	[diff] [blame]	1489	Device controller
				1490	-----------------
				1491
				1492	Device controller manages access to device files. It includes both
				1493	creation of new device files (using mknod), and access to the
				1494	existing device files.
				1495
				1496	Cgroup v2 device controller has no interface files and is implemented
				1497	on top of cgroup BPF. To control access to device files, a user may
				1498	create bpf programs of the BPF_CGROUP_DEVICE type and attach them
				1499	to cgroups. On an attempt to access a device file, corresponding
				1500	BPF programs will be executed, and depending on the return value
				1501	the attempt will succeed or fail with -EPERM.
				1502
				1503	A BPF_CGROUP_DEVICE program takes a pointer to the bpf_cgroup_dev_ctx
				1504	structure, which describes the device access attempt: access type
				1505	(mknod/read/write) and device (type, major and minor numbers).
				1506	If the program returns 0, the attempt fails with -EPERM, otherwise
				1507	it succeeds.
				1508
				1509	An example of BPF_CGROUP_DEVICE program may be found in the kernel
				1510	source tree in the tools/testing/selftests/bpf/dev_cgroup.c file.
				1511
				1512
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1513	RDMA
				1514	----
Tejun Heo	968ebff	2017-01-29 14:35:20 -0500	[diff] [blame]	1515
Parav Pandit	9c1e67f	2017-01-10 00:02:15 +0000	[diff] [blame]	1516	The "rdma" controller regulates the distribution and accounting of
				1517	of RDMA resources.
				1518
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1519	RDMA Interface Files
				1520	~~~~~~~~~~~~~~~~~~~~
Parav Pandit	9c1e67f	2017-01-10 00:02:15 +0000	[diff] [blame]	1521
				1522	rdma.max
				1523	A readwrite nested-keyed file that exists for all the cgroups
				1524	except root that describes current configured resource limit
				1525	for a RDMA/IB device.
				1526
				1527	Lines are keyed by device name and are not ordered.
				1528	Each line contains space separated resource name and its configured
				1529	limit that can be distributed.
				1530
				1531	The following nested keys are defined.
				1532
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1533	========== =============================
Parav Pandit	9c1e67f	2017-01-10 00:02:15 +0000	[diff] [blame]	1534	hca_handle Maximum number of HCA Handles
				1535	hca_object Maximum number of HCA Objects
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1536	========== =============================
Parav Pandit	9c1e67f	2017-01-10 00:02:15 +0000	[diff] [blame]	1537
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1538	An example for mlx4 and ocrdma device follows::
Parav Pandit	9c1e67f	2017-01-10 00:02:15 +0000	[diff] [blame]	1539
				1540	mlx4_0 hca_handle=2 hca_object=2000
				1541	ocrdma1 hca_handle=3 hca_object=max
				1542
				1543	rdma.current
				1544	A read-only file that describes current resource usage.
				1545	It exists for all the cgroup except root.
				1546
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1547	An example for mlx4 and ocrdma device follows::
Parav Pandit	9c1e67f	2017-01-10 00:02:15 +0000	[diff] [blame]	1548
				1549	mlx4_0 hca_handle=1 hca_object=20
				1550	ocrdma1 hca_handle=1 hca_object=23
				1551
				1552
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1553	Misc
				1554	----
Tejun Heo	63f1ca5	2017-02-02 13:50:35 -0500	[diff] [blame]	1555
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1556	perf_event
				1557	~~~~~~~~~~
Tejun Heo	968ebff	2017-01-29 14:35:20 -0500	[diff] [blame]	1558
				1559	perf_event controller, if not mounted on a legacy hierarchy, is
				1560	automatically enabled on the v2 hierarchy so that perf events can
				1561	always be filtered by cgroup v2 path. The controller can still be
				1562	moved to a legacy hierarchy after v2 hierarchy is populated.
				1563
				1564
Maciej S. Szmigiero	c4e0842	2018-01-10 23:33:19 +0100	[diff] [blame]	1565	Non-normative information
				1566	-------------------------
				1567
				1568	This section contains information that isn't considered to be a part of
				1569	the stable kernel API and so is subject to change.
				1570
				1571
				1572	CPU controller root cgroup process behaviour
				1573	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
				1574
				1575	When distributing CPU cycles in the root cgroup each thread in this
				1576	cgroup is treated as if it was hosted in a separate child cgroup of the
				1577	root cgroup. This child cgroup weight is dependent on its thread nice
				1578	level.
				1579
				1580	For details of this mapping see sched_prio_to_weight array in
				1581	kernel/sched/core.c file (values from this array should be scaled
				1582	appropriately so the neutral - nice 0 - value is 100 instead of 1024).
				1583
				1584
				1585	IO controller root cgroup process behaviour
				1586	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
				1587
				1588	Root cgroup processes are hosted in an implicit leaf child node.
				1589	When distributing IO resources this implicit child node is taken into
				1590	account as if it was a normal child cgroup of the root cgroup with a
				1591	weight value of 200.
				1592
				1593
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1594	Namespace
				1595	=========
Serge Hallyn	d4021f6	2016-01-29 02:54:10 -0600	[diff] [blame]	1596
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1597	Basics
				1598	------
Serge Hallyn	d4021f6	2016-01-29 02:54:10 -0600	[diff] [blame]	1599
				1600	cgroup namespace provides a mechanism to virtualize the view of the
				1601	"/proc/$PID/cgroup" file and cgroup mounts. The CLONE_NEWCGROUP clone
				1602	flag can be used with clone(2) and unshare(2) to create a new cgroup
				1603	namespace. The process running inside the cgroup namespace will have
				1604	its "/proc/$PID/cgroup" output restricted to cgroupns root. The
				1605	cgroupns root is the cgroup of the process at the time of creation of
				1606	the cgroup namespace.
				1607
				1608	Without cgroup namespace, the "/proc/$PID/cgroup" file shows the
				1609	complete path of the cgroup of a process. In a container setup where
				1610	a set of cgroups and namespaces are intended to isolate processes the
				1611	"/proc/$PID/cgroup" file may leak potential system level information
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1612	to the isolated processes. For Example::
Serge Hallyn	d4021f6	2016-01-29 02:54:10 -0600	[diff] [blame]	1613
				1614	# cat /proc/self/cgroup
				1615	0::/batchjobs/container_id1
				1616
				1617	The path '/batchjobs/container_id1' can be considered as system-data
				1618	and undesirable to expose to the isolated processes. cgroup namespace
				1619	can be used to restrict visibility of this path. For example, before
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1620	creating a cgroup namespace, one would see::
Serge Hallyn	d4021f6	2016-01-29 02:54:10 -0600	[diff] [blame]	1621
				1622	# ls -l /proc/self/ns/cgroup
				1623	lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
				1624	# cat /proc/self/cgroup
				1625	0::/batchjobs/container_id1
				1626
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1627	After unsharing a new namespace, the view changes::
Serge Hallyn	d4021f6	2016-01-29 02:54:10 -0600	[diff] [blame]	1628
				1629	# ls -l /proc/self/ns/cgroup
				1630	lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
				1631	# cat /proc/self/cgroup
				1632	0::/
				1633
				1634	When some thread from a multi-threaded process unshares its cgroup
				1635	namespace, the new cgroupns gets applied to the entire process (all
				1636	the threads). This is natural for the v2 hierarchy; however, for the
				1637	legacy hierarchies, this may be unexpected.
				1638
				1639	A cgroup namespace is alive as long as there are processes inside or
				1640	mounts pinning it. When the last usage goes away, the cgroup
				1641	namespace is destroyed. The cgroupns root and the actual cgroups
				1642	remain.
				1643
				1644
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1645	The Root and Views
				1646	------------------
Serge Hallyn	d4021f6	2016-01-29 02:54:10 -0600	[diff] [blame]	1647
				1648	The 'cgroupns root' for a cgroup namespace is the cgroup in which the
				1649	process calling unshare(2) is running. For example, if a process in
				1650	/batchjobs/container_id1 cgroup calls unshare, cgroup
				1651	/batchjobs/container_id1 becomes the cgroupns root. For the
				1652	init_cgroup_ns, this is the real root ('/') cgroup.
				1653
				1654	The cgroupns root cgroup does not change even if the namespace creator
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1655	process later moves to a different cgroup::
Serge Hallyn	d4021f6	2016-01-29 02:54:10 -0600	[diff] [blame]	1656
				1657	# ~/unshare -c # unshare cgroupns in some cgroup
				1658	# cat /proc/self/cgroup
				1659	0::/
				1660	# mkdir sub_cgrp_1
				1661	# echo 0 > sub_cgrp_1/cgroup.procs
				1662	# cat /proc/self/cgroup
				1663	0::/sub_cgrp_1
				1664
				1665	Each process gets its namespace-specific view of "/proc/$PID/cgroup"
				1666
				1667	Processes running inside the cgroup namespace will be able to see
				1668	cgroup paths (in /proc/self/cgroup) only inside their root cgroup.
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1669	From within an unshared cgroupns::
Serge Hallyn	d4021f6	2016-01-29 02:54:10 -0600	[diff] [blame]	1670
				1671	# sleep 100000 &
				1672	[1] 7353
				1673	# echo 7353 > sub_cgrp_1/cgroup.procs
				1674	# cat /proc/7353/cgroup
				1675	0::/sub_cgrp_1
				1676
				1677	From the initial cgroup namespace, the real cgroup path will be
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1678	visible::
Serge Hallyn	d4021f6	2016-01-29 02:54:10 -0600	[diff] [blame]	1679
				1680	$ cat /proc/7353/cgroup
				1681	0::/batchjobs/container_id1/sub_cgrp_1
				1682
				1683	From a sibling cgroup namespace (that is, a namespace rooted at a
				1684	different cgroup), the cgroup path relative to its own cgroup
				1685	namespace root will be shown. For instance, if PID 7353's cgroup
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1686	namespace root is at '/batchjobs/container_id2', then it will see::
Serge Hallyn	d4021f6	2016-01-29 02:54:10 -0600	[diff] [blame]	1687
				1688	# cat /proc/7353/cgroup
				1689	0::/../container_id2/sub_cgrp_1
				1690
				1691	Note that the relative path always starts with '/' to indicate that
				1692	its relative to the cgroup namespace root of the caller.
				1693
				1694
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1695	Migration and setns(2)
				1696	----------------------
Serge Hallyn	d4021f6	2016-01-29 02:54:10 -0600	[diff] [blame]	1697
				1698	Processes inside a cgroup namespace can move into and out of the
				1699	namespace root if they have proper access to external cgroups. For
				1700	example, from inside a namespace with cgroupns root at
				1701	/batchjobs/container_id1, and assuming that the global hierarchy is
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1702	still accessible inside cgroupns::
Serge Hallyn	d4021f6	2016-01-29 02:54:10 -0600	[diff] [blame]	1703
				1704	# cat /proc/7353/cgroup
				1705	0::/sub_cgrp_1
				1706	# echo 7353 > batchjobs/container_id2/cgroup.procs
				1707	# cat /proc/7353/cgroup
				1708	0::/../container_id2
				1709
				1710	Note that this kind of setup is not encouraged. A task inside cgroup
				1711	namespace should only be exposed to its own cgroupns hierarchy.
				1712
				1713	setns(2) to another cgroup namespace is allowed when:
				1714
				1715	(a) the process has CAP_SYS_ADMIN against its current user namespace
				1716	(b) the process has CAP_SYS_ADMIN against the target cgroup
				1717	namespace's userns
				1718
				1719	No implicit cgroup changes happen with attaching to another cgroup
				1720	namespace. It is expected that the someone moves the attaching
				1721	process under the target cgroup namespace root.
				1722
				1723
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1724	Interaction with Other Namespaces
				1725	---------------------------------
Serge Hallyn	d4021f6	2016-01-29 02:54:10 -0600	[diff] [blame]	1726
				1727	Namespace specific cgroup hierarchy can be mounted by a process
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1728	running inside a non-init cgroup namespace::
Serge Hallyn	d4021f6	2016-01-29 02:54:10 -0600	[diff] [blame]	1729
				1730	# mount -t cgroup2 none $MOUNT_POINT
				1731
				1732	This will mount the unified cgroup hierarchy with cgroupns root as the
				1733	filesystem root. The process needs CAP_SYS_ADMIN against its user and
				1734	mount namespaces.
				1735
				1736	The virtualization of /proc/self/cgroup file combined with restricting
				1737	the view of cgroup hierarchy by namespace-private cgroupfs mount
				1738	provides a properly isolated cgroup view inside the container.
				1739
				1740
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1741	Information on Kernel Programming
				1742	=================================
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1743
				1744	This section contains kernel programming information in the areas
				1745	where interacting with cgroup is necessary. cgroup core and
				1746	controllers are not covered.
				1747
				1748
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1749	Filesystem Support for Writeback
				1750	--------------------------------
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1751
				1752	A filesystem can support cgroup writeback by updating
				1753	address_space_operations->writepage[s]() to annotate bio's using the
				1754	following two functions.
				1755
				1756	wbc_init_bio(@wbc, @bio)
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1757	Should be called for each bio carrying writeback data and
				1758	associates the bio with the inode's owner cgroup. Can be
				1759	called anytime between bio allocation and submission.
				1760
				1761	wbc_account_io(@wbc, @page, @bytes)
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1762	Should be called for each data segment being written out.
				1763	While this function doesn't care exactly when it's called
				1764	during the writeback session, it's the easiest and most
				1765	natural to call it as data segments are added to a bio.
				1766
				1767	With writeback bio's annotated, cgroup support can be enabled per
				1768	super_block by setting SB_I_CGROUPWB in ->s_iflags. This allows for
				1769	selective disabling of cgroup writeback support which is helpful when
				1770	certain filesystem features, e.g. journaled data mode, are
				1771	incompatible.
				1772
				1773	wbc_init_bio() binds the specified bio to its cgroup. Depending on
				1774	the configuration, the bio may be executed at a lower priority and if
				1775	the writeback session is holding shared resources, e.g. a journal
				1776	entry, may lead to priority inversion. There is no one easy solution
				1777	for the problem. Filesystems can try to work around specific problem
				1778	cases by skipping wbc_init_bio() or using bio_associate_blkcg()
				1779	directly.
				1780
				1781
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1782	Deprecated v1 Core Features
				1783	===========================
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1784
				1785	- Multiple hierarchies including named ones are not supported.
				1786
Tejun Heo	5136f63	2017-06-27 14:30:28 -0400	[diff] [blame]	1787	- All v1 mount options are not supported.
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1788
				1789	- The "tasks" file is removed and "cgroup.procs" is not sorted.
				1790
				1791	- "cgroup.clone_children" is removed.
				1792
				1793	- /proc/cgroups is meaningless for v2. Use "cgroup.controllers" file
				1794	at the root instead.
				1795
				1796
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1797	Issues with v1 and Rationales for v2
				1798	====================================
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1799
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1800	Multiple Hierarchies
				1801	--------------------
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1802
				1803	cgroup v1 allowed an arbitrary number of hierarchies and each
				1804	hierarchy could host any number of controllers. While this seemed to
				1805	provide a high level of flexibility, it wasn't useful in practice.
				1806
				1807	For example, as there is only one instance of each controller, utility
				1808	type controllers such as freezer which can be useful in all
				1809	hierarchies could only be used in one. The issue is exacerbated by
				1810	the fact that controllers couldn't be moved to another hierarchy once
				1811	hierarchies were populated. Another issue was that all controllers
				1812	bound to a hierarchy were forced to have exactly the same view of the
				1813	hierarchy. It wasn't possible to vary the granularity depending on
				1814	the specific controller.
				1815
				1816	In practice, these issues heavily limited which controllers could be
				1817	put on the same hierarchy and most configurations resorted to putting
				1818	each controller on its own hierarchy. Only closely related ones, such
				1819	as the cpu and cpuacct controllers, made sense to be put on the same
				1820	hierarchy. This often meant that userland ended up managing multiple
				1821	similar hierarchies repeating the same steps on each hierarchy
				1822	whenever a hierarchy management operation was necessary.
				1823
				1824	Furthermore, support for multiple hierarchies came at a steep cost.
				1825	It greatly complicated cgroup core implementation but more importantly
				1826	the support for multiple hierarchies restricted how cgroup could be
				1827	used in general and what controllers was able to do.
				1828
				1829	There was no limit on how many hierarchies there might be, which meant
				1830	that a thread's cgroup membership couldn't be described in finite
				1831	length. The key might contain any number of entries and was unlimited
				1832	in length, which made it highly awkward to manipulate and led to
				1833	addition of controllers which existed only to identify membership,
				1834	which in turn exacerbated the original problem of proliferating number
				1835	of hierarchies.
				1836
				1837	Also, as a controller couldn't have any expectation regarding the
				1838	topologies of hierarchies other controllers might be on, each
				1839	controller had to assume that all other controllers were attached to
				1840	completely orthogonal hierarchies. This made it impossible, or at
				1841	least very cumbersome, for controllers to cooperate with each other.
				1842
				1843	In most use cases, putting controllers on hierarchies which are
				1844	completely orthogonal to each other isn't necessary. What usually is
				1845	called for is the ability to have differing levels of granularity
				1846	depending on the specific controller. In other words, hierarchy may
				1847	be collapsed from leaf towards root when viewed from specific
				1848	controllers. For example, a given configuration might not care about
				1849	how memory is distributed beyond a certain level while still wanting
				1850	to control how CPU cycles are distributed.
				1851
				1852
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1853	Thread Granularity
				1854	------------------
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1855
				1856	cgroup v1 allowed threads of a process to belong to different cgroups.
				1857	This didn't make sense for some controllers and those controllers
				1858	ended up implementing different ways to ignore such situations but
				1859	much more importantly it blurred the line between API exposed to
				1860	individual applications and system management interface.
				1861
				1862	Generally, in-process knowledge is available only to the process
				1863	itself; thus, unlike service-level organization of processes,
				1864	categorizing threads of a process requires active participation from
				1865	the application which owns the target process.
				1866
				1867	cgroup v1 had an ambiguously defined delegation model which got abused
				1868	in combination with thread granularity. cgroups were delegated to
				1869	individual applications so that they can create and manage their own
				1870	sub-hierarchies and control resource distributions along them. This
				1871	effectively raised cgroup to the status of a syscall-like API exposed
				1872	to lay programs.
				1873
				1874	First of all, cgroup has a fundamentally inadequate interface to be
				1875	exposed this way. For a process to access its own knobs, it has to
				1876	extract the path on the target hierarchy from /proc/self/cgroup,
				1877	construct the path by appending the name of the knob to the path, open
				1878	and then read and/or write to it. This is not only extremely clunky
				1879	and unusual but also inherently racy. There is no conventional way to
				1880	define transaction across the required steps and nothing can guarantee
				1881	that the process would actually be operating on its own sub-hierarchy.
				1882
				1883	cgroup controllers implemented a number of knobs which would never be
				1884	accepted as public APIs because they were just adding control knobs to
				1885	system-management pseudo filesystem. cgroup ended up with interface
				1886	knobs which were not properly abstracted or refined and directly
				1887	revealed kernel internal details. These knobs got exposed to
				1888	individual applications through the ill-defined delegation mechanism
				1889	effectively abusing cgroup as a shortcut to implementing public APIs
				1890	without going through the required scrutiny.
				1891
				1892	This was painful for both userland and kernel. Userland ended up with
				1893	misbehaving and poorly abstracted interfaces and kernel exposing and
				1894	locked into constructs inadvertently.
				1895
				1896
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1897	Competition Between Inner Nodes and Threads
				1898	-------------------------------------------
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1899
				1900	cgroup v1 allowed threads to be in any cgroups which created an
				1901	interesting problem where threads belonging to a parent cgroup and its
				1902	children cgroups competed for resources. This was nasty as two
				1903	different types of entities competed and there was no obvious way to
				1904	settle it. Different controllers did different things.
				1905
				1906	The cpu controller considered threads and cgroups as equivalents and
				1907	mapped nice levels to cgroup weights. This worked for some cases but
				1908	fell flat when children wanted to be allocated specific ratios of CPU
				1909	cycles and the number of internal threads fluctuated - the ratios
				1910	constantly changed as the number of competing entities fluctuated.
				1911	There also were other issues. The mapping from nice level to weight
				1912	wasn't obvious or universal, and there were various other knobs which
				1913	simply weren't available for threads.
				1914
				1915	The io controller implicitly created a hidden leaf node for each
				1916	cgroup to host the threads. The hidden leaf had its own copies of all
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1917	the knobs with ``leaf_`` prefixed. While this allowed equivalent
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1918	control over internal threads, it was with serious drawbacks. It
				1919	always added an extra layer of nesting which wouldn't be necessary
				1920	otherwise, made the interface messy and significantly complicated the
				1921	implementation.
				1922
				1923	The memory controller didn't have a way to control what happened
				1924	between internal tasks and child cgroups and the behavior was not
				1925	clearly defined. There were attempts to add ad-hoc behaviors and
				1926	knobs to tailor the behavior to specific workloads which would have
				1927	led to problems extremely difficult to resolve in the long term.
				1928
				1929	Multiple controllers struggled with internal tasks and came up with
				1930	different ways to deal with it; unfortunately, all the approaches were
				1931	severely flawed and, furthermore, the widely different behaviors
				1932	made cgroup as a whole highly inconsistent.
				1933
				1934	This clearly is a problem which needs to be addressed from cgroup core
				1935	in a uniform way.
				1936
				1937
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1938	Other Interface Issues
				1939	----------------------
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1940
				1941	cgroup v1 grew without oversight and developed a large number of
				1942	idiosyncrasies and inconsistencies. One issue on the cgroup core side
				1943	was how an empty cgroup was notified - a userland helper binary was
				1944	forked and executed for each event. The event delivery wasn't
				1945	recursive or delegatable. The limitations of the mechanism also led
				1946	to in-kernel event delivery filtering mechanism further complicating
				1947	the interface.
				1948
				1949	Controller interfaces were problematic too. An extreme example is
				1950	controllers completely ignoring hierarchical organization and treating
				1951	all cgroups as if they were all located directly under the root
				1952	cgroup. Some controllers exposed a large amount of inconsistent
				1953	implementation details to userland.
				1954
				1955	There also was no consistency across controllers. When a new cgroup
				1956	was created, some controllers defaulted to not imposing extra
				1957	restrictions while others disallowed any resource usage until
				1958	explicitly configured. Configuration knobs for the same type of
				1959	control used widely differing naming schemes and formats. Statistics
				1960	and information knobs were named arbitrarily and used different
				1961	formats and units even in the same controller.
				1962
				1963	cgroup v2 establishes common conventions where appropriate and updates
				1964	controllers so that they expose minimal and consistent interfaces.
				1965
				1966
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1967	Controller Issues and Remedies
				1968	------------------------------
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1969
Mauro Carvalho Chehab	633b11b	2017-05-14 08:48:40 -0300	[diff] [blame]	1970	Memory
				1971	~~~~~~
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1972
				1973	The original lower boundary, the soft limit, is defined as a limit
				1974	that is per default unset. As a result, the set of cgroups that
				1975	global reclaim prefers is opt-in, rather than opt-out. The costs for
				1976	optimizing these mostly negative lookups are so high that the
				1977	implementation, despite its enormous size, does not even provide the
				1978	basic desirable behavior. First off, the soft limit has no
				1979	hierarchical meaning. All configured groups are organized in a global
				1980	rbtree and treated like equal peers, regardless where they are located
				1981	in the hierarchy. This makes subtree delegation impossible. Second,
				1982	the soft limit reclaim pass is so aggressive that it not just
				1983	introduces high allocation latencies into the system, but also impacts
				1984	system performance due to overreclaim, to the point where the feature
				1985	becomes self-defeating.
				1986
				1987	The memory.low boundary on the other hand is a top-down allocated
Roman Gushchin	7854207	2018-06-07 17:06:29 -0700	[diff] [blame]	1988	reserve. A cgroup enjoys reclaim protection when it's within its low,
				1989	which makes delegation of subtrees possible.
Tejun Heo	6c29209	2015-11-16 11:13:34 -0500	[diff] [blame]	1990
				1991	The original high boundary, the hard limit, is defined as a strict
				1992	limit that can not budge, even if the OOM killer has to be called.
				1993	But this generally goes against the goal of making the most out of the
				1994	available memory. The memory consumption of workloads varies during
				1995	runtime, and that requires users to overcommit. But doing that with a
				1996	strict upper limit requires either a fairly accurate prediction of the
				1997	working set size or adding slack to the limit. Since working set size
				1998	estimation is hard and error prone, and getting it wrong results in
				1999	OOM kills, most users tend to err on the side of a looser limit and
				2000	end up wasting precious resources.
				2001
				2002	The memory.high boundary on the other hand can be set much more
				2003	conservatively. When hit, it throttles allocations by forcing them
				2004	into direct reclaim to work off the excess, but it never invokes the
				2005	OOM killer. As a result, a high boundary that is chosen too
				2006	aggressively will not terminate the processes, but instead it will
				2007	lead to gradual performance degradation. The user can monitor this
				2008	and make corrections until the minimal memory footprint that still
				2009	gives acceptable performance is found.
				2010
				2011	In extreme cases, with many concurrent allocations and a complete
				2012	breakdown of reclaim progress within the group, the high boundary can
				2013	be exceeded. But even then it's mostly better to satisfy the
				2014	allocation from the slack available in other groups or the rest of the
				2015	system than killing the group. Otherwise, memory.max is there to
				2016	limit this type of spillover and ultimately contain buggy or even
				2017	malicious applications.
Vladimir Davydov	3e24b19	2016-01-20 15:03:13 -0800	[diff] [blame]	2018
Johannes Weiner	b6e6edc	2016-03-17 14:20:28 -0700	[diff] [blame]	2019	Setting the original memory.limit_in_bytes below the current usage was
				2020	subject to a race condition, where concurrent charges could cause the
				2021	limit setting to fail. memory.max on the other hand will first set the
				2022	limit to prevent new charges, and then reclaim and OOM kill until the
				2023	new limit is met - or the task writing to memory.max is killed.
				2024
Vladimir Davydov	3e24b19	2016-01-20 15:03:13 -0800	[diff] [blame]	2025	The combined memory+swap accounting and limiting is replaced by real
				2026	control over swap space.
				2027
				2028	The main argument for a combined memory+swap facility in the original
				2029	cgroup design was that global or parental pressure would always be
				2030	able to swap all anonymous memory of a child group, regardless of the
				2031	child's own (possibly untrusted) configuration. However, untrusted
				2032	groups can sabotage swapping by other means - such as referencing its
				2033	anonymous memory in a tight loop - and an admin can not assume full
				2034	swappability when overcommitting untrusted jobs.
				2035
				2036	For trusted jobs, on the other hand, a combined counter is not an
				2037	intuitive userspace interface, and it flies in the face of the idea
				2038	that cgroup controllers should account and limit specific physical
				2039	resources. Swap space is a resource like all others in the system,
				2040	and that's why unified hierarchy allows distributing it separately.