Merge branch 'for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
- cgroup v2 interface is now official. It's no longer hidden behind a
devel flag and can be mounted using the new cgroup2 fs type.
Unfortunately, cpu v2 interface hasn't made it yet due to the
discussion around in-process hierarchical resource distribution and
only memory and io controllers can be used on the v2 interface at the
moment.
- The existing documentation which has always been a bit of mess is
relocated under Documentation/cgroup-v1/. Documentation/cgroup-v2.txt
is added as the authoritative documentation for the v2 interface.
- Some features are added through for-4.5-ancestor-test branch to
enable netfilter xt_cgroup match to use cgroup v2 paths. The actual
netfilter changes will be merged through the net tree which pulled in
the said branch.
- Various cleanups
* 'for-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: rename cgroup documentations
cgroup: fix a typo.
cgroup: Remove resource_counter.txt in Documentation/cgroup-legacy/00-INDEX.
cgroup: demote subsystem init messages to KERN_DEBUG
cgroup: Fix uninitialized variable warning
cgroup: put controller Kconfig options in meaningful order
cgroup: clean up the kernel configuration menu nomenclature
cgroup_pids: fix a typo.
Subject: cgroup: Fix incomplete dd command in blkio documentation
cgroup: kill cgrp_ss_priv[CGROUP_CANFORK_COUNT] and friends
cpuset: Replace all instances of time_t with time64_t
cgroup: replace unified-hierarchy.txt with a proper cgroup v2 documentation
cgroup: rename Documentation/cgroups/ to Documentation/cgroup-legacy/
cgroup: replace __DEVEL__sane_behavior with cgroup2 fs type
diff --git a/Documentation/cgroups/00-INDEX b/Documentation/cgroup-v1/00-INDEX
similarity index 95%
rename from Documentation/cgroups/00-INDEX
rename to Documentation/cgroup-v1/00-INDEX
index 3f5a40f..6ad425f 100644
--- a/Documentation/cgroups/00-INDEX
+++ b/Documentation/cgroup-v1/00-INDEX
@@ -24,7 +24,5 @@
- Network priority cgroups details and usages.
pids.txt
- Process number cgroups details and usages.
-resource_counter.txt
- - Resource Counter API.
unified-hierarchy.txt
- Description the new/next cgroup interface.
diff --git a/Documentation/cgroups/blkio-controller.txt b/Documentation/cgroup-v1/blkio-controller.txt
similarity index 80%
rename from Documentation/cgroups/blkio-controller.txt
rename to Documentation/cgroup-v1/blkio-controller.txt
index 52fa9f3..673dc34 100644
--- a/Documentation/cgroups/blkio-controller.txt
+++ b/Documentation/cgroup-v1/blkio-controller.txt
@@ -84,8 +84,7 @@
- Run dd to read a file and see if rate is throttled to 1MB/s or not.
- # dd if=/mnt/common/zerofile of=/dev/null bs=4K count=1024
- # iflag=direct
+ # dd iflag=direct if=/mnt/common/zerofile of=/dev/null bs=4K count=1024
1024+0 records in
1024+0 records out
4194304 bytes (4.2 MB) copied, 4.0001 s, 1.0 MB/s
@@ -374,82 +373,3 @@
groups and put applications in that group which are not driving enough
IO to keep disk busy. In that case set group_idle=0, and CFQ will not idle
on individual groups and throughput should improve.
-
-Writeback
-=========
-
-Page cache is dirtied through buffered writes and shared mmaps and
-written asynchronously to the backing filesystem by the writeback
-mechanism. Writeback sits between the memory and IO domains and
-regulates the proportion of dirty memory by balancing dirtying and
-write IOs.
-
-On traditional cgroup hierarchies, relationships between different
-controllers cannot be established making it impossible for writeback
-to operate accounting for cgroup resource restrictions and all
-writeback IOs are attributed to the root cgroup.
-
-If both the blkio and memory controllers are used on the v2 hierarchy
-and the filesystem supports cgroup writeback, writeback operations
-correctly follow the resource restrictions imposed by both memory and
-blkio controllers.
-
-Writeback examines both system-wide and per-cgroup dirty memory status
-and enforces the more restrictive of the two. Also, writeback control
-parameters which are absolute values - vm.dirty_bytes and
-vm.dirty_background_bytes - are distributed across cgroups according
-to their current writeback bandwidth.
-
-There's a peculiarity stemming from the discrepancy in ownership
-granularity between memory controller and writeback. While memory
-controller tracks ownership per page, writeback operates on inode
-basis. cgroup writeback bridges the gap by tracking ownership by
-inode but migrating ownership if too many foreign pages, pages which
-don't match the current inode ownership, have been encountered while
-writing back the inode.
-
-This is a conscious design choice as writeback operations are
-inherently tied to inodes making strictly following page ownership
-complicated and inefficient. The only use case which suffers from
-this compromise is multiple cgroups concurrently dirtying disjoint
-regions of the same inode, which is an unlikely use case and decided
-to be unsupported. Note that as memory controller assigns page
-ownership on the first use and doesn't update it until the page is
-released, even if cgroup writeback strictly follows page ownership,
-multiple cgroups dirtying overlapping areas wouldn't work as expected.
-In general, write-sharing an inode across multiple cgroups is not well
-supported.
-
-Filesystem support for cgroup writeback
----------------------------------------
-
-A filesystem can make writeback IOs cgroup-aware by updating
-address_space_operations->writepage[s]() to annotate bio's using the
-following two functions.
-
-* wbc_init_bio(@wbc, @bio)
-
- Should be called for each bio carrying writeback data and associates
- the bio with the inode's owner cgroup. Can be called anytime
- between bio allocation and submission.
-
-* wbc_account_io(@wbc, @page, @bytes)
-
- Should be called for each data segment being written out. While
- this function doesn't care exactly when it's called during the
- writeback session, it's the easiest and most natural to call it as
- data segments are added to a bio.
-
-With writeback bio's annotated, cgroup support can be enabled per
-super_block by setting MS_CGROUPWB in ->s_flags. This allows for
-selective disabling of cgroup writeback support which is helpful when
-certain filesystem features, e.g. journaled data mode, are
-incompatible.
-
-wbc_init_bio() binds the specified bio to its cgroup. Depending on
-the configuration, the bio may be executed at a lower priority and if
-the writeback session is holding shared resources, e.g. a journal
-entry, may lead to priority inversion. There is no one easy solution
-for the problem. Filesystems can try to work around specific problem
-cases by skipping wbc_init_bio() or using bio_associate_blkcg()
-directly.
diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroup-v1/cgroups.txt
similarity index 100%
rename from Documentation/cgroups/cgroups.txt
rename to Documentation/cgroup-v1/cgroups.txt
diff --git a/Documentation/cgroups/cpuacct.txt b/Documentation/cgroup-v1/cpuacct.txt
similarity index 100%
rename from Documentation/cgroups/cpuacct.txt
rename to Documentation/cgroup-v1/cpuacct.txt
diff --git a/Documentation/cgroups/cpusets.txt b/Documentation/cgroup-v1/cpusets.txt
similarity index 100%
rename from Documentation/cgroups/cpusets.txt
rename to Documentation/cgroup-v1/cpusets.txt
diff --git a/Documentation/cgroups/devices.txt b/Documentation/cgroup-v1/devices.txt
similarity index 100%
rename from Documentation/cgroups/devices.txt
rename to Documentation/cgroup-v1/devices.txt
diff --git a/Documentation/cgroups/freezer-subsystem.txt b/Documentation/cgroup-v1/freezer-subsystem.txt
similarity index 100%
rename from Documentation/cgroups/freezer-subsystem.txt
rename to Documentation/cgroup-v1/freezer-subsystem.txt
diff --git a/Documentation/cgroups/hugetlb.txt b/Documentation/cgroup-v1/hugetlb.txt
similarity index 100%
rename from Documentation/cgroups/hugetlb.txt
rename to Documentation/cgroup-v1/hugetlb.txt
diff --git a/Documentation/cgroups/memcg_test.txt b/Documentation/cgroup-v1/memcg_test.txt
similarity index 100%
rename from Documentation/cgroups/memcg_test.txt
rename to Documentation/cgroup-v1/memcg_test.txt
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroup-v1/memory.txt
similarity index 100%
rename from Documentation/cgroups/memory.txt
rename to Documentation/cgroup-v1/memory.txt
diff --git a/Documentation/cgroups/net_cls.txt b/Documentation/cgroup-v1/net_cls.txt
similarity index 100%
rename from Documentation/cgroups/net_cls.txt
rename to Documentation/cgroup-v1/net_cls.txt
diff --git a/Documentation/cgroups/net_prio.txt b/Documentation/cgroup-v1/net_prio.txt
similarity index 100%
rename from Documentation/cgroups/net_prio.txt
rename to Documentation/cgroup-v1/net_prio.txt
diff --git a/Documentation/cgroups/pids.txt b/Documentation/cgroup-v1/pids.txt
similarity index 100%
rename from Documentation/cgroups/pids.txt
rename to Documentation/cgroup-v1/pids.txt
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
new file mode 100644
index 0000000..31d1f7b
--- /dev/null
+++ b/Documentation/cgroup-v2.txt
@@ -0,0 +1,1293 @@
+
+Control Group v2
+
+October, 2015 Tejun Heo <tj@kernel.org>
+
+This is the authoritative documentation on the design, interface and
+conventions of cgroup v2. It describes all userland-visible aspects
+of cgroup including core and specific controller behaviors. All
+future changes must be reflected in this document. Documentation for
+v1 is available under Documentation/cgroup-legacy/.
+
+CONTENTS
+
+1. Introduction
+ 1-1. Terminology
+ 1-2. What is cgroup?
+2. Basic Operations
+ 2-1. Mounting
+ 2-2. Organizing Processes
+ 2-3. [Un]populated Notification
+ 2-4. Controlling Controllers
+ 2-4-1. Enabling and Disabling
+ 2-4-2. Top-down Constraint
+ 2-4-3. No Internal Process Constraint
+ 2-5. Delegation
+ 2-5-1. Model of Delegation
+ 2-5-2. Delegation Containment
+ 2-6. Guidelines
+ 2-6-1. Organize Once and Control
+ 2-6-2. Avoid Name Collisions
+3. Resource Distribution Models
+ 3-1. Weights
+ 3-2. Limits
+ 3-3. Protections
+ 3-4. Allocations
+4. Interface Files
+ 4-1. Format
+ 4-2. Conventions
+ 4-3. Core Interface Files
+5. Controllers
+ 5-1. CPU
+ 5-1-1. CPU Interface Files
+ 5-2. Memory
+ 5-2-1. Memory Interface Files
+ 5-2-2. Usage Guidelines
+ 5-2-3. Memory Ownership
+ 5-3. IO
+ 5-3-1. IO Interface Files
+ 5-3-2. Writeback
+P. Information on Kernel Programming
+ P-1. Filesystem Support for Writeback
+D. Deprecated v1 Core Features
+R. Issues with v1 and Rationales for v2
+ R-1. Multiple Hierarchies
+ R-2. Thread Granularity
+ R-3. Competition Between Inner Nodes and Threads
+ R-4. Other Interface Issues
+ R-5. Controller Issues and Remedies
+ R-5-1. Memory
+
+
+1. Introduction
+
+1-1. Terminology
+
+"cgroup" stands for "control group" and is never capitalized. The
+singular form is used to designate the whole feature and also as a
+qualifier as in "cgroup controllers". When explicitly referring to
+multiple individual control groups, the plural form "cgroups" is used.
+
+
+1-2. What is cgroup?
+
+cgroup is a mechanism to organize processes hierarchically and
+distribute system resources along the hierarchy in a controlled and
+configurable manner.
+
+cgroup is largely composed of two parts - the core and controllers.
+cgroup core is primarily responsible for hierarchically organizing
+processes. A cgroup controller is usually responsible for
+distributing a specific type of system resource along the hierarchy
+although there are utility controllers which serve purposes other than
+resource distribution.
+
+cgroups form a tree structure and every process in the system belongs
+to one and only one cgroup. All threads of a process belong to the
+same cgroup. On creation, all processes are put in the cgroup that
+the parent process belongs to at the time. A process can be migrated
+to another cgroup. Migration of a process doesn't affect already
+existing descendant processes.
+
+Following certain structural constraints, controllers may be enabled or
+disabled selectively on a cgroup. All controller behaviors are
+hierarchical - if a controller is enabled on a cgroup, it affects all
+processes which belong to the cgroups consisting the inclusive
+sub-hierarchy of the cgroup. When a controller is enabled on a nested
+cgroup, it always restricts the resource distribution further. The
+restrictions set closer to the root in the hierarchy can not be
+overridden from further away.
+
+
+2. Basic Operations
+
+2-1. Mounting
+
+Unlike v1, cgroup v2 has only single hierarchy. The cgroup v2
+hierarchy can be mounted with the following mount command.
+
+ # mount -t cgroup2 none $MOUNT_POINT
+
+cgroup2 filesystem has the magic number 0x63677270 ("cgrp"). All
+controllers which support v2 and are not bound to a v1 hierarchy are
+automatically bound to the v2 hierarchy and show up at the root.
+Controllers which are not in active use in the v2 hierarchy can be
+bound to other hierarchies. This allows mixing v2 hierarchy with the
+legacy v1 multiple hierarchies in a fully backward compatible way.
+
+A controller can be moved across hierarchies only after the controller
+is no longer referenced in its current hierarchy. Because per-cgroup
+controller states are destroyed asynchronously and controllers may
+have lingering references, a controller may not show up immediately on
+the v2 hierarchy after the final umount of the previous hierarchy.
+Similarly, a controller should be fully disabled to be moved out of
+the unified hierarchy and it may take some time for the disabled
+controller to become available for other hierarchies; furthermore, due
+to inter-controller dependencies, other controllers may need to be
+disabled too.
+
+While useful for development and manual configurations, moving
+controllers dynamically between the v2 and other hierarchies is
+strongly discouraged for production use. It is recommended to decide
+the hierarchies and controller associations before starting using the
+controllers after system boot.
+
+
+2-2. Organizing Processes
+
+Initially, only the root cgroup exists to which all processes belong.
+A child cgroup can be created by creating a sub-directory.
+
+ # mkdir $CGROUP_NAME
+
+A given cgroup may have multiple child cgroups forming a tree
+structure. Each cgroup has a read-writable interface file
+"cgroup.procs". When read, it lists the PIDs of all processes which
+belong to the cgroup one-per-line. The PIDs are not ordered and the
+same PID may show up more than once if the process got moved to
+another cgroup and then back or the PID got recycled while reading.
+
+A process can be migrated into a cgroup by writing its PID to the
+target cgroup's "cgroup.procs" file. Only one process can be migrated
+on a single write(2) call. If a process is composed of multiple
+threads, writing the PID of any thread migrates all threads of the
+process.
+
+When a process forks a child process, the new process is born into the
+cgroup that the forking process belongs to at the time of the
+operation. After exit, a process stays associated with the cgroup
+that it belonged to at the time of exit until it's reaped; however, a
+zombie process does not appear in "cgroup.procs" and thus can't be
+moved to another cgroup.
+
+A cgroup which doesn't have any children or live processes can be
+destroyed by removing the directory. Note that a cgroup which doesn't
+have any children and is associated only with zombie processes is
+considered empty and can be removed.
+
+ # rmdir $CGROUP_NAME
+
+"/proc/$PID/cgroup" lists a process's cgroup membership. If legacy
+cgroup is in use in the system, this file may contain multiple lines,
+one for each hierarchy. The entry for cgroup v2 is always in the
+format "0::$PATH".
+
+ # cat /proc/842/cgroup
+ ...
+ 0::/test-cgroup/test-cgroup-nested
+
+If the process becomes a zombie and the cgroup it was associated with
+is removed subsequently, " (deleted)" is appended to the path.
+
+ # cat /proc/842/cgroup
+ ...
+ 0::/test-cgroup/test-cgroup-nested (deleted)
+
+
+2-3. [Un]populated Notification
+
+Each non-root cgroup has a "cgroup.events" file which contains
+"populated" field indicating whether the cgroup's sub-hierarchy has
+live processes in it. Its value is 0 if there is no live process in
+the cgroup and its descendants; otherwise, 1. poll and [id]notify
+events are triggered when the value changes. This can be used, for
+example, to start a clean-up operation after all processes of a given
+sub-hierarchy have exited. The populated state updates and
+notifications are recursive. Consider the following sub-hierarchy
+where the numbers in the parentheses represent the numbers of processes
+in each cgroup.
+
+ A(4) - B(0) - C(1)
+ \ D(0)
+
+A, B and C's "populated" fields would be 1 while D's 0. After the one
+process in C exits, B and C's "populated" fields would flip to "0" and
+file modified events will be generated on the "cgroup.events" files of
+both cgroups.
+
+
+2-4. Controlling Controllers
+
+2-4-1. Enabling and Disabling
+
+Each cgroup has a "cgroup.controllers" file which lists all
+controllers available for the cgroup to enable.
+
+ # cat cgroup.controllers
+ cpu io memory
+
+No controller is enabled by default. Controllers can be enabled and
+disabled by writing to the "cgroup.subtree_control" file.
+
+ # echo "+cpu +memory -io" > cgroup.subtree_control
+
+Only controllers which are listed in "cgroup.controllers" can be
+enabled. When multiple operations are specified as above, either they
+all succeed or fail. If multiple operations on the same controller
+are specified, the last one is effective.
+
+Enabling a controller in a cgroup indicates that the distribution of
+the target resource across its immediate children will be controlled.
+Consider the following sub-hierarchy. The enabled controllers are
+listed in parentheses.
+
+ A(cpu,memory) - B(memory) - C()
+ \ D()
+
+As A has "cpu" and "memory" enabled, A will control the distribution
+of CPU cycles and memory to its children, in this case, B. As B has
+"memory" enabled but not "CPU", C and D will compete freely on CPU
+cycles but their division of memory available to B will be controlled.
+
+As a controller regulates the distribution of the target resource to
+the cgroup's children, enabling it creates the controller's interface
+files in the child cgroups. In the above example, enabling "cpu" on B
+would create the "cpu." prefixed controller interface files in C and
+D. Likewise, disabling "memory" from B would remove the "memory."
+prefixed controller interface files from C and D. This means that the
+controller interface files - anything which doesn't start with
+"cgroup." are owned by the parent rather than the cgroup itself.
+
+
+2-4-2. Top-down Constraint
+
+Resources are distributed top-down and a cgroup can further distribute
+a resource only if the resource has been distributed to it from the
+parent. This means that all non-root "cgroup.subtree_control" files
+can only contain controllers which are enabled in the parent's
+"cgroup.subtree_control" file. A controller can be enabled only if
+the parent has the controller enabled and a controller can't be
+disabled if one or more children have it enabled.
+
+
+2-4-3. No Internal Process Constraint
+
+Non-root cgroups can only distribute resources to their children when
+they don't have any processes of their own. In other words, only
+cgroups which don't contain any processes can have controllers enabled
+in their "cgroup.subtree_control" files.
+
+This guarantees that, when a controller is looking at the part of the
+hierarchy which has it enabled, processes are always only on the
+leaves. This rules out situations where child cgroups compete against
+internal processes of the parent.
+
+The root cgroup is exempt from this restriction. Root contains
+processes and anonymous resource consumption which can't be associated
+with any other cgroups and requires special treatment from most
+controllers. How resource consumption in the root cgroup is governed
+is up to each controller.
+
+Note that the restriction doesn't get in the way if there is no
+enabled controller in the cgroup's "cgroup.subtree_control". This is
+important as otherwise it wouldn't be possible to create children of a
+populated cgroup. To control resource distribution of a cgroup, the
+cgroup must create children and transfer all its processes to the
+children before enabling controllers in its "cgroup.subtree_control"
+file.
+
+
+2-5. Delegation
+
+2-5-1. Model of Delegation
+
+A cgroup can be delegated to a less privileged user by granting write
+access of the directory and its "cgroup.procs" file to the user. Note
+that resource control interface files in a given directory control the
+distribution of the parent's resources and thus must not be delegated
+along with the directory.
+
+Once delegated, the user can build sub-hierarchy under the directory,
+organize processes as it sees fit and further distribute the resources
+it received from the parent. The limits and other settings of all
+resource controllers are hierarchical and regardless of what happens
+in the delegated sub-hierarchy, nothing can escape the resource
+restrictions imposed by the parent.
+
+Currently, cgroup doesn't impose any restrictions on the number of
+cgroups in or nesting depth of a delegated sub-hierarchy; however,
+this may be limited explicitly in the future.
+
+
+2-5-2. Delegation Containment
+
+A delegated sub-hierarchy is contained in the sense that processes
+can't be moved into or out of the sub-hierarchy by the delegatee. For
+a process with a non-root euid to migrate a target process into a
+cgroup by writing its PID to the "cgroup.procs" file, the following
+conditions must be met.
+
+- The writer's euid must match either uid or suid of the target process.
+
+- The writer must have write access to the "cgroup.procs" file.
+
+- The writer must have write access to the "cgroup.procs" file of the
+ common ancestor of the source and destination cgroups.
+
+The above three constraints ensure that while a delegatee may migrate
+processes around freely in the delegated sub-hierarchy it can't pull
+in from or push out to outside the sub-hierarchy.
+
+For an example, let's assume cgroups C0 and C1 have been delegated to
+user U0 who created C00, C01 under C0 and C10 under C1 as follows and
+all processes under C0 and C1 belong to U0.
+
+ ~~~~~~~~~~~~~ - C0 - C00
+ ~ cgroup ~ \ C01
+ ~ hierarchy ~
+ ~~~~~~~~~~~~~ - C1 - C10
+
+Let's also say U0 wants to write the PID of a process which is
+currently in C10 into "C00/cgroup.procs". U0 has write access to the
+file and uid match on the process; however, the common ancestor of the
+source cgroup C10 and the destination cgroup C00 is above the points
+of delegation and U0 would not have write access to its "cgroup.procs"
+files and thus the write will be denied with -EACCES.
+
+
+2-6. Guidelines
+
+2-6-1. Organize Once and Control
+
+Migrating a process across cgroups is a relatively expensive operation
+and stateful resources such as memory are not moved together with the
+process. This is an explicit design decision as there often exist
+inherent trade-offs between migration and various hot paths in terms
+of synchronization cost.
+
+As such, migrating processes across cgroups frequently as a means to
+apply different resource restrictions is discouraged. A workload
+should be assigned to a cgroup according to the system's logical and
+resource structure once on start-up. Dynamic adjustments to resource
+distribution can be made by changing controller configuration through
+the interface files.
+
+
+2-6-2. Avoid Name Collisions
+
+Interface files for a cgroup and its children cgroups occupy the same
+directory and it is possible to create children cgroups which collide
+with interface files.
+
+All cgroup core interface files are prefixed with "cgroup." and each
+controller's interface files are prefixed with the controller name and
+a dot. A controller's name is composed of lower case alphabets and
+'_'s but never begins with an '_' so it can be used as the prefix
+character for collision avoidance. Also, interface file names won't
+start or end with terms which are often used in categorizing workloads
+such as job, service, slice, unit or workload.
+
+cgroup doesn't do anything to prevent name collisions and it's the
+user's responsibility to avoid them.
+
+
+3. Resource Distribution Models
+
+cgroup controllers implement several resource distribution schemes
+depending on the resource type and expected use cases. This section
+describes major schemes in use along with their expected behaviors.
+
+
+3-1. Weights
+
+A parent's resource is distributed by adding up the weights of all
+active children and giving each the fraction matching the ratio of its
+weight against the sum. As only children which can make use of the
+resource at the moment participate in the distribution, this is
+work-conserving. Due to the dynamic nature, this model is usually
+used for stateless resources.
+
+All weights are in the range [1, 10000] with the default at 100. This
+allows symmetric multiplicative biases in both directions at fine
+enough granularity while staying in the intuitive range.
+
+As long as the weight is in range, all configuration combinations are
+valid and there is no reason to reject configuration changes or
+process migrations.
+
+"cpu.weight" proportionally distributes CPU cycles to active children
+and is an example of this type.
+
+
+3-2. Limits
+
+A child can only consume upto the configured amount of the resource.
+Limits can be over-committed - the sum of the limits of children can
+exceed the amount of resource available to the parent.
+
+Limits are in the range [0, max] and defaults to "max", which is noop.
+
+As limits can be over-committed, all configuration combinations are
+valid and there is no reason to reject configuration changes or
+process migrations.
+
+"io.max" limits the maximum BPS and/or IOPS that a cgroup can consume
+on an IO device and is an example of this type.
+
+
+3-3. Protections
+
+A cgroup is protected to be allocated upto the configured amount of
+the resource if the usages of all its ancestors are under their
+protected levels. Protections can be hard guarantees or best effort
+soft boundaries. Protections can also be over-committed in which case
+only upto the amount available to the parent is protected among
+children.
+
+Protections are in the range [0, max] and defaults to 0, which is
+noop.
+
+As protections can be over-committed, all configuration combinations
+are valid and there is no reason to reject configuration changes or
+process migrations.
+
+"memory.low" implements best-effort memory protection and is an
+example of this type.
+
+
+3-4. Allocations
+
+A cgroup is exclusively allocated a certain amount of a finite
+resource. Allocations can't be over-committed - the sum of the
+allocations of children can not exceed the amount of resource
+available to the parent.
+
+Allocations are in the range [0, max] and defaults to 0, which is no
+resource.
+
+As allocations can't be over-committed, some configuration
+combinations are invalid and should be rejected. Also, if the
+resource is mandatory for execution of processes, process migrations
+may be rejected.
+
+"cpu.rt.max" hard-allocates realtime slices and is an example of this
+type.
+
+
+4. Interface Files
+
+4-1. Format
+
+All interface files should be in one of the following formats whenever
+possible.
+
+ New-line separated values
+ (when only one value can be written at once)
+
+ VAL0\n
+ VAL1\n
+ ...
+
+ Space separated values
+ (when read-only or multiple values can be written at once)
+
+ VAL0 VAL1 ...\n
+
+ Flat keyed
+
+ KEY0 VAL0\n
+ KEY1 VAL1\n
+ ...
+
+ Nested keyed
+
+ KEY0 SUB_KEY0=VAL00 SUB_KEY1=VAL01...
+ KEY1 SUB_KEY0=VAL10 SUB_KEY1=VAL11...
+ ...
+
+For a writable file, the format for writing should generally match
+reading; however, controllers may allow omitting later fields or
+implement restricted shortcuts for most common use cases.
+
+For both flat and nested keyed files, only the values for a single key
+can be written at a time. For nested keyed files, the sub key pairs
+may be specified in any order and not all pairs have to be specified.
+
+
+4-2. Conventions
+
+- Settings for a single feature should be contained in a single file.
+
+- The root cgroup should be exempt from resource control and thus
+ shouldn't have resource control interface files. Also,
+ informational files on the root cgroup which end up showing global
+ information available elsewhere shouldn't exist.
+
+- If a controller implements weight based resource distribution, its
+ interface file should be named "weight" and have the range [1,
+ 10000] with 100 as the default. The values are chosen to allow
+ enough and symmetric bias in both directions while keeping it
+ intuitive (the default is 100%).
+
+- If a controller implements an absolute resource guarantee and/or
+ limit, the interface files should be named "min" and "max"
+ respectively. If a controller implements best effort resource
+ guarantee and/or limit, the interface files should be named "low"
+ and "high" respectively.
+
+ In the above four control files, the special token "max" should be
+ used to represent upward infinity for both reading and writing.
+
+- If a setting has a configurable default value and keyed specific
+ overrides, the default entry should be keyed with "default" and
+ appear as the first entry in the file.
+
+ The default value can be updated by writing either "default $VAL" or
+ "$VAL".
+
+ When writing to update a specific override, "default" can be used as
+ the value to indicate removal of the override. Override entries
+ with "default" as the value must not appear when read.
+
+ For example, a setting which is keyed by major:minor device numbers
+ with integer values may look like the following.
+
+ # cat cgroup-example-interface-file
+ default 150
+ 8:0 300
+
+ The default value can be updated by
+
+ # echo 125 > cgroup-example-interface-file
+
+ or
+
+ # echo "default 125" > cgroup-example-interface-file
+
+ An override can be set by
+
+ # echo "8:16 170" > cgroup-example-interface-file
+
+ and cleared by
+
+ # echo "8:0 default" > cgroup-example-interface-file
+ # cat cgroup-example-interface-file
+ default 125
+ 8:16 170
+
+- For events which are not very high frequency, an interface file
+ "events" should be created which lists event key value pairs.
+ Whenever a notifiable event happens, file modified event should be
+ generated on the file.
+
+
+4-3. Core Interface Files
+
+All cgroup core files are prefixed with "cgroup."
+
+ cgroup.procs
+
+ A read-write new-line separated values file which exists on
+ all cgroups.
+
+ When read, it lists the PIDs of all processes which belong to
+ the cgroup one-per-line. The PIDs are not ordered and the
+ same PID may show up more than once if the process got moved
+ to another cgroup and then back or the PID got recycled while
+ reading.
+
+ A PID can be written to migrate the process associated with
+ the PID to the cgroup. The writer should match all of the
+ following conditions.
+
+ - Its euid is either root or must match either uid or suid of
+ the target process.
+
+ - It must have write access to the "cgroup.procs" file.
+
+ - It must have write access to the "cgroup.procs" file of the
+ common ancestor of the source and destination cgroups.
+
+ When delegating a sub-hierarchy, write access to this file
+ should be granted along with the containing directory.
+
+ cgroup.controllers
+
+ A read-only space separated values file which exists on all
+ cgroups.
+
+ It shows space separated list of all controllers available to
+ the cgroup. The controllers are not ordered.
+
+ cgroup.subtree_control
+
+ A read-write space separated values file which exists on all
+ cgroups. Starts out empty.
+
+ When read, it shows space separated list of the controllers
+ which are enabled to control resource distribution from the
+ cgroup to its children.
+
+ Space separated list of controllers prefixed with '+' or '-'
+ can be written to enable or disable controllers. A controller
+ name prefixed with '+' enables the controller and '-'
+ disables. If a controller appears more than once on the list,
+ the last one is effective. When multiple enable and disable
+ operations are specified, either all succeed or all fail.
+
+ cgroup.events
+
+ A read-only flat-keyed file which exists on non-root cgroups.
+ The following entries are defined. Unless specified
+ otherwise, a value change in this file generates a file
+ modified event.
+
+ populated
+
+ 1 if the cgroup or its descendants contains any live
+ processes; otherwise, 0.
+
+
+5. Controllers
+
+5-1. CPU
+
+[NOTE: The interface for the cpu controller hasn't been merged yet]
+
+The "cpu" controllers regulates distribution of CPU cycles. This
+controller implements weight and absolute bandwidth limit models for
+normal scheduling policy and absolute bandwidth allocation model for
+realtime scheduling policy.
+
+
+5-1-1. CPU Interface Files
+
+All time durations are in microseconds.
+
+ cpu.stat
+
+ A read-only flat-keyed file which exists on non-root cgroups.
+
+ It reports the following six stats.
+
+ usage_usec
+ user_usec
+ system_usec
+ nr_periods
+ nr_throttled
+ throttled_usec
+
+ cpu.weight
+
+ A read-write single value file which exists on non-root
+ cgroups. The default is "100".
+
+ The weight in the range [1, 10000].
+
+ cpu.max
+
+ A read-write two value file which exists on non-root cgroups.
+ The default is "max 100000".
+
+ The maximum bandwidth limit. It's in the following format.
+
+ $MAX $PERIOD
+
+ which indicates that the group may consume upto $MAX in each
+ $PERIOD duration. "max" for $MAX indicates no limit. If only
+ one number is written, $MAX is updated.
+
+ cpu.rt.max
+
+ [NOTE: The semantics of this file is still under discussion and the
+ interface hasn't been merged yet]
+
+ A read-write two value file which exists on all cgroups.
+ The default is "0 100000".
+
+ The maximum realtime runtime allocation. Over-committing
+ configurations are disallowed and process migrations are
+ rejected if not enough bandwidth is available. It's in the
+ following format.
+
+ $MAX $PERIOD
+
+ which indicates that the group may consume upto $MAX in each
+ $PERIOD duration. If only one number is written, $MAX is
+ updated.
+
+
+5-2. Memory
+
+The "memory" controller regulates distribution of memory. Memory is
+stateful and implements both limit and protection models. Due to the
+intertwining between memory usage and reclaim pressure and the
+stateful nature of memory, the distribution model is relatively
+complex.
+
+While not completely water-tight, all major memory usages by a given
+cgroup are tracked so that the total memory consumption can be
+accounted and controlled to a reasonable extent. Currently, the
+following types of memory usages are tracked.
+
+- Userland memory - page cache and anonymous memory.
+
+- Kernel data structures such as dentries and inodes.
+
+- TCP socket buffers.
+
+The above list may expand in the future for better coverage.
+
+
+5-2-1. Memory Interface Files
+
+All memory amounts are in bytes. If a value which is not aligned to
+PAGE_SIZE is written, the value may be rounded up to the closest
+PAGE_SIZE multiple when read back.
+
+ memory.current
+
+ A read-only single value file which exists on non-root
+ cgroups.
+
+ The total amount of memory currently being used by the cgroup
+ and its descendants.
+
+ memory.low
+
+ A read-write single value file which exists on non-root
+ cgroups. The default is "0".
+
+ Best-effort memory protection. If the memory usages of a
+ cgroup and all its ancestors are below their low boundaries,
+ the cgroup's memory won't be reclaimed unless memory can be
+ reclaimed from unprotected cgroups.
+
+ Putting more memory than generally available under this
+ protection is discouraged.
+
+ memory.high
+
+ A read-write single value file which exists on non-root
+ cgroups. The default is "max".
+
+ Memory usage throttle limit. This is the main mechanism to
+ control memory usage of a cgroup. If a cgroup's usage goes
+ over the high boundary, the processes of the cgroup are
+ throttled and put under heavy reclaim pressure.
+
+ Going over the high limit never invokes the OOM killer and
+ under extreme conditions the limit may be breached.
+
+ memory.max
+
+ A read-write single value file which exists on non-root
+ cgroups. The default is "max".
+
+ Memory usage hard limit. This is the final protection
+ mechanism. If a cgroup's memory usage reaches this limit and
+ can't be reduced, the OOM killer is invoked in the cgroup.
+ Under certain circumstances, the usage may go over the limit
+ temporarily.
+
+ This is the ultimate protection mechanism. As long as the
+ high limit is used and monitored properly, this limit's
+ utility is limited to providing the final safety net.
+
+ memory.events
+
+ A read-only flat-keyed file which exists on non-root cgroups.
+ The following entries are defined. Unless specified
+ otherwise, a value change in this file generates a file
+ modified event.
+
+ low
+
+ The number of times the cgroup is reclaimed due to
+ high memory pressure even though its usage is under
+ the low boundary. This usually indicates that the low
+ boundary is over-committed.
+
+ high
+
+ The number of times processes of the cgroup are
+ throttled and routed to perform direct memory reclaim
+ because the high memory boundary was exceeded. For a
+ cgroup whose memory usage is capped by the high limit
+ rather than global memory pressure, this event's
+ occurrences are expected.
+
+ max
+
+ The number of times the cgroup's memory usage was
+ about to go over the max boundary. If direct reclaim
+ fails to bring it down, the OOM killer is invoked.
+
+ oom
+
+ The number of times the OOM killer has been invoked in
+ the cgroup. This may not exactly match the number of
+ processes killed but should generally be close.
+
+
+5-2-2. General Usage
+
+"memory.high" is the main mechanism to control memory usage.
+Over-committing on high limit (sum of high limits > available memory)
+and letting global memory pressure to distribute memory according to
+usage is a viable strategy.
+
+Because breach of the high limit doesn't trigger the OOM killer but
+throttles the offending cgroup, a management agent has ample
+opportunities to monitor and take appropriate actions such as granting
+more memory or terminating the workload.
+
+Determining whether a cgroup has enough memory is not trivial as
+memory usage doesn't indicate whether the workload can benefit from
+more memory. For example, a workload which writes data received from
+network to a file can use all available memory but can also operate as
+performant with a small amount of memory. A measure of memory
+pressure - how much the workload is being impacted due to lack of
+memory - is necessary to determine whether a workload needs more
+memory; unfortunately, memory pressure monitoring mechanism isn't
+implemented yet.
+
+
+5-2-3. Memory Ownership
+
+A memory area is charged to the cgroup which instantiated it and stays
+charged to the cgroup until the area is released. Migrating a process
+to a different cgroup doesn't move the memory usages that it
+instantiated while in the previous cgroup to the new cgroup.
+
+A memory area may be used by processes belonging to different cgroups.
+To which cgroup the area will be charged is in-deterministic; however,
+over time, the memory area is likely to end up in a cgroup which has
+enough memory allowance to avoid high reclaim pressure.
+
+If a cgroup sweeps a considerable amount of memory which is expected
+to be accessed repeatedly by other cgroups, it may make sense to use
+POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
+belonging to the affected files to ensure correct memory ownership.
+
+
+5-3. IO
+
+The "io" controller regulates the distribution of IO resources. This
+controller implements both weight based and absolute bandwidth or IOPS
+limit distribution; however, weight based distribution is available
+only if cfq-iosched is in use and neither scheme is available for
+blk-mq devices.
+
+
+5-3-1. IO Interface Files
+
+ io.stat
+
+ A read-only nested-keyed file which exists on non-root
+ cgroups.
+
+ Lines are keyed by $MAJ:$MIN device numbers and not ordered.
+ The following nested keys are defined.
+
+ rbytes Bytes read
+ wbytes Bytes written
+ rios Number of read IOs
+ wios Number of write IOs
+
+ An example read output follows.
+
+ 8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353
+ 8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252
+
+ io.weight
+
+ A read-write flat-keyed file which exists on non-root cgroups.
+ The default is "default 100".
+
+ The first line is the default weight applied to devices
+ without specific override. The rest are overrides keyed by
+ $MAJ:$MIN device numbers and not ordered. The weights are in
+ the range [1, 10000] and specifies the relative amount IO time
+ the cgroup can use in relation to its siblings.
+
+ The default weight can be updated by writing either "default
+ $WEIGHT" or simply "$WEIGHT". Overrides can be set by writing
+ "$MAJ:$MIN $WEIGHT" and unset by writing "$MAJ:$MIN default".
+
+ An example read output follows.
+
+ default 100
+ 8:16 200
+ 8:0 50
+
+ io.max
+
+ A read-write nested-keyed file which exists on non-root
+ cgroups.
+
+ BPS and IOPS based IO limit. Lines are keyed by $MAJ:$MIN
+ device numbers and not ordered. The following nested keys are
+ defined.
+
+ rbps Max read bytes per second
+ wbps Max write bytes per second
+ riops Max read IO operations per second
+ wiops Max write IO operations per second
+
+ When writing, any number of nested key-value pairs can be
+ specified in any order. "max" can be specified as the value
+ to remove a specific limit. If the same key is specified
+ multiple times, the outcome is undefined.
+
+ BPS and IOPS are measured in each IO direction and IOs are
+ delayed if limit is reached. Temporary bursts are allowed.
+
+ Setting read limit at 2M BPS and write at 120 IOPS for 8:16.
+
+ echo "8:16 rbps=2097152 wiops=120" > io.max
+
+ Reading returns the following.
+
+ 8:16 rbps=2097152 wbps=max riops=max wiops=120
+
+ Write IOPS limit can be removed by writing the following.
+
+ echo "8:16 wiops=max" > io.max
+
+ Reading now returns the following.
+
+ 8:16 rbps=2097152 wbps=max riops=max wiops=max
+
+
+5-3-2. Writeback
+
+Page cache is dirtied through buffered writes and shared mmaps and
+written asynchronously to the backing filesystem by the writeback
+mechanism. Writeback sits between the memory and IO domains and
+regulates the proportion of dirty memory by balancing dirtying and
+write IOs.
+
+The io controller, in conjunction with the memory controller,
+implements control of page cache writeback IOs. The memory controller
+defines the memory domain that dirty memory ratio is calculated and
+maintained for and the io controller defines the io domain which
+writes out dirty pages for the memory domain. Both system-wide and
+per-cgroup dirty memory states are examined and the more restrictive
+of the two is enforced.
+
+cgroup writeback requires explicit support from the underlying
+filesystem. Currently, cgroup writeback is implemented on ext2, ext4
+and btrfs. On other filesystems, all writeback IOs are attributed to
+the root cgroup.
+
+There are inherent differences in memory and writeback management
+which affects how cgroup ownership is tracked. Memory is tracked per
+page while writeback per inode. For the purpose of writeback, an
+inode is assigned to a cgroup and all IO requests to write dirty pages
+from the inode are attributed to that cgroup.
+
+As cgroup ownership for memory is tracked per page, there can be pages
+which are associated with different cgroups than the one the inode is
+associated with. These are called foreign pages. The writeback
+constantly keeps track of foreign pages and, if a particular foreign
+cgroup becomes the majority over a certain period of time, switches
+the ownership of the inode to that cgroup.
+
+While this model is enough for most use cases where a given inode is
+mostly dirtied by a single cgroup even when the main writing cgroup
+changes over time, use cases where multiple cgroups write to a single
+inode simultaneously are not supported well. In such circumstances, a
+significant portion of IOs are likely to be attributed incorrectly.
+As memory controller assigns page ownership on the first use and
+doesn't update it until the page is released, even if writeback
+strictly follows page ownership, multiple cgroups dirtying overlapping
+areas wouldn't work as expected. It's recommended to avoid such usage
+patterns.
+
+The sysctl knobs which affect writeback behavior are applied to cgroup
+writeback as follows.
+
+ vm.dirty_background_ratio
+ vm.dirty_ratio
+
+ These ratios apply the same to cgroup writeback with the
+ amount of available memory capped by limits imposed by the
+ memory controller and system-wide clean memory.
+
+ vm.dirty_background_bytes
+ vm.dirty_bytes
+
+ For cgroup writeback, this is calculated into ratio against
+ total available memory and applied the same way as
+ vm.dirty[_background]_ratio.
+
+
+P. Information on Kernel Programming
+
+This section contains kernel programming information in the areas
+where interacting with cgroup is necessary. cgroup core and
+controllers are not covered.
+
+
+P-1. Filesystem Support for Writeback
+
+A filesystem can support cgroup writeback by updating
+address_space_operations->writepage[s]() to annotate bio's using the
+following two functions.
+
+ wbc_init_bio(@wbc, @bio)
+
+ Should be called for each bio carrying writeback data and
+ associates the bio with the inode's owner cgroup. Can be
+ called anytime between bio allocation and submission.
+
+ wbc_account_io(@wbc, @page, @bytes)
+
+ Should be called for each data segment being written out.
+ While this function doesn't care exactly when it's called
+ during the writeback session, it's the easiest and most
+ natural to call it as data segments are added to a bio.
+
+With writeback bio's annotated, cgroup support can be enabled per
+super_block by setting SB_I_CGROUPWB in ->s_iflags. This allows for
+selective disabling of cgroup writeback support which is helpful when
+certain filesystem features, e.g. journaled data mode, are
+incompatible.
+
+wbc_init_bio() binds the specified bio to its cgroup. Depending on
+the configuration, the bio may be executed at a lower priority and if
+the writeback session is holding shared resources, e.g. a journal
+entry, may lead to priority inversion. There is no one easy solution
+for the problem. Filesystems can try to work around specific problem
+cases by skipping wbc_init_bio() or using bio_associate_blkcg()
+directly.
+
+
+D. Deprecated v1 Core Features
+
+- Multiple hierarchies including named ones are not supported.
+
+- All mount options and remounting are not supported.
+
+- The "tasks" file is removed and "cgroup.procs" is not sorted.
+
+- "cgroup.clone_children" is removed.
+
+- /proc/cgroups is meaningless for v2. Use "cgroup.controllers" file
+ at the root instead.
+
+
+R. Issues with v1 and Rationales for v2
+
+R-1. Multiple Hierarchies
+
+cgroup v1 allowed an arbitrary number of hierarchies and each
+hierarchy could host any number of controllers. While this seemed to
+provide a high level of flexibility, it wasn't useful in practice.
+
+For example, as there is only one instance of each controller, utility
+type controllers such as freezer which can be useful in all
+hierarchies could only be used in one. The issue is exacerbated by
+the fact that controllers couldn't be moved to another hierarchy once
+hierarchies were populated. Another issue was that all controllers
+bound to a hierarchy were forced to have exactly the same view of the
+hierarchy. It wasn't possible to vary the granularity depending on
+the specific controller.
+
+In practice, these issues heavily limited which controllers could be
+put on the same hierarchy and most configurations resorted to putting
+each controller on its own hierarchy. Only closely related ones, such
+as the cpu and cpuacct controllers, made sense to be put on the same
+hierarchy. This often meant that userland ended up managing multiple
+similar hierarchies repeating the same steps on each hierarchy
+whenever a hierarchy management operation was necessary.
+
+Furthermore, support for multiple hierarchies came at a steep cost.
+It greatly complicated cgroup core implementation but more importantly
+the support for multiple hierarchies restricted how cgroup could be
+used in general and what controllers was able to do.
+
+There was no limit on how many hierarchies there might be, which meant
+that a thread's cgroup membership couldn't be described in finite
+length. The key might contain any number of entries and was unlimited
+in length, which made it highly awkward to manipulate and led to
+addition of controllers which existed only to identify membership,
+which in turn exacerbated the original problem of proliferating number
+of hierarchies.
+
+Also, as a controller couldn't have any expectation regarding the
+topologies of hierarchies other controllers might be on, each
+controller had to assume that all other controllers were attached to
+completely orthogonal hierarchies. This made it impossible, or at
+least very cumbersome, for controllers to cooperate with each other.
+
+In most use cases, putting controllers on hierarchies which are
+completely orthogonal to each other isn't necessary. What usually is
+called for is the ability to have differing levels of granularity
+depending on the specific controller. In other words, hierarchy may
+be collapsed from leaf towards root when viewed from specific
+controllers. For example, a given configuration might not care about
+how memory is distributed beyond a certain level while still wanting
+to control how CPU cycles are distributed.
+
+
+R-2. Thread Granularity
+
+cgroup v1 allowed threads of a process to belong to different cgroups.
+This didn't make sense for some controllers and those controllers
+ended up implementing different ways to ignore such situations but
+much more importantly it blurred the line between API exposed to
+individual applications and system management interface.
+
+Generally, in-process knowledge is available only to the process
+itself; thus, unlike service-level organization of processes,
+categorizing threads of a process requires active participation from
+the application which owns the target process.
+
+cgroup v1 had an ambiguously defined delegation model which got abused
+in combination with thread granularity. cgroups were delegated to
+individual applications so that they can create and manage their own
+sub-hierarchies and control resource distributions along them. This
+effectively raised cgroup to the status of a syscall-like API exposed
+to lay programs.
+
+First of all, cgroup has a fundamentally inadequate interface to be
+exposed this way. For a process to access its own knobs, it has to
+extract the path on the target hierarchy from /proc/self/cgroup,
+construct the path by appending the name of the knob to the path, open
+and then read and/or write to it. This is not only extremely clunky
+and unusual but also inherently racy. There is no conventional way to
+define transaction across the required steps and nothing can guarantee
+that the process would actually be operating on its own sub-hierarchy.
+
+cgroup controllers implemented a number of knobs which would never be
+accepted as public APIs because they were just adding control knobs to
+system-management pseudo filesystem. cgroup ended up with interface
+knobs which were not properly abstracted or refined and directly
+revealed kernel internal details. These knobs got exposed to
+individual applications through the ill-defined delegation mechanism
+effectively abusing cgroup as a shortcut to implementing public APIs
+without going through the required scrutiny.
+
+This was painful for both userland and kernel. Userland ended up with
+misbehaving and poorly abstracted interfaces and kernel exposing and
+locked into constructs inadvertently.
+
+
+R-3. Competition Between Inner Nodes and Threads
+
+cgroup v1 allowed threads to be in any cgroups which created an
+interesting problem where threads belonging to a parent cgroup and its
+children cgroups competed for resources. This was nasty as two
+different types of entities competed and there was no obvious way to
+settle it. Different controllers did different things.
+
+The cpu controller considered threads and cgroups as equivalents and
+mapped nice levels to cgroup weights. This worked for some cases but
+fell flat when children wanted to be allocated specific ratios of CPU
+cycles and the number of internal threads fluctuated - the ratios
+constantly changed as the number of competing entities fluctuated.
+There also were other issues. The mapping from nice level to weight
+wasn't obvious or universal, and there were various other knobs which
+simply weren't available for threads.
+
+The io controller implicitly created a hidden leaf node for each
+cgroup to host the threads. The hidden leaf had its own copies of all
+the knobs with "leaf_" prefixed. While this allowed equivalent
+control over internal threads, it was with serious drawbacks. It
+always added an extra layer of nesting which wouldn't be necessary
+otherwise, made the interface messy and significantly complicated the
+implementation.
+
+The memory controller didn't have a way to control what happened
+between internal tasks and child cgroups and the behavior was not
+clearly defined. There were attempts to add ad-hoc behaviors and
+knobs to tailor the behavior to specific workloads which would have
+led to problems extremely difficult to resolve in the long term.
+
+Multiple controllers struggled with internal tasks and came up with
+different ways to deal with it; unfortunately, all the approaches were
+severely flawed and, furthermore, the widely different behaviors
+made cgroup as a whole highly inconsistent.
+
+This clearly is a problem which needs to be addressed from cgroup core
+in a uniform way.
+
+
+R-4. Other Interface Issues
+
+cgroup v1 grew without oversight and developed a large number of
+idiosyncrasies and inconsistencies. One issue on the cgroup core side
+was how an empty cgroup was notified - a userland helper binary was
+forked and executed for each event. The event delivery wasn't
+recursive or delegatable. The limitations of the mechanism also led
+to in-kernel event delivery filtering mechanism further complicating
+the interface.
+
+Controller interfaces were problematic too. An extreme example is
+controllers completely ignoring hierarchical organization and treating
+all cgroups as if they were all located directly under the root
+cgroup. Some controllers exposed a large amount of inconsistent
+implementation details to userland.
+
+There also was no consistency across controllers. When a new cgroup
+was created, some controllers defaulted to not imposing extra
+restrictions while others disallowed any resource usage until
+explicitly configured. Configuration knobs for the same type of
+control used widely differing naming schemes and formats. Statistics
+and information knobs were named arbitrarily and used different
+formats and units even in the same controller.
+
+cgroup v2 establishes common conventions where appropriate and updates
+controllers so that they expose minimal and consistent interfaces.
+
+
+R-5. Controller Issues and Remedies
+
+R-5-1. Memory
+
+The original lower boundary, the soft limit, is defined as a limit
+that is per default unset. As a result, the set of cgroups that
+global reclaim prefers is opt-in, rather than opt-out. The costs for
+optimizing these mostly negative lookups are so high that the
+implementation, despite its enormous size, does not even provide the
+basic desirable behavior. First off, the soft limit has no
+hierarchical meaning. All configured groups are organized in a global
+rbtree and treated like equal peers, regardless where they are located
+in the hierarchy. This makes subtree delegation impossible. Second,
+the soft limit reclaim pass is so aggressive that it not just
+introduces high allocation latencies into the system, but also impacts
+system performance due to overreclaim, to the point where the feature
+becomes self-defeating.
+
+The memory.low boundary on the other hand is a top-down allocated
+reserve. A cgroup enjoys reclaim protection when it and all its
+ancestors are below their low boundaries, which makes delegation of
+subtrees possible. Secondly, new cgroups have no reserve per default
+and in the common case most cgroups are eligible for the preferred
+reclaim pass. This allows the new low boundary to be efficiently
+implemented with just a minor addition to the generic reclaim code,
+without the need for out-of-band data structures and reclaim passes.
+Because the generic reclaim code considers all cgroups except for the
+ones running low in the preferred first reclaim pass, overreclaim of
+individual groups is eliminated as well, resulting in much better
+overall workload performance.
+
+The original high boundary, the hard limit, is defined as a strict
+limit that can not budge, even if the OOM killer has to be called.
+But this generally goes against the goal of making the most out of the
+available memory. The memory consumption of workloads varies during
+runtime, and that requires users to overcommit. But doing that with a
+strict upper limit requires either a fairly accurate prediction of the
+working set size or adding slack to the limit. Since working set size
+estimation is hard and error prone, and getting it wrong results in
+OOM kills, most users tend to err on the side of a looser limit and
+end up wasting precious resources.
+
+The memory.high boundary on the other hand can be set much more
+conservatively. When hit, it throttles allocations by forcing them
+into direct reclaim to work off the excess, but it never invokes the
+OOM killer. As a result, a high boundary that is chosen too
+aggressively will not terminate the processes, but instead it will
+lead to gradual performance degradation. The user can monitor this
+and make corrections until the minimal memory footprint that still
+gives acceptable performance is found.
+
+In extreme cases, with many concurrent allocations and a complete
+breakdown of reclaim progress within the group, the high boundary can
+be exceeded. But even then it's mostly better to satisfy the
+allocation from the slack available in other groups or the rest of the
+system than killing the group. Otherwise, memory.max is there to
+limit this type of spillover and ultimately contain buggy or even
+malicious applications.
diff --git a/Documentation/cgroups/unified-hierarchy.txt b/Documentation/cgroups/unified-hierarchy.txt
deleted file mode 100644
index 781b1d4..0000000
--- a/Documentation/cgroups/unified-hierarchy.txt
+++ /dev/null
@@ -1,647 +0,0 @@
-
-Cgroup unified hierarchy
-
-April, 2014 Tejun Heo <tj@kernel.org>
-
-This document describes the changes made by unified hierarchy and
-their rationales. It will eventually be merged into the main cgroup
-documentation.
-
-CONTENTS
-
-1. Background
-2. Basic Operation
- 2-1. Mounting
- 2-2. cgroup.subtree_control
- 2-3. cgroup.controllers
-3. Structural Constraints
- 3-1. Top-down
- 3-2. No internal tasks
-4. Delegation
- 4-1. Model of delegation
- 4-2. Common ancestor rule
-5. Other Changes
- 5-1. [Un]populated Notification
- 5-2. Other Core Changes
- 5-3. Controller File Conventions
- 5-3-1. Format
- 5-3-2. Control Knobs
- 5-4. Per-Controller Changes
- 5-4-1. io
- 5-4-2. cpuset
- 5-4-3. memory
-6. Planned Changes
- 6-1. CAP for resource control
-
-
-1. Background
-
-cgroup allows an arbitrary number of hierarchies and each hierarchy
-can host any number of controllers. While this seems to provide a
-high level of flexibility, it isn't quite useful in practice.
-
-For example, as there is only one instance of each controller, utility
-type controllers such as freezer which can be useful in all
-hierarchies can only be used in one. The issue is exacerbated by the
-fact that controllers can't be moved around once hierarchies are
-populated. Another issue is that all controllers bound to a hierarchy
-are forced to have exactly the same view of the hierarchy. It isn't
-possible to vary the granularity depending on the specific controller.
-
-In practice, these issues heavily limit which controllers can be put
-on the same hierarchy and most configurations resort to putting each
-controller on its own hierarchy. Only closely related ones, such as
-the cpu and cpuacct controllers, make sense to put on the same
-hierarchy. This often means that userland ends up managing multiple
-similar hierarchies repeating the same steps on each hierarchy
-whenever a hierarchy management operation is necessary.
-
-Unfortunately, support for multiple hierarchies comes at a steep cost.
-Internal implementation in cgroup core proper is dazzlingly
-complicated but more importantly the support for multiple hierarchies
-restricts how cgroup is used in general and what controllers can do.
-
-There's no limit on how many hierarchies there may be, which means
-that a task's cgroup membership can't be described in finite length.
-The key may contain any varying number of entries and is unlimited in
-length, which makes it highly awkward to handle and leads to addition
-of controllers which exist only to identify membership, which in turn
-exacerbates the original problem.
-
-Also, as a controller can't have any expectation regarding what shape
-of hierarchies other controllers would be on, each controller has to
-assume that all other controllers are operating on completely
-orthogonal hierarchies. This makes it impossible, or at least very
-cumbersome, for controllers to cooperate with each other.
-
-In most use cases, putting controllers on hierarchies which are
-completely orthogonal to each other isn't necessary. What usually is
-called for is the ability to have differing levels of granularity
-depending on the specific controller. In other words, hierarchy may
-be collapsed from leaf towards root when viewed from specific
-controllers. For example, a given configuration might not care about
-how memory is distributed beyond a certain level while still wanting
-to control how CPU cycles are distributed.
-
-Unified hierarchy is the next version of cgroup interface. It aims to
-address the aforementioned issues by having more structure while
-retaining enough flexibility for most use cases. Various other
-general and controller-specific interface issues are also addressed in
-the process.
-
-
-2. Basic Operation
-
-2-1. Mounting
-
-Currently, unified hierarchy can be mounted with the following mount
-command. Note that this is still under development and scheduled to
-change soon.
-
- mount -t cgroup -o __DEVEL__sane_behavior cgroup $MOUNT_POINT
-
-All controllers which support the unified hierarchy and are not bound
-to other hierarchies are automatically bound to unified hierarchy and
-show up at the root of it. Controllers which are enabled only in the
-root of unified hierarchy can be bound to other hierarchies. This
-allows mixing unified hierarchy with the traditional multiple
-hierarchies in a fully backward compatible way.
-
-A controller can be moved across hierarchies only after the controller
-is no longer referenced in its current hierarchy. Because per-cgroup
-controller states are destroyed asynchronously and controllers may
-have lingering references, a controller may not show up immediately on
-the unified hierarchy after the final umount of the previous
-hierarchy. Similarly, a controller should be fully disabled to be
-moved out of the unified hierarchy and it may take some time for the
-disabled controller to become available for other hierarchies;
-furthermore, due to dependencies among controllers, other controllers
-may need to be disabled too.
-
-While useful for development and manual configurations, dynamically
-moving controllers between the unified and other hierarchies is
-strongly discouraged for production use. It is recommended to decide
-the hierarchies and controller associations before starting using the
-controllers.
-
-
-2-2. cgroup.subtree_control
-
-All cgroups on unified hierarchy have a "cgroup.subtree_control" file
-which governs which controllers are enabled on the children of the
-cgroup. Let's assume a hierarchy like the following.
-
- root - A - B - C
- \ D
-
-root's "cgroup.subtree_control" file determines which controllers are
-enabled on A. A's on B. B's on C and D. This coincides with the
-fact that controllers on the immediate sub-level are used to
-distribute the resources of the parent. In fact, it's natural to
-assume that resource control knobs of a child belong to its parent.
-Enabling a controller in a "cgroup.subtree_control" file declares that
-distribution of the respective resources of the cgroup will be
-controlled. Note that this means that controller enable states are
-shared among siblings.
-
-When read, the file contains a space-separated list of currently
-enabled controllers. A write to the file should contain a
-space-separated list of controllers with '+' or '-' prefixed (without
-the quotes). Controllers prefixed with '+' are enabled and '-'
-disabled. If a controller is listed multiple times, the last entry
-wins. The specific operations are executed atomically - either all
-succeed or fail.
-
-
-2-3. cgroup.controllers
-
-Read-only "cgroup.controllers" file contains a space-separated list of
-controllers which can be enabled in the cgroup's
-"cgroup.subtree_control" file.
-
-In the root cgroup, this lists controllers which are not bound to
-other hierarchies and the content changes as controllers are bound to
-and unbound from other hierarchies.
-
-In non-root cgroups, the content of this file equals that of the
-parent's "cgroup.subtree_control" file as only controllers enabled
-from the parent can be used in its children.
-
-
-3. Structural Constraints
-
-3-1. Top-down
-
-As it doesn't make sense to nest control of an uncontrolled resource,
-all non-root "cgroup.subtree_control" files can only contain
-controllers which are enabled in the parent's "cgroup.subtree_control"
-file. A controller can be enabled only if the parent has the
-controller enabled and a controller can't be disabled if one or more
-children have it enabled.
-
-
-3-2. No internal tasks
-
-One long-standing issue that cgroup faces is the competition between
-tasks belonging to the parent cgroup and its children cgroups. This
-is inherently nasty as two different types of entities compete and
-there is no agreed-upon obvious way to handle it. Different
-controllers are doing different things.
-
-The cpu controller considers tasks and cgroups as equivalents and maps
-nice levels to cgroup weights. This works for some cases but falls
-flat when children should be allocated specific ratios of CPU cycles
-and the number of internal tasks fluctuates - the ratios constantly
-change as the number of competing entities fluctuates. There also are
-other issues. The mapping from nice level to weight isn't obvious or
-universal, and there are various other knobs which simply aren't
-available for tasks.
-
-The io controller implicitly creates a hidden leaf node for each
-cgroup to host the tasks. The hidden leaf has its own copies of all
-the knobs with "leaf_" prefixed. While this allows equivalent control
-over internal tasks, it's with serious drawbacks. It always adds an
-extra layer of nesting which may not be necessary, makes the interface
-messy and significantly complicates the implementation.
-
-The memory controller currently doesn't have a way to control what
-happens between internal tasks and child cgroups and the behavior is
-not clearly defined. There have been attempts to add ad-hoc behaviors
-and knobs to tailor the behavior to specific workloads. Continuing
-this direction will lead to problems which will be extremely difficult
-to resolve in the long term.
-
-Multiple controllers struggle with internal tasks and came up with
-different ways to deal with it; unfortunately, all the approaches in
-use now are severely flawed and, furthermore, the widely different
-behaviors make cgroup as whole highly inconsistent.
-
-It is clear that this is something which needs to be addressed from
-cgroup core proper in a uniform way so that controllers don't need to
-worry about it and cgroup as a whole shows a consistent and logical
-behavior. To achieve that, unified hierarchy enforces the following
-structural constraint:
-
- Except for the root, only cgroups which don't contain any task may
- have controllers enabled in their "cgroup.subtree_control" files.
-
-Combined with other properties, this guarantees that, when a
-controller is looking at the part of the hierarchy which has it
-enabled, tasks are always only on the leaves. This rules out
-situations where child cgroups compete against internal tasks of the
-parent.
-
-There are two things to note. Firstly, the root cgroup is exempt from
-the restriction. Root contains tasks and anonymous resource
-consumption which can't be associated with any other cgroup and
-requires special treatment from most controllers. How resource
-consumption in the root cgroup is governed is up to each controller.
-
-Secondly, the restriction doesn't take effect if there is no enabled
-controller in the cgroup's "cgroup.subtree_control" file. This is
-important as otherwise it wouldn't be possible to create children of a
-populated cgroup. To control resource distribution of a cgroup, the
-cgroup must create children and transfer all its tasks to the children
-before enabling controllers in its "cgroup.subtree_control" file.
-
-
-4. Delegation
-
-4-1. Model of delegation
-
-A cgroup can be delegated to a less privileged user by granting write
-access of the directory and its "cgroup.procs" file to the user. Note
-that the resource control knobs in a given directory concern the
-resources of the parent and thus must not be delegated along with the
-directory.
-
-Once delegated, the user can build sub-hierarchy under the directory,
-organize processes as it sees fit and further distribute the resources
-it got from the parent. The limits and other settings of all resource
-controllers are hierarchical and regardless of what happens in the
-delegated sub-hierarchy, nothing can escape the resource restrictions
-imposed by the parent.
-
-Currently, cgroup doesn't impose any restrictions on the number of
-cgroups in or nesting depth of a delegated sub-hierarchy; however,
-this may in the future be limited explicitly.
-
-
-4-2. Common ancestor rule
-
-On the unified hierarchy, to write to a "cgroup.procs" file, in
-addition to the usual write permission to the file and uid match, the
-writer must also have write access to the "cgroup.procs" file of the
-common ancestor of the source and destination cgroups. This prevents
-delegatees from smuggling processes across disjoint sub-hierarchies.
-
-Let's say cgroups C0 and C1 have been delegated to user U0 who created
-C00, C01 under C0 and C10 under C1 as follows.
-
- ~~~~~~~~~~~~~ - C0 - C00
- ~ cgroup ~ \ C01
- ~ hierarchy ~
- ~~~~~~~~~~~~~ - C1 - C10
-
-C0 and C1 are separate entities in terms of resource distribution
-regardless of their relative positions in the hierarchy. The
-resources the processes under C0 are entitled to are controlled by
-C0's ancestors and may be completely different from C1. It's clear
-that the intention of delegating C0 to U0 is allowing U0 to organize
-the processes under C0 and further control the distribution of C0's
-resources.
-
-On traditional hierarchies, if a task has write access to "tasks" or
-"cgroup.procs" file of a cgroup and its uid agrees with the target, it
-can move the target to the cgroup. In the above example, U0 will not
-only be able to move processes in each sub-hierarchy but also across
-the two sub-hierarchies, effectively allowing it to violate the
-organizational and resource restrictions implied by the hierarchical
-structure above C0 and C1.
-
-On the unified hierarchy, let's say U0 wants to write the pid of a
-process which has a matching uid and is currently in C10 into
-"C00/cgroup.procs". U0 obviously has write access to the file and
-migration permission on the process; however, the common ancestor of
-the source cgroup C10 and the destination cgroup C00 is above the
-points of delegation and U0 would not have write access to its
-"cgroup.procs" and thus be denied with -EACCES.
-
-
-5. Other Changes
-
-5-1. [Un]populated Notification
-
-cgroup users often need a way to determine when a cgroup's
-subhierarchy becomes empty so that it can be cleaned up. cgroup
-currently provides release_agent for it; unfortunately, this mechanism
-is riddled with issues.
-
-- It delivers events by forking and execing a userland binary
- specified as the release_agent. This is a long deprecated method of
- notification delivery. It's extremely heavy, slow and cumbersome to
- integrate with larger infrastructure.
-
-- There is single monitoring point at the root. There's no way to
- delegate management of a subtree.
-
-- The event isn't recursive. It triggers when a cgroup doesn't have
- any tasks or child cgroups. Events for internal nodes trigger only
- after all children are removed. This again makes it impossible to
- delegate management of a subtree.
-
-- Events are filtered from the kernel side. A "notify_on_release"
- file is used to subscribe to or suppress release events. This is
- unnecessarily complicated and probably done this way because event
- delivery itself was expensive.
-
-Unified hierarchy implements "populated" field in "cgroup.events"
-interface file which can be used to monitor whether the cgroup's
-subhierarchy has tasks in it or not. Its value is 0 if there is no
-task in the cgroup and its descendants; otherwise, 1. poll and
-[id]notify events are triggered when the value changes.
-
-This is significantly lighter and simpler and trivially allows
-delegating management of subhierarchy - subhierarchy monitoring can
-block further propagation simply by putting itself or another process
-in the subhierarchy and monitor events that it's interested in from
-there without interfering with monitoring higher in the tree.
-
-In unified hierarchy, the release_agent mechanism is no longer
-supported and the interface files "release_agent" and
-"notify_on_release" do not exist.
-
-
-5-2. Other Core Changes
-
-- None of the mount options is allowed.
-
-- remount is disallowed.
-
-- rename(2) is disallowed.
-
-- The "tasks" file is removed. Everything should at process
- granularity. Use the "cgroup.procs" file instead.
-
-- The "cgroup.procs" file is not sorted. pids will be unique unless
- they got recycled in-between reads.
-
-- The "cgroup.clone_children" file is removed.
-
-- /proc/PID/cgroup keeps reporting the cgroup that a zombie belonged
- to before exiting. If the cgroup is removed before the zombie is
- reaped, " (deleted)" is appeneded to the path.
-
-
-5-3. Controller File Conventions
-
-5-3-1. Format
-
-In general, all controller files should be in one of the following
-formats whenever possible.
-
-- Values only files
-
- VAL0 VAL1...\n
-
-- Flat keyed files
-
- KEY0 VAL0\n
- KEY1 VAL1\n
- ...
-
-- Nested keyed files
-
- KEY0 SUB_KEY0=VAL00 SUB_KEY1=VAL01...
- KEY1 SUB_KEY0=VAL10 SUB_KEY1=VAL11...
- ...
-
-For a writeable file, the format for writing should generally match
-reading; however, controllers may allow omitting later fields or
-implement restricted shortcuts for most common use cases.
-
-For both flat and nested keyed files, only the values for a single key
-can be written at a time. For nested keyed files, the sub key pairs
-may be specified in any order and not all pairs have to be specified.
-
-
-5-3-2. Control Knobs
-
-- Settings for a single feature should generally be implemented in a
- single file.
-
-- In general, the root cgroup should be exempt from resource control
- and thus shouldn't have resource control knobs.
-
-- If a controller implements ratio based resource distribution, the
- control knob should be named "weight" and have the range [1, 10000]
- and 100 should be the default value. The values are chosen to allow
- enough and symmetric bias in both directions while keeping it
- intuitive (the default is 100%).
-
-- If a controller implements an absolute resource guarantee and/or
- limit, the control knobs should be named "min" and "max"
- respectively. If a controller implements best effort resource
- gurantee and/or limit, the control knobs should be named "low" and
- "high" respectively.
-
- In the above four control files, the special token "max" should be
- used to represent upward infinity for both reading and writing.
-
-- If a setting has configurable default value and specific overrides,
- the default settings should be keyed with "default" and appear as
- the first entry in the file. Specific entries can use "default" as
- its value to indicate inheritance of the default value.
-
-- For events which are not very high frequency, an interface file
- "events" should be created which lists event key value pairs.
- Whenever a notifiable event happens, file modified event should be
- generated on the file.
-
-
-5-4. Per-Controller Changes
-
-5-4-1. io
-
-- blkio is renamed to io. The interface is overhauled anyway. The
- new name is more in line with the other two major controllers, cpu
- and memory, and better suited given that it may be used for cgroup
- writeback without involving block layer.
-
-- Everything including stat is always hierarchical making separate
- recursive stat files pointless and, as no internal node can have
- tasks, leaf weights are meaningless. The operation model is
- simplified and the interface is overhauled accordingly.
-
- io.stat
-
- The stat file. The reported stats are from the point where
- bio's are issued to request_queue. The stats are counted
- independent of which policies are enabled. Each line in the
- file follows the following format. More fields may later be
- added at the end.
-
- $MAJ:$MIN rbytes=$RBYTES wbytes=$WBYTES rios=$RIOS wrios=$WIOS
-
- io.weight
-
- The weight setting, currently only available and effective if
- cfq-iosched is in use for the target device. The weight is
- between 1 and 10000 and defaults to 100. The first line
- always contains the default weight in the following format to
- use when per-device setting is missing.
-
- default $WEIGHT
-
- Subsequent lines list per-device weights of the following
- format.
-
- $MAJ:$MIN $WEIGHT
-
- Writing "$WEIGHT" or "default $WEIGHT" changes the default
- setting. Writing "$MAJ:$MIN $WEIGHT" sets per-device weight
- while "$MAJ:$MIN default" clears it.
-
- This file is available only on non-root cgroups.
-
- io.max
-
- The maximum bandwidth and/or iops setting, only available if
- blk-throttle is enabled. The file is of the following format.
-
- $MAJ:$MIN rbps=$RBPS wbps=$WBPS riops=$RIOPS wiops=$WIOPS
-
- ${R|W}BPS are read/write bytes per second and ${R|W}IOPS are
- read/write IOs per second. "max" indicates no limit. Writing
- to the file follows the same format but the individual
- settings may be omitted or specified in any order.
-
- This file is available only on non-root cgroups.
-
-
-5-4-2. cpuset
-
-- Tasks are kept in empty cpusets after hotplug and take on the masks
- of the nearest non-empty ancestor, instead of being moved to it.
-
-- A task can be moved into an empty cpuset, and again it takes on the
- masks of the nearest non-empty ancestor.
-
-
-5-4-3. memory
-
-- use_hierarchy is on by default and the cgroup file for the flag is
- not created.
-
-- The original lower boundary, the soft limit, is defined as a limit
- that is per default unset. As a result, the set of cgroups that
- global reclaim prefers is opt-in, rather than opt-out. The costs
- for optimizing these mostly negative lookups are so high that the
- implementation, despite its enormous size, does not even provide the
- basic desirable behavior. First off, the soft limit has no
- hierarchical meaning. All configured groups are organized in a
- global rbtree and treated like equal peers, regardless where they
- are located in the hierarchy. This makes subtree delegation
- impossible. Second, the soft limit reclaim pass is so aggressive
- that it not just introduces high allocation latencies into the
- system, but also impacts system performance due to overreclaim, to
- the point where the feature becomes self-defeating.
-
- The memory.low boundary on the other hand is a top-down allocated
- reserve. A cgroup enjoys reclaim protection when it and all its
- ancestors are below their low boundaries, which makes delegation of
- subtrees possible. Secondly, new cgroups have no reserve per
- default and in the common case most cgroups are eligible for the
- preferred reclaim pass. This allows the new low boundary to be
- efficiently implemented with just a minor addition to the generic
- reclaim code, without the need for out-of-band data structures and
- reclaim passes. Because the generic reclaim code considers all
- cgroups except for the ones running low in the preferred first
- reclaim pass, overreclaim of individual groups is eliminated as
- well, resulting in much better overall workload performance.
-
-- The original high boundary, the hard limit, is defined as a strict
- limit that can not budge, even if the OOM killer has to be called.
- But this generally goes against the goal of making the most out of
- the available memory. The memory consumption of workloads varies
- during runtime, and that requires users to overcommit. But doing
- that with a strict upper limit requires either a fairly accurate
- prediction of the working set size or adding slack to the limit.
- Since working set size estimation is hard and error prone, and
- getting it wrong results in OOM kills, most users tend to err on the
- side of a looser limit and end up wasting precious resources.
-
- The memory.high boundary on the other hand can be set much more
- conservatively. When hit, it throttles allocations by forcing them
- into direct reclaim to work off the excess, but it never invokes the
- OOM killer. As a result, a high boundary that is chosen too
- aggressively will not terminate the processes, but instead it will
- lead to gradual performance degradation. The user can monitor this
- and make corrections until the minimal memory footprint that still
- gives acceptable performance is found.
-
- In extreme cases, with many concurrent allocations and a complete
- breakdown of reclaim progress within the group, the high boundary
- can be exceeded. But even then it's mostly better to satisfy the
- allocation from the slack available in other groups or the rest of
- the system than killing the group. Otherwise, memory.max is there
- to limit this type of spillover and ultimately contain buggy or even
- malicious applications.
-
-- The original control file names are unwieldy and inconsistent in
- many different ways. For example, the upper boundary hit count is
- exported in the memory.failcnt file, but an OOM event count has to
- be manually counted by listening to memory.oom_control events, and
- lower boundary / soft limit events have to be counted by first
- setting a threshold for that value and then counting those events.
- Also, usage and limit files encode their units in the filename.
- That makes the filenames very long, even though this is not
- information that a user needs to be reminded of every time they type
- out those names.
-
- To address these naming issues, as well as to signal clearly that
- the new interface carries a new configuration model, the naming
- conventions in it necessarily differ from the old interface.
-
-- The original limit files indicate the state of an unset limit with a
- Very High Number, and a configured limit can be unset by echoing -1
- into those files. But that very high number is implementation and
- architecture dependent and not very descriptive. And while -1 can
- be understood as an underflow into the highest possible value, -2 or
- -10M etc. do not work, so it's not consistent.
-
- memory.low, memory.high, and memory.max will use the string "max" to
- indicate and set the highest possible value.
-
-6. Planned Changes
-
-6-1. CAP for resource control
-
-Unified hierarchy will require one of the capabilities(7), which is
-yet to be decided, for all resource control related knobs. Process
-organization operations - creation of sub-cgroups and migration of
-processes in sub-hierarchies may be delegated by changing the
-ownership and/or permissions on the cgroup directory and
-"cgroup.procs" interface file; however, all operations which affect
-resource control - writes to a "cgroup.subtree_control" file or any
-controller-specific knobs - will require an explicit CAP privilege.
-
-This, in part, is to prevent the cgroup interface from being
-inadvertently promoted to programmable API used by non-privileged
-binaries. cgroup exposes various aspects of the system in ways which
-aren't properly abstracted for direct consumption by regular programs.
-This is an administration interface much closer to sysctl knobs than
-system calls. Even the basic access model, being filesystem path
-based, isn't suitable for direct consumption. There's no way to
-access "my cgroup" in a race-free way or make multiple operations
-atomic against migration to another cgroup.
-
-Another aspect is that, for better or for worse, the cgroup interface
-goes through far less scrutiny than regular interfaces for
-unprivileged userland. The upside is that cgroup is able to expose
-useful features which may not be suitable for general consumption in a
-reasonable time frame. It provides a relatively short path between
-internal details and userland-visible interface. Of course, this
-shortcut comes with high risk. We go through what we go through for
-general kernel APIs for good reasons. It may end up leaking internal
-details in a way which can exert significant pain by locking the
-kernel into a contract that can't be maintained in a reasonable
-manner.
-
-Also, due to the specific nature, cgroup and its controllers don't
-tend to attract attention from a wide scope of developers. cgroup's
-short history is already fraught with severely mis-designed
-interfaces, unnecessary commitments to and exposing of internal
-details, broken and dangerous implementations of various features.
-
-Keeping cgroup as an administration interface is both advantageous for
-its role and imperative given its nature. Some of the cgroup features
-may make sense for unprivileged access. If deemed justified, those
-must be further abstracted and implemented as a different interface,
-be it a system call or process-private filesystem, and survive through
-the scrutiny that any interface for general consumption is required to
-go through.
-
-Requiring CAP is not a complete solution but should serve as a
-significant deterrent against spraying cgroup usages in non-privileged
-programs.
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index e5f4164..7f540f7 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -34,17 +34,12 @@
/* define the enumeration of all cgroup subsystems */
#define SUBSYS(_x) _x ## _cgrp_id,
-#define SUBSYS_TAG(_t) CGROUP_ ## _t, \
- __unused_tag_ ## _t = CGROUP_ ## _t - 1,
enum cgroup_subsys_id {
#include <linux/cgroup_subsys.h>
CGROUP_SUBSYS_COUNT,
};
-#undef SUBSYS_TAG
#undef SUBSYS
-#define CGROUP_CANFORK_COUNT (CGROUP_CANFORK_END - CGROUP_CANFORK_START)
-
/* bits in struct cgroup_subsys_state flags field */
enum {
CSS_NO_REF = (1 << 0), /* no reference counting for this css */
@@ -66,7 +61,6 @@
/* cgroup_root->flags */
enum {
- CGRP_ROOT_SANE_BEHAVIOR = (1 << 0), /* __DEVEL__sane_behavior specified */
CGRP_ROOT_NOPREFIX = (1 << 1), /* mounted subsystems have no named prefix */
CGRP_ROOT_XATTR = (1 << 2), /* supports extended attributes */
};
@@ -439,9 +433,9 @@
int (*can_attach)(struct cgroup_taskset *tset);
void (*cancel_attach)(struct cgroup_taskset *tset);
void (*attach)(struct cgroup_taskset *tset);
- int (*can_fork)(struct task_struct *task, void **priv_p);
- void (*cancel_fork)(struct task_struct *task, void *priv);
- void (*fork)(struct task_struct *task, void *priv);
+ int (*can_fork)(struct task_struct *task);
+ void (*cancel_fork)(struct task_struct *task);
+ void (*fork)(struct task_struct *task);
void (*exit)(struct task_struct *task);
void (*free)(struct task_struct *task);
void (*bind)(struct cgroup_subsys_state *root_css);
@@ -527,7 +521,6 @@
#else /* CONFIG_CGROUPS */
-#define CGROUP_CANFORK_COUNT 0
#define CGROUP_SUBSYS_COUNT 0
static inline void cgroup_threadgroup_change_begin(struct task_struct *tsk) {}
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 322a284..2162dca 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -97,12 +97,9 @@
struct pid *pid, struct task_struct *tsk);
void cgroup_fork(struct task_struct *p);
-extern int cgroup_can_fork(struct task_struct *p,
- void *ss_priv[CGROUP_CANFORK_COUNT]);
-extern void cgroup_cancel_fork(struct task_struct *p,
- void *ss_priv[CGROUP_CANFORK_COUNT]);
-extern void cgroup_post_fork(struct task_struct *p,
- void *old_ss_priv[CGROUP_CANFORK_COUNT]);
+extern int cgroup_can_fork(struct task_struct *p);
+extern void cgroup_cancel_fork(struct task_struct *p);
+extern void cgroup_post_fork(struct task_struct *p);
void cgroup_exit(struct task_struct *p);
void cgroup_free(struct task_struct *p);
@@ -562,13 +559,9 @@
struct dentry *dentry) { return -EINVAL; }
static inline void cgroup_fork(struct task_struct *p) {}
-static inline int cgroup_can_fork(struct task_struct *p,
- void *ss_priv[CGROUP_CANFORK_COUNT])
-{ return 0; }
-static inline void cgroup_cancel_fork(struct task_struct *p,
- void *ss_priv[CGROUP_CANFORK_COUNT]) {}
-static inline void cgroup_post_fork(struct task_struct *p,
- void *ss_priv[CGROUP_CANFORK_COUNT]) {}
+static inline int cgroup_can_fork(struct task_struct *p) { return 0; }
+static inline void cgroup_cancel_fork(struct task_struct *p) {}
+static inline void cgroup_post_fork(struct task_struct *p) {}
static inline void cgroup_exit(struct task_struct *p) {}
static inline void cgroup_free(struct task_struct *p) {}
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 1a96fda..0df0336a 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -6,14 +6,8 @@
/*
* This file *must* be included with SUBSYS() defined.
- * SUBSYS_TAG() is a noop if undefined.
*/
-#ifndef SUBSYS_TAG
-#define __TMP_SUBSYS_TAG
-#define SUBSYS_TAG(_x)
-#endif
-
#if IS_ENABLED(CONFIG_CPUSETS)
SUBSYS(cpuset)
#endif
@@ -58,17 +52,10 @@
SUBSYS(hugetlb)
#endif
-/*
- * Subsystems that implement the can_fork() family of callbacks.
- */
-SUBSYS_TAG(CANFORK_START)
-
#if IS_ENABLED(CONFIG_CGROUP_PIDS)
SUBSYS(pids)
#endif
-SUBSYS_TAG(CANFORK_END)
-
/*
* The following subsystems are not supported on the default hierarchy.
*/
@@ -76,11 +63,6 @@
SUBSYS(debug)
#endif
-#ifdef __TMP_SUBSYS_TAG
-#undef __TMP_SUBSYS_TAG
-#undef SUBSYS_TAG
-#endif
-
/*
* DO NOT ADD ANY SUBSYSTEM WITHOUT EXPLICIT ACKS FROM CGROUP MAINTAINERS.
*/
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index accb036..b283d56 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -54,6 +54,7 @@
#define SMB_SUPER_MAGIC 0x517B
#define CGROUP_SUPER_MAGIC 0x27e0eb
+#define CGROUP2_SUPER_MAGIC 0x63677270
#define STACK_END_MAGIC 0x57AC6E9D
diff --git a/init/Kconfig b/init/Kconfig
index 235c7a2..5481b49 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -940,95 +940,24 @@
if CGROUPS
-config CGROUP_DEBUG
- bool "Example debug cgroup subsystem"
- default n
- help
- This option enables a simple cgroup subsystem that
- exports useful debugging information about the cgroups
- framework.
-
- Say N if unsure.
-
-config CGROUP_FREEZER
- bool "Freezer cgroup subsystem"
- help
- Provides a way to freeze and unfreeze all tasks in a
- cgroup.
-
-config CGROUP_PIDS
- bool "PIDs cgroup subsystem"
- help
- Provides enforcement of process number limits in the scope of a
- cgroup. Any attempt to fork more processes than is allowed in the
- cgroup will fail. PIDs are fundamentally a global resource because it
- is fairly trivial to reach PID exhaustion before you reach even a
- conservative kmemcg limit. As a result, it is possible to grind a
- system to halt without being limited by other cgroup policies. The
- PIDs cgroup subsystem is designed to stop this from happening.
-
- It should be noted that organisational operations (such as attaching
- to a cgroup hierarchy will *not* be blocked by the PIDs subsystem),
- since the PIDs limit only affects a process's ability to fork, not to
- attach to a cgroup.
-
-config CGROUP_DEVICE
- bool "Device controller for cgroups"
- help
- Provides a cgroup implementing whitelists for devices which
- a process in the cgroup can mknod or open.
-
-config CPUSETS
- bool "Cpuset support"
- help
- This option will let you create and manage CPUSETs which
- allow dynamically partitioning a system into sets of CPUs and
- Memory Nodes and assigning tasks to run only within those sets.
- This is primarily useful on large SMP or NUMA systems.
-
- Say N if unsure.
-
-config PROC_PID_CPUSET
- bool "Include legacy /proc/<pid>/cpuset file"
- depends on CPUSETS
- default y
-
-config CGROUP_CPUACCT
- bool "Simple CPU accounting cgroup subsystem"
- help
- Provides a simple Resource Controller for monitoring the
- total CPU consumed by the tasks in a cgroup.
-
config PAGE_COUNTER
bool
config MEMCG
- bool "Memory Resource Controller for Control Groups"
+ bool "Memory controller"
select PAGE_COUNTER
select EVENTFD
help
- Provides a memory resource controller that manages both anonymous
- memory and page cache. (See Documentation/cgroups/memory.txt)
+ Provides control over the memory footprint of tasks in a cgroup.
config MEMCG_SWAP
- bool "Memory Resource Controller Swap Extension"
+ bool "Swap controller"
depends on MEMCG && SWAP
help
- Add swap management feature to memory resource controller. When you
- enable this, you can limit mem+swap usage per cgroup. In other words,
- when you disable this, memory resource controller has no cares to
- usage of swap...a process can exhaust all of the swap. This extension
- is useful when you want to avoid exhaustion swap but this itself
- adds more overheads and consumes memory for remembering information.
- Especially if you use 32bit system or small memory system, please
- be careful about enabling this. When memory resource controller
- is disabled by boot option, this will be automatically disabled and
- there will be no overhead from this. Even when you set this config=y,
- if boot option "swapaccount=0" is set, swap will not be accounted.
- Now, memory usage of swap_cgroup is 2 bytes per entry. If swap page
- size is 4096bytes, 512k per 1Gbytes of swap.
+ Provides control over the swap space consumed by tasks in a cgroup.
+
config MEMCG_SWAP_ENABLED
- bool "Memory Resource Controller Swap Extension enabled by default"
+ bool "Swap controller enabled by default"
depends on MEMCG_SWAP
default y
help
@@ -1052,34 +981,43 @@
the kmem extension can use it to guarantee that no group of processes
will ever exhaust kernel resources alone.
-config CGROUP_HUGETLB
- bool "HugeTLB Resource Controller for Control Groups"
- depends on HUGETLB_PAGE
- select PAGE_COUNTER
+config BLK_CGROUP
+ bool "IO controller"
+ depends on BLOCK
default n
- help
- Provides a cgroup Resource Controller for HugeTLB pages.
- When you enable this, you can put a per cgroup limit on HugeTLB usage.
- The limit is enforced during page fault. Since HugeTLB doesn't
- support page reclaim, enforcing the limit at page fault time implies
- that, the application will get SIGBUS signal if it tries to access
- HugeTLB pages beyond its limit. This requires the application to know
- beforehand how much HugeTLB pages it would require for its use. The
- control group is tracked in the third page lru pointer. This means
- that we cannot use the controller with huge page less than 3 pages.
+ ---help---
+ Generic block IO controller cgroup interface. This is the common
+ cgroup interface which should be used by various IO controlling
+ policies.
-config CGROUP_PERF
- bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
- depends on PERF_EVENTS && CGROUPS
- help
- This option extends the per-cpu mode to restrict monitoring to
- threads which belong to the cgroup specified and run on the
- designated cpu.
+ Currently, CFQ IO scheduler uses it to recognize task groups and
+ control disk bandwidth allocation (proportional time slice allocation)
+ to such task groups. It is also used by bio throttling logic in
+ block layer to implement upper limit in IO rates on a device.
- Say N if unsure.
+ This option only enables generic Block IO controller infrastructure.
+ One needs to also enable actual IO controlling logic/policy. For
+ enabling proportional weight division of disk bandwidth in CFQ, set
+ CONFIG_CFQ_GROUP_IOSCHED=y; for enabling throttling policy, set
+ CONFIG_BLK_DEV_THROTTLING=y.
+
+ See Documentation/cgroups/blkio-controller.txt for more information.
+
+config DEBUG_BLK_CGROUP
+ bool "IO controller debugging"
+ depends on BLK_CGROUP
+ default n
+ ---help---
+ Enable some debugging help. Currently it exports additional stat
+ files in a cgroup which can be useful for debugging.
+
+config CGROUP_WRITEBACK
+ bool
+ depends on MEMCG && BLK_CGROUP
+ default y
menuconfig CGROUP_SCHED
- bool "Group CPU scheduler"
+ bool "CPU controller"
default n
help
This feature lets CPU scheduler recognize task groups and control CPU
@@ -1116,41 +1054,90 @@
endif #CGROUP_SCHED
-config BLK_CGROUP
- bool "Block IO controller"
- depends on BLOCK
+config CGROUP_PIDS
+ bool "PIDs controller"
+ help
+ Provides enforcement of process number limits in the scope of a
+ cgroup. Any attempt to fork more processes than is allowed in the
+ cgroup will fail. PIDs are fundamentally a global resource because it
+ is fairly trivial to reach PID exhaustion before you reach even a
+ conservative kmemcg limit. As a result, it is possible to grind a
+ system to halt without being limited by other cgroup policies. The
+ PIDs cgroup subsystem is designed to stop this from happening.
+
+ It should be noted that organisational operations (such as attaching
+ to a cgroup hierarchy will *not* be blocked by the PIDs subsystem),
+ since the PIDs limit only affects a process's ability to fork, not to
+ attach to a cgroup.
+
+config CGROUP_FREEZER
+ bool "Freezer controller"
+ help
+ Provides a way to freeze and unfreeze all tasks in a
+ cgroup.
+
+config CGROUP_HUGETLB
+ bool "HugeTLB controller"
+ depends on HUGETLB_PAGE
+ select PAGE_COUNTER
default n
- ---help---
- Generic block IO controller cgroup interface. This is the common
- cgroup interface which should be used by various IO controlling
- policies.
+ help
+ Provides a cgroup controller for HugeTLB pages.
+ When you enable this, you can put a per cgroup limit on HugeTLB usage.
+ The limit is enforced during page fault. Since HugeTLB doesn't
+ support page reclaim, enforcing the limit at page fault time implies
+ that, the application will get SIGBUS signal if it tries to access
+ HugeTLB pages beyond its limit. This requires the application to know
+ beforehand how much HugeTLB pages it would require for its use. The
+ control group is tracked in the third page lru pointer. This means
+ that we cannot use the controller with huge page less than 3 pages.
- Currently, CFQ IO scheduler uses it to recognize task groups and
- control disk bandwidth allocation (proportional time slice allocation)
- to such task groups. It is also used by bio throttling logic in
- block layer to implement upper limit in IO rates on a device.
+config CPUSETS
+ bool "Cpuset controller"
+ help
+ This option will let you create and manage CPUSETs which
+ allow dynamically partitioning a system into sets of CPUs and
+ Memory Nodes and assigning tasks to run only within those sets.
+ This is primarily useful on large SMP or NUMA systems.
- This option only enables generic Block IO controller infrastructure.
- One needs to also enable actual IO controlling logic/policy. For
- enabling proportional weight division of disk bandwidth in CFQ, set
- CONFIG_CFQ_GROUP_IOSCHED=y; for enabling throttling policy, set
- CONFIG_BLK_DEV_THROTTLING=y.
+ Say N if unsure.
- See Documentation/cgroups/blkio-controller.txt for more information.
-
-config DEBUG_BLK_CGROUP
- bool "Enable Block IO controller debugging"
- depends on BLK_CGROUP
- default n
- ---help---
- Enable some debugging help. Currently it exports additional stat
- files in a cgroup which can be useful for debugging.
-
-config CGROUP_WRITEBACK
- bool
- depends on MEMCG && BLK_CGROUP
+config PROC_PID_CPUSET
+ bool "Include legacy /proc/<pid>/cpuset file"
+ depends on CPUSETS
default y
+config CGROUP_DEVICE
+ bool "Device controller"
+ help
+ Provides a cgroup controller implementing whitelists for
+ devices which a process in the cgroup can mknod or open.
+
+config CGROUP_CPUACCT
+ bool "Simple CPU accounting controller"
+ help
+ Provides a simple controller for monitoring the
+ total CPU consumed by the tasks in a cgroup.
+
+config CGROUP_PERF
+ bool "Perf controller"
+ depends on PERF_EVENTS
+ help
+ This option extends the perf per-cpu mode to restrict monitoring
+ to threads which belong to the cgroup specified and run on the
+ designated cpu.
+
+ Say N if unsure.
+
+config CGROUP_DEBUG
+ bool "Example controller"
+ default n
+ help
+ This option enables a simple controller that exports
+ debugging information about the cgroups framework.
+
+ Say N.
+
endif # CGROUPS
config CHECKPOINT_RESTORE
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index fe95970..c03a640 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -211,6 +211,7 @@
/* Ditto for the can_fork callback. */
static unsigned long have_canfork_callback __read_mostly;
+static struct file_system_type cgroup2_fs_type;
static struct cftype cgroup_dfl_base_files[];
static struct cftype cgroup_legacy_base_files[];
@@ -1623,10 +1624,6 @@
all_ss = true;
continue;
}
- if (!strcmp(token, "__DEVEL__sane_behavior")) {
- opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
- continue;
- }
if (!strcmp(token, "noprefix")) {
opts->flags |= CGRP_ROOT_NOPREFIX;
continue;
@@ -1693,15 +1690,6 @@
return -ENOENT;
}
- if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
- pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
- if (nr_opts != 1) {
- pr_err("sane_behavior: no other mount options allowed\n");
- return -EINVAL;
- }
- return 0;
- }
-
/*
* If the 'all' option was specified select all the subsystems,
* otherwise if 'none', 'name=' and a subsystem name options were
@@ -1981,6 +1969,7 @@
int flags, const char *unused_dev_name,
void *data)
{
+ bool is_v2 = fs_type == &cgroup2_fs_type;
struct super_block *pinned_sb = NULL;
struct cgroup_subsys *ss;
struct cgroup_root *root;
@@ -1997,6 +1986,17 @@
if (!use_task_css_set_links)
cgroup_enable_task_cg_lists();
+ if (is_v2) {
+ if (data) {
+ pr_err("cgroup2: unknown option \"%s\"\n", (char *)data);
+ return ERR_PTR(-EINVAL);
+ }
+ cgrp_dfl_root_visible = true;
+ root = &cgrp_dfl_root;
+ cgroup_get(&root->cgrp);
+ goto out_mount;
+ }
+
mutex_lock(&cgroup_mutex);
/* First find the desired set of subsystems */
@@ -2004,15 +2004,6 @@
if (ret)
goto out_unlock;
- /* look for a matching existing root */
- if (opts.flags & CGRP_ROOT_SANE_BEHAVIOR) {
- cgrp_dfl_root_visible = true;
- root = &cgrp_dfl_root;
- cgroup_get(&root->cgrp);
- ret = 0;
- goto out_unlock;
- }
-
/*
* Destruction of cgroup root is asynchronous, so subsystems may
* still be dying after the previous unmount. Let's drain the
@@ -2123,9 +2114,10 @@
if (ret)
return ERR_PTR(ret);
-
+out_mount:
dentry = kernfs_mount(fs_type, flags, root->kf_root,
- CGROUP_SUPER_MAGIC, &new_sb);
+ is_v2 ? CGROUP2_SUPER_MAGIC : CGROUP_SUPER_MAGIC,
+ &new_sb);
if (IS_ERR(dentry) || !new_sb)
cgroup_put(&root->cgrp);
@@ -2168,6 +2160,12 @@
.kill_sb = cgroup_kill_sb,
};
+static struct file_system_type cgroup2_fs_type = {
+ .name = "cgroup2",
+ .mount = cgroup_mount,
+ .kill_sb = cgroup_kill_sb,
+};
+
/**
* task_cgroup_path - cgroup path of a task in the first cgroup hierarchy
* @task: target task
@@ -4039,7 +4037,7 @@
goto out_err;
/*
- * Migrate tasks one-by-one until @form is empty. This fails iff
+ * Migrate tasks one-by-one until @from is empty. This fails iff
* ->can_attach() fails.
*/
do {
@@ -5171,7 +5169,7 @@
{
struct cgroup_subsys_state *css;
- printk(KERN_INFO "Initializing cgroup subsys %s\n", ss->name);
+ pr_debug("Initializing cgroup subsys %s\n", ss->name);
mutex_lock(&cgroup_mutex);
@@ -5329,6 +5327,7 @@
WARN_ON(sysfs_create_mount_point(fs_kobj, "cgroup"));
WARN_ON(register_filesystem(&cgroup_fs_type));
+ WARN_ON(register_filesystem(&cgroup2_fs_type));
WARN_ON(!proc_create("cgroups", 0, NULL, &proc_cgroupstats_operations));
return 0;
@@ -5472,19 +5471,6 @@
.release = single_release,
};
-static void **subsys_canfork_priv_p(void *ss_priv[CGROUP_CANFORK_COUNT], int i)
-{
- if (CGROUP_CANFORK_START <= i && i < CGROUP_CANFORK_END)
- return &ss_priv[i - CGROUP_CANFORK_START];
- return NULL;
-}
-
-static void *subsys_canfork_priv(void *ss_priv[CGROUP_CANFORK_COUNT], int i)
-{
- void **private = subsys_canfork_priv_p(ss_priv, i);
- return private ? *private : NULL;
-}
-
/**
* cgroup_fork - initialize cgroup related fields during copy_process()
* @child: pointer to task_struct of forking parent process.
@@ -5507,14 +5493,13 @@
* returns an error, the fork aborts with that error code. This allows for
* a cgroup subsystem to conditionally allow or deny new forks.
*/
-int cgroup_can_fork(struct task_struct *child,
- void *ss_priv[CGROUP_CANFORK_COUNT])
+int cgroup_can_fork(struct task_struct *child)
{
struct cgroup_subsys *ss;
int i, j, ret;
for_each_subsys_which(ss, i, &have_canfork_callback) {
- ret = ss->can_fork(child, subsys_canfork_priv_p(ss_priv, i));
+ ret = ss->can_fork(child);
if (ret)
goto out_revert;
}
@@ -5526,7 +5511,7 @@
if (j >= i)
break;
if (ss->cancel_fork)
- ss->cancel_fork(child, subsys_canfork_priv(ss_priv, j));
+ ss->cancel_fork(child);
}
return ret;
@@ -5539,15 +5524,14 @@
* This calls the cancel_fork() callbacks if a fork failed *after*
* cgroup_can_fork() succeded.
*/
-void cgroup_cancel_fork(struct task_struct *child,
- void *ss_priv[CGROUP_CANFORK_COUNT])
+void cgroup_cancel_fork(struct task_struct *child)
{
struct cgroup_subsys *ss;
int i;
for_each_subsys(ss, i)
if (ss->cancel_fork)
- ss->cancel_fork(child, subsys_canfork_priv(ss_priv, i));
+ ss->cancel_fork(child);
}
/**
@@ -5560,8 +5544,7 @@
* cgroup_task_iter_start() - to guarantee that the new task ends up on its
* list.
*/
-void cgroup_post_fork(struct task_struct *child,
- void *old_ss_priv[CGROUP_CANFORK_COUNT])
+void cgroup_post_fork(struct task_struct *child)
{
struct cgroup_subsys *ss;
int i;
@@ -5605,7 +5588,7 @@
* and addition to css_set.
*/
for_each_subsys_which(ss, i, &have_fork_callback)
- ss->fork(child, subsys_canfork_priv(old_ss_priv, i));
+ ss->fork(child);
}
/**
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index 2d3df82..1b72d56 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -200,7 +200,7 @@
* to do anything as freezer_attach() will put @task into the appropriate
* state.
*/
-static void freezer_fork(struct task_struct *task, void *private)
+static void freezer_fork(struct task_struct *task)
{
struct freezer *freezer;
diff --git a/kernel/cgroup_pids.c b/kernel/cgroup_pids.c
index b50d5a1..303097b 100644
--- a/kernel/cgroup_pids.c
+++ b/kernel/cgroup_pids.c
@@ -134,7 +134,7 @@
*
* This function follows the set limit. It will fail if the charge would cause
* the new value to exceed the hierarchical limit. Returns 0 if the charge
- * succeded, otherwise -EAGAIN.
+ * succeeded, otherwise -EAGAIN.
*/
static int pids_try_charge(struct pids_cgroup *pids, int num)
{
@@ -209,7 +209,7 @@
* task_css_check(true) in pids_can_fork() and pids_cancel_fork() relies
* on threadgroup_change_begin() held by the copy_process().
*/
-static int pids_can_fork(struct task_struct *task, void **priv_p)
+static int pids_can_fork(struct task_struct *task)
{
struct cgroup_subsys_state *css;
struct pids_cgroup *pids;
@@ -219,7 +219,7 @@
return pids_try_charge(pids, 1);
}
-static void pids_cancel_fork(struct task_struct *task, void *priv)
+static void pids_cancel_fork(struct task_struct *task)
{
struct cgroup_subsys_state *css;
struct pids_cgroup *pids;
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 02a8ea5..3e945fc 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -51,6 +51,7 @@
#include <linux/stat.h>
#include <linux/string.h>
#include <linux/time.h>
+#include <linux/time64.h>
#include <linux/backing-dev.h>
#include <linux/sort.h>
@@ -68,7 +69,7 @@
struct fmeter {
int cnt; /* unprocessed events count */
int val; /* most recent output value */
- time_t time; /* clock (secs) when val computed */
+ time64_t time; /* clock (secs) when val computed */
spinlock_t lock; /* guards read or write of above */
};
@@ -1374,7 +1375,7 @@
*/
#define FM_COEF 933 /* coefficient for half-life of 10 secs */
-#define FM_MAXTICKS ((time_t)99) /* useless computing more ticks than this */
+#define FM_MAXTICKS ((u32)99) /* useless computing more ticks than this */
#define FM_MAXCNT 1000000 /* limit cnt to avoid overflow */
#define FM_SCALE 1000 /* faux fixed point scale */
@@ -1390,8 +1391,11 @@
/* Internal meter update - process cnt events and update value */
static void fmeter_update(struct fmeter *fmp)
{
- time_t now = get_seconds();
- time_t ticks = now - fmp->time;
+ time64_t now;
+ u32 ticks;
+
+ now = ktime_get_seconds();
+ ticks = now - fmp->time;
if (ticks == 0)
return;
diff --git a/kernel/fork.c b/kernel/fork.c
index 291b08c..6774e6b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1250,7 +1250,6 @@
{
int retval;
struct task_struct *p;
- void *cgrp_ss_priv[CGROUP_CANFORK_COUNT] = {};
if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
return ERR_PTR(-EINVAL);
@@ -1527,7 +1526,7 @@
* between here and cgroup_post_fork() if an organisation operation is in
* progress.
*/
- retval = cgroup_can_fork(p, cgrp_ss_priv);
+ retval = cgroup_can_fork(p);
if (retval)
goto bad_fork_free_pid;
@@ -1609,7 +1608,7 @@
write_unlock_irq(&tasklist_lock);
proc_fork_connector(p);
- cgroup_post_fork(p, cgrp_ss_priv);
+ cgroup_post_fork(p);
threadgroup_change_end(current);
perf_event_fork(p);
@@ -1619,7 +1618,7 @@
return p;
bad_fork_cancel_cgroup:
- cgroup_cancel_fork(p, cgrp_ss_priv);
+ cgroup_cancel_fork(p);
bad_fork_free_pid:
if (pid != &init_struct_pid)
free_pid(pid);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 77d97a6..44253ad 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8342,7 +8342,7 @@
sched_offline_group(tg);
}
-static void cpu_cgroup_fork(struct task_struct *task, void *private)
+static void cpu_cgroup_fork(struct task_struct *task)
{
sched_move_task(task);
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fc10620..14cb1db 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4813,7 +4813,7 @@
static int mem_cgroup_can_attach(struct cgroup_taskset *tset)
{
struct cgroup_subsys_state *css;
- struct mem_cgroup *memcg;
+ struct mem_cgroup *memcg = NULL; /* unneeded init to make gcc happy */
struct mem_cgroup *from;
struct task_struct *leader, *p;
struct mm_struct *mm;