Tejun Heo | 6573157 | 2014-04-25 18:28:02 -0400 | [diff] [blame] | 1 | |
| 2 | Cgroup unified hierarchy |
| 3 | |
| 4 | April, 2014 Tejun Heo <tj@kernel.org> |
| 5 | |
| 6 | This document describes the changes made by unified hierarchy and |
| 7 | their rationales. It will eventually be merged into the main cgroup |
| 8 | documentation. |
| 9 | |
| 10 | CONTENTS |
| 11 | |
| 12 | 1. Background |
| 13 | 2. Basic Operation |
| 14 | 2-1. Mounting |
| 15 | 2-2. cgroup.subtree_control |
| 16 | 2-3. cgroup.controllers |
| 17 | 3. Structural Constraints |
| 18 | 3-1. Top-down |
| 19 | 3-2. No internal tasks |
Tejun Heo | 8a0792e | 2015-06-18 16:54:28 -0400 | [diff] [blame] | 20 | 4. Delegation |
| 21 | 4-1. Model of delegation |
| 22 | 4-2. Common ancestor rule |
| 23 | 5. Other Changes |
| 24 | 5-1. [Un]populated Notification |
| 25 | 5-2. Other Core Changes |
Tejun Heo | 6abc8ca | 2015-08-04 15:20:55 -0400 | [diff] [blame] | 26 | 5-3. Controller File Conventions |
| 27 | 5-3-1. Format |
| 28 | 5-3-2. Control Knobs |
| 29 | 5-4. Per-Controller Changes |
Tejun Heo | 2ee867dc | 2015-08-18 14:55:34 -0700 | [diff] [blame] | 30 | 5-4-1. io |
Tejun Heo | 6abc8ca | 2015-08-04 15:20:55 -0400 | [diff] [blame] | 31 | 5-4-2. cpuset |
| 32 | 5-4-3. memory |
Tejun Heo | 8a0792e | 2015-06-18 16:54:28 -0400 | [diff] [blame] | 33 | 6. Planned Changes |
| 34 | 6-1. CAP for resource control |
Tejun Heo | 6573157 | 2014-04-25 18:28:02 -0400 | [diff] [blame] | 35 | |
| 36 | |
| 37 | 1. Background |
| 38 | |
| 39 | cgroup allows an arbitrary number of hierarchies and each hierarchy |
| 40 | can host any number of controllers. While this seems to provide a |
| 41 | high level of flexibility, it isn't quite useful in practice. |
| 42 | |
| 43 | For example, as there is only one instance of each controller, utility |
| 44 | type controllers such as freezer which can be useful in all |
| 45 | hierarchies can only be used in one. The issue is exacerbated by the |
| 46 | fact that controllers can't be moved around once hierarchies are |
| 47 | populated. Another issue is that all controllers bound to a hierarchy |
| 48 | are forced to have exactly the same view of the hierarchy. It isn't |
| 49 | possible to vary the granularity depending on the specific controller. |
| 50 | |
| 51 | In practice, these issues heavily limit which controllers can be put |
| 52 | on the same hierarchy and most configurations resort to putting each |
| 53 | controller on its own hierarchy. Only closely related ones, such as |
| 54 | the cpu and cpuacct controllers, make sense to put on the same |
| 55 | hierarchy. This often means that userland ends up managing multiple |
| 56 | similar hierarchies repeating the same steps on each hierarchy |
| 57 | whenever a hierarchy management operation is necessary. |
| 58 | |
| 59 | Unfortunately, support for multiple hierarchies comes at a steep cost. |
| 60 | Internal implementation in cgroup core proper is dazzlingly |
| 61 | complicated but more importantly the support for multiple hierarchies |
| 62 | restricts how cgroup is used in general and what controllers can do. |
| 63 | |
| 64 | There's no limit on how many hierarchies there may be, which means |
| 65 | that a task's cgroup membership can't be described in finite length. |
| 66 | The key may contain any varying number of entries and is unlimited in |
| 67 | length, which makes it highly awkward to handle and leads to addition |
| 68 | of controllers which exist only to identify membership, which in turn |
| 69 | exacerbates the original problem. |
| 70 | |
| 71 | Also, as a controller can't have any expectation regarding what shape |
| 72 | of hierarchies other controllers would be on, each controller has to |
| 73 | assume that all other controllers are operating on completely |
| 74 | orthogonal hierarchies. This makes it impossible, or at least very |
| 75 | cumbersome, for controllers to cooperate with each other. |
| 76 | |
| 77 | In most use cases, putting controllers on hierarchies which are |
| 78 | completely orthogonal to each other isn't necessary. What usually is |
| 79 | called for is the ability to have differing levels of granularity |
| 80 | depending on the specific controller. In other words, hierarchy may |
| 81 | be collapsed from leaf towards root when viewed from specific |
| 82 | controllers. For example, a given configuration might not care about |
| 83 | how memory is distributed beyond a certain level while still wanting |
| 84 | to control how CPU cycles are distributed. |
| 85 | |
| 86 | Unified hierarchy is the next version of cgroup interface. It aims to |
| 87 | address the aforementioned issues by having more structure while |
| 88 | retaining enough flexibility for most use cases. Various other |
| 89 | general and controller-specific interface issues are also addressed in |
| 90 | the process. |
| 91 | |
| 92 | |
| 93 | 2. Basic Operation |
| 94 | |
| 95 | 2-1. Mounting |
| 96 | |
| 97 | Currently, unified hierarchy can be mounted with the following mount |
| 98 | command. Note that this is still under development and scheduled to |
| 99 | change soon. |
| 100 | |
| 101 | mount -t cgroup -o __DEVEL__sane_behavior cgroup $MOUNT_POINT |
| 102 | |
Tejun Heo | a8ddc82 | 2014-07-15 11:05:10 -0400 | [diff] [blame] | 103 | All controllers which support the unified hierarchy and are not bound |
| 104 | to other hierarchies are automatically bound to unified hierarchy and |
| 105 | show up at the root of it. Controllers which are enabled only in the |
| 106 | root of unified hierarchy can be bound to other hierarchies. This |
| 107 | allows mixing unified hierarchy with the traditional multiple |
| 108 | hierarchies in a fully backward compatible way. |
| 109 | |
Tejun Heo | af0ba67 | 2014-07-08 18:02:57 -0400 | [diff] [blame] | 110 | A controller can be moved across hierarchies only after the controller |
| 111 | is no longer referenced in its current hierarchy. Because per-cgroup |
| 112 | controller states are destroyed asynchronously and controllers may |
| 113 | have lingering references, a controller may not show up immediately on |
| 114 | the unified hierarchy after the final umount of the previous |
| 115 | hierarchy. Similarly, a controller should be fully disabled to be |
| 116 | moved out of the unified hierarchy and it may take some time for the |
| 117 | disabled controller to become available for other hierarchies; |
| 118 | furthermore, due to dependencies among controllers, other controllers |
| 119 | may need to be disabled too. |
| 120 | |
| 121 | While useful for development and manual configurations, dynamically |
| 122 | moving controllers between the unified and other hierarchies is |
| 123 | strongly discouraged for production use. It is recommended to decide |
| 124 | the hierarchies and controller associations before starting using the |
| 125 | controllers. |
Tejun Heo | 6573157 | 2014-04-25 18:28:02 -0400 | [diff] [blame] | 126 | |
| 127 | |
| 128 | 2-2. cgroup.subtree_control |
| 129 | |
| 130 | All cgroups on unified hierarchy have a "cgroup.subtree_control" file |
| 131 | which governs which controllers are enabled on the children of the |
| 132 | cgroup. Let's assume a hierarchy like the following. |
| 133 | |
| 134 | root - A - B - C |
| 135 | \ D |
| 136 | |
| 137 | root's "cgroup.subtree_control" file determines which controllers are |
| 138 | enabled on A. A's on B. B's on C and D. This coincides with the |
| 139 | fact that controllers on the immediate sub-level are used to |
| 140 | distribute the resources of the parent. In fact, it's natural to |
| 141 | assume that resource control knobs of a child belong to its parent. |
| 142 | Enabling a controller in a "cgroup.subtree_control" file declares that |
| 143 | distribution of the respective resources of the cgroup will be |
| 144 | controlled. Note that this means that controller enable states are |
| 145 | shared among siblings. |
| 146 | |
| 147 | When read, the file contains a space-separated list of currently |
| 148 | enabled controllers. A write to the file should contain a |
| 149 | space-separated list of controllers with '+' or '-' prefixed (without |
| 150 | the quotes). Controllers prefixed with '+' are enabled and '-' |
| 151 | disabled. If a controller is listed multiple times, the last entry |
| 152 | wins. The specific operations are executed atomically - either all |
| 153 | succeed or fail. |
| 154 | |
| 155 | |
| 156 | 2-3. cgroup.controllers |
| 157 | |
| 158 | Read-only "cgroup.controllers" file contains a space-separated list of |
| 159 | controllers which can be enabled in the cgroup's |
| 160 | "cgroup.subtree_control" file. |
| 161 | |
| 162 | In the root cgroup, this lists controllers which are not bound to |
| 163 | other hierarchies and the content changes as controllers are bound to |
| 164 | and unbound from other hierarchies. |
| 165 | |
| 166 | In non-root cgroups, the content of this file equals that of the |
| 167 | parent's "cgroup.subtree_control" file as only controllers enabled |
| 168 | from the parent can be used in its children. |
| 169 | |
| 170 | |
| 171 | 3. Structural Constraints |
| 172 | |
| 173 | 3-1. Top-down |
| 174 | |
| 175 | As it doesn't make sense to nest control of an uncontrolled resource, |
| 176 | all non-root "cgroup.subtree_control" files can only contain |
| 177 | controllers which are enabled in the parent's "cgroup.subtree_control" |
| 178 | file. A controller can be enabled only if the parent has the |
| 179 | controller enabled and a controller can't be disabled if one or more |
| 180 | children have it enabled. |
| 181 | |
| 182 | |
| 183 | 3-2. No internal tasks |
| 184 | |
| 185 | One long-standing issue that cgroup faces is the competition between |
| 186 | tasks belonging to the parent cgroup and its children cgroups. This |
| 187 | is inherently nasty as two different types of entities compete and |
| 188 | there is no agreed-upon obvious way to handle it. Different |
| 189 | controllers are doing different things. |
| 190 | |
| 191 | The cpu controller considers tasks and cgroups as equivalents and maps |
| 192 | nice levels to cgroup weights. This works for some cases but falls |
| 193 | flat when children should be allocated specific ratios of CPU cycles |
| 194 | and the number of internal tasks fluctuates - the ratios constantly |
| 195 | change as the number of competing entities fluctuates. There also are |
| 196 | other issues. The mapping from nice level to weight isn't obvious or |
| 197 | universal, and there are various other knobs which simply aren't |
| 198 | available for tasks. |
| 199 | |
Tejun Heo | 2ee867dc | 2015-08-18 14:55:34 -0700 | [diff] [blame] | 200 | The io controller implicitly creates a hidden leaf node for each |
Tejun Heo | 6573157 | 2014-04-25 18:28:02 -0400 | [diff] [blame] | 201 | cgroup to host the tasks. The hidden leaf has its own copies of all |
| 202 | the knobs with "leaf_" prefixed. While this allows equivalent control |
| 203 | over internal tasks, it's with serious drawbacks. It always adds an |
| 204 | extra layer of nesting which may not be necessary, makes the interface |
| 205 | messy and significantly complicates the implementation. |
| 206 | |
| 207 | The memory controller currently doesn't have a way to control what |
| 208 | happens between internal tasks and child cgroups and the behavior is |
| 209 | not clearly defined. There have been attempts to add ad-hoc behaviors |
| 210 | and knobs to tailor the behavior to specific workloads. Continuing |
| 211 | this direction will lead to problems which will be extremely difficult |
| 212 | to resolve in the long term. |
| 213 | |
| 214 | Multiple controllers struggle with internal tasks and came up with |
| 215 | different ways to deal with it; unfortunately, all the approaches in |
| 216 | use now are severely flawed and, furthermore, the widely different |
| 217 | behaviors make cgroup as whole highly inconsistent. |
| 218 | |
| 219 | It is clear that this is something which needs to be addressed from |
| 220 | cgroup core proper in a uniform way so that controllers don't need to |
| 221 | worry about it and cgroup as a whole shows a consistent and logical |
| 222 | behavior. To achieve that, unified hierarchy enforces the following |
| 223 | structural constraint: |
| 224 | |
| 225 | Except for the root, only cgroups which don't contain any task may |
| 226 | have controllers enabled in their "cgroup.subtree_control" files. |
| 227 | |
| 228 | Combined with other properties, this guarantees that, when a |
| 229 | controller is looking at the part of the hierarchy which has it |
| 230 | enabled, tasks are always only on the leaves. This rules out |
| 231 | situations where child cgroups compete against internal tasks of the |
| 232 | parent. |
| 233 | |
| 234 | There are two things to note. Firstly, the root cgroup is exempt from |
| 235 | the restriction. Root contains tasks and anonymous resource |
| 236 | consumption which can't be associated with any other cgroup and |
| 237 | requires special treatment from most controllers. How resource |
| 238 | consumption in the root cgroup is governed is up to each controller. |
| 239 | |
| 240 | Secondly, the restriction doesn't take effect if there is no enabled |
| 241 | controller in the cgroup's "cgroup.subtree_control" file. This is |
| 242 | important as otherwise it wouldn't be possible to create children of a |
| 243 | populated cgroup. To control resource distribution of a cgroup, the |
| 244 | cgroup must create children and transfer all its tasks to the children |
| 245 | before enabling controllers in its "cgroup.subtree_control" file. |
| 246 | |
| 247 | |
Tejun Heo | 8a0792e | 2015-06-18 16:54:28 -0400 | [diff] [blame] | 248 | 4. Delegation |
Tejun Heo | 6573157 | 2014-04-25 18:28:02 -0400 | [diff] [blame] | 249 | |
Tejun Heo | 8a0792e | 2015-06-18 16:54:28 -0400 | [diff] [blame] | 250 | 4-1. Model of delegation |
| 251 | |
| 252 | A cgroup can be delegated to a less privileged user by granting write |
| 253 | access of the directory and its "cgroup.procs" file to the user. Note |
| 254 | that the resource control knobs in a given directory concern the |
| 255 | resources of the parent and thus must not be delegated along with the |
| 256 | directory. |
| 257 | |
| 258 | Once delegated, the user can build sub-hierarchy under the directory, |
| 259 | organize processes as it sees fit and further distribute the resources |
| 260 | it got from the parent. The limits and other settings of all resource |
| 261 | controllers are hierarchical and regardless of what happens in the |
| 262 | delegated sub-hierarchy, nothing can escape the resource restrictions |
| 263 | imposed by the parent. |
| 264 | |
| 265 | Currently, cgroup doesn't impose any restrictions on the number of |
| 266 | cgroups in or nesting depth of a delegated sub-hierarchy; however, |
| 267 | this may in the future be limited explicitly. |
| 268 | |
| 269 | |
| 270 | 4-2. Common ancestor rule |
| 271 | |
| 272 | On the unified hierarchy, to write to a "cgroup.procs" file, in |
| 273 | addition to the usual write permission to the file and uid match, the |
| 274 | writer must also have write access to the "cgroup.procs" file of the |
| 275 | common ancestor of the source and destination cgroups. This prevents |
| 276 | delegatees from smuggling processes across disjoint sub-hierarchies. |
| 277 | |
| 278 | Let's say cgroups C0 and C1 have been delegated to user U0 who created |
| 279 | C00, C01 under C0 and C10 under C1 as follows. |
| 280 | |
| 281 | ~~~~~~~~~~~~~ - C0 - C00 |
| 282 | ~ cgroup ~ \ C01 |
| 283 | ~ hierarchy ~ |
| 284 | ~~~~~~~~~~~~~ - C1 - C10 |
| 285 | |
| 286 | C0 and C1 are separate entities in terms of resource distribution |
| 287 | regardless of their relative positions in the hierarchy. The |
| 288 | resources the processes under C0 are entitled to are controlled by |
| 289 | C0's ancestors and may be completely different from C1. It's clear |
| 290 | that the intention of delegating C0 to U0 is allowing U0 to organize |
| 291 | the processes under C0 and further control the distribution of C0's |
| 292 | resources. |
| 293 | |
| 294 | On traditional hierarchies, if a task has write access to "tasks" or |
| 295 | "cgroup.procs" file of a cgroup and its uid agrees with the target, it |
| 296 | can move the target to the cgroup. In the above example, U0 will not |
| 297 | only be able to move processes in each sub-hierarchy but also across |
| 298 | the two sub-hierarchies, effectively allowing it to violate the |
| 299 | organizational and resource restrictions implied by the hierarchical |
| 300 | structure above C0 and C1. |
| 301 | |
| 302 | On the unified hierarchy, let's say U0 wants to write the pid of a |
| 303 | process which has a matching uid and is currently in C10 into |
| 304 | "C00/cgroup.procs". U0 obviously has write access to the file and |
| 305 | migration permission on the process; however, the common ancestor of |
| 306 | the source cgroup C10 and the destination cgroup C00 is above the |
| 307 | points of delegation and U0 would not have write access to its |
| 308 | "cgroup.procs" and thus be denied with -EACCES. |
| 309 | |
| 310 | |
| 311 | 5. Other Changes |
| 312 | |
| 313 | 5-1. [Un]populated Notification |
Tejun Heo | 6573157 | 2014-04-25 18:28:02 -0400 | [diff] [blame] | 314 | |
| 315 | cgroup users often need a way to determine when a cgroup's |
| 316 | subhierarchy becomes empty so that it can be cleaned up. cgroup |
| 317 | currently provides release_agent for it; unfortunately, this mechanism |
| 318 | is riddled with issues. |
| 319 | |
| 320 | - It delivers events by forking and execing a userland binary |
| 321 | specified as the release_agent. This is a long deprecated method of |
| 322 | notification delivery. It's extremely heavy, slow and cumbersome to |
| 323 | integrate with larger infrastructure. |
| 324 | |
| 325 | - There is single monitoring point at the root. There's no way to |
| 326 | delegate management of a subtree. |
| 327 | |
| 328 | - The event isn't recursive. It triggers when a cgroup doesn't have |
| 329 | any tasks or child cgroups. Events for internal nodes trigger only |
| 330 | after all children are removed. This again makes it impossible to |
| 331 | delegate management of a subtree. |
| 332 | |
| 333 | - Events are filtered from the kernel side. A "notify_on_release" |
| 334 | file is used to subscribe to or suppress release events. This is |
| 335 | unnecessarily complicated and probably done this way because event |
| 336 | delivery itself was expensive. |
| 337 | |
Tejun Heo | 4a07c22 | 2015-09-18 17:54:22 -0400 | [diff] [blame] | 338 | Unified hierarchy implements "populated" field in "cgroup.events" |
| 339 | interface file which can be used to monitor whether the cgroup's |
| 340 | subhierarchy has tasks in it or not. Its value is 0 if there is no |
| 341 | task in the cgroup and its descendants; otherwise, 1. poll and |
| 342 | [id]notify events are triggered when the value changes. |
Tejun Heo | 6573157 | 2014-04-25 18:28:02 -0400 | [diff] [blame] | 343 | |
| 344 | This is significantly lighter and simpler and trivially allows |
| 345 | delegating management of subhierarchy - subhierarchy monitoring can |
| 346 | block further propagation simply by putting itself or another process |
| 347 | in the subhierarchy and monitor events that it's interested in from |
| 348 | there without interfering with monitoring higher in the tree. |
| 349 | |
| 350 | In unified hierarchy, the release_agent mechanism is no longer |
| 351 | supported and the interface files "release_agent" and |
| 352 | "notify_on_release" do not exist. |
| 353 | |
| 354 | |
Tejun Heo | 8a0792e | 2015-06-18 16:54:28 -0400 | [diff] [blame] | 355 | 5-2. Other Core Changes |
Tejun Heo | 6573157 | 2014-04-25 18:28:02 -0400 | [diff] [blame] | 356 | |
| 357 | - None of the mount options is allowed. |
| 358 | |
| 359 | - remount is disallowed. |
| 360 | |
| 361 | - rename(2) is disallowed. |
| 362 | |
| 363 | - The "tasks" file is removed. Everything should at process |
| 364 | granularity. Use the "cgroup.procs" file instead. |
| 365 | |
| 366 | - The "cgroup.procs" file is not sorted. pids will be unique unless |
| 367 | they got recycled in-between reads. |
| 368 | |
| 369 | - The "cgroup.clone_children" file is removed. |
| 370 | |
Tejun Heo | 2e91fa7 | 2015-10-15 16:41:53 -0400 | [diff] [blame] | 371 | - /proc/PID/cgroup keeps reporting the cgroup that a zombie belonged |
| 372 | to before exiting. If the cgroup is removed before the zombie is |
| 373 | reaped, " (deleted)" is appeneded to the path. |
| 374 | |
Tejun Heo | 6573157 | 2014-04-25 18:28:02 -0400 | [diff] [blame] | 375 | |
Tejun Heo | 6abc8ca | 2015-08-04 15:20:55 -0400 | [diff] [blame] | 376 | 5-3. Controller File Conventions |
Tejun Heo | 6573157 | 2014-04-25 18:28:02 -0400 | [diff] [blame] | 377 | |
Tejun Heo | 6abc8ca | 2015-08-04 15:20:55 -0400 | [diff] [blame] | 378 | 5-3-1. Format |
| 379 | |
| 380 | In general, all controller files should be in one of the following |
| 381 | formats whenever possible. |
| 382 | |
| 383 | - Values only files |
| 384 | |
| 385 | VAL0 VAL1...\n |
| 386 | |
| 387 | - Flat keyed files |
| 388 | |
| 389 | KEY0 VAL0\n |
| 390 | KEY1 VAL1\n |
| 391 | ... |
| 392 | |
| 393 | - Nested keyed files |
| 394 | |
| 395 | KEY0 SUB_KEY0=VAL00 SUB_KEY1=VAL01... |
| 396 | KEY1 SUB_KEY0=VAL10 SUB_KEY1=VAL11... |
| 397 | ... |
| 398 | |
| 399 | For a writeable file, the format for writing should generally match |
| 400 | reading; however, controllers may allow omitting later fields or |
| 401 | implement restricted shortcuts for most common use cases. |
| 402 | |
| 403 | For both flat and nested keyed files, only the values for a single key |
| 404 | can be written at a time. For nested keyed files, the sub key pairs |
| 405 | may be specified in any order and not all pairs have to be specified. |
| 406 | |
| 407 | |
| 408 | 5-3-2. Control Knobs |
| 409 | |
| 410 | - Settings for a single feature should generally be implemented in a |
| 411 | single file. |
| 412 | |
| 413 | - In general, the root cgroup should be exempt from resource control |
| 414 | and thus shouldn't have resource control knobs. |
| 415 | |
| 416 | - If a controller implements ratio based resource distribution, the |
| 417 | control knob should be named "weight" and have the range [1, 10000] |
| 418 | and 100 should be the default value. The values are chosen to allow |
| 419 | enough and symmetric bias in both directions while keeping it |
| 420 | intuitive (the default is 100%). |
| 421 | |
| 422 | - If a controller implements an absolute resource guarantee and/or |
| 423 | limit, the control knobs should be named "min" and "max" |
| 424 | respectively. If a controller implements best effort resource |
| 425 | gurantee and/or limit, the control knobs should be named "low" and |
| 426 | "high" respectively. |
| 427 | |
| 428 | In the above four control files, the special token "max" should be |
| 429 | used to represent upward infinity for both reading and writing. |
| 430 | |
| 431 | - If a setting has configurable default value and specific overrides, |
| 432 | the default settings should be keyed with "default" and appear as |
| 433 | the first entry in the file. Specific entries can use "default" as |
| 434 | its value to indicate inheritance of the default value. |
| 435 | |
Tejun Heo | 4a07c22 | 2015-09-18 17:54:22 -0400 | [diff] [blame] | 436 | - For events which are not very high frequency, an interface file |
| 437 | "events" should be created which lists event key value pairs. |
| 438 | Whenever a notifiable event happens, file modified event should be |
| 439 | generated on the file. |
| 440 | |
Tejun Heo | 6abc8ca | 2015-08-04 15:20:55 -0400 | [diff] [blame] | 441 | |
| 442 | 5-4. Per-Controller Changes |
| 443 | |
Tejun Heo | 2ee867dc | 2015-08-18 14:55:34 -0700 | [diff] [blame] | 444 | 5-4-1. io |
Tejun Heo | 6573157 | 2014-04-25 18:28:02 -0400 | [diff] [blame] | 445 | |
Tejun Heo | 2ee867dc | 2015-08-18 14:55:34 -0700 | [diff] [blame] | 446 | - blkio is renamed to io. The interface is overhauled anyway. The |
| 447 | new name is more in line with the other two major controllers, cpu |
| 448 | and memory, and better suited given that it may be used for cgroup |
| 449 | writeback without involving block layer. |
| 450 | |
| 451 | - Everything including stat is always hierarchical making separate |
| 452 | recursive stat files pointless and, as no internal node can have |
| 453 | tasks, leaf weights are meaningless. The operation model is |
| 454 | simplified and the interface is overhauled accordingly. |
| 455 | |
| 456 | io.stat |
| 457 | |
| 458 | The stat file. The reported stats are from the point where |
| 459 | bio's are issued to request_queue. The stats are counted |
| 460 | independent of which policies are enabled. Each line in the |
| 461 | file follows the following format. More fields may later be |
| 462 | added at the end. |
| 463 | |
| 464 | $MAJ:$MIN rbytes=$RBYTES wbytes=$WBYTES rios=$RIOS wrios=$WIOS |
| 465 | |
| 466 | io.weight |
| 467 | |
| 468 | The weight setting, currently only available and effective if |
| 469 | cfq-iosched is in use for the target device. The weight is |
Tejun Heo | 69d7fde | 2015-08-18 14:55:36 -0700 | [diff] [blame] | 470 | between 1 and 10000 and defaults to 100. The first line |
Tejun Heo | 2ee867dc | 2015-08-18 14:55:34 -0700 | [diff] [blame] | 471 | always contains the default weight in the following format to |
| 472 | use when per-device setting is missing. |
| 473 | |
| 474 | default $WEIGHT |
| 475 | |
| 476 | Subsequent lines list per-device weights of the following |
| 477 | format. |
| 478 | |
| 479 | $MAJ:$MIN $WEIGHT |
| 480 | |
| 481 | Writing "$WEIGHT" or "default $WEIGHT" changes the default |
| 482 | setting. Writing "$MAJ:$MIN $WEIGHT" sets per-device weight |
| 483 | while "$MAJ:$MIN default" clears it. |
| 484 | |
| 485 | This file is available only on non-root cgroups. |
| 486 | |
| 487 | io.max |
| 488 | |
| 489 | The maximum bandwidth and/or iops setting, only available if |
| 490 | blk-throttle is enabled. The file is of the following format. |
| 491 | |
| 492 | $MAJ:$MIN rbps=$RBPS wbps=$WBPS riops=$RIOPS wiops=$WIOPS |
| 493 | |
| 494 | ${R|W}BPS are read/write bytes per second and ${R|W}IOPS are |
| 495 | read/write IOs per second. "max" indicates no limit. Writing |
| 496 | to the file follows the same format but the individual |
Yuan Sun | 55d0159 | 2015-09-22 17:00:06 +0800 | [diff] [blame] | 497 | settings may be omitted or specified in any order. |
Tejun Heo | 2ee867dc | 2015-08-18 14:55:34 -0700 | [diff] [blame] | 498 | |
| 499 | This file is available only on non-root cgroups. |
Tejun Heo | 6573157 | 2014-04-25 18:28:02 -0400 | [diff] [blame] | 500 | |
| 501 | |
Tejun Heo | 6abc8ca | 2015-08-04 15:20:55 -0400 | [diff] [blame] | 502 | 5-4-2. cpuset |
Tejun Heo | 6573157 | 2014-04-25 18:28:02 -0400 | [diff] [blame] | 503 | |
| 504 | - Tasks are kept in empty cpusets after hotplug and take on the masks |
| 505 | of the nearest non-empty ancestor, instead of being moved to it. |
| 506 | |
| 507 | - A task can be moved into an empty cpuset, and again it takes on the |
| 508 | masks of the nearest non-empty ancestor. |
| 509 | |
| 510 | |
Tejun Heo | 6abc8ca | 2015-08-04 15:20:55 -0400 | [diff] [blame] | 511 | 5-4-3. memory |
Tejun Heo | 6573157 | 2014-04-25 18:28:02 -0400 | [diff] [blame] | 512 | |
| 513 | - use_hierarchy is on by default and the cgroup file for the flag is |
| 514 | not created. |
| 515 | |
Johannes Weiner | 241994e | 2015-02-11 15:26:06 -0800 | [diff] [blame] | 516 | - The original lower boundary, the soft limit, is defined as a limit |
| 517 | that is per default unset. As a result, the set of cgroups that |
| 518 | global reclaim prefers is opt-in, rather than opt-out. The costs |
| 519 | for optimizing these mostly negative lookups are so high that the |
| 520 | implementation, despite its enormous size, does not even provide the |
| 521 | basic desirable behavior. First off, the soft limit has no |
| 522 | hierarchical meaning. All configured groups are organized in a |
| 523 | global rbtree and treated like equal peers, regardless where they |
| 524 | are located in the hierarchy. This makes subtree delegation |
| 525 | impossible. Second, the soft limit reclaim pass is so aggressive |
| 526 | that it not just introduces high allocation latencies into the |
| 527 | system, but also impacts system performance due to overreclaim, to |
| 528 | the point where the feature becomes self-defeating. |
| 529 | |
| 530 | The memory.low boundary on the other hand is a top-down allocated |
| 531 | reserve. A cgroup enjoys reclaim protection when it and all its |
| 532 | ancestors are below their low boundaries, which makes delegation of |
| 533 | subtrees possible. Secondly, new cgroups have no reserve per |
| 534 | default and in the common case most cgroups are eligible for the |
| 535 | preferred reclaim pass. This allows the new low boundary to be |
| 536 | efficiently implemented with just a minor addition to the generic |
| 537 | reclaim code, without the need for out-of-band data structures and |
| 538 | reclaim passes. Because the generic reclaim code considers all |
| 539 | cgroups except for the ones running low in the preferred first |
| 540 | reclaim pass, overreclaim of individual groups is eliminated as |
| 541 | well, resulting in much better overall workload performance. |
| 542 | |
| 543 | - The original high boundary, the hard limit, is defined as a strict |
| 544 | limit that can not budge, even if the OOM killer has to be called. |
| 545 | But this generally goes against the goal of making the most out of |
| 546 | the available memory. The memory consumption of workloads varies |
| 547 | during runtime, and that requires users to overcommit. But doing |
| 548 | that with a strict upper limit requires either a fairly accurate |
| 549 | prediction of the working set size or adding slack to the limit. |
| 550 | Since working set size estimation is hard and error prone, and |
| 551 | getting it wrong results in OOM kills, most users tend to err on the |
| 552 | side of a looser limit and end up wasting precious resources. |
| 553 | |
| 554 | The memory.high boundary on the other hand can be set much more |
| 555 | conservatively. When hit, it throttles allocations by forcing them |
| 556 | into direct reclaim to work off the excess, but it never invokes the |
| 557 | OOM killer. As a result, a high boundary that is chosen too |
| 558 | aggressively will not terminate the processes, but instead it will |
| 559 | lead to gradual performance degradation. The user can monitor this |
| 560 | and make corrections until the minimal memory footprint that still |
| 561 | gives acceptable performance is found. |
| 562 | |
| 563 | In extreme cases, with many concurrent allocations and a complete |
| 564 | breakdown of reclaim progress within the group, the high boundary |
| 565 | can be exceeded. But even then it's mostly better to satisfy the |
| 566 | allocation from the slack available in other groups or the rest of |
| 567 | the system than killing the group. Otherwise, memory.max is there |
| 568 | to limit this type of spillover and ultimately contain buggy or even |
| 569 | malicious applications. |
| 570 | |
| 571 | - The original control file names are unwieldy and inconsistent in |
| 572 | many different ways. For example, the upper boundary hit count is |
| 573 | exported in the memory.failcnt file, but an OOM event count has to |
| 574 | be manually counted by listening to memory.oom_control events, and |
| 575 | lower boundary / soft limit events have to be counted by first |
| 576 | setting a threshold for that value and then counting those events. |
| 577 | Also, usage and limit files encode their units in the filename. |
| 578 | That makes the filenames very long, even though this is not |
| 579 | information that a user needs to be reminded of every time they type |
| 580 | out those names. |
| 581 | |
| 582 | To address these naming issues, as well as to signal clearly that |
| 583 | the new interface carries a new configuration model, the naming |
| 584 | conventions in it necessarily differ from the old interface. |
| 585 | |
| 586 | - The original limit files indicate the state of an unset limit with a |
| 587 | Very High Number, and a configured limit can be unset by echoing -1 |
| 588 | into those files. But that very high number is implementation and |
| 589 | architecture dependent and not very descriptive. And while -1 can |
| 590 | be understood as an underflow into the highest possible value, -2 or |
| 591 | -10M etc. do not work, so it's not consistent. |
| 592 | |
Johannes Weiner | d297369 | 2015-02-27 15:52:04 -0800 | [diff] [blame] | 593 | memory.low, memory.high, and memory.max will use the string "max" to |
| 594 | indicate and set the highest possible value. |
Tejun Heo | 6573157 | 2014-04-25 18:28:02 -0400 | [diff] [blame] | 595 | |
Tejun Heo | 8a0792e | 2015-06-18 16:54:28 -0400 | [diff] [blame] | 596 | 6. Planned Changes |
Tejun Heo | 6573157 | 2014-04-25 18:28:02 -0400 | [diff] [blame] | 597 | |
Tejun Heo | 8a0792e | 2015-06-18 16:54:28 -0400 | [diff] [blame] | 598 | 6-1. CAP for resource control |
Tejun Heo | 6573157 | 2014-04-25 18:28:02 -0400 | [diff] [blame] | 599 | |
| 600 | Unified hierarchy will require one of the capabilities(7), which is |
| 601 | yet to be decided, for all resource control related knobs. Process |
| 602 | organization operations - creation of sub-cgroups and migration of |
| 603 | processes in sub-hierarchies may be delegated by changing the |
| 604 | ownership and/or permissions on the cgroup directory and |
| 605 | "cgroup.procs" interface file; however, all operations which affect |
| 606 | resource control - writes to a "cgroup.subtree_control" file or any |
| 607 | controller-specific knobs - will require an explicit CAP privilege. |
| 608 | |
| 609 | This, in part, is to prevent the cgroup interface from being |
| 610 | inadvertently promoted to programmable API used by non-privileged |
| 611 | binaries. cgroup exposes various aspects of the system in ways which |
| 612 | aren't properly abstracted for direct consumption by regular programs. |
| 613 | This is an administration interface much closer to sysctl knobs than |
| 614 | system calls. Even the basic access model, being filesystem path |
| 615 | based, isn't suitable for direct consumption. There's no way to |
| 616 | access "my cgroup" in a race-free way or make multiple operations |
| 617 | atomic against migration to another cgroup. |
| 618 | |
| 619 | Another aspect is that, for better or for worse, the cgroup interface |
| 620 | goes through far less scrutiny than regular interfaces for |
| 621 | unprivileged userland. The upside is that cgroup is able to expose |
| 622 | useful features which may not be suitable for general consumption in a |
| 623 | reasonable time frame. It provides a relatively short path between |
| 624 | internal details and userland-visible interface. Of course, this |
| 625 | shortcut comes with high risk. We go through what we go through for |
| 626 | general kernel APIs for good reasons. It may end up leaking internal |
| 627 | details in a way which can exert significant pain by locking the |
| 628 | kernel into a contract that can't be maintained in a reasonable |
| 629 | manner. |
| 630 | |
| 631 | Also, due to the specific nature, cgroup and its controllers don't |
| 632 | tend to attract attention from a wide scope of developers. cgroup's |
| 633 | short history is already fraught with severely mis-designed |
| 634 | interfaces, unnecessary commitments to and exposing of internal |
| 635 | details, broken and dangerous implementations of various features. |
| 636 | |
| 637 | Keeping cgroup as an administration interface is both advantageous for |
| 638 | its role and imperative given its nature. Some of the cgroup features |
| 639 | may make sense for unprivileged access. If deemed justified, those |
| 640 | must be further abstracted and implemented as a different interface, |
| 641 | be it a system call or process-private filesystem, and survive through |
| 642 | the scrutiny that any interface for general consumption is required to |
| 643 | go through. |
| 644 | |
| 645 | Requiring CAP is not a complete solution but should serve as a |
| 646 | significant deterrent against spraying cgroup usages in non-privileged |
| 647 | programs. |