Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1 | CPUSETS |
| 2 | ------- |
| 3 | |
| 4 | Copyright (C) 2004 BULL SA. |
| 5 | Written by Simon.Derr@bull.net |
| 6 | |
| 7 | Portions Copyright (c) 2004 Silicon Graphics, Inc. |
| 8 | Modified by Paul Jackson <pj@sgi.com> |
| 9 | |
| 10 | CONTENTS: |
| 11 | ========= |
| 12 | |
| 13 | 1. Cpusets |
| 14 | 1.1 What are cpusets ? |
| 15 | 1.2 Why are cpusets needed ? |
| 16 | 1.3 How are cpusets implemented ? |
Paul Jackson | bd5e09c | 2006-01-08 01:01:50 -0800 | [diff] [blame^] | 17 | 1.4 What are exclusive cpusets ? |
| 18 | 1.5 What does notify_on_release do ? |
| 19 | 1.6 What is a marker_pid ? |
| 20 | 1.7 What is memory_pressure ? |
| 21 | 1.8 How do I use cpusets ? |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 22 | 2. Usage Examples and Syntax |
| 23 | 2.1 Basic Usage |
| 24 | 2.2 Adding/removing cpus |
| 25 | 2.3 Setting flags |
| 26 | 2.4 Attaching processes |
| 27 | 3. Questions |
| 28 | 4. Contact |
| 29 | |
| 30 | 1. Cpusets |
| 31 | ========== |
| 32 | |
| 33 | 1.1 What are cpusets ? |
| 34 | ---------------------- |
| 35 | |
| 36 | Cpusets provide a mechanism for assigning a set of CPUs and Memory |
| 37 | Nodes to a set of tasks. |
| 38 | |
| 39 | Cpusets constrain the CPU and Memory placement of tasks to only |
| 40 | the resources within a tasks current cpuset. They form a nested |
| 41 | hierarchy visible in a virtual file system. These are the essential |
| 42 | hooks, beyond what is already present, required to manage dynamic |
| 43 | job placement on large systems. |
| 44 | |
| 45 | Each task has a pointer to a cpuset. Multiple tasks may reference |
| 46 | the same cpuset. Requests by a task, using the sched_setaffinity(2) |
| 47 | system call to include CPUs in its CPU affinity mask, and using the |
| 48 | mbind(2) and set_mempolicy(2) system calls to include Memory Nodes |
| 49 | in its memory policy, are both filtered through that tasks cpuset, |
| 50 | filtering out any CPUs or Memory Nodes not in that cpuset. The |
| 51 | scheduler will not schedule a task on a CPU that is not allowed in |
| 52 | its cpus_allowed vector, and the kernel page allocator will not |
| 53 | allocate a page on a node that is not allowed in the requesting tasks |
| 54 | mems_allowed vector. |
| 55 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 56 | User level code may create and destroy cpusets by name in the cpuset |
| 57 | virtual file system, manage the attributes and permissions of these |
| 58 | cpusets and which CPUs and Memory Nodes are assigned to each cpuset, |
| 59 | specify and query to which cpuset a task is assigned, and list the |
| 60 | task pids assigned to a cpuset. |
| 61 | |
| 62 | |
| 63 | 1.2 Why are cpusets needed ? |
| 64 | ---------------------------- |
| 65 | |
| 66 | The management of large computer systems, with many processors (CPUs), |
| 67 | complex memory cache hierarchies and multiple Memory Nodes having |
| 68 | non-uniform access times (NUMA) presents additional challenges for |
| 69 | the efficient scheduling and memory placement of processes. |
| 70 | |
| 71 | Frequently more modest sized systems can be operated with adequate |
| 72 | efficiency just by letting the operating system automatically share |
| 73 | the available CPU and Memory resources amongst the requesting tasks. |
| 74 | |
| 75 | But larger systems, which benefit more from careful processor and |
| 76 | memory placement to reduce memory access times and contention, |
| 77 | and which typically represent a larger investment for the customer, |
Jean Delvare | 33430dc | 2005-10-30 15:02:20 -0800 | [diff] [blame] | 78 | can benefit from explicitly placing jobs on properly sized subsets of |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 79 | the system. |
| 80 | |
| 81 | This can be especially valuable on: |
| 82 | |
| 83 | * Web Servers running multiple instances of the same web application, |
| 84 | * Servers running different applications (for instance, a web server |
| 85 | and a database), or |
| 86 | * NUMA systems running large HPC applications with demanding |
| 87 | performance characteristics. |
Dinakar Guniguntala | 85d7b94 | 2005-06-25 14:57:34 -0700 | [diff] [blame] | 88 | * Also cpu_exclusive cpusets are useful for servers running orthogonal |
| 89 | workloads such as RT applications requiring low latency and HPC |
| 90 | applications that are throughput sensitive |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 91 | |
| 92 | These subsets, or "soft partitions" must be able to be dynamically |
| 93 | adjusted, as the job mix changes, without impacting other concurrently |
| 94 | executing jobs. |
| 95 | |
| 96 | The kernel cpuset patch provides the minimum essential kernel |
| 97 | mechanisms required to efficiently implement such subsets. It |
| 98 | leverages existing CPU and Memory Placement facilities in the Linux |
| 99 | kernel to avoid any additional impact on the critical scheduler or |
| 100 | memory allocator code. |
| 101 | |
| 102 | |
| 103 | 1.3 How are cpusets implemented ? |
| 104 | --------------------------------- |
| 105 | |
| 106 | Cpusets provide a Linux kernel (2.6.7 and above) mechanism to constrain |
| 107 | which CPUs and Memory Nodes are used by a process or set of processes. |
| 108 | |
| 109 | The Linux kernel already has a pair of mechanisms to specify on which |
| 110 | CPUs a task may be scheduled (sched_setaffinity) and on which Memory |
| 111 | Nodes it may obtain memory (mbind, set_mempolicy). |
| 112 | |
| 113 | Cpusets extends these two mechanisms as follows: |
| 114 | |
| 115 | - Cpusets are sets of allowed CPUs and Memory Nodes, known to the |
| 116 | kernel. |
| 117 | - Each task in the system is attached to a cpuset, via a pointer |
| 118 | in the task structure to a reference counted cpuset structure. |
| 119 | - Calls to sched_setaffinity are filtered to just those CPUs |
| 120 | allowed in that tasks cpuset. |
| 121 | - Calls to mbind and set_mempolicy are filtered to just |
| 122 | those Memory Nodes allowed in that tasks cpuset. |
| 123 | - The root cpuset contains all the systems CPUs and Memory |
| 124 | Nodes. |
| 125 | - For any cpuset, one can define child cpusets containing a subset |
| 126 | of the parents CPU and Memory Node resources. |
| 127 | - The hierarchy of cpusets can be mounted at /dev/cpuset, for |
| 128 | browsing and manipulation from user space. |
| 129 | - A cpuset may be marked exclusive, which ensures that no other |
| 130 | cpuset (except direct ancestors and descendents) may contain |
| 131 | any overlapping CPUs or Memory Nodes. |
Dinakar Guniguntala | 85d7b94 | 2005-06-25 14:57:34 -0700 | [diff] [blame] | 132 | Also a cpu_exclusive cpuset would be associated with a sched |
| 133 | domain. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 134 | - You can list all the tasks (by pid) attached to any cpuset. |
| 135 | |
| 136 | The implementation of cpusets requires a few, simple hooks |
| 137 | into the rest of the kernel, none in performance critical paths: |
| 138 | |
| 139 | - in main/init.c, to initialize the root cpuset at system boot. |
| 140 | - in fork and exit, to attach and detach a task from its cpuset. |
| 141 | - in sched_setaffinity, to mask the requested CPUs by what's |
| 142 | allowed in that tasks cpuset. |
| 143 | - in sched.c migrate_all_tasks(), to keep migrating tasks within |
| 144 | the CPUs allowed by their cpuset, if possible. |
Dinakar Guniguntala | 85d7b94 | 2005-06-25 14:57:34 -0700 | [diff] [blame] | 145 | - in sched.c, a new API partition_sched_domains for handling |
| 146 | sched domain changes associated with cpu_exclusive cpusets |
| 147 | and related changes in both sched.c and arch/ia64/kernel/domain.c |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 148 | - in the mbind and set_mempolicy system calls, to mask the requested |
| 149 | Memory Nodes by what's allowed in that tasks cpuset. |
| 150 | - in page_alloc, to restrict memory to allowed nodes. |
| 151 | - in vmscan.c, to restrict page recovery to the current cpuset. |
| 152 | |
| 153 | In addition a new file system, of type "cpuset" may be mounted, |
| 154 | typically at /dev/cpuset, to enable browsing and modifying the cpusets |
| 155 | presently known to the kernel. No new system calls are added for |
| 156 | cpusets - all support for querying and modifying cpusets is via |
| 157 | this cpuset file system. |
| 158 | |
| 159 | Each task under /proc has an added file named 'cpuset', displaying |
| 160 | the cpuset name, as the path relative to the root of the cpuset file |
| 161 | system. |
| 162 | |
| 163 | The /proc/<pid>/status file for each task has two added lines, |
| 164 | displaying the tasks cpus_allowed (on which CPUs it may be scheduled) |
| 165 | and mems_allowed (on which Memory Nodes it may obtain memory), |
| 166 | in the format seen in the following example: |
| 167 | |
| 168 | Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff |
| 169 | Mems_allowed: ffffffff,ffffffff |
| 170 | |
| 171 | Each cpuset is represented by a directory in the cpuset file system |
| 172 | containing the following files describing that cpuset: |
| 173 | |
| 174 | - cpus: list of CPUs in that cpuset |
| 175 | - mems: list of Memory Nodes in that cpuset |
Paul Jackson | 45b07ef | 2006-01-08 01:00:56 -0800 | [diff] [blame] | 176 | - memory_migrate flag: if set, move pages to cpusets nodes |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 177 | - cpu_exclusive flag: is cpu placement exclusive? |
| 178 | - mem_exclusive flag: is memory placement exclusive? |
| 179 | - tasks: list of tasks (by pid) attached to that cpuset |
Paul Jackson | bd5e09c | 2006-01-08 01:01:50 -0800 | [diff] [blame^] | 180 | - notify_on_release flag: run /sbin/cpuset_release_agent on exit? |
| 181 | - marker_pid: pid of user task in co-ordinated operation sequence |
| 182 | - memory_pressure: measure of how much paging pressure in cpuset |
| 183 | |
| 184 | In addition, the root cpuset only has the following file: |
| 185 | - memory_pressure_enabled flag: compute memory_pressure? |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 186 | |
| 187 | New cpusets are created using the mkdir system call or shell |
| 188 | command. The properties of a cpuset, such as its flags, allowed |
| 189 | CPUs and Memory Nodes, and attached tasks, are modified by writing |
| 190 | to the appropriate file in that cpusets directory, as listed above. |
| 191 | |
| 192 | The named hierarchical structure of nested cpusets allows partitioning |
| 193 | a large system into nested, dynamically changeable, "soft-partitions". |
| 194 | |
| 195 | The attachment of each task, automatically inherited at fork by any |
| 196 | children of that task, to a cpuset allows organizing the work load |
| 197 | on a system into related sets of tasks such that each set is constrained |
| 198 | to using the CPUs and Memory Nodes of a particular cpuset. A task |
| 199 | may be re-attached to any other cpuset, if allowed by the permissions |
| 200 | on the necessary cpuset file system directories. |
| 201 | |
| 202 | Such management of a system "in the large" integrates smoothly with |
| 203 | the detailed placement done on individual tasks and memory regions |
| 204 | using the sched_setaffinity, mbind and set_mempolicy system calls. |
| 205 | |
| 206 | The following rules apply to each cpuset: |
| 207 | |
| 208 | - Its CPUs and Memory Nodes must be a subset of its parents. |
| 209 | - It can only be marked exclusive if its parent is. |
| 210 | - If its cpu or memory is exclusive, they may not overlap any sibling. |
| 211 | |
| 212 | These rules, and the natural hierarchy of cpusets, enable efficient |
| 213 | enforcement of the exclusive guarantee, without having to scan all |
| 214 | cpusets every time any of them change to ensure nothing overlaps a |
| 215 | exclusive cpuset. Also, the use of a Linux virtual file system (vfs) |
| 216 | to represent the cpuset hierarchy provides for a familiar permission |
| 217 | and name space for cpusets, with a minimum of additional kernel code. |
| 218 | |
Paul Jackson | bd5e09c | 2006-01-08 01:01:50 -0800 | [diff] [blame^] | 219 | |
| 220 | 1.4 What are exclusive cpusets ? |
| 221 | -------------------------------- |
| 222 | |
| 223 | If a cpuset is cpu or mem exclusive, no other cpuset, other than |
| 224 | a direct ancestor or descendent, may share any of the same CPUs or |
| 225 | Memory Nodes. |
| 226 | |
| 227 | A cpuset that is cpu_exclusive has a scheduler (sched) domain |
| 228 | associated with it. The sched domain consists of all CPUs in the |
| 229 | current cpuset that are not part of any exclusive child cpusets. |
| 230 | This ensures that the scheduler load balancing code only balances |
| 231 | against the CPUs that are in the sched domain as defined above and |
| 232 | not all of the CPUs in the system. This removes any overhead due to |
| 233 | load balancing code trying to pull tasks outside of the cpu_exclusive |
| 234 | cpuset only to be prevented by the tasks' cpus_allowed mask. |
| 235 | |
| 236 | A cpuset that is mem_exclusive restricts kernel allocations for |
| 237 | page, buffer and other data commonly shared by the kernel across |
| 238 | multiple users. All cpusets, whether mem_exclusive or not, restrict |
| 239 | allocations of memory for user space. This enables configuring a |
| 240 | system so that several independent jobs can share common kernel data, |
| 241 | such as file system pages, while isolating each jobs user allocation in |
| 242 | its own cpuset. To do this, construct a large mem_exclusive cpuset to |
| 243 | hold all the jobs, and construct child, non-mem_exclusive cpusets for |
| 244 | each individual job. Only a small amount of typical kernel memory, |
| 245 | such as requests from interrupt handlers, is allowed to be taken |
| 246 | outside even a mem_exclusive cpuset. |
| 247 | |
| 248 | |
| 249 | 1.5 What does notify_on_release do ? |
| 250 | ------------------------------------ |
| 251 | |
| 252 | If the notify_on_release flag is enabled (1) in a cpuset, then whenever |
| 253 | the last task in the cpuset leaves (exits or attaches to some other |
| 254 | cpuset) and the last child cpuset of that cpuset is removed, then |
| 255 | the kernel runs the command /sbin/cpuset_release_agent, supplying the |
| 256 | pathname (relative to the mount point of the cpuset file system) of the |
| 257 | abandoned cpuset. This enables automatic removal of abandoned cpusets. |
| 258 | The default value of notify_on_release in the root cpuset at system |
| 259 | boot is disabled (0). The default value of other cpusets at creation |
| 260 | is the current value of their parents notify_on_release setting. |
| 261 | |
| 262 | |
| 263 | 1.6 What is a marker_pid ? |
| 264 | -------------------------- |
| 265 | |
| 266 | The marker_pid helps manage cpuset changes safely from user space. |
| 267 | |
| 268 | The interface presented to user space for cpusets uses system wide |
| 269 | numbering of CPUs and Memory Nodes. It is the responsibility of |
| 270 | user level code, presumably in a library, to present cpuset-relative |
| 271 | numbering to applications when that would be more useful to them. |
| 272 | |
| 273 | However if a task is moved to a different cpuset, or if the 'cpus' or |
| 274 | 'mems' of a cpuset are changed, then we need a way for such library |
| 275 | code to detect that its cpuset-relative numbering has changed, when |
| 276 | expressed using system wide numbering. |
| 277 | |
| 278 | The kernel cannot safely allow user code to lock kernel resources. |
| 279 | The kernel could deliver out-of-band notice of cpuset changes by |
| 280 | such mechanisms as signals or usermodehelper callbacks, however |
| 281 | this can't be synchronously delivered to library code linked in |
| 282 | applications without intruding on the IPC mechanisms available to |
| 283 | the app. The kernel could require user level code to do all the work, |
| 284 | tracking the cpuset state before and during changes, to verify no |
| 285 | unexpected change occurred, but this becomes an onerous task. |
| 286 | |
| 287 | The "marker_pid" cpuset field provides a simple way to make this task |
| 288 | less onerous on user library code. A task writes its pid to a cpusets |
| 289 | "marker_pid" at the start of a sequence of queries and updates, |
| 290 | and check as it goes that the cpusets marker_pid doesn't change. |
| 291 | The pread(2) system call does a seek and read in a single call. |
| 292 | If the marker_pid changes, the user code should retry the required |
| 293 | sequence of operations. |
| 294 | |
| 295 | Anytime that a task modifies the "cpus" or "mems" of a cpuset, |
| 296 | unless it's pid is in the cpusets marker_pid field, the kernel zeros |
| 297 | this field. |
| 298 | |
| 299 | The above was inspired by the load linked and store conditional |
| 300 | (ll/sc) instructions in the MIPS II instruction set. |
| 301 | |
| 302 | |
| 303 | 1.7 What is memory_pressure ? |
| 304 | ----------------------------- |
| 305 | The memory_pressure of a cpuset provides a simple per-cpuset metric |
| 306 | of the rate that the tasks in a cpuset are attempting to free up in |
| 307 | use memory on the nodes of the cpuset to satisfy additional memory |
| 308 | requests. |
| 309 | |
| 310 | This enables batch managers monitoring jobs running in dedicated |
| 311 | cpusets to efficiently detect what level of memory pressure that job |
| 312 | is causing. |
| 313 | |
| 314 | This is useful both on tightly managed systems running a wide mix of |
| 315 | submitted jobs, which may choose to terminate or re-prioritize jobs that |
| 316 | are trying to use more memory than allowed on the nodes assigned them, |
| 317 | and with tightly coupled, long running, massively parallel scientific |
| 318 | computing jobs that will dramatically fail to meet required performance |
| 319 | goals if they start to use more memory than allowed to them. |
| 320 | |
| 321 | This mechanism provides a very economical way for the batch manager |
| 322 | to monitor a cpuset for signs of memory pressure. It's up to the |
| 323 | batch manager or other user code to decide what to do about it and |
| 324 | take action. |
| 325 | |
| 326 | ==> Unless this feature is enabled by writing "1" to the special file |
| 327 | /dev/cpuset/memory_pressure_enabled, the hook in the rebalance |
| 328 | code of __alloc_pages() for this metric reduces to simply noticing |
| 329 | that the cpuset_memory_pressure_enabled flag is zero. So only |
| 330 | systems that enable this feature will compute the metric. |
| 331 | |
| 332 | Why a per-cpuset, running average: |
| 333 | |
| 334 | Because this meter is per-cpuset, rather than per-task or mm, |
| 335 | the system load imposed by a batch scheduler monitoring this |
| 336 | metric is sharply reduced on large systems, because a scan of |
| 337 | the tasklist can be avoided on each set of queries. |
| 338 | |
| 339 | Because this meter is a running average, instead of an accumulating |
| 340 | counter, a batch scheduler can detect memory pressure with a |
| 341 | single read, instead of having to read and accumulate results |
| 342 | for a period of time. |
| 343 | |
| 344 | Because this meter is per-cpuset rather than per-task or mm, |
| 345 | the batch scheduler can obtain the key information, memory |
| 346 | pressure in a cpuset, with a single read, rather than having to |
| 347 | query and accumulate results over all the (dynamically changing) |
| 348 | set of tasks in the cpuset. |
| 349 | |
| 350 | A per-cpuset simple digital filter (requires a spinlock and 3 words |
| 351 | of data per-cpuset) is kept, and updated by any task attached to that |
| 352 | cpuset, if it enters the synchronous (direct) page reclaim code. |
| 353 | |
| 354 | A per-cpuset file provides an integer number representing the recent |
| 355 | (half-life of 10 seconds) rate of direct page reclaims caused by |
| 356 | the tasks in the cpuset, in units of reclaims attempted per second, |
| 357 | times 1000. |
| 358 | |
| 359 | |
| 360 | 1.8 How do I use cpusets ? |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 361 | -------------------------- |
| 362 | |
| 363 | In order to minimize the impact of cpusets on critical kernel |
| 364 | code, such as the scheduler, and due to the fact that the kernel |
| 365 | does not support one task updating the memory placement of another |
| 366 | task directly, the impact on a task of changing its cpuset CPU |
| 367 | or Memory Node placement, or of changing to which cpuset a task |
| 368 | is attached, is subtle. |
| 369 | |
| 370 | If a cpuset has its Memory Nodes modified, then for each task attached |
| 371 | to that cpuset, the next time that the kernel attempts to allocate |
| 372 | a page of memory for that task, the kernel will notice the change |
| 373 | in the tasks cpuset, and update its per-task memory placement to |
| 374 | remain within the new cpusets memory placement. If the task was using |
| 375 | mempolicy MPOL_BIND, and the nodes to which it was bound overlap with |
| 376 | its new cpuset, then the task will continue to use whatever subset |
| 377 | of MPOL_BIND nodes are still allowed in the new cpuset. If the task |
| 378 | was using MPOL_BIND and now none of its MPOL_BIND nodes are allowed |
| 379 | in the new cpuset, then the task will be essentially treated as if it |
| 380 | was MPOL_BIND bound to the new cpuset (even though its numa placement, |
| 381 | as queried by get_mempolicy(), doesn't change). If a task is moved |
| 382 | from one cpuset to another, then the kernel will adjust the tasks |
| 383 | memory placement, as above, the next time that the kernel attempts |
| 384 | to allocate a page of memory for that task. |
| 385 | |
| 386 | If a cpuset has its CPUs modified, then each task using that |
| 387 | cpuset does _not_ change its behavior automatically. In order to |
| 388 | minimize the impact on the critical scheduling code in the kernel, |
| 389 | tasks will continue to use their prior CPU placement until they |
| 390 | are rebound to their cpuset, by rewriting their pid to the 'tasks' |
| 391 | file of their cpuset. If a task had been bound to some subset of its |
| 392 | cpuset using the sched_setaffinity() call, and if any of that subset |
| 393 | is still allowed in its new cpuset settings, then the task will be |
| 394 | restricted to the intersection of the CPUs it was allowed on before, |
| 395 | and its new cpuset CPU placement. If, on the other hand, there is |
| 396 | no overlap between a tasks prior placement and its new cpuset CPU |
| 397 | placement, then the task will be allowed to run on any CPU allowed |
| 398 | in its new cpuset. If a task is moved from one cpuset to another, |
| 399 | its CPU placement is updated in the same way as if the tasks pid is |
| 400 | rewritten to the 'tasks' file of its current cpuset. |
| 401 | |
| 402 | In summary, the memory placement of a task whose cpuset is changed is |
| 403 | updated by the kernel, on the next allocation of a page for that task, |
| 404 | but the processor placement is not updated, until that tasks pid is |
| 405 | rewritten to the 'tasks' file of its cpuset. This is done to avoid |
| 406 | impacting the scheduler code in the kernel with a check for changes |
| 407 | in a tasks processor placement. |
| 408 | |
Paul Jackson | 45b07ef | 2006-01-08 01:00:56 -0800 | [diff] [blame] | 409 | Normally, once a page is allocated (given a physical page |
| 410 | of main memory) then that page stays on whatever node it |
| 411 | was allocated, so long as it remains allocated, even if the |
| 412 | cpusets memory placement policy 'mems' subsequently changes. |
| 413 | If the cpuset flag file 'memory_migrate' is set true, then when |
| 414 | tasks are attached to that cpuset, any pages that task had |
| 415 | allocated to it on nodes in its previous cpuset are migrated |
| 416 | to the tasks new cpuset. Depending on the implementation, |
| 417 | this migration may either be done by swapping the page out, |
| 418 | so that the next time the page is referenced, it will be paged |
| 419 | into the tasks new cpuset, usually on the node where it was |
| 420 | referenced, or this migration may be done by directly copying |
| 421 | the pages from the tasks previous cpuset to the new cpuset, |
| 422 | where possible to the same node, relative to the new cpuset, |
| 423 | as the node that held the page, relative to the old cpuset. |
| 424 | Also if 'memory_migrate' is set true, then if that cpusets |
| 425 | 'mems' file is modified, pages allocated to tasks in that |
| 426 | cpuset, that were on nodes in the previous setting of 'mems', |
| 427 | will be moved to nodes in the new setting of 'mems.' Again, |
| 428 | depending on the implementation, this might be done by swapping, |
| 429 | or by direct copying. In either case, pages that were not in |
| 430 | the tasks prior cpuset, or in the cpusets prior 'mems' setting, |
| 431 | will not be moved. |
| 432 | |
Tobias Klauser | d533f67 | 2005-09-10 00:26:46 -0700 | [diff] [blame] | 433 | There is an exception to the above. If hotplug functionality is used |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 434 | to remove all the CPUs that are currently assigned to a cpuset, |
| 435 | then the kernel will automatically update the cpus_allowed of all |
Paul Jackson | b39c4fa | 2005-05-20 13:59:15 -0700 | [diff] [blame] | 436 | tasks attached to CPUs in that cpuset to allow all CPUs. When memory |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 437 | hotplug functionality for removing Memory Nodes is available, a |
| 438 | similar exception is expected to apply there as well. In general, |
| 439 | the kernel prefers to violate cpuset placement, over starving a task |
| 440 | that has had all its allowed CPUs or Memory Nodes taken offline. User |
| 441 | code should reconfigure cpusets to only refer to online CPUs and Memory |
| 442 | Nodes when using hotplug to add or remove such resources. |
| 443 | |
| 444 | There is a second exception to the above. GFP_ATOMIC requests are |
| 445 | kernel internal allocations that must be satisfied, immediately. |
| 446 | The kernel may drop some request, in rare cases even panic, if a |
| 447 | GFP_ATOMIC alloc fails. If the request cannot be satisfied within |
| 448 | the current tasks cpuset, then we relax the cpuset, and look for |
| 449 | memory anywhere we can find it. It's better to violate the cpuset |
| 450 | than stress the kernel. |
| 451 | |
| 452 | To start a new job that is to be contained within a cpuset, the steps are: |
| 453 | |
| 454 | 1) mkdir /dev/cpuset |
| 455 | 2) mount -t cpuset none /dev/cpuset |
| 456 | 3) Create the new cpuset by doing mkdir's and write's (or echo's) in |
| 457 | the /dev/cpuset virtual file system. |
| 458 | 4) Start a task that will be the "founding father" of the new job. |
| 459 | 5) Attach that task to the new cpuset by writing its pid to the |
| 460 | /dev/cpuset tasks file for that cpuset. |
| 461 | 6) fork, exec or clone the job tasks from this founding father task. |
| 462 | |
| 463 | For example, the following sequence of commands will setup a cpuset |
| 464 | named "Charlie", containing just CPUs 2 and 3, and Memory Node 1, |
| 465 | and then start a subshell 'sh' in that cpuset: |
| 466 | |
| 467 | mount -t cpuset none /dev/cpuset |
| 468 | cd /dev/cpuset |
| 469 | mkdir Charlie |
| 470 | cd Charlie |
| 471 | /bin/echo 2-3 > cpus |
| 472 | /bin/echo 1 > mems |
| 473 | /bin/echo $$ > tasks |
| 474 | sh |
| 475 | # The subshell 'sh' is now running in cpuset Charlie |
| 476 | # The next line should display '/Charlie' |
| 477 | cat /proc/self/cpuset |
| 478 | |
| 479 | In the case that a change of cpuset includes wanting to move already |
| 480 | allocated memory pages, consider further the work of IWAMOTO |
| 481 | Toshihiro <iwamoto@valinux.co.jp> for page remapping and memory |
| 482 | hotremoval, which can be found at: |
| 483 | |
| 484 | http://people.valinux.co.jp/~iwamoto/mh.html |
| 485 | |
| 486 | The integration of cpusets with such memory migration is not yet |
| 487 | available. |
| 488 | |
| 489 | In the future, a C library interface to cpusets will likely be |
| 490 | available. For now, the only way to query or modify cpusets is |
| 491 | via the cpuset file system, using the various cd, mkdir, echo, cat, |
| 492 | rmdir commands from the shell, or their equivalent from C. |
| 493 | |
| 494 | The sched_setaffinity calls can also be done at the shell prompt using |
| 495 | SGI's runon or Robert Love's taskset. The mbind and set_mempolicy |
| 496 | calls can be done at the shell prompt using the numactl command |
| 497 | (part of Andi Kleen's numa package). |
| 498 | |
| 499 | 2. Usage Examples and Syntax |
| 500 | ============================ |
| 501 | |
| 502 | 2.1 Basic Usage |
| 503 | --------------- |
| 504 | |
| 505 | Creating, modifying, using the cpusets can be done through the cpuset |
| 506 | virtual filesystem. |
| 507 | |
| 508 | To mount it, type: |
| 509 | # mount -t cpuset none /dev/cpuset |
| 510 | |
| 511 | Then under /dev/cpuset you can find a tree that corresponds to the |
| 512 | tree of the cpusets in the system. For instance, /dev/cpuset |
| 513 | is the cpuset that holds the whole system. |
| 514 | |
| 515 | If you want to create a new cpuset under /dev/cpuset: |
| 516 | # cd /dev/cpuset |
| 517 | # mkdir my_cpuset |
| 518 | |
| 519 | Now you want to do something with this cpuset. |
| 520 | # cd my_cpuset |
| 521 | |
| 522 | In this directory you can find several files: |
| 523 | # ls |
| 524 | cpus cpu_exclusive mems mem_exclusive tasks |
| 525 | |
| 526 | Reading them will give you information about the state of this cpuset: |
| 527 | the CPUs and Memory Nodes it can use, the processes that are using |
| 528 | it, its properties. By writing to these files you can manipulate |
| 529 | the cpuset. |
| 530 | |
| 531 | Set some flags: |
| 532 | # /bin/echo 1 > cpu_exclusive |
| 533 | |
| 534 | Add some cpus: |
| 535 | # /bin/echo 0-7 > cpus |
| 536 | |
| 537 | Now attach your shell to this cpuset: |
| 538 | # /bin/echo $$ > tasks |
| 539 | |
| 540 | You can also create cpusets inside your cpuset by using mkdir in this |
| 541 | directory. |
| 542 | # mkdir my_sub_cs |
| 543 | |
| 544 | To remove a cpuset, just use rmdir: |
| 545 | # rmdir my_sub_cs |
| 546 | This will fail if the cpuset is in use (has cpusets inside, or has |
| 547 | processes attached). |
| 548 | |
| 549 | 2.2 Adding/removing cpus |
| 550 | ------------------------ |
| 551 | |
| 552 | This is the syntax to use when writing in the cpus or mems files |
| 553 | in cpuset directories: |
| 554 | |
| 555 | # /bin/echo 1-4 > cpus -> set cpus list to cpus 1,2,3,4 |
| 556 | # /bin/echo 1,2,3,4 > cpus -> set cpus list to cpus 1,2,3,4 |
| 557 | |
| 558 | 2.3 Setting flags |
| 559 | ----------------- |
| 560 | |
| 561 | The syntax is very simple: |
| 562 | |
| 563 | # /bin/echo 1 > cpu_exclusive -> set flag 'cpu_exclusive' |
| 564 | # /bin/echo 0 > cpu_exclusive -> unset flag 'cpu_exclusive' |
| 565 | |
| 566 | 2.4 Attaching processes |
| 567 | ----------------------- |
| 568 | |
| 569 | # /bin/echo PID > tasks |
| 570 | |
| 571 | Note that it is PID, not PIDs. You can only attach ONE task at a time. |
| 572 | If you have several tasks to attach, you have to do it one after another: |
| 573 | |
| 574 | # /bin/echo PID1 > tasks |
| 575 | # /bin/echo PID2 > tasks |
| 576 | ... |
| 577 | # /bin/echo PIDn > tasks |
| 578 | |
| 579 | |
| 580 | 3. Questions |
| 581 | ============ |
| 582 | |
| 583 | Q: what's up with this '/bin/echo' ? |
| 584 | A: bash's builtin 'echo' command does not check calls to write() against |
| 585 | errors. If you use it in the cpuset file system, you won't be |
| 586 | able to tell whether a command succeeded or failed. |
| 587 | |
| 588 | Q: When I attach processes, only the first of the line gets really attached ! |
| 589 | A: We can only return one error code per call to write(). So you should also |
| 590 | put only ONE pid. |
| 591 | |
| 592 | 4. Contact |
| 593 | ========== |
| 594 | |
| 595 | Web: http://www.bullopensource.org/cpuset |