Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1 | CPUSETS |
| 2 | ------- |
| 3 | |
| 4 | Copyright (C) 2004 BULL SA. |
| 5 | Written by Simon.Derr@bull.net |
| 6 | |
Christoph Lameter | b4fb376 | 2006-03-14 19:50:20 -0800 | [diff] [blame] | 7 | Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 8 | Modified by Paul Jackson <pj@sgi.com> |
Christoph Lameter | b4fb376 | 2006-03-14 19:50:20 -0800 | [diff] [blame] | 9 | Modified by Christoph Lameter <clameter@sgi.com> |
Paul Menage | 8793d85 | 2007-10-18 23:39:39 -0700 | [diff] [blame] | 10 | Modified by Paul Menage <menage@google.com> |
Hidetoshi Seto | 4d5f355 | 2008-04-15 14:03:17 +0900 | [diff] [blame] | 11 | Modified by Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 12 | |
| 13 | CONTENTS: |
| 14 | ========= |
| 15 | |
| 16 | 1. Cpusets |
| 17 | 1.1 What are cpusets ? |
| 18 | 1.2 Why are cpusets needed ? |
| 19 | 1.3 How are cpusets implemented ? |
Paul Jackson | bd5e09c | 2006-01-08 01:01:50 -0800 | [diff] [blame] | 20 | 1.4 What are exclusive cpusets ? |
Paul Menage | 8793d85 | 2007-10-18 23:39:39 -0700 | [diff] [blame] | 21 | 1.5 What is memory_pressure ? |
| 22 | 1.6 What is memory spread ? |
Paul Jackson | 029190c | 2007-10-18 23:40:20 -0700 | [diff] [blame] | 23 | 1.7 What is sched_load_balance ? |
Hidetoshi Seto | 4d5f355 | 2008-04-15 14:03:17 +0900 | [diff] [blame] | 24 | 1.8 What is sched_relax_domain_level ? |
| 25 | 1.9 How do I use cpusets ? |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 26 | 2. Usage Examples and Syntax |
| 27 | 2.1 Basic Usage |
| 28 | 2.2 Adding/removing cpus |
| 29 | 2.3 Setting flags |
| 30 | 2.4 Attaching processes |
| 31 | 3. Questions |
| 32 | 4. Contact |
| 33 | |
| 34 | 1. Cpusets |
| 35 | ========== |
| 36 | |
| 37 | 1.1 What are cpusets ? |
| 38 | ---------------------- |
| 39 | |
| 40 | Cpusets provide a mechanism for assigning a set of CPUs and Memory |
Christoph Lameter | 0e1e7c7 | 2007-10-16 01:25:38 -0700 | [diff] [blame] | 41 | Nodes to a set of tasks. In this document "Memory Node" refers to |
| 42 | an on-line node that contains memory. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 43 | |
| 44 | Cpusets constrain the CPU and Memory placement of tasks to only |
Greg Thelen | 5239c4f | 2010-03-24 14:48:30 -0700 | [diff] [blame] | 45 | the resources within a task's current cpuset. They form a nested |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 46 | hierarchy visible in a virtual file system. These are the essential |
| 47 | hooks, beyond what is already present, required to manage dynamic |
| 48 | job placement on large systems. |
| 49 | |
Paul Menage | 8793d85 | 2007-10-18 23:39:39 -0700 | [diff] [blame] | 50 | Cpusets use the generic cgroup subsystem described in |
Matt Helsley | bde5ab6 | 2008-10-18 20:27:24 -0700 | [diff] [blame] | 51 | Documentation/cgroups/cgroups.txt. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 52 | |
Paul Menage | 8793d85 | 2007-10-18 23:39:39 -0700 | [diff] [blame] | 53 | Requests by a task, using the sched_setaffinity(2) system call to |
| 54 | include CPUs in its CPU affinity mask, and using the mbind(2) and |
| 55 | set_mempolicy(2) system calls to include Memory Nodes in its memory |
Greg Thelen | 5239c4f | 2010-03-24 14:48:30 -0700 | [diff] [blame] | 56 | policy, are both filtered through that task's cpuset, filtering out any |
Paul Menage | 8793d85 | 2007-10-18 23:39:39 -0700 | [diff] [blame] | 57 | CPUs or Memory Nodes not in that cpuset. The scheduler will not |
| 58 | schedule a task on a CPU that is not allowed in its cpus_allowed |
| 59 | vector, and the kernel page allocator will not allocate a page on a |
Greg Thelen | 5239c4f | 2010-03-24 14:48:30 -0700 | [diff] [blame] | 60 | node that is not allowed in the requesting task's mems_allowed vector. |
Paul Menage | 8793d85 | 2007-10-18 23:39:39 -0700 | [diff] [blame] | 61 | |
| 62 | User level code may create and destroy cpusets by name in the cgroup |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 63 | virtual file system, manage the attributes and permissions of these |
| 64 | cpusets and which CPUs and Memory Nodes are assigned to each cpuset, |
| 65 | specify and query to which cpuset a task is assigned, and list the |
| 66 | task pids assigned to a cpuset. |
| 67 | |
| 68 | |
| 69 | 1.2 Why are cpusets needed ? |
| 70 | ---------------------------- |
| 71 | |
| 72 | The management of large computer systems, with many processors (CPUs), |
| 73 | complex memory cache hierarchies and multiple Memory Nodes having |
| 74 | non-uniform access times (NUMA) presents additional challenges for |
| 75 | the efficient scheduling and memory placement of processes. |
| 76 | |
| 77 | Frequently more modest sized systems can be operated with adequate |
| 78 | efficiency just by letting the operating system automatically share |
| 79 | the available CPU and Memory resources amongst the requesting tasks. |
| 80 | |
| 81 | But larger systems, which benefit more from careful processor and |
| 82 | memory placement to reduce memory access times and contention, |
| 83 | and which typically represent a larger investment for the customer, |
Jean Delvare | 33430dc | 2005-10-30 15:02:20 -0800 | [diff] [blame] | 84 | can benefit from explicitly placing jobs on properly sized subsets of |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 85 | the system. |
| 86 | |
| 87 | This can be especially valuable on: |
| 88 | |
| 89 | * Web Servers running multiple instances of the same web application, |
| 90 | * Servers running different applications (for instance, a web server |
| 91 | and a database), or |
| 92 | * NUMA systems running large HPC applications with demanding |
| 93 | performance characteristics. |
| 94 | |
| 95 | These subsets, or "soft partitions" must be able to be dynamically |
| 96 | adjusted, as the job mix changes, without impacting other concurrently |
Christoph Lameter | b4fb376 | 2006-03-14 19:50:20 -0800 | [diff] [blame] | 97 | executing jobs. The location of the running jobs pages may also be moved |
| 98 | when the memory locations are changed. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 99 | |
| 100 | The kernel cpuset patch provides the minimum essential kernel |
| 101 | mechanisms required to efficiently implement such subsets. It |
| 102 | leverages existing CPU and Memory Placement facilities in the Linux |
| 103 | kernel to avoid any additional impact on the critical scheduler or |
| 104 | memory allocator code. |
| 105 | |
| 106 | |
| 107 | 1.3 How are cpusets implemented ? |
| 108 | --------------------------------- |
| 109 | |
Christoph Lameter | b4fb376 | 2006-03-14 19:50:20 -0800 | [diff] [blame] | 110 | Cpusets provide a Linux kernel mechanism to constrain which CPUs and |
| 111 | Memory Nodes are used by a process or set of processes. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 112 | |
| 113 | The Linux kernel already has a pair of mechanisms to specify on which |
| 114 | CPUs a task may be scheduled (sched_setaffinity) and on which Memory |
| 115 | Nodes it may obtain memory (mbind, set_mempolicy). |
| 116 | |
| 117 | Cpusets extends these two mechanisms as follows: |
| 118 | |
| 119 | - Cpusets are sets of allowed CPUs and Memory Nodes, known to the |
| 120 | kernel. |
| 121 | - Each task in the system is attached to a cpuset, via a pointer |
Paul Menage | 8793d85 | 2007-10-18 23:39:39 -0700 | [diff] [blame] | 122 | in the task structure to a reference counted cgroup structure. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 123 | - Calls to sched_setaffinity are filtered to just those CPUs |
Greg Thelen | 5239c4f | 2010-03-24 14:48:30 -0700 | [diff] [blame] | 124 | allowed in that task's cpuset. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 125 | - Calls to mbind and set_mempolicy are filtered to just |
Greg Thelen | 5239c4f | 2010-03-24 14:48:30 -0700 | [diff] [blame] | 126 | those Memory Nodes allowed in that task's cpuset. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 127 | - The root cpuset contains all the systems CPUs and Memory |
| 128 | Nodes. |
| 129 | - For any cpuset, one can define child cpusets containing a subset |
| 130 | of the parents CPU and Memory Node resources. |
| 131 | - The hierarchy of cpusets can be mounted at /dev/cpuset, for |
| 132 | browsing and manipulation from user space. |
| 133 | - A cpuset may be marked exclusive, which ensures that no other |
Chris Samuel | caa790b | 2009-01-17 00:01:18 +1100 | [diff] [blame] | 134 | cpuset (except direct ancestors and descendants) may contain |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 135 | any overlapping CPUs or Memory Nodes. |
| 136 | - You can list all the tasks (by pid) attached to any cpuset. |
| 137 | |
| 138 | The implementation of cpusets requires a few, simple hooks |
| 139 | into the rest of the kernel, none in performance critical paths: |
| 140 | |
Paul Jackson | 864913f | 2006-01-11 02:01:38 +0100 | [diff] [blame] | 141 | - in init/main.c, to initialize the root cpuset at system boot. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 142 | - in fork and exit, to attach and detach a task from its cpuset. |
| 143 | - in sched_setaffinity, to mask the requested CPUs by what's |
Greg Thelen | 5239c4f | 2010-03-24 14:48:30 -0700 | [diff] [blame] | 144 | allowed in that task's cpuset. |
Li Zefan | 3fd076d | 2009-02-20 15:38:48 -0800 | [diff] [blame] | 145 | - in sched.c migrate_live_tasks(), to keep migrating tasks within |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 146 | the CPUs allowed by their cpuset, if possible. |
| 147 | - in the mbind and set_mempolicy system calls, to mask the requested |
Greg Thelen | 5239c4f | 2010-03-24 14:48:30 -0700 | [diff] [blame] | 148 | Memory Nodes by what's allowed in that task's cpuset. |
Paul Jackson | 864913f | 2006-01-11 02:01:38 +0100 | [diff] [blame] | 149 | - in page_alloc.c, to restrict memory to allowed nodes. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 150 | - in vmscan.c, to restrict page recovery to the current cpuset. |
| 151 | |
Paul Menage | 8793d85 | 2007-10-18 23:39:39 -0700 | [diff] [blame] | 152 | You should mount the "cgroup" filesystem type in order to enable |
| 153 | browsing and modifying the cpusets presently known to the kernel. No |
| 154 | new system calls are added for cpusets - all support for querying and |
| 155 | modifying cpusets is via this cpuset file system. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 156 | |
Paul Jackson | 985ee7f | 2008-07-04 10:00:01 -0700 | [diff] [blame] | 157 | The /proc/<pid>/status file for each task has four added lines, |
Greg Thelen | 5239c4f | 2010-03-24 14:48:30 -0700 | [diff] [blame] | 158 | displaying the task's cpus_allowed (on which CPUs it may be scheduled) |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 159 | and mems_allowed (on which Memory Nodes it may obtain memory), |
Paul Jackson | 985ee7f | 2008-07-04 10:00:01 -0700 | [diff] [blame] | 160 | in the two formats seen in the following example: |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 161 | |
| 162 | Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff |
Paul Jackson | 985ee7f | 2008-07-04 10:00:01 -0700 | [diff] [blame] | 163 | Cpus_allowed_list: 0-127 |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 164 | Mems_allowed: ffffffff,ffffffff |
Paul Jackson | 985ee7f | 2008-07-04 10:00:01 -0700 | [diff] [blame] | 165 | Mems_allowed_list: 0-63 |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 166 | |
Paul Menage | 8793d85 | 2007-10-18 23:39:39 -0700 | [diff] [blame] | 167 | Each cpuset is represented by a directory in the cgroup file system |
| 168 | containing (on top of the standard cgroup files) the following |
| 169 | files describing that cpuset: |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 170 | |
GeunSik Lim | e21a05c | 2010-02-24 11:06:39 +0100 | [diff] [blame] | 171 | - cpuset.cpus: list of CPUs in that cpuset |
| 172 | - cpuset.mems: list of Memory Nodes in that cpuset |
| 173 | - cpuset.memory_migrate flag: if set, move pages to cpusets nodes |
| 174 | - cpuset.cpu_exclusive flag: is cpu placement exclusive? |
| 175 | - cpuset.mem_exclusive flag: is memory placement exclusive? |
| 176 | - cpuset.mem_hardwall flag: is memory allocation hardwalled |
| 177 | - cpuset.memory_pressure: measure of how much paging pressure in cpuset |
| 178 | - cpuset.memory_spread_page flag: if set, spread page cache evenly on allowed nodes |
| 179 | - cpuset.memory_spread_slab flag: if set, spread slab cache evenly on allowed nodes |
| 180 | - cpuset.sched_load_balance flag: if set, load balance within CPUs on that cpuset |
| 181 | - cpuset.sched_relax_domain_level: the searching range when migrating tasks |
Paul Jackson | bd5e09c | 2006-01-08 01:01:50 -0800 | [diff] [blame] | 182 | |
Wanlong Gao | 9fd615f4 | 2011-07-23 10:38:17 -0700 | [diff] [blame] | 183 | In addition, only the root cpuset has the following file: |
GeunSik Lim | e21a05c | 2010-02-24 11:06:39 +0100 | [diff] [blame] | 184 | - cpuset.memory_pressure_enabled flag: compute memory_pressure? |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 185 | |
| 186 | New cpusets are created using the mkdir system call or shell |
| 187 | command. The properties of a cpuset, such as its flags, allowed |
| 188 | CPUs and Memory Nodes, and attached tasks, are modified by writing |
| 189 | to the appropriate file in that cpusets directory, as listed above. |
| 190 | |
| 191 | The named hierarchical structure of nested cpusets allows partitioning |
| 192 | a large system into nested, dynamically changeable, "soft-partitions". |
| 193 | |
| 194 | The attachment of each task, automatically inherited at fork by any |
| 195 | children of that task, to a cpuset allows organizing the work load |
| 196 | on a system into related sets of tasks such that each set is constrained |
| 197 | to using the CPUs and Memory Nodes of a particular cpuset. A task |
| 198 | may be re-attached to any other cpuset, if allowed by the permissions |
| 199 | on the necessary cpuset file system directories. |
| 200 | |
| 201 | Such management of a system "in the large" integrates smoothly with |
| 202 | the detailed placement done on individual tasks and memory regions |
| 203 | using the sched_setaffinity, mbind and set_mempolicy system calls. |
| 204 | |
| 205 | The following rules apply to each cpuset: |
| 206 | |
| 207 | - Its CPUs and Memory Nodes must be a subset of its parents. |
Miao Xie | 6a7d68e | 2008-06-05 22:45:54 -0700 | [diff] [blame] | 208 | - It can't be marked exclusive unless its parent is. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 209 | - If its cpu or memory is exclusive, they may not overlap any sibling. |
| 210 | |
| 211 | These rules, and the natural hierarchy of cpusets, enable efficient |
| 212 | enforcement of the exclusive guarantee, without having to scan all |
| 213 | cpusets every time any of them change to ensure nothing overlaps a |
| 214 | exclusive cpuset. Also, the use of a Linux virtual file system (vfs) |
| 215 | to represent the cpuset hierarchy provides for a familiar permission |
| 216 | and name space for cpusets, with a minimum of additional kernel code. |
| 217 | |
Paul Jackson | 38837fc | 2006-09-29 02:01:16 -0700 | [diff] [blame] | 218 | The cpus and mems files in the root (top_cpuset) cpuset are |
| 219 | read-only. The cpus file automatically tracks the value of |
| 220 | cpu_online_map using a CPU hotplug notifier, and the mems file |
KOSAKI Motohiro | 0b72037 | 2008-02-23 15:23:41 -0800 | [diff] [blame] | 221 | automatically tracks the value of node_states[N_HIGH_MEMORY]--i.e., |
Christoph Lameter | 0e1e7c7 | 2007-10-16 01:25:38 -0700 | [diff] [blame] | 222 | nodes with memory--using the cpuset_track_online_nodes() hook. |
Paul Jackson | 4c4d50f | 2006-08-27 01:23:51 -0700 | [diff] [blame] | 223 | |
Paul Jackson | bd5e09c | 2006-01-08 01:01:50 -0800 | [diff] [blame] | 224 | |
| 225 | 1.4 What are exclusive cpusets ? |
| 226 | -------------------------------- |
| 227 | |
| 228 | If a cpuset is cpu or mem exclusive, no other cpuset, other than |
Chris Samuel | caa790b | 2009-01-17 00:01:18 +1100 | [diff] [blame] | 229 | a direct ancestor or descendant, may share any of the same CPUs or |
Paul Jackson | bd5e09c | 2006-01-08 01:01:50 -0800 | [diff] [blame] | 230 | Memory Nodes. |
| 231 | |
GeunSik Lim | e21a05c | 2010-02-24 11:06:39 +0100 | [diff] [blame] | 232 | A cpuset that is cpuset.mem_exclusive *or* cpuset.mem_hardwall is "hardwalled", |
Paul Menage | 7860836 | 2008-04-29 01:00:26 -0700 | [diff] [blame] | 233 | i.e. it restricts kernel allocations for page, buffer and other data |
| 234 | commonly shared by the kernel across multiple users. All cpusets, |
| 235 | whether hardwalled or not, restrict allocations of memory for user |
| 236 | space. This enables configuring a system so that several independent |
| 237 | jobs can share common kernel data, such as file system pages, while |
| 238 | isolating each job's user allocation in its own cpuset. To do this, |
| 239 | construct a large mem_exclusive cpuset to hold all the jobs, and |
| 240 | construct child, non-mem_exclusive cpusets for each individual job. |
| 241 | Only a small amount of typical kernel memory, such as requests from |
| 242 | interrupt handlers, is allowed to be taken outside even a |
| 243 | mem_exclusive cpuset. |
Paul Jackson | bd5e09c | 2006-01-08 01:01:50 -0800 | [diff] [blame] | 244 | |
| 245 | |
Paul Menage | 8793d85 | 2007-10-18 23:39:39 -0700 | [diff] [blame] | 246 | 1.5 What is memory_pressure ? |
Paul Jackson | bd5e09c | 2006-01-08 01:01:50 -0800 | [diff] [blame] | 247 | ----------------------------- |
| 248 | The memory_pressure of a cpuset provides a simple per-cpuset metric |
| 249 | of the rate that the tasks in a cpuset are attempting to free up in |
| 250 | use memory on the nodes of the cpuset to satisfy additional memory |
| 251 | requests. |
| 252 | |
| 253 | This enables batch managers monitoring jobs running in dedicated |
| 254 | cpusets to efficiently detect what level of memory pressure that job |
| 255 | is causing. |
| 256 | |
| 257 | This is useful both on tightly managed systems running a wide mix of |
| 258 | submitted jobs, which may choose to terminate or re-prioritize jobs that |
Li Zefan | 3fd076d | 2009-02-20 15:38:48 -0800 | [diff] [blame] | 259 | are trying to use more memory than allowed on the nodes assigned to them, |
Paul Jackson | bd5e09c | 2006-01-08 01:01:50 -0800 | [diff] [blame] | 260 | and with tightly coupled, long running, massively parallel scientific |
| 261 | computing jobs that will dramatically fail to meet required performance |
| 262 | goals if they start to use more memory than allowed to them. |
| 263 | |
| 264 | This mechanism provides a very economical way for the batch manager |
| 265 | to monitor a cpuset for signs of memory pressure. It's up to the |
| 266 | batch manager or other user code to decide what to do about it and |
| 267 | take action. |
| 268 | |
| 269 | ==> Unless this feature is enabled by writing "1" to the special file |
| 270 | /dev/cpuset/memory_pressure_enabled, the hook in the rebalance |
| 271 | code of __alloc_pages() for this metric reduces to simply noticing |
| 272 | that the cpuset_memory_pressure_enabled flag is zero. So only |
| 273 | systems that enable this feature will compute the metric. |
| 274 | |
| 275 | Why a per-cpuset, running average: |
| 276 | |
| 277 | Because this meter is per-cpuset, rather than per-task or mm, |
| 278 | the system load imposed by a batch scheduler monitoring this |
| 279 | metric is sharply reduced on large systems, because a scan of |
| 280 | the tasklist can be avoided on each set of queries. |
| 281 | |
| 282 | Because this meter is a running average, instead of an accumulating |
| 283 | counter, a batch scheduler can detect memory pressure with a |
| 284 | single read, instead of having to read and accumulate results |
| 285 | for a period of time. |
| 286 | |
| 287 | Because this meter is per-cpuset rather than per-task or mm, |
| 288 | the batch scheduler can obtain the key information, memory |
| 289 | pressure in a cpuset, with a single read, rather than having to |
| 290 | query and accumulate results over all the (dynamically changing) |
| 291 | set of tasks in the cpuset. |
| 292 | |
| 293 | A per-cpuset simple digital filter (requires a spinlock and 3 words |
| 294 | of data per-cpuset) is kept, and updated by any task attached to that |
| 295 | cpuset, if it enters the synchronous (direct) page reclaim code. |
| 296 | |
| 297 | A per-cpuset file provides an integer number representing the recent |
| 298 | (half-life of 10 seconds) rate of direct page reclaims caused by |
| 299 | the tasks in the cpuset, in units of reclaims attempted per second, |
| 300 | times 1000. |
| 301 | |
| 302 | |
Paul Menage | 8793d85 | 2007-10-18 23:39:39 -0700 | [diff] [blame] | 303 | 1.6 What is memory spread ? |
Paul Jackson | 825a46a | 2006-03-24 03:16:03 -0800 | [diff] [blame] | 304 | --------------------------- |
| 305 | There are two boolean flag files per cpuset that control where the |
| 306 | kernel allocates pages for the file system buffers and related in |
GeunSik Lim | e21a05c | 2010-02-24 11:06:39 +0100 | [diff] [blame] | 307 | kernel data structures. They are called 'cpuset.memory_spread_page' and |
| 308 | 'cpuset.memory_spread_slab'. |
Paul Jackson | 825a46a | 2006-03-24 03:16:03 -0800 | [diff] [blame] | 309 | |
GeunSik Lim | e21a05c | 2010-02-24 11:06:39 +0100 | [diff] [blame] | 310 | If the per-cpuset boolean flag file 'cpuset.memory_spread_page' is set, then |
Paul Jackson | 825a46a | 2006-03-24 03:16:03 -0800 | [diff] [blame] | 311 | the kernel will spread the file system buffers (page cache) evenly |
| 312 | over all the nodes that the faulting task is allowed to use, instead |
| 313 | of preferring to put those pages on the node where the task is running. |
| 314 | |
GeunSik Lim | e21a05c | 2010-02-24 11:06:39 +0100 | [diff] [blame] | 315 | If the per-cpuset boolean flag file 'cpuset.memory_spread_slab' is set, |
Paul Jackson | 825a46a | 2006-03-24 03:16:03 -0800 | [diff] [blame] | 316 | then the kernel will spread some file system related slab caches, |
| 317 | such as for inodes and dentries evenly over all the nodes that the |
| 318 | faulting task is allowed to use, instead of preferring to put those |
| 319 | pages on the node where the task is running. |
| 320 | |
| 321 | The setting of these flags does not affect anonymous data segment or |
| 322 | stack segment pages of a task. |
| 323 | |
| 324 | By default, both kinds of memory spreading are off, and memory |
| 325 | pages are allocated on the node local to where the task is running, |
Greg Thelen | 5239c4f | 2010-03-24 14:48:30 -0700 | [diff] [blame] | 326 | except perhaps as modified by the task's NUMA mempolicy or cpuset |
Paul Jackson | 825a46a | 2006-03-24 03:16:03 -0800 | [diff] [blame] | 327 | configuration, so long as sufficient free memory pages are available. |
| 328 | |
| 329 | When new cpusets are created, they inherit the memory spread settings |
| 330 | of their parent. |
| 331 | |
| 332 | Setting memory spreading causes allocations for the affected page |
Greg Thelen | 5239c4f | 2010-03-24 14:48:30 -0700 | [diff] [blame] | 333 | or slab caches to ignore the task's NUMA mempolicy and be spread |
Paul Jackson | 825a46a | 2006-03-24 03:16:03 -0800 | [diff] [blame] | 334 | instead. Tasks using mbind() or set_mempolicy() calls to set NUMA |
| 335 | mempolicies will not notice any change in these calls as a result of |
Greg Thelen | 5239c4f | 2010-03-24 14:48:30 -0700 | [diff] [blame] | 336 | their containing task's memory spread settings. If memory spreading |
Paul Jackson | 825a46a | 2006-03-24 03:16:03 -0800 | [diff] [blame] | 337 | is turned off, then the currently specified NUMA mempolicy once again |
| 338 | applies to memory page allocations. |
| 339 | |
GeunSik Lim | e21a05c | 2010-02-24 11:06:39 +0100 | [diff] [blame] | 340 | Both 'cpuset.memory_spread_page' and 'cpuset.memory_spread_slab' are boolean flag |
Paul Jackson | 825a46a | 2006-03-24 03:16:03 -0800 | [diff] [blame] | 341 | files. By default they contain "0", meaning that the feature is off |
| 342 | for that cpuset. If a "1" is written to that file, then that turns |
| 343 | the named feature on. |
| 344 | |
| 345 | The implementation is simple. |
| 346 | |
GeunSik Lim | e21a05c | 2010-02-24 11:06:39 +0100 | [diff] [blame] | 347 | Setting the flag 'cpuset.memory_spread_page' turns on a per-process flag |
Paul Jackson | 825a46a | 2006-03-24 03:16:03 -0800 | [diff] [blame] | 348 | PF_SPREAD_PAGE for each task that is in that cpuset or subsequently |
| 349 | joins that cpuset. The page allocation calls for the page cache |
| 350 | is modified to perform an inline check for this PF_SPREAD_PAGE task |
| 351 | flag, and if set, a call to a new routine cpuset_mem_spread_node() |
| 352 | returns the node to prefer for the allocation. |
| 353 | |
GeunSik Lim | e21a05c | 2010-02-24 11:06:39 +0100 | [diff] [blame] | 354 | Similarly, setting 'cpuset.memory_spread_slab' turns on the flag |
Paul Jackson | 825a46a | 2006-03-24 03:16:03 -0800 | [diff] [blame] | 355 | PF_SPREAD_SLAB, and appropriately marked slab caches will allocate |
| 356 | pages from the node returned by cpuset_mem_spread_node(). |
| 357 | |
| 358 | The cpuset_mem_spread_node() routine is also simple. It uses the |
| 359 | value of a per-task rotor cpuset_mem_spread_rotor to select the next |
Greg Thelen | 5239c4f | 2010-03-24 14:48:30 -0700 | [diff] [blame] | 360 | node in the current task's mems_allowed to prefer for the allocation. |
Paul Jackson | 825a46a | 2006-03-24 03:16:03 -0800 | [diff] [blame] | 361 | |
| 362 | This memory placement policy is also known (in other contexts) as |
| 363 | round-robin or interleave. |
| 364 | |
| 365 | This policy can provide substantial improvements for jobs that need |
| 366 | to place thread local data on the corresponding node, but that need |
| 367 | to access large file system data sets that need to be spread across |
| 368 | the several nodes in the jobs cpuset in order to fit. Without this |
| 369 | policy, especially for jobs that might have one thread reading in the |
| 370 | data set, the memory allocation across the nodes in the jobs cpuset |
| 371 | can become very uneven. |
| 372 | |
Paul Jackson | 029190c | 2007-10-18 23:40:20 -0700 | [diff] [blame] | 373 | 1.7 What is sched_load_balance ? |
| 374 | -------------------------------- |
Paul Jackson | 825a46a | 2006-03-24 03:16:03 -0800 | [diff] [blame] | 375 | |
Paul Jackson | 029190c | 2007-10-18 23:40:20 -0700 | [diff] [blame] | 376 | The kernel scheduler (kernel/sched.c) automatically load balances |
| 377 | tasks. If one CPU is underutilized, kernel code running on that |
| 378 | CPU will look for tasks on other more overloaded CPUs and move those |
| 379 | tasks to itself, within the constraints of such placement mechanisms |
| 380 | as cpusets and sched_setaffinity. |
| 381 | |
| 382 | The algorithmic cost of load balancing and its impact on key shared |
| 383 | kernel data structures such as the task list increases more than |
| 384 | linearly with the number of CPUs being balanced. So the scheduler |
Li Zefan | 3fd076d | 2009-02-20 15:38:48 -0800 | [diff] [blame] | 385 | has support to partition the systems CPUs into a number of sched |
Paul Jackson | 029190c | 2007-10-18 23:40:20 -0700 | [diff] [blame] | 386 | domains such that it only load balances within each sched domain. |
| 387 | Each sched domain covers some subset of the CPUs in the system; |
| 388 | no two sched domains overlap; some CPUs might not be in any sched |
| 389 | domain and hence won't be load balanced. |
| 390 | |
| 391 | Put simply, it costs less to balance between two smaller sched domains |
| 392 | than one big one, but doing so means that overloads in one of the |
| 393 | two domains won't be load balanced to the other one. |
| 394 | |
| 395 | By default, there is one sched domain covering all CPUs, except those |
| 396 | marked isolated using the kernel boot time "isolcpus=" argument. |
| 397 | |
| 398 | This default load balancing across all CPUs is not well suited for |
| 399 | the following two situations: |
| 400 | 1) On large systems, load balancing across many CPUs is expensive. |
| 401 | If the system is managed using cpusets to place independent jobs |
| 402 | on separate sets of CPUs, full load balancing is unnecessary. |
| 403 | 2) Systems supporting realtime on some CPUs need to minimize |
| 404 | system overhead on those CPUs, including avoiding task load |
| 405 | balancing if that is not needed. |
| 406 | |
GeunSik Lim | e21a05c | 2010-02-24 11:06:39 +0100 | [diff] [blame] | 407 | When the per-cpuset flag "cpuset.sched_load_balance" is enabled (the default |
| 408 | setting), it requests that all the CPUs in that cpusets allowed 'cpuset.cpus' |
Paul Jackson | 029190c | 2007-10-18 23:40:20 -0700 | [diff] [blame] | 409 | be contained in a single sched domain, ensuring that load balancing |
| 410 | can move a task (not otherwised pinned, as by sched_setaffinity) |
| 411 | from any CPU in that cpuset to any other. |
| 412 | |
GeunSik Lim | e21a05c | 2010-02-24 11:06:39 +0100 | [diff] [blame] | 413 | When the per-cpuset flag "cpuset.sched_load_balance" is disabled, then the |
Paul Jackson | 029190c | 2007-10-18 23:40:20 -0700 | [diff] [blame] | 414 | scheduler will avoid load balancing across the CPUs in that cpuset, |
| 415 | --except-- in so far as is necessary because some overlapping cpuset |
| 416 | has "sched_load_balance" enabled. |
| 417 | |
GeunSik Lim | e21a05c | 2010-02-24 11:06:39 +0100 | [diff] [blame] | 418 | So, for example, if the top cpuset has the flag "cpuset.sched_load_balance" |
Paul Jackson | 029190c | 2007-10-18 23:40:20 -0700 | [diff] [blame] | 419 | enabled, then the scheduler will have one sched domain covering all |
GeunSik Lim | e21a05c | 2010-02-24 11:06:39 +0100 | [diff] [blame] | 420 | CPUs, and the setting of the "cpuset.sched_load_balance" flag in any other |
Paul Jackson | 029190c | 2007-10-18 23:40:20 -0700 | [diff] [blame] | 421 | cpusets won't matter, as we're already fully load balancing. |
| 422 | |
| 423 | Therefore in the above two situations, the top cpuset flag |
GeunSik Lim | e21a05c | 2010-02-24 11:06:39 +0100 | [diff] [blame] | 424 | "cpuset.sched_load_balance" should be disabled, and only some of the smaller, |
Paul Jackson | 029190c | 2007-10-18 23:40:20 -0700 | [diff] [blame] | 425 | child cpusets have this flag enabled. |
| 426 | |
| 427 | When doing this, you don't usually want to leave any unpinned tasks in |
| 428 | the top cpuset that might use non-trivial amounts of CPU, as such tasks |
| 429 | may be artificially constrained to some subset of CPUs, depending on |
Chris Samuel | caa790b | 2009-01-17 00:01:18 +1100 | [diff] [blame] | 430 | the particulars of this flag setting in descendant cpusets. Even if |
Paul Jackson | 029190c | 2007-10-18 23:40:20 -0700 | [diff] [blame] | 431 | such a task could use spare CPU cycles in some other CPUs, the kernel |
| 432 | scheduler might not consider the possibility of load balancing that |
| 433 | task to that underused CPU. |
| 434 | |
| 435 | Of course, tasks pinned to a particular CPU can be left in a cpuset |
GeunSik Lim | e21a05c | 2010-02-24 11:06:39 +0100 | [diff] [blame] | 436 | that disables "cpuset.sched_load_balance" as those tasks aren't going anywhere |
Paul Jackson | 029190c | 2007-10-18 23:40:20 -0700 | [diff] [blame] | 437 | else anyway. |
| 438 | |
| 439 | There is an impedance mismatch here, between cpusets and sched domains. |
| 440 | Cpusets are hierarchical and nest. Sched domains are flat; they don't |
| 441 | overlap and each CPU is in at most one sched domain. |
| 442 | |
| 443 | It is necessary for sched domains to be flat because load balancing |
| 444 | across partially overlapping sets of CPUs would risk unstable dynamics |
| 445 | that would be beyond our understanding. So if each of two partially |
GeunSik Lim | e21a05c | 2010-02-24 11:06:39 +0100 | [diff] [blame] | 446 | overlapping cpusets enables the flag 'cpuset.sched_load_balance', then we |
Paul Jackson | 029190c | 2007-10-18 23:40:20 -0700 | [diff] [blame] | 447 | form a single sched domain that is a superset of both. We won't move |
| 448 | a task to a CPU outside it cpuset, but the scheduler load balancing |
| 449 | code might waste some compute cycles considering that possibility. |
| 450 | |
| 451 | This mismatch is why there is not a simple one-to-one relation |
GeunSik Lim | e21a05c | 2010-02-24 11:06:39 +0100 | [diff] [blame] | 452 | between which cpusets have the flag "cpuset.sched_load_balance" enabled, |
Paul Jackson | 029190c | 2007-10-18 23:40:20 -0700 | [diff] [blame] | 453 | and the sched domain configuration. If a cpuset enables the flag, it |
| 454 | will get balancing across all its CPUs, but if it disables the flag, |
| 455 | it will only be assured of no load balancing if no other overlapping |
| 456 | cpuset enables the flag. |
| 457 | |
GeunSik Lim | e21a05c | 2010-02-24 11:06:39 +0100 | [diff] [blame] | 458 | If two cpusets have partially overlapping 'cpuset.cpus' allowed, and only |
Paul Jackson | 029190c | 2007-10-18 23:40:20 -0700 | [diff] [blame] | 459 | one of them has this flag enabled, then the other may find its |
| 460 | tasks only partially load balanced, just on the overlapping CPUs. |
| 461 | This is just the general case of the top_cpuset example given a few |
| 462 | paragraphs above. In the general case, as in the top cpuset case, |
| 463 | don't leave tasks that might use non-trivial amounts of CPU in |
| 464 | such partially load balanced cpusets, as they may be artificially |
| 465 | constrained to some subset of the CPUs allowed to them, for lack of |
| 466 | load balancing to the other CPUs. |
| 467 | |
| 468 | 1.7.1 sched_load_balance implementation details. |
| 469 | ------------------------------------------------ |
| 470 | |
GeunSik Lim | e21a05c | 2010-02-24 11:06:39 +0100 | [diff] [blame] | 471 | The per-cpuset flag 'cpuset.sched_load_balance' defaults to enabled (contrary |
Paul Jackson | 029190c | 2007-10-18 23:40:20 -0700 | [diff] [blame] | 472 | to most cpuset flags.) When enabled for a cpuset, the kernel will |
| 473 | ensure that it can load balance across all the CPUs in that cpuset |
| 474 | (makes sure that all the CPUs in the cpus_allowed of that cpuset are |
| 475 | in the same sched domain.) |
| 476 | |
GeunSik Lim | e21a05c | 2010-02-24 11:06:39 +0100 | [diff] [blame] | 477 | If two overlapping cpusets both have 'cpuset.sched_load_balance' enabled, |
Paul Jackson | 029190c | 2007-10-18 23:40:20 -0700 | [diff] [blame] | 478 | then they will be (must be) both in the same sched domain. |
| 479 | |
GeunSik Lim | e21a05c | 2010-02-24 11:06:39 +0100 | [diff] [blame] | 480 | If, as is the default, the top cpuset has 'cpuset.sched_load_balance' enabled, |
Paul Jackson | 029190c | 2007-10-18 23:40:20 -0700 | [diff] [blame] | 481 | then by the above that means there is a single sched domain covering |
| 482 | the whole system, regardless of any other cpuset settings. |
| 483 | |
| 484 | The kernel commits to user space that it will avoid load balancing |
| 485 | where it can. It will pick as fine a granularity partition of sched |
| 486 | domains as it can while still providing load balancing for any set |
GeunSik Lim | e21a05c | 2010-02-24 11:06:39 +0100 | [diff] [blame] | 487 | of CPUs allowed to a cpuset having 'cpuset.sched_load_balance' enabled. |
Paul Jackson | 029190c | 2007-10-18 23:40:20 -0700 | [diff] [blame] | 488 | |
| 489 | The internal kernel cpuset to scheduler interface passes from the |
| 490 | cpuset code to the scheduler code a partition of the load balanced |
| 491 | CPUs in the system. This partition is a set of subsets (represented |
Li Zefan | 3fd076d | 2009-02-20 15:38:48 -0800 | [diff] [blame] | 492 | as an array of struct cpumask) of CPUs, pairwise disjoint, that cover |
| 493 | all the CPUs that must be load balanced. |
Paul Jackson | 029190c | 2007-10-18 23:40:20 -0700 | [diff] [blame] | 494 | |
Li Zefan | 3fd076d | 2009-02-20 15:38:48 -0800 | [diff] [blame] | 495 | The cpuset code builds a new such partition and passes it to the |
| 496 | scheduler sched domain setup code, to have the sched domains rebuilt |
| 497 | as necessary, whenever: |
GeunSik Lim | e21a05c | 2010-02-24 11:06:39 +0100 | [diff] [blame] | 498 | - the 'cpuset.sched_load_balance' flag of a cpuset with non-empty CPUs changes, |
Li Zefan | 3fd076d | 2009-02-20 15:38:48 -0800 | [diff] [blame] | 499 | - or CPUs come or go from a cpuset with this flag enabled, |
GeunSik Lim | e21a05c | 2010-02-24 11:06:39 +0100 | [diff] [blame] | 500 | - or 'cpuset.sched_relax_domain_level' value of a cpuset with non-empty CPUs |
Li Zefan | 3fd076d | 2009-02-20 15:38:48 -0800 | [diff] [blame] | 501 | and with this flag enabled changes, |
| 502 | - or a cpuset with non-empty CPUs and with this flag enabled is removed, |
| 503 | - or a cpu is offlined/onlined. |
Paul Jackson | 029190c | 2007-10-18 23:40:20 -0700 | [diff] [blame] | 504 | |
| 505 | This partition exactly defines what sched domains the scheduler should |
Li Zefan | 3fd076d | 2009-02-20 15:38:48 -0800 | [diff] [blame] | 506 | setup - one sched domain for each element (struct cpumask) in the |
| 507 | partition. |
Paul Jackson | 029190c | 2007-10-18 23:40:20 -0700 | [diff] [blame] | 508 | |
| 509 | The scheduler remembers the currently active sched domain partitions. |
| 510 | When the scheduler routine partition_sched_domains() is invoked from |
| 511 | the cpuset code to update these sched domains, it compares the new |
| 512 | partition requested with the current, and updates its sched domains, |
| 513 | removing the old and adding the new, for each change. |
| 514 | |
Hidetoshi Seto | 4d5f355 | 2008-04-15 14:03:17 +0900 | [diff] [blame] | 515 | |
| 516 | 1.8 What is sched_relax_domain_level ? |
| 517 | -------------------------------------- |
| 518 | |
| 519 | In sched domain, the scheduler migrates tasks in 2 ways; periodic load |
| 520 | balance on tick, and at time of some schedule events. |
| 521 | |
| 522 | When a task is woken up, scheduler try to move the task on idle CPU. |
| 523 | For example, if a task A running on CPU X activates another task B |
| 524 | on the same CPU X, and if CPU Y is X's sibling and performing idle, |
| 525 | then scheduler migrate task B to CPU Y so that task B can start on |
| 526 | CPU Y without waiting task A on CPU X. |
| 527 | |
| 528 | And if a CPU run out of tasks in its runqueue, the CPU try to pull |
| 529 | extra tasks from other busy CPUs to help them before it is going to |
| 530 | be idle. |
| 531 | |
| 532 | Of course it takes some searching cost to find movable tasks and/or |
| 533 | idle CPUs, the scheduler might not search all CPUs in the domain |
Chris Samuel | caa790b | 2009-01-17 00:01:18 +1100 | [diff] [blame] | 534 | every time. In fact, in some architectures, the searching ranges on |
Hidetoshi Seto | 4d5f355 | 2008-04-15 14:03:17 +0900 | [diff] [blame] | 535 | events are limited in the same socket or node where the CPU locates, |
Chris Samuel | caa790b | 2009-01-17 00:01:18 +1100 | [diff] [blame] | 536 | while the load balance on tick searches all. |
Hidetoshi Seto | 4d5f355 | 2008-04-15 14:03:17 +0900 | [diff] [blame] | 537 | |
| 538 | For example, assume CPU Z is relatively far from CPU X. Even if CPU Z |
| 539 | is idle while CPU X and the siblings are busy, scheduler can't migrate |
| 540 | woken task B from X to Z since it is out of its searching range. |
| 541 | As the result, task B on CPU X need to wait task A or wait load balance |
| 542 | on the next tick. For some applications in special situation, waiting |
| 543 | 1 tick may be too long. |
| 544 | |
GeunSik Lim | e21a05c | 2010-02-24 11:06:39 +0100 | [diff] [blame] | 545 | The 'cpuset.sched_relax_domain_level' file allows you to request changing |
Hidetoshi Seto | 4d5f355 | 2008-04-15 14:03:17 +0900 | [diff] [blame] | 546 | this searching range as you like. This file takes int value which |
| 547 | indicates size of searching range in levels ideally as follows, |
| 548 | otherwise initial value -1 that indicates the cpuset has no request. |
| 549 | |
| 550 | -1 : no request. use system default or follow request of others. |
| 551 | 0 : no search. |
| 552 | 1 : search siblings (hyperthreads in a core). |
| 553 | 2 : search cores in a package. |
| 554 | 3 : search cpus in a node [= system wide on non-NUMA system] |
| 555 | ( 4 : search nodes in a chunk of node [on NUMA system] ) |
Li Zefan | 30e0e17 | 2008-05-13 10:27:17 +0800 | [diff] [blame] | 556 | ( 5 : search system wide [on NUMA system] ) |
Hidetoshi Seto | 4d5f355 | 2008-04-15 14:03:17 +0900 | [diff] [blame] | 557 | |
Paul Jackson | 46b6d94 | 2008-07-04 10:00:09 -0700 | [diff] [blame] | 558 | The system default is architecture dependent. The system default |
| 559 | can be changed using the relax_domain_level= boot parameter. |
| 560 | |
Hidetoshi Seto | 4d5f355 | 2008-04-15 14:03:17 +0900 | [diff] [blame] | 561 | This file is per-cpuset and affect the sched domain where the cpuset |
GeunSik Lim | e21a05c | 2010-02-24 11:06:39 +0100 | [diff] [blame] | 562 | belongs to. Therefore if the flag 'cpuset.sched_load_balance' of a cpuset |
| 563 | is disabled, then 'cpuset.sched_relax_domain_level' have no effect since |
Hidetoshi Seto | 4d5f355 | 2008-04-15 14:03:17 +0900 | [diff] [blame] | 564 | there is no sched domain belonging the cpuset. |
| 565 | |
| 566 | If multiple cpusets are overlapping and hence they form a single sched |
| 567 | domain, the largest value among those is used. Be careful, if one |
| 568 | requests 0 and others are -1 then 0 is used. |
| 569 | |
| 570 | Note that modifying this file will have both good and bad effects, |
Li Zefan | 3fd076d | 2009-02-20 15:38:48 -0800 | [diff] [blame] | 571 | and whether it is acceptable or not depends on your situation. |
Hidetoshi Seto | 4d5f355 | 2008-04-15 14:03:17 +0900 | [diff] [blame] | 572 | Don't modify this file if you are not sure. |
| 573 | |
| 574 | If your situation is: |
| 575 | - The migration costs between each cpu can be assumed considerably |
| 576 | small(for you) due to your special application's behavior or |
| 577 | special hardware support for CPU cache etc. |
| 578 | - The searching cost doesn't have impact(for you) or you can make |
| 579 | the searching cost enough small by managing cpuset to compact etc. |
| 580 | - The latency is required even it sacrifices cache hit rate etc. |
| 581 | then increasing 'sched_relax_domain_level' would benefit you. |
| 582 | |
| 583 | |
| 584 | 1.9 How do I use cpusets ? |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 585 | -------------------------- |
| 586 | |
| 587 | In order to minimize the impact of cpusets on critical kernel |
| 588 | code, such as the scheduler, and due to the fact that the kernel |
| 589 | does not support one task updating the memory placement of another |
| 590 | task directly, the impact on a task of changing its cpuset CPU |
| 591 | or Memory Node placement, or of changing to which cpuset a task |
| 592 | is attached, is subtle. |
| 593 | |
| 594 | If a cpuset has its Memory Nodes modified, then for each task attached |
| 595 | to that cpuset, the next time that the kernel attempts to allocate |
| 596 | a page of memory for that task, the kernel will notice the change |
Greg Thelen | 5239c4f | 2010-03-24 14:48:30 -0700 | [diff] [blame] | 597 | in the task's cpuset, and update its per-task memory placement to |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 598 | remain within the new cpusets memory placement. If the task was using |
| 599 | mempolicy MPOL_BIND, and the nodes to which it was bound overlap with |
| 600 | its new cpuset, then the task will continue to use whatever subset |
| 601 | of MPOL_BIND nodes are still allowed in the new cpuset. If the task |
| 602 | was using MPOL_BIND and now none of its MPOL_BIND nodes are allowed |
| 603 | in the new cpuset, then the task will be essentially treated as if it |
Chris Samuel | caa790b | 2009-01-17 00:01:18 +1100 | [diff] [blame] | 604 | was MPOL_BIND bound to the new cpuset (even though its NUMA placement, |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 605 | as queried by get_mempolicy(), doesn't change). If a task is moved |
Greg Thelen | 5239c4f | 2010-03-24 14:48:30 -0700 | [diff] [blame] | 606 | from one cpuset to another, then the kernel will adjust the task's |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 607 | memory placement, as above, the next time that the kernel attempts |
| 608 | to allocate a page of memory for that task. |
| 609 | |
GeunSik Lim | e21a05c | 2010-02-24 11:06:39 +0100 | [diff] [blame] | 610 | If a cpuset has its 'cpuset.cpus' modified, then each task in that cpuset |
Paul Jackson | 8f5aa26 | 2008-02-07 00:14:48 -0800 | [diff] [blame] | 611 | will have its allowed CPU placement changed immediately. Similarly, |
Greg Thelen | 5239c4f | 2010-03-24 14:48:30 -0700 | [diff] [blame] | 612 | if a task's pid is written to another cpusets 'cpuset.tasks' file, then its |
Li Zefan | 3fd076d | 2009-02-20 15:38:48 -0800 | [diff] [blame] | 613 | allowed CPU placement is changed immediately. If such a task had been |
| 614 | bound to some subset of its cpuset using the sched_setaffinity() call, |
| 615 | the task will be allowed to run on any CPU allowed in its new cpuset, |
| 616 | negating the effect of the prior sched_setaffinity() call. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 617 | |
| 618 | In summary, the memory placement of a task whose cpuset is changed is |
| 619 | updated by the kernel, on the next allocation of a page for that task, |
Li Zefan | 3fd076d | 2009-02-20 15:38:48 -0800 | [diff] [blame] | 620 | and the processor placement is updated immediately. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 621 | |
Paul Jackson | 45b07ef | 2006-01-08 01:00:56 -0800 | [diff] [blame] | 622 | Normally, once a page is allocated (given a physical page |
| 623 | of main memory) then that page stays on whatever node it |
| 624 | was allocated, so long as it remains allocated, even if the |
GeunSik Lim | e21a05c | 2010-02-24 11:06:39 +0100 | [diff] [blame] | 625 | cpusets memory placement policy 'cpuset.mems' subsequently changes. |
| 626 | If the cpuset flag file 'cpuset.memory_migrate' is set true, then when |
Paul Jackson | 45b07ef | 2006-01-08 01:00:56 -0800 | [diff] [blame] | 627 | tasks are attached to that cpuset, any pages that task had |
| 628 | allocated to it on nodes in its previous cpuset are migrated |
Greg Thelen | 5239c4f | 2010-03-24 14:48:30 -0700 | [diff] [blame] | 629 | to the task's new cpuset. The relative placement of the page within |
Christoph Lameter | b4fb376 | 2006-03-14 19:50:20 -0800 | [diff] [blame] | 630 | the cpuset is preserved during these migration operations if possible. |
| 631 | For example if the page was on the second valid node of the prior cpuset |
| 632 | then the page will be placed on the second valid node of the new cpuset. |
| 633 | |
Greg Thelen | 5239c4f | 2010-03-24 14:48:30 -0700 | [diff] [blame] | 634 | Also if 'cpuset.memory_migrate' is set true, then if that cpuset's |
GeunSik Lim | e21a05c | 2010-02-24 11:06:39 +0100 | [diff] [blame] | 635 | 'cpuset.mems' file is modified, pages allocated to tasks in that |
| 636 | cpuset, that were on nodes in the previous setting of 'cpuset.mems', |
Christoph Lameter | b4fb376 | 2006-03-14 19:50:20 -0800 | [diff] [blame] | 637 | will be moved to nodes in the new setting of 'mems.' |
Greg Thelen | 5239c4f | 2010-03-24 14:48:30 -0700 | [diff] [blame] | 638 | Pages that were not in the task's prior cpuset, or in the cpuset's |
GeunSik Lim | e21a05c | 2010-02-24 11:06:39 +0100 | [diff] [blame] | 639 | prior 'cpuset.mems' setting, will not be moved. |
Paul Jackson | 45b07ef | 2006-01-08 01:00:56 -0800 | [diff] [blame] | 640 | |
Tobias Klauser | d533f67 | 2005-09-10 00:26:46 -0700 | [diff] [blame] | 641 | There is an exception to the above. If hotplug functionality is used |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 642 | to remove all the CPUs that are currently assigned to a cpuset, |
Li Zefan | 0249943 | 2008-09-13 02:33:09 -0700 | [diff] [blame] | 643 | then all the tasks in that cpuset will be moved to the nearest ancestor |
| 644 | with non-empty cpus. But the moving of some (or all) tasks might fail if |
| 645 | cpuset is bound with another cgroup subsystem which has some restrictions |
| 646 | on task attaching. In this failing case, those tasks will stay |
| 647 | in the original cpuset, and the kernel will automatically update |
| 648 | their cpus_allowed to allow all online CPUs. When memory hotplug |
| 649 | functionality for removing Memory Nodes is available, a similar exception |
| 650 | is expected to apply there as well. In general, the kernel prefers to |
| 651 | violate cpuset placement, over starving a task that has had all |
| 652 | its allowed CPUs or Memory Nodes taken offline. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 653 | |
| 654 | There is a second exception to the above. GFP_ATOMIC requests are |
| 655 | kernel internal allocations that must be satisfied, immediately. |
| 656 | The kernel may drop some request, in rare cases even panic, if a |
| 657 | GFP_ATOMIC alloc fails. If the request cannot be satisfied within |
Greg Thelen | 5239c4f | 2010-03-24 14:48:30 -0700 | [diff] [blame] | 658 | the current task's cpuset, then we relax the cpuset, and look for |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 659 | memory anywhere we can find it. It's better to violate the cpuset |
| 660 | than stress the kernel. |
| 661 | |
| 662 | To start a new job that is to be contained within a cpuset, the steps are: |
| 663 | |
Jörg Sommer | f6e07d3 | 2011-06-15 12:59:45 -0700 | [diff] [blame] | 664 | 1) mkdir /sys/fs/cgroup/cpuset |
| 665 | 2) mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 666 | 3) Create the new cpuset by doing mkdir's and write's (or echo's) in |
Jörg Sommer | f6e07d3 | 2011-06-15 12:59:45 -0700 | [diff] [blame] | 667 | the /sys/fs/cgroup/cpuset virtual file system. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 668 | 4) Start a task that will be the "founding father" of the new job. |
| 669 | 5) Attach that task to the new cpuset by writing its pid to the |
Jörg Sommer | f6e07d3 | 2011-06-15 12:59:45 -0700 | [diff] [blame] | 670 | /sys/fs/cgroup/cpuset tasks file for that cpuset. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 671 | 6) fork, exec or clone the job tasks from this founding father task. |
| 672 | |
| 673 | For example, the following sequence of commands will setup a cpuset |
| 674 | named "Charlie", containing just CPUs 2 and 3, and Memory Node 1, |
| 675 | and then start a subshell 'sh' in that cpuset: |
| 676 | |
Jörg Sommer | f6e07d3 | 2011-06-15 12:59:45 -0700 | [diff] [blame] | 677 | mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset |
| 678 | cd /sys/fs/cgroup/cpuset |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 679 | mkdir Charlie |
| 680 | cd Charlie |
GeunSik Lim | e21a05c | 2010-02-24 11:06:39 +0100 | [diff] [blame] | 681 | /bin/echo 2-3 > cpuset.cpus |
| 682 | /bin/echo 1 > cpuset.mems |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 683 | /bin/echo $$ > tasks |
| 684 | sh |
| 685 | # The subshell 'sh' is now running in cpuset Charlie |
| 686 | # The next line should display '/Charlie' |
| 687 | cat /proc/self/cpuset |
| 688 | |
Li Zefan | 3fd076d | 2009-02-20 15:38:48 -0800 | [diff] [blame] | 689 | There are ways to query or modify cpusets: |
| 690 | - via the cpuset file system directly, using the various cd, mkdir, echo, |
| 691 | cat, rmdir commands from the shell, or their equivalent from C. |
| 692 | - via the C library libcpuset. |
| 693 | - via the C library libcgroup. |
Justin P. Mattock | 0ea6e61 | 2010-07-23 20:51:24 -0700 | [diff] [blame] | 694 | (http://sourceforge.net/projects/libcg/) |
Li Zefan | 3fd076d | 2009-02-20 15:38:48 -0800 | [diff] [blame] | 695 | - via the python application cset. |
GeunSik Lim | 8671139 | 2011-03-03 10:16:54 +0900 | [diff] [blame] | 696 | (http://code.google.com/p/cpuset/) |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 697 | |
| 698 | The sched_setaffinity calls can also be done at the shell prompt using |
| 699 | SGI's runon or Robert Love's taskset. The mbind and set_mempolicy |
| 700 | calls can be done at the shell prompt using the numactl command |
| 701 | (part of Andi Kleen's numa package). |
| 702 | |
| 703 | 2. Usage Examples and Syntax |
| 704 | ============================ |
| 705 | |
| 706 | 2.1 Basic Usage |
| 707 | --------------- |
| 708 | |
| 709 | Creating, modifying, using the cpusets can be done through the cpuset |
| 710 | virtual filesystem. |
| 711 | |
| 712 | To mount it, type: |
Jörg Sommer | f6e07d3 | 2011-06-15 12:59:45 -0700 | [diff] [blame] | 713 | # mount -t cgroup -o cpuset cpuset /sys/fs/cgroup/cpuset |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 714 | |
Jörg Sommer | f6e07d3 | 2011-06-15 12:59:45 -0700 | [diff] [blame] | 715 | Then under /sys/fs/cgroup/cpuset you can find a tree that corresponds to the |
| 716 | tree of the cpusets in the system. For instance, /sys/fs/cgroup/cpuset |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 717 | is the cpuset that holds the whole system. |
| 718 | |
Jörg Sommer | f6e07d3 | 2011-06-15 12:59:45 -0700 | [diff] [blame] | 719 | If you want to create a new cpuset under /sys/fs/cgroup/cpuset: |
| 720 | # cd /sys/fs/cgroup/cpuset |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 721 | # mkdir my_cpuset |
| 722 | |
| 723 | Now you want to do something with this cpuset. |
| 724 | # cd my_cpuset |
| 725 | |
| 726 | In this directory you can find several files: |
| 727 | # ls |
GeunSik Lim | 8671139 | 2011-03-03 10:16:54 +0900 | [diff] [blame] | 728 | cgroup.clone_children cpuset.memory_pressure |
| 729 | cgroup.event_control cpuset.memory_spread_page |
| 730 | cgroup.procs cpuset.memory_spread_slab |
| 731 | cpuset.cpu_exclusive cpuset.mems |
| 732 | cpuset.cpus cpuset.sched_load_balance |
| 733 | cpuset.mem_exclusive cpuset.sched_relax_domain_level |
| 734 | cpuset.mem_hardwall notify_on_release |
| 735 | cpuset.memory_migrate tasks |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 736 | |
| 737 | Reading them will give you information about the state of this cpuset: |
| 738 | the CPUs and Memory Nodes it can use, the processes that are using |
| 739 | it, its properties. By writing to these files you can manipulate |
| 740 | the cpuset. |
| 741 | |
| 742 | Set some flags: |
GeunSik Lim | e21a05c | 2010-02-24 11:06:39 +0100 | [diff] [blame] | 743 | # /bin/echo 1 > cpuset.cpu_exclusive |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 744 | |
| 745 | Add some cpus: |
GeunSik Lim | e21a05c | 2010-02-24 11:06:39 +0100 | [diff] [blame] | 746 | # /bin/echo 0-7 > cpuset.cpus |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 747 | |
Simon Horman | 2400ff7 | 2007-04-01 23:49:40 -0700 | [diff] [blame] | 748 | Add some mems: |
GeunSik Lim | e21a05c | 2010-02-24 11:06:39 +0100 | [diff] [blame] | 749 | # /bin/echo 0-7 > cpuset.mems |
Simon Horman | 2400ff7 | 2007-04-01 23:49:40 -0700 | [diff] [blame] | 750 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 751 | Now attach your shell to this cpuset: |
| 752 | # /bin/echo $$ > tasks |
| 753 | |
| 754 | You can also create cpusets inside your cpuset by using mkdir in this |
| 755 | directory. |
| 756 | # mkdir my_sub_cs |
| 757 | |
| 758 | To remove a cpuset, just use rmdir: |
| 759 | # rmdir my_sub_cs |
| 760 | This will fail if the cpuset is in use (has cpusets inside, or has |
| 761 | processes attached). |
| 762 | |
Paul Menage | 8793d85 | 2007-10-18 23:39:39 -0700 | [diff] [blame] | 763 | Note that for legacy reasons, the "cpuset" filesystem exists as a |
| 764 | wrapper around the cgroup filesystem. |
| 765 | |
| 766 | The command |
| 767 | |
Jörg Sommer | f6e07d3 | 2011-06-15 12:59:45 -0700 | [diff] [blame] | 768 | mount -t cpuset X /sys/fs/cgroup/cpuset |
Paul Menage | 8793d85 | 2007-10-18 23:39:39 -0700 | [diff] [blame] | 769 | |
| 770 | is equivalent to |
| 771 | |
Jörg Sommer | f6e07d3 | 2011-06-15 12:59:45 -0700 | [diff] [blame] | 772 | mount -t cgroup -ocpuset,noprefix X /sys/fs/cgroup/cpuset |
| 773 | echo "/sbin/cpuset_release_agent" > /sys/fs/cgroup/cpuset/release_agent |
Paul Menage | 8793d85 | 2007-10-18 23:39:39 -0700 | [diff] [blame] | 774 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 775 | 2.2 Adding/removing cpus |
| 776 | ------------------------ |
| 777 | |
| 778 | This is the syntax to use when writing in the cpus or mems files |
| 779 | in cpuset directories: |
| 780 | |
GeunSik Lim | e21a05c | 2010-02-24 11:06:39 +0100 | [diff] [blame] | 781 | # /bin/echo 1-4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4 |
| 782 | # /bin/echo 1,2,3,4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4 |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 783 | |
Nikanth Karthikesan | b37f2d4 | 2009-06-30 11:41:36 -0700 | [diff] [blame] | 784 | To add a CPU to a cpuset, write the new list of CPUs including the |
| 785 | CPU to be added. To add 6 to the above cpuset: |
| 786 | |
GeunSik Lim | e21a05c | 2010-02-24 11:06:39 +0100 | [diff] [blame] | 787 | # /bin/echo 1-4,6 > cpuset.cpus -> set cpus list to cpus 1,2,3,4,6 |
Nikanth Karthikesan | b37f2d4 | 2009-06-30 11:41:36 -0700 | [diff] [blame] | 788 | |
| 789 | Similarly to remove a CPU from a cpuset, write the new list of CPUs |
| 790 | without the CPU to be removed. |
| 791 | |
| 792 | To remove all the CPUs: |
| 793 | |
GeunSik Lim | e21a05c | 2010-02-24 11:06:39 +0100 | [diff] [blame] | 794 | # /bin/echo "" > cpuset.cpus -> clear cpus list |
Nikanth Karthikesan | b37f2d4 | 2009-06-30 11:41:36 -0700 | [diff] [blame] | 795 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 796 | 2.3 Setting flags |
| 797 | ----------------- |
| 798 | |
| 799 | The syntax is very simple: |
| 800 | |
GeunSik Lim | e21a05c | 2010-02-24 11:06:39 +0100 | [diff] [blame] | 801 | # /bin/echo 1 > cpuset.cpu_exclusive -> set flag 'cpuset.cpu_exclusive' |
| 802 | # /bin/echo 0 > cpuset.cpu_exclusive -> unset flag 'cpuset.cpu_exclusive' |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 803 | |
| 804 | 2.4 Attaching processes |
| 805 | ----------------------- |
| 806 | |
| 807 | # /bin/echo PID > tasks |
| 808 | |
| 809 | Note that it is PID, not PIDs. You can only attach ONE task at a time. |
| 810 | If you have several tasks to attach, you have to do it one after another: |
| 811 | |
| 812 | # /bin/echo PID1 > tasks |
| 813 | # /bin/echo PID2 > tasks |
| 814 | ... |
| 815 | # /bin/echo PIDn > tasks |
| 816 | |
| 817 | |
| 818 | 3. Questions |
| 819 | ============ |
| 820 | |
| 821 | Q: what's up with this '/bin/echo' ? |
| 822 | A: bash's builtin 'echo' command does not check calls to write() against |
| 823 | errors. If you use it in the cpuset file system, you won't be |
| 824 | able to tell whether a command succeeded or failed. |
| 825 | |
| 826 | Q: When I attach processes, only the first of the line gets really attached ! |
| 827 | A: We can only return one error code per call to write(). So you should also |
| 828 | put only ONE pid. |
| 829 | |
| 830 | 4. Contact |
| 831 | ========== |
| 832 | |
| 833 | Web: http://www.bullopensource.org/cpuset |