| CPUSETS |
| ------- |
| |
| Copyright (C) 2004 BULL SA. |
| Written by Simon.Derr@bull.net |
| |
| Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. |
| Modified by Paul Jackson <pj@sgi.com> |
| Modified by Christoph Lameter <clameter@sgi.com> |
| Modified by Paul Menage <menage@google.com> |
| |
| CONTENTS: |
| ========= |
| |
| 1. Cpusets |
| 1.1 What are cpusets ? |
| 1.2 Why are cpusets needed ? |
| 1.3 How are cpusets implemented ? |
| 1.4 What are exclusive cpusets ? |
| 1.5 What is memory_pressure ? |
| 1.6 What is memory spread ? |
| 1.7 What is sched_load_balance ? |
| 1.8 How do I use cpusets ? |
| 2. Usage Examples and Syntax |
| 2.1 Basic Usage |
| 2.2 Adding/removing cpus |
| 2.3 Setting flags |
| 2.4 Attaching processes |
| 3. Questions |
| 4. Contact |
| |
| 1. Cpusets |
| ========== |
| |
| 1.1 What are cpusets ? |
| ---------------------- |
| |
| Cpusets provide a mechanism for assigning a set of CPUs and Memory |
| Nodes to a set of tasks. In this document "Memory Node" refers to |
| an on-line node that contains memory. |
| |
| Cpusets constrain the CPU and Memory placement of tasks to only |
| the resources within a tasks current cpuset. They form a nested |
| hierarchy visible in a virtual file system. These are the essential |
| hooks, beyond what is already present, required to manage dynamic |
| job placement on large systems. |
| |
| Cpusets use the generic cgroup subsystem described in |
| Documentation/cgroup.txt. |
| |
| Requests by a task, using the sched_setaffinity(2) system call to |
| include CPUs in its CPU affinity mask, and using the mbind(2) and |
| set_mempolicy(2) system calls to include Memory Nodes in its memory |
| policy, are both filtered through that tasks cpuset, filtering out any |
| CPUs or Memory Nodes not in that cpuset. The scheduler will not |
| schedule a task on a CPU that is not allowed in its cpus_allowed |
| vector, and the kernel page allocator will not allocate a page on a |
| node that is not allowed in the requesting tasks mems_allowed vector. |
| |
| User level code may create and destroy cpusets by name in the cgroup |
| virtual file system, manage the attributes and permissions of these |
| cpusets and which CPUs and Memory Nodes are assigned to each cpuset, |
| specify and query to which cpuset a task is assigned, and list the |
| task pids assigned to a cpuset. |
| |
| |
| 1.2 Why are cpusets needed ? |
| ---------------------------- |
| |
| The management of large computer systems, with many processors (CPUs), |
| complex memory cache hierarchies and multiple Memory Nodes having |
| non-uniform access times (NUMA) presents additional challenges for |
| the efficient scheduling and memory placement of processes. |
| |
| Frequently more modest sized systems can be operated with adequate |
| efficiency just by letting the operating system automatically share |
| the available CPU and Memory resources amongst the requesting tasks. |
| |
| But larger systems, which benefit more from careful processor and |
| memory placement to reduce memory access times and contention, |
| and which typically represent a larger investment for the customer, |
| can benefit from explicitly placing jobs on properly sized subsets of |
| the system. |
| |
| This can be especially valuable on: |
| |
| * Web Servers running multiple instances of the same web application, |
| * Servers running different applications (for instance, a web server |
| and a database), or |
| * NUMA systems running large HPC applications with demanding |
| performance characteristics. |
| |
| These subsets, or "soft partitions" must be able to be dynamically |
| adjusted, as the job mix changes, without impacting other concurrently |
| executing jobs. The location of the running jobs pages may also be moved |
| when the memory locations are changed. |
| |
| The kernel cpuset patch provides the minimum essential kernel |
| mechanisms required to efficiently implement such subsets. It |
| leverages existing CPU and Memory Placement facilities in the Linux |
| kernel to avoid any additional impact on the critical scheduler or |
| memory allocator code. |
| |
| |
| 1.3 How are cpusets implemented ? |
| --------------------------------- |
| |
| Cpusets provide a Linux kernel mechanism to constrain which CPUs and |
| Memory Nodes are used by a process or set of processes. |
| |
| The Linux kernel already has a pair of mechanisms to specify on which |
| CPUs a task may be scheduled (sched_setaffinity) and on which Memory |
| Nodes it may obtain memory (mbind, set_mempolicy). |
| |
| Cpusets extends these two mechanisms as follows: |
| |
| - Cpusets are sets of allowed CPUs and Memory Nodes, known to the |
| kernel. |
| - Each task in the system is attached to a cpuset, via a pointer |
| in the task structure to a reference counted cgroup structure. |
| - Calls to sched_setaffinity are filtered to just those CPUs |
| allowed in that tasks cpuset. |
| - Calls to mbind and set_mempolicy are filtered to just |
| those Memory Nodes allowed in that tasks cpuset. |
| - The root cpuset contains all the systems CPUs and Memory |
| Nodes. |
| - For any cpuset, one can define child cpusets containing a subset |
| of the parents CPU and Memory Node resources. |
| - The hierarchy of cpusets can be mounted at /dev/cpuset, for |
| browsing and manipulation from user space. |
| - A cpuset may be marked exclusive, which ensures that no other |
| cpuset (except direct ancestors and descendents) may contain |
| any overlapping CPUs or Memory Nodes. |
| - You can list all the tasks (by pid) attached to any cpuset. |
| |
| The implementation of cpusets requires a few, simple hooks |
| into the rest of the kernel, none in performance critical paths: |
| |
| - in init/main.c, to initialize the root cpuset at system boot. |
| - in fork and exit, to attach and detach a task from its cpuset. |
| - in sched_setaffinity, to mask the requested CPUs by what's |
| allowed in that tasks cpuset. |
| - in sched.c migrate_all_tasks(), to keep migrating tasks within |
| the CPUs allowed by their cpuset, if possible. |
| - in the mbind and set_mempolicy system calls, to mask the requested |
| Memory Nodes by what's allowed in that tasks cpuset. |
| - in page_alloc.c, to restrict memory to allowed nodes. |
| - in vmscan.c, to restrict page recovery to the current cpuset. |
| |
| You should mount the "cgroup" filesystem type in order to enable |
| browsing and modifying the cpusets presently known to the kernel. No |
| new system calls are added for cpusets - all support for querying and |
| modifying cpusets is via this cpuset file system. |
| |
| The /proc/<pid>/status file for each task has two added lines, |
| displaying the tasks cpus_allowed (on which CPUs it may be scheduled) |
| and mems_allowed (on which Memory Nodes it may obtain memory), |
| in the format seen in the following example: |
| |
| Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff |
| Mems_allowed: ffffffff,ffffffff |
| |
| Each cpuset is represented by a directory in the cgroup file system |
| containing (on top of the standard cgroup files) the following |
| files describing that cpuset: |
| |
| - cpus: list of CPUs in that cpuset |
| - mems: list of Memory Nodes in that cpuset |
| - memory_migrate flag: if set, move pages to cpusets nodes |
| - cpu_exclusive flag: is cpu placement exclusive? |
| - mem_exclusive flag: is memory placement exclusive? |
| - memory_pressure: measure of how much paging pressure in cpuset |
| |
| In addition, the root cpuset only has the following file: |
| - memory_pressure_enabled flag: compute memory_pressure? |
| |
| New cpusets are created using the mkdir system call or shell |
| command. The properties of a cpuset, such as its flags, allowed |
| CPUs and Memory Nodes, and attached tasks, are modified by writing |
| to the appropriate file in that cpusets directory, as listed above. |
| |
| The named hierarchical structure of nested cpusets allows partitioning |
| a large system into nested, dynamically changeable, "soft-partitions". |
| |
| The attachment of each task, automatically inherited at fork by any |
| children of that task, to a cpuset allows organizing the work load |
| on a system into related sets of tasks such that each set is constrained |
| to using the CPUs and Memory Nodes of a particular cpuset. A task |
| may be re-attached to any other cpuset, if allowed by the permissions |
| on the necessary cpuset file system directories. |
| |
| Such management of a system "in the large" integrates smoothly with |
| the detailed placement done on individual tasks and memory regions |
| using the sched_setaffinity, mbind and set_mempolicy system calls. |
| |
| The following rules apply to each cpuset: |
| |
| - Its CPUs and Memory Nodes must be a subset of its parents. |
| - It can only be marked exclusive if its parent is. |
| - If its cpu or memory is exclusive, they may not overlap any sibling. |
| |
| These rules, and the natural hierarchy of cpusets, enable efficient |
| enforcement of the exclusive guarantee, without having to scan all |
| cpusets every time any of them change to ensure nothing overlaps a |
| exclusive cpuset. Also, the use of a Linux virtual file system (vfs) |
| to represent the cpuset hierarchy provides for a familiar permission |
| and name space for cpusets, with a minimum of additional kernel code. |
| |
| The cpus and mems files in the root (top_cpuset) cpuset are |
| read-only. The cpus file automatically tracks the value of |
| cpu_online_map using a CPU hotplug notifier, and the mems file |
| automatically tracks the value of node_states[N_MEMORY]--i.e., |
| nodes with memory--using the cpuset_track_online_nodes() hook. |
| |
| |
| 1.4 What are exclusive cpusets ? |
| -------------------------------- |
| |
| If a cpuset is cpu or mem exclusive, no other cpuset, other than |
| a direct ancestor or descendent, may share any of the same CPUs or |
| Memory Nodes. |
| |
| A cpuset that is mem_exclusive restricts kernel allocations for |
| page, buffer and other data commonly shared by the kernel across |
| multiple users. All cpusets, whether mem_exclusive or not, restrict |
| allocations of memory for user space. This enables configuring a |
| system so that several independent jobs can share common kernel data, |
| such as file system pages, while isolating each jobs user allocation in |
| its own cpuset. To do this, construct a large mem_exclusive cpuset to |
| hold all the jobs, and construct child, non-mem_exclusive cpusets for |
| each individual job. Only a small amount of typical kernel memory, |
| such as requests from interrupt handlers, is allowed to be taken |
| outside even a mem_exclusive cpuset. |
| |
| |
| 1.5 What is memory_pressure ? |
| ----------------------------- |
| The memory_pressure of a cpuset provides a simple per-cpuset metric |
| of the rate that the tasks in a cpuset are attempting to free up in |
| use memory on the nodes of the cpuset to satisfy additional memory |
| requests. |
| |
| This enables batch managers monitoring jobs running in dedicated |
| cpusets to efficiently detect what level of memory pressure that job |
| is causing. |
| |
| This is useful both on tightly managed systems running a wide mix of |
| submitted jobs, which may choose to terminate or re-prioritize jobs that |
| are trying to use more memory than allowed on the nodes assigned them, |
| and with tightly coupled, long running, massively parallel scientific |
| computing jobs that will dramatically fail to meet required performance |
| goals if they start to use more memory than allowed to them. |
| |
| This mechanism provides a very economical way for the batch manager |
| to monitor a cpuset for signs of memory pressure. It's up to the |
| batch manager or other user code to decide what to do about it and |
| take action. |
| |
| ==> Unless this feature is enabled by writing "1" to the special file |
| /dev/cpuset/memory_pressure_enabled, the hook in the rebalance |
| code of __alloc_pages() for this metric reduces to simply noticing |
| that the cpuset_memory_pressure_enabled flag is zero. So only |
| systems that enable this feature will compute the metric. |
| |
| Why a per-cpuset, running average: |
| |
| Because this meter is per-cpuset, rather than per-task or mm, |
| the system load imposed by a batch scheduler monitoring this |
| metric is sharply reduced on large systems, because a scan of |
| the tasklist can be avoided on each set of queries. |
| |
| Because this meter is a running average, instead of an accumulating |
| counter, a batch scheduler can detect memory pressure with a |
| single read, instead of having to read and accumulate results |
| for a period of time. |
| |
| Because this meter is per-cpuset rather than per-task or mm, |
| the batch scheduler can obtain the key information, memory |
| pressure in a cpuset, with a single read, rather than having to |
| query and accumulate results over all the (dynamically changing) |
| set of tasks in the cpuset. |
| |
| A per-cpuset simple digital filter (requires a spinlock and 3 words |
| of data per-cpuset) is kept, and updated by any task attached to that |
| cpuset, if it enters the synchronous (direct) page reclaim code. |
| |
| A per-cpuset file provides an integer number representing the recent |
| (half-life of 10 seconds) rate of direct page reclaims caused by |
| the tasks in the cpuset, in units of reclaims attempted per second, |
| times 1000. |
| |
| |
| 1.6 What is memory spread ? |
| --------------------------- |
| There are two boolean flag files per cpuset that control where the |
| kernel allocates pages for the file system buffers and related in |
| kernel data structures. They are called 'memory_spread_page' and |
| 'memory_spread_slab'. |
| |
| If the per-cpuset boolean flag file 'memory_spread_page' is set, then |
| the kernel will spread the file system buffers (page cache) evenly |
| over all the nodes that the faulting task is allowed to use, instead |
| of preferring to put those pages on the node where the task is running. |
| |
| If the per-cpuset boolean flag file 'memory_spread_slab' is set, |
| then the kernel will spread some file system related slab caches, |
| such as for inodes and dentries evenly over all the nodes that the |
| faulting task is allowed to use, instead of preferring to put those |
| pages on the node where the task is running. |
| |
| The setting of these flags does not affect anonymous data segment or |
| stack segment pages of a task. |
| |
| By default, both kinds of memory spreading are off, and memory |
| pages are allocated on the node local to where the task is running, |
| except perhaps as modified by the tasks NUMA mempolicy or cpuset |
| configuration, so long as sufficient free memory pages are available. |
| |
| When new cpusets are created, they inherit the memory spread settings |
| of their parent. |
| |
| Setting memory spreading causes allocations for the affected page |
| or slab caches to ignore the tasks NUMA mempolicy and be spread |
| instead. Tasks using mbind() or set_mempolicy() calls to set NUMA |
| mempolicies will not notice any change in these calls as a result of |
| their containing tasks memory spread settings. If memory spreading |
| is turned off, then the currently specified NUMA mempolicy once again |
| applies to memory page allocations. |
| |
| Both 'memory_spread_page' and 'memory_spread_slab' are boolean flag |
| files. By default they contain "0", meaning that the feature is off |
| for that cpuset. If a "1" is written to that file, then that turns |
| the named feature on. |
| |
| The implementation is simple. |
| |
| Setting the flag 'memory_spread_page' turns on a per-process flag |
| PF_SPREAD_PAGE for each task that is in that cpuset or subsequently |
| joins that cpuset. The page allocation calls for the page cache |
| is modified to perform an inline check for this PF_SPREAD_PAGE task |
| flag, and if set, a call to a new routine cpuset_mem_spread_node() |
| returns the node to prefer for the allocation. |
| |
| Similarly, setting 'memory_spread_cache' turns on the flag |
| PF_SPREAD_SLAB, and appropriately marked slab caches will allocate |
| pages from the node returned by cpuset_mem_spread_node(). |
| |
| The cpuset_mem_spread_node() routine is also simple. It uses the |
| value of a per-task rotor cpuset_mem_spread_rotor to select the next |
| node in the current tasks mems_allowed to prefer for the allocation. |
| |
| This memory placement policy is also known (in other contexts) as |
| round-robin or interleave. |
| |
| This policy can provide substantial improvements for jobs that need |
| to place thread local data on the corresponding node, but that need |
| to access large file system data sets that need to be spread across |
| the several nodes in the jobs cpuset in order to fit. Without this |
| policy, especially for jobs that might have one thread reading in the |
| data set, the memory allocation across the nodes in the jobs cpuset |
| can become very uneven. |
| |
| 1.7 What is sched_load_balance ? |
| -------------------------------- |
| |
| The kernel scheduler (kernel/sched.c) automatically load balances |
| tasks. If one CPU is underutilized, kernel code running on that |
| CPU will look for tasks on other more overloaded CPUs and move those |
| tasks to itself, within the constraints of such placement mechanisms |
| as cpusets and sched_setaffinity. |
| |
| The algorithmic cost of load balancing and its impact on key shared |
| kernel data structures such as the task list increases more than |
| linearly with the number of CPUs being balanced. So the scheduler |
| has support to partition the systems CPUs into a number of sched |
| domains such that it only load balances within each sched domain. |
| Each sched domain covers some subset of the CPUs in the system; |
| no two sched domains overlap; some CPUs might not be in any sched |
| domain and hence won't be load balanced. |
| |
| Put simply, it costs less to balance between two smaller sched domains |
| than one big one, but doing so means that overloads in one of the |
| two domains won't be load balanced to the other one. |
| |
| By default, there is one sched domain covering all CPUs, except those |
| marked isolated using the kernel boot time "isolcpus=" argument. |
| |
| This default load balancing across all CPUs is not well suited for |
| the following two situations: |
| 1) On large systems, load balancing across many CPUs is expensive. |
| If the system is managed using cpusets to place independent jobs |
| on separate sets of CPUs, full load balancing is unnecessary. |
| 2) Systems supporting realtime on some CPUs need to minimize |
| system overhead on those CPUs, including avoiding task load |
| balancing if that is not needed. |
| |
| When the per-cpuset flag "sched_load_balance" is enabled (the default |
| setting), it requests that all the CPUs in that cpusets allowed 'cpus' |
| be contained in a single sched domain, ensuring that load balancing |
| can move a task (not otherwised pinned, as by sched_setaffinity) |
| from any CPU in that cpuset to any other. |
| |
| When the per-cpuset flag "sched_load_balance" is disabled, then the |
| scheduler will avoid load balancing across the CPUs in that cpuset, |
| --except-- in so far as is necessary because some overlapping cpuset |
| has "sched_load_balance" enabled. |
| |
| So, for example, if the top cpuset has the flag "sched_load_balance" |
| enabled, then the scheduler will have one sched domain covering all |
| CPUs, and the setting of the "sched_load_balance" flag in any other |
| cpusets won't matter, as we're already fully load balancing. |
| |
| Therefore in the above two situations, the top cpuset flag |
| "sched_load_balance" should be disabled, and only some of the smaller, |
| child cpusets have this flag enabled. |
| |
| When doing this, you don't usually want to leave any unpinned tasks in |
| the top cpuset that might use non-trivial amounts of CPU, as such tasks |
| may be artificially constrained to some subset of CPUs, depending on |
| the particulars of this flag setting in descendent cpusets. Even if |
| such a task could use spare CPU cycles in some other CPUs, the kernel |
| scheduler might not consider the possibility of load balancing that |
| task to that underused CPU. |
| |
| Of course, tasks pinned to a particular CPU can be left in a cpuset |
| that disables "sched_load_balance" as those tasks aren't going anywhere |
| else anyway. |
| |
| There is an impedance mismatch here, between cpusets and sched domains. |
| Cpusets are hierarchical and nest. Sched domains are flat; they don't |
| overlap and each CPU is in at most one sched domain. |
| |
| It is necessary for sched domains to be flat because load balancing |
| across partially overlapping sets of CPUs would risk unstable dynamics |
| that would be beyond our understanding. So if each of two partially |
| overlapping cpusets enables the flag 'sched_load_balance', then we |
| form a single sched domain that is a superset of both. We won't move |
| a task to a CPU outside it cpuset, but the scheduler load balancing |
| code might waste some compute cycles considering that possibility. |
| |
| This mismatch is why there is not a simple one-to-one relation |
| between which cpusets have the flag "sched_load_balance" enabled, |
| and the sched domain configuration. If a cpuset enables the flag, it |
| will get balancing across all its CPUs, but if it disables the flag, |
| it will only be assured of no load balancing if no other overlapping |
| cpuset enables the flag. |
| |
| If two cpusets have partially overlapping 'cpus' allowed, and only |
| one of them has this flag enabled, then the other may find its |
| tasks only partially load balanced, just on the overlapping CPUs. |
| This is just the general case of the top_cpuset example given a few |
| paragraphs above. In the general case, as in the top cpuset case, |
| don't leave tasks that might use non-trivial amounts of CPU in |
| such partially load balanced cpusets, as they may be artificially |
| constrained to some subset of the CPUs allowed to them, for lack of |
| load balancing to the other CPUs. |
| |
| 1.7.1 sched_load_balance implementation details. |
| ------------------------------------------------ |
| |
| The per-cpuset flag 'sched_load_balance' defaults to enabled (contrary |
| to most cpuset flags.) When enabled for a cpuset, the kernel will |
| ensure that it can load balance across all the CPUs in that cpuset |
| (makes sure that all the CPUs in the cpus_allowed of that cpuset are |
| in the same sched domain.) |
| |
| If two overlapping cpusets both have 'sched_load_balance' enabled, |
| then they will be (must be) both in the same sched domain. |
| |
| If, as is the default, the top cpuset has 'sched_load_balance' enabled, |
| then by the above that means there is a single sched domain covering |
| the whole system, regardless of any other cpuset settings. |
| |
| The kernel commits to user space that it will avoid load balancing |
| where it can. It will pick as fine a granularity partition of sched |
| domains as it can while still providing load balancing for any set |
| of CPUs allowed to a cpuset having 'sched_load_balance' enabled. |
| |
| The internal kernel cpuset to scheduler interface passes from the |
| cpuset code to the scheduler code a partition of the load balanced |
| CPUs in the system. This partition is a set of subsets (represented |
| as an array of cpumask_t) of CPUs, pairwise disjoint, that cover all |
| the CPUs that must be load balanced. |
| |
| Whenever the 'sched_load_balance' flag changes, or CPUs come or go |
| from a cpuset with this flag enabled, or a cpuset with this flag |
| enabled is removed, the cpuset code builds a new such partition and |
| passes it to the scheduler sched domain setup code, to have the sched |
| domains rebuilt as necessary. |
| |
| This partition exactly defines what sched domains the scheduler should |
| setup - one sched domain for each element (cpumask_t) in the partition. |
| |
| The scheduler remembers the currently active sched domain partitions. |
| When the scheduler routine partition_sched_domains() is invoked from |
| the cpuset code to update these sched domains, it compares the new |
| partition requested with the current, and updates its sched domains, |
| removing the old and adding the new, for each change. |
| |
| 1.8 How do I use cpusets ? |
| -------------------------- |
| |
| In order to minimize the impact of cpusets on critical kernel |
| code, such as the scheduler, and due to the fact that the kernel |
| does not support one task updating the memory placement of another |
| task directly, the impact on a task of changing its cpuset CPU |
| or Memory Node placement, or of changing to which cpuset a task |
| is attached, is subtle. |
| |
| If a cpuset has its Memory Nodes modified, then for each task attached |
| to that cpuset, the next time that the kernel attempts to allocate |
| a page of memory for that task, the kernel will notice the change |
| in the tasks cpuset, and update its per-task memory placement to |
| remain within the new cpusets memory placement. If the task was using |
| mempolicy MPOL_BIND, and the nodes to which it was bound overlap with |
| its new cpuset, then the task will continue to use whatever subset |
| of MPOL_BIND nodes are still allowed in the new cpuset. If the task |
| was using MPOL_BIND and now none of its MPOL_BIND nodes are allowed |
| in the new cpuset, then the task will be essentially treated as if it |
| was MPOL_BIND bound to the new cpuset (even though its numa placement, |
| as queried by get_mempolicy(), doesn't change). If a task is moved |
| from one cpuset to another, then the kernel will adjust the tasks |
| memory placement, as above, the next time that the kernel attempts |
| to allocate a page of memory for that task. |
| |
| If a cpuset has its CPUs modified, then each task using that |
| cpuset does _not_ change its behavior automatically. In order to |
| minimize the impact on the critical scheduling code in the kernel, |
| tasks will continue to use their prior CPU placement until they |
| are rebound to their cpuset, by rewriting their pid to the 'tasks' |
| file of their cpuset. If a task had been bound to some subset of its |
| cpuset using the sched_setaffinity() call, and if any of that subset |
| is still allowed in its new cpuset settings, then the task will be |
| restricted to the intersection of the CPUs it was allowed on before, |
| and its new cpuset CPU placement. If, on the other hand, there is |
| no overlap between a tasks prior placement and its new cpuset CPU |
| placement, then the task will be allowed to run on any CPU allowed |
| in its new cpuset. If a task is moved from one cpuset to another, |
| its CPU placement is updated in the same way as if the tasks pid is |
| rewritten to the 'tasks' file of its current cpuset. |
| |
| In summary, the memory placement of a task whose cpuset is changed is |
| updated by the kernel, on the next allocation of a page for that task, |
| but the processor placement is not updated, until that tasks pid is |
| rewritten to the 'tasks' file of its cpuset. This is done to avoid |
| impacting the scheduler code in the kernel with a check for changes |
| in a tasks processor placement. |
| |
| Normally, once a page is allocated (given a physical page |
| of main memory) then that page stays on whatever node it |
| was allocated, so long as it remains allocated, even if the |
| cpusets memory placement policy 'mems' subsequently changes. |
| If the cpuset flag file 'memory_migrate' is set true, then when |
| tasks are attached to that cpuset, any pages that task had |
| allocated to it on nodes in its previous cpuset are migrated |
| to the tasks new cpuset. The relative placement of the page within |
| the cpuset is preserved during these migration operations if possible. |
| For example if the page was on the second valid node of the prior cpuset |
| then the page will be placed on the second valid node of the new cpuset. |
| |
| Also if 'memory_migrate' is set true, then if that cpusets |
| 'mems' file is modified, pages allocated to tasks in that |
| cpuset, that were on nodes in the previous setting of 'mems', |
| will be moved to nodes in the new setting of 'mems.' |
| Pages that were not in the tasks prior cpuset, or in the cpusets |
| prior 'mems' setting, will not be moved. |
| |
| There is an exception to the above. If hotplug functionality is used |
| to remove all the CPUs that are currently assigned to a cpuset, |
| then the kernel will automatically update the cpus_allowed of all |
| tasks attached to CPUs in that cpuset to allow all CPUs. When memory |
| hotplug functionality for removing Memory Nodes is available, a |
| similar exception is expected to apply there as well. In general, |
| the kernel prefers to violate cpuset placement, over starving a task |
| that has had all its allowed CPUs or Memory Nodes taken offline. User |
| code should reconfigure cpusets to only refer to online CPUs and Memory |
| Nodes when using hotplug to add or remove such resources. |
| |
| There is a second exception to the above. GFP_ATOMIC requests are |
| kernel internal allocations that must be satisfied, immediately. |
| The kernel may drop some request, in rare cases even panic, if a |
| GFP_ATOMIC alloc fails. If the request cannot be satisfied within |
| the current tasks cpuset, then we relax the cpuset, and look for |
| memory anywhere we can find it. It's better to violate the cpuset |
| than stress the kernel. |
| |
| To start a new job that is to be contained within a cpuset, the steps are: |
| |
| 1) mkdir /dev/cpuset |
| 2) mount -t cgroup -ocpuset cpuset /dev/cpuset |
| 3) Create the new cpuset by doing mkdir's and write's (or echo's) in |
| the /dev/cpuset virtual file system. |
| 4) Start a task that will be the "founding father" of the new job. |
| 5) Attach that task to the new cpuset by writing its pid to the |
| /dev/cpuset tasks file for that cpuset. |
| 6) fork, exec or clone the job tasks from this founding father task. |
| |
| For example, the following sequence of commands will setup a cpuset |
| named "Charlie", containing just CPUs 2 and 3, and Memory Node 1, |
| and then start a subshell 'sh' in that cpuset: |
| |
| mount -t cgroup -ocpuset cpuset /dev/cpuset |
| cd /dev/cpuset |
| mkdir Charlie |
| cd Charlie |
| /bin/echo 2-3 > cpus |
| /bin/echo 1 > mems |
| /bin/echo $$ > tasks |
| sh |
| # The subshell 'sh' is now running in cpuset Charlie |
| # The next line should display '/Charlie' |
| cat /proc/self/cpuset |
| |
| In the future, a C library interface to cpusets will likely be |
| available. For now, the only way to query or modify cpusets is |
| via the cpuset file system, using the various cd, mkdir, echo, cat, |
| rmdir commands from the shell, or their equivalent from C. |
| |
| The sched_setaffinity calls can also be done at the shell prompt using |
| SGI's runon or Robert Love's taskset. The mbind and set_mempolicy |
| calls can be done at the shell prompt using the numactl command |
| (part of Andi Kleen's numa package). |
| |
| 2. Usage Examples and Syntax |
| ============================ |
| |
| 2.1 Basic Usage |
| --------------- |
| |
| Creating, modifying, using the cpusets can be done through the cpuset |
| virtual filesystem. |
| |
| To mount it, type: |
| # mount -t cgroup -o cpuset cpuset /dev/cpuset |
| |
| Then under /dev/cpuset you can find a tree that corresponds to the |
| tree of the cpusets in the system. For instance, /dev/cpuset |
| is the cpuset that holds the whole system. |
| |
| If you want to create a new cpuset under /dev/cpuset: |
| # cd /dev/cpuset |
| # mkdir my_cpuset |
| |
| Now you want to do something with this cpuset. |
| # cd my_cpuset |
| |
| In this directory you can find several files: |
| # ls |
| cpus cpu_exclusive mems mem_exclusive tasks |
| |
| Reading them will give you information about the state of this cpuset: |
| the CPUs and Memory Nodes it can use, the processes that are using |
| it, its properties. By writing to these files you can manipulate |
| the cpuset. |
| |
| Set some flags: |
| # /bin/echo 1 > cpu_exclusive |
| |
| Add some cpus: |
| # /bin/echo 0-7 > cpus |
| |
| Add some mems: |
| # /bin/echo 0-7 > mems |
| |
| Now attach your shell to this cpuset: |
| # /bin/echo $$ > tasks |
| |
| You can also create cpusets inside your cpuset by using mkdir in this |
| directory. |
| # mkdir my_sub_cs |
| |
| To remove a cpuset, just use rmdir: |
| # rmdir my_sub_cs |
| This will fail if the cpuset is in use (has cpusets inside, or has |
| processes attached). |
| |
| Note that for legacy reasons, the "cpuset" filesystem exists as a |
| wrapper around the cgroup filesystem. |
| |
| The command |
| |
| mount -t cpuset X /dev/cpuset |
| |
| is equivalent to |
| |
| mount -t cgroup -ocpuset X /dev/cpuset |
| echo "/sbin/cpuset_release_agent" > /dev/cpuset/release_agent |
| |
| 2.2 Adding/removing cpus |
| ------------------------ |
| |
| This is the syntax to use when writing in the cpus or mems files |
| in cpuset directories: |
| |
| # /bin/echo 1-4 > cpus -> set cpus list to cpus 1,2,3,4 |
| # /bin/echo 1,2,3,4 > cpus -> set cpus list to cpus 1,2,3,4 |
| |
| 2.3 Setting flags |
| ----------------- |
| |
| The syntax is very simple: |
| |
| # /bin/echo 1 > cpu_exclusive -> set flag 'cpu_exclusive' |
| # /bin/echo 0 > cpu_exclusive -> unset flag 'cpu_exclusive' |
| |
| 2.4 Attaching processes |
| ----------------------- |
| |
| # /bin/echo PID > tasks |
| |
| Note that it is PID, not PIDs. You can only attach ONE task at a time. |
| If you have several tasks to attach, you have to do it one after another: |
| |
| # /bin/echo PID1 > tasks |
| # /bin/echo PID2 > tasks |
| ... |
| # /bin/echo PIDn > tasks |
| |
| |
| 3. Questions |
| ============ |
| |
| Q: what's up with this '/bin/echo' ? |
| A: bash's builtin 'echo' command does not check calls to write() against |
| errors. If you use it in the cpuset file system, you won't be |
| able to tell whether a command succeeded or failed. |
| |
| Q: When I attach processes, only the first of the line gets really attached ! |
| A: We can only return one error code per call to write(). So you should also |
| put only ONE pid. |
| |
| 4. Contact |
| ========== |
| |
| Web: http://www.bullopensource.org/cpuset |