| |
| What is Linux Memory Policy? |
| |
| In the Linux kernel, "memory policy" determines from which node the kernel will |
| allocate memory in a NUMA system or in an emulated NUMA system. Linux has |
| supported platforms with Non-Uniform Memory Access architectures since 2.4.?. |
| The current memory policy support was added to Linux 2.6 around May 2004. This |
| document attempts to describe the concepts and APIs of the 2.6 memory policy |
| support. |
| |
| Memory policies should not be confused with cpusets (Documentation/cpusets.txt) |
| which is an administrative mechanism for restricting the nodes from which |
| memory may be allocated by a set of processes. Memory policies are a |
| programming interface that a NUMA-aware application can take advantage of. When |
| both cpusets and policies are applied to a task, the restrictions of the cpuset |
| takes priority. See "MEMORY POLICIES AND CPUSETS" below for more details. |
| |
| MEMORY POLICY CONCEPTS |
| |
| Scope of Memory Policies |
| |
| The Linux kernel supports _scopes_ of memory policy, described here from |
| most general to most specific: |
| |
| System Default Policy: this policy is "hard coded" into the kernel. It |
| is the policy that governs all page allocations that aren't controlled |
| by one of the more specific policy scopes discussed below. When the |
| system is "up and running", the system default policy will use "local |
| allocation" described below. However, during boot up, the system |
| default policy will be set to interleave allocations across all nodes |
| with "sufficient" memory, so as not to overload the initial boot node |
| with boot-time allocations. |
| |
| Task/Process Policy: this is an optional, per-task policy. When defined |
| for a specific task, this policy controls all page allocations made by or |
| on behalf of the task that aren't controlled by a more specific scope. |
| If a task does not define a task policy, then all page allocations that |
| would have been controlled by the task policy "fall back" to the System |
| Default Policy. |
| |
| The task policy applies to the entire address space of a task. Thus, |
| it is inheritable, and indeed is inherited, across both fork() |
| [clone() w/o the CLONE_VM flag] and exec*(). This allows a parent task |
| to establish the task policy for a child task exec()'d from an |
| executable image that has no awareness of memory policy. See the |
| MEMORY POLICY APIS section, below, for an overview of the system call |
| that a task may use to set/change it's task/process policy. |
| |
| In a multi-threaded task, task policies apply only to the thread |
| [Linux kernel task] that installs the policy and any threads |
| subsequently created by that thread. Any sibling threads existing |
| at the time a new task policy is installed retain their current |
| policy. |
| |
| A task policy applies only to pages allocated after the policy is |
| installed. Any pages already faulted in by the task when the task |
| changes its task policy remain where they were allocated based on |
| the policy at the time they were allocated. |
| |
| VMA Policy: A "VMA" or "Virtual Memory Area" refers to a range of a task's |
| virtual adddress space. A task may define a specific policy for a range |
| of its virtual address space. See the MEMORY POLICIES APIS section, |
| below, for an overview of the mbind() system call used to set a VMA |
| policy. |
| |
| A VMA policy will govern the allocation of pages that back this region of |
| the address space. Any regions of the task's address space that don't |
| have an explicit VMA policy will fall back to the task policy, which may |
| itself fall back to the System Default Policy. |
| |
| VMA policies have a few complicating details: |
| |
| VMA policy applies ONLY to anonymous pages. These include pages |
| allocated for anonymous segments, such as the task stack and heap, and |
| any regions of the address space mmap()ed with the MAP_ANONYMOUS flag. |
| If a VMA policy is applied to a file mapping, it will be ignored if |
| the mapping used the MAP_SHARED flag. If the file mapping used the |
| MAP_PRIVATE flag, the VMA policy will only be applied when an |
| anonymous page is allocated on an attempt to write to the mapping-- |
| i.e., at Copy-On-Write. |
| |
| VMA policies are shared between all tasks that share a virtual address |
| space--a.k.a. threads--independent of when the policy is installed; and |
| they are inherited across fork(). However, because VMA policies refer |
| to a specific region of a task's address space, and because the address |
| space is discarded and recreated on exec*(), VMA policies are NOT |
| inheritable across exec(). Thus, only NUMA-aware applications may |
| use VMA policies. |
| |
| A task may install a new VMA policy on a sub-range of a previously |
| mmap()ed region. When this happens, Linux splits the existing virtual |
| memory area into 2 or 3 VMAs, each with it's own policy. |
| |
| By default, VMA policy applies only to pages allocated after the policy |
| is installed. Any pages already faulted into the VMA range remain |
| where they were allocated based on the policy at the time they were |
| allocated. However, since 2.6.16, Linux supports page migration via |
| the mbind() system call, so that page contents can be moved to match |
| a newly installed policy. |
| |
| Shared Policy: Conceptually, shared policies apply to "memory objects" |
| mapped shared into one or more tasks' distinct address spaces. An |
| application installs a shared policies the same way as VMA policies--using |
| the mbind() system call specifying a range of virtual addresses that map |
| the shared object. However, unlike VMA policies, which can be considered |
| to be an attribute of a range of a task's address space, shared policies |
| apply directly to the shared object. Thus, all tasks that attach to the |
| object share the policy, and all pages allocated for the shared object, |
| by any task, will obey the shared policy. |
| |
| As of 2.6.22, only shared memory segments, created by shmget() or |
| mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy. When shared |
| policy support was added to Linux, the associated data structures were |
| added to hugetlbfs shmem segments. At the time, hugetlbfs did not |
| support allocation at fault time--a.k.a lazy allocation--so hugetlbfs |
| shmem segments were never "hooked up" to the shared policy support. |
| Although hugetlbfs segments now support lazy allocation, their support |
| for shared policy has not been completed. |
| |
| As mentioned above [re: VMA policies], allocations of page cache |
| pages for regular files mmap()ed with MAP_SHARED ignore any VMA |
| policy installed on the virtual address range backed by the shared |
| file mapping. Rather, shared page cache pages, including pages backing |
| private mappings that have not yet been written by the task, follow |
| task policy, if any, else System Default Policy. |
| |
| The shared policy infrastructure supports different policies on subset |
| ranges of the shared object. However, Linux still splits the VMA of |
| the task that installs the policy for each range of distinct policy. |
| Thus, different tasks that attach to a shared memory segment can have |
| different VMA configurations mapping that one shared object. This |
| can be seen by examining the /proc/<pid>/numa_maps of tasks sharing |
| a shared memory region, when one task has installed shared policy on |
| one or more ranges of the region. |
| |
| Components of Memory Policies |
| |
| A Linux memory policy is a tuple consisting of a "mode" and an optional set |
| of nodes. The mode determine the behavior of the policy, while the |
| optional set of nodes can be viewed as the arguments to the behavior. |
| |
| Internally, memory policies are implemented by a reference counted |
| structure, struct mempolicy. Details of this structure will be discussed |
| in context, below, as required to explain the behavior. |
| |
| Note: in some functions AND in the struct mempolicy itself, the mode |
| is called "policy". However, to avoid confusion with the policy tuple, |
| this document will continue to use the term "mode". |
| |
| Linux memory policy supports the following 4 behavioral modes: |
| |
| Default Mode--MPOL_DEFAULT: The behavior specified by this mode is |
| context or scope dependent. |
| |
| As mentioned in the Policy Scope section above, during normal |
| system operation, the System Default Policy is hard coded to |
| contain the Default mode. |
| |
| In this context, default mode means "local" allocation--that is |
| attempt to allocate the page from the node associated with the cpu |
| where the fault occurs. If the "local" node has no memory, or the |
| node's memory can be exhausted [no free pages available], local |
| allocation will "fallback to"--attempt to allocate pages from-- |
| "nearby" nodes, in order of increasing "distance". |
| |
| Implementation detail -- subject to change: "Fallback" uses |
| a per node list of sibling nodes--called zonelists--built at |
| boot time, or when nodes or memory are added or removed from |
| the system [memory hotplug]. These per node zonelist are |
| constructed with nodes in order of increasing distance based |
| on information provided by the platform firmware. |
| |
| When a task/process policy or a shared policy contains the Default |
| mode, this also means "local allocation", as described above. |
| |
| In the context of a VMA, Default mode means "fall back to task |
| policy"--which may or may not specify Default mode. Thus, Default |
| mode can not be counted on to mean local allocation when used |
| on a non-shared region of the address space. However, see |
| MPOL_PREFERRED below. |
| |
| The Default mode does not use the optional set of nodes. |
| |
| MPOL_BIND: This mode specifies that memory must come from the |
| set of nodes specified by the policy. |
| |
| The memory policy APIs do not specify an order in which the nodes |
| will be searched. However, unlike "local allocation", the Bind |
| policy does not consider the distance between the nodes. Rather, |
| allocations will fallback to the nodes specified by the policy in |
| order of numeric node id. Like everything in Linux, this is subject |
| to change. |
| |
| MPOL_PREFERRED: This mode specifies that the allocation should be |
| attempted from the single node specified in the policy. If that |
| allocation fails, the kernel will search other nodes, exactly as |
| it would for a local allocation that started at the preferred node |
| in increasing distance from the preferred node. "Local" allocation |
| policy can be viewed as a Preferred policy that starts at the node |
| containing the cpu where the allocation takes place. |
| |
| Internally, the Preferred policy uses a single node--the |
| preferred_node member of struct mempolicy. A "distinguished |
| value of this preferred_node, currently '-1', is interpreted |
| as "the node containing the cpu where the allocation takes |
| place"--local allocation. This is the way to specify |
| local allocation for a specific range of addresses--i.e. for |
| VMA policies. |
| |
| MPOL_INTERLEAVED: This mode specifies that page allocations be |
| interleaved, on a page granularity, across the nodes specified in |
| the policy. This mode also behaves slightly differently, based on |
| the context where it is used: |
| |
| For allocation of anonymous pages and shared memory pages, |
| Interleave mode indexes the set of nodes specified by the policy |
| using the page offset of the faulting address into the segment |
| [VMA] containing the address modulo the number of nodes specified |
| by the policy. It then attempts to allocate a page, starting at |
| the selected node, as if the node had been specified by a Preferred |
| policy or had been selected by a local allocation. That is, |
| allocation will follow the per node zonelist. |
| |
| For allocation of page cache pages, Interleave mode indexes the set |
| of nodes specified by the policy using a node counter maintained |
| per task. This counter wraps around to the lowest specified node |
| after it reaches the highest specified node. This will tend to |
| spread the pages out over the nodes specified by the policy based |
| on the order in which they are allocated, rather than based on any |
| page offset into an address range or file. During system boot up, |
| the temporary interleaved system default policy works in this |
| mode. |
| |
| MEMORY POLICY APIs |
| |
| Linux supports 3 system calls for controlling memory policy. These APIS |
| always affect only the calling task, the calling task's address space, or |
| some shared object mapped into the calling task's address space. |
| |
| Note: the headers that define these APIs and the parameter data types |
| for user space applications reside in a package that is not part of |
| the Linux kernel. The kernel system call interfaces, with the 'sys_' |
| prefix, are defined in <linux/syscalls.h>; the mode and flag |
| definitions are defined in <linux/mempolicy.h>. |
| |
| Set [Task] Memory Policy: |
| |
| long set_mempolicy(int mode, const unsigned long *nmask, |
| unsigned long maxnode); |
| |
| Set's the calling task's "task/process memory policy" to mode |
| specified by the 'mode' argument and the set of nodes defined |
| by 'nmask'. 'nmask' points to a bit mask of node ids containing |
| at least 'maxnode' ids. |
| |
| See the set_mempolicy(2) man page for more details |
| |
| |
| Get [Task] Memory Policy or Related Information |
| |
| long get_mempolicy(int *mode, |
| const unsigned long *nmask, unsigned long maxnode, |
| void *addr, int flags); |
| |
| Queries the "task/process memory policy" of the calling task, or |
| the policy or location of a specified virtual address, depending |
| on the 'flags' argument. |
| |
| See the get_mempolicy(2) man page for more details |
| |
| |
| Install VMA/Shared Policy for a Range of Task's Address Space |
| |
| long mbind(void *start, unsigned long len, int mode, |
| const unsigned long *nmask, unsigned long maxnode, |
| unsigned flags); |
| |
| mbind() installs the policy specified by (mode, nmask, maxnodes) as |
| a VMA policy for the range of the calling task's address space |
| specified by the 'start' and 'len' arguments. Additional actions |
| may be requested via the 'flags' argument. |
| |
| See the mbind(2) man page for more details. |
| |
| MEMORY POLICY COMMAND LINE INTERFACE |
| |
| Although not strictly part of the Linux implementation of memory policy, |
| a command line tool, numactl(8), exists that allows one to: |
| |
| + set the task policy for a specified program via set_mempolicy(2), fork(2) and |
| exec(2) |
| |
| + set the shared policy for a shared memory segment via mbind(2) |
| |
| The numactl(8) tool is packages with the run-time version of the library |
| containing the memory policy system call wrappers. Some distributions |
| package the headers and compile-time libraries in a separate development |
| package. |
| |
| |
| MEMORY POLICIES AND CPUSETS |
| |
| Memory policies work within cpusets as described above. For memory policies |
| that require a node or set of nodes, the nodes are restricted to the set of |
| nodes whose memories are allowed by the cpuset constraints. If the nodemask |
| specified for the policy contains nodes that are not allowed by the cpuset, or |
| the intersection of the set of nodes specified for the policy and the set of |
| nodes with memory is the empty set, the policy is considered invalid |
| and cannot be installed. |
| |
| The interaction of memory policies and cpusets can be problematic for a |
| couple of reasons: |
| |
| 1) the memory policy APIs take physical node id's as arguments. As mentioned |
| above, it is illegal to specify nodes that are not allowed in the cpuset. |
| The application must query the allowed nodes using the get_mempolicy() |
| API with the MPOL_F_MEMS_ALLOWED flag to determine the allowed nodes and |
| restrict itself to those nodes. However, the resources available to a |
| cpuset can be changed by the system administrator, or a workload manager |
| application, at any time. So, a task may still get errors attempting to |
| specify policy nodes, and must query the allowed memories again. |
| |
| 2) when tasks in two cpusets share access to a memory region, such as shared |
| memory segments created by shmget() of mmap() with the MAP_ANONYMOUS and |
| MAP_SHARED flags, and any of the tasks install shared policy on the region, |
| only nodes whose memories are allowed in both cpusets may be used in the |
| policies. Obtaining this information requires "stepping outside" the |
| memory policy APIs to use the cpuset information and requires that one |
| know in what cpusets other task might be attaching to the shared region. |
| Furthermore, if the cpusets' allowed memory sets are disjoint, "local" |
| allocation is the only valid policy. |