Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 1 | |
| 2 | What is Linux Memory Policy? |
| 3 | |
| 4 | In the Linux kernel, "memory policy" determines from which node the kernel will |
| 5 | allocate memory in a NUMA system or in an emulated NUMA system. Linux has |
| 6 | supported platforms with Non-Uniform Memory Access architectures since 2.4.?. |
| 7 | The current memory policy support was added to Linux 2.6 around May 2004. This |
| 8 | document attempts to describe the concepts and APIs of the 2.6 memory policy |
| 9 | support. |
| 10 | |
| 11 | Memory policies should not be confused with cpusets (Documentation/cpusets.txt) |
| 12 | which is an administrative mechanism for restricting the nodes from which |
| 13 | memory may be allocated by a set of processes. Memory policies are a |
| 14 | programming interface that a NUMA-aware application can take advantage of. When |
| 15 | both cpusets and policies are applied to a task, the restrictions of the cpuset |
| 16 | takes priority. See "MEMORY POLICIES AND CPUSETS" below for more details. |
| 17 | |
| 18 | MEMORY POLICY CONCEPTS |
| 19 | |
| 20 | Scope of Memory Policies |
| 21 | |
| 22 | The Linux kernel supports _scopes_ of memory policy, described here from |
| 23 | most general to most specific: |
| 24 | |
| 25 | System Default Policy: this policy is "hard coded" into the kernel. It |
| 26 | is the policy that governs all page allocations that aren't controlled |
| 27 | by one of the more specific policy scopes discussed below. When the |
| 28 | system is "up and running", the system default policy will use "local |
| 29 | allocation" described below. However, during boot up, the system |
| 30 | default policy will be set to interleave allocations across all nodes |
| 31 | with "sufficient" memory, so as not to overload the initial boot node |
| 32 | with boot-time allocations. |
| 33 | |
| 34 | Task/Process Policy: this is an optional, per-task policy. When defined |
| 35 | for a specific task, this policy controls all page allocations made by or |
| 36 | on behalf of the task that aren't controlled by a more specific scope. |
| 37 | If a task does not define a task policy, then all page allocations that |
| 38 | would have been controlled by the task policy "fall back" to the System |
| 39 | Default Policy. |
| 40 | |
| 41 | The task policy applies to the entire address space of a task. Thus, |
| 42 | it is inheritable, and indeed is inherited, across both fork() |
| 43 | [clone() w/o the CLONE_VM flag] and exec*(). This allows a parent task |
| 44 | to establish the task policy for a child task exec()'d from an |
| 45 | executable image that has no awareness of memory policy. See the |
| 46 | MEMORY POLICY APIS section, below, for an overview of the system call |
| 47 | that a task may use to set/change it's task/process policy. |
| 48 | |
| 49 | In a multi-threaded task, task policies apply only to the thread |
| 50 | [Linux kernel task] that installs the policy and any threads |
| 51 | subsequently created by that thread. Any sibling threads existing |
| 52 | at the time a new task policy is installed retain their current |
| 53 | policy. |
| 54 | |
| 55 | A task policy applies only to pages allocated after the policy is |
| 56 | installed. Any pages already faulted in by the task when the task |
| 57 | changes its task policy remain where they were allocated based on |
| 58 | the policy at the time they were allocated. |
| 59 | |
| 60 | VMA Policy: A "VMA" or "Virtual Memory Area" refers to a range of a task's |
| 61 | virtual adddress space. A task may define a specific policy for a range |
| 62 | of its virtual address space. See the MEMORY POLICIES APIS section, |
| 63 | below, for an overview of the mbind() system call used to set a VMA |
| 64 | policy. |
| 65 | |
| 66 | A VMA policy will govern the allocation of pages that back this region of |
| 67 | the address space. Any regions of the task's address space that don't |
| 68 | have an explicit VMA policy will fall back to the task policy, which may |
| 69 | itself fall back to the System Default Policy. |
| 70 | |
| 71 | VMA policies have a few complicating details: |
| 72 | |
| 73 | VMA policy applies ONLY to anonymous pages. These include pages |
| 74 | allocated for anonymous segments, such as the task stack and heap, and |
| 75 | any regions of the address space mmap()ed with the MAP_ANONYMOUS flag. |
| 76 | If a VMA policy is applied to a file mapping, it will be ignored if |
| 77 | the mapping used the MAP_SHARED flag. If the file mapping used the |
| 78 | MAP_PRIVATE flag, the VMA policy will only be applied when an |
| 79 | anonymous page is allocated on an attempt to write to the mapping-- |
| 80 | i.e., at Copy-On-Write. |
| 81 | |
| 82 | VMA policies are shared between all tasks that share a virtual address |
| 83 | space--a.k.a. threads--independent of when the policy is installed; and |
| 84 | they are inherited across fork(). However, because VMA policies refer |
| 85 | to a specific region of a task's address space, and because the address |
| 86 | space is discarded and recreated on exec*(), VMA policies are NOT |
| 87 | inheritable across exec(). Thus, only NUMA-aware applications may |
| 88 | use VMA policies. |
| 89 | |
| 90 | A task may install a new VMA policy on a sub-range of a previously |
| 91 | mmap()ed region. When this happens, Linux splits the existing virtual |
| 92 | memory area into 2 or 3 VMAs, each with it's own policy. |
| 93 | |
| 94 | By default, VMA policy applies only to pages allocated after the policy |
| 95 | is installed. Any pages already faulted into the VMA range remain |
| 96 | where they were allocated based on the policy at the time they were |
| 97 | allocated. However, since 2.6.16, Linux supports page migration via |
| 98 | the mbind() system call, so that page contents can be moved to match |
| 99 | a newly installed policy. |
| 100 | |
| 101 | Shared Policy: Conceptually, shared policies apply to "memory objects" |
| 102 | mapped shared into one or more tasks' distinct address spaces. An |
| 103 | application installs a shared policies the same way as VMA policies--using |
| 104 | the mbind() system call specifying a range of virtual addresses that map |
| 105 | the shared object. However, unlike VMA policies, which can be considered |
| 106 | to be an attribute of a range of a task's address space, shared policies |
| 107 | apply directly to the shared object. Thus, all tasks that attach to the |
| 108 | object share the policy, and all pages allocated for the shared object, |
| 109 | by any task, will obey the shared policy. |
| 110 | |
| 111 | As of 2.6.22, only shared memory segments, created by shmget() or |
| 112 | mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy. When shared |
| 113 | policy support was added to Linux, the associated data structures were |
| 114 | added to hugetlbfs shmem segments. At the time, hugetlbfs did not |
| 115 | support allocation at fault time--a.k.a lazy allocation--so hugetlbfs |
| 116 | shmem segments were never "hooked up" to the shared policy support. |
| 117 | Although hugetlbfs segments now support lazy allocation, their support |
| 118 | for shared policy has not been completed. |
| 119 | |
| 120 | As mentioned above [re: VMA policies], allocations of page cache |
| 121 | pages for regular files mmap()ed with MAP_SHARED ignore any VMA |
| 122 | policy installed on the virtual address range backed by the shared |
| 123 | file mapping. Rather, shared page cache pages, including pages backing |
| 124 | private mappings that have not yet been written by the task, follow |
| 125 | task policy, if any, else System Default Policy. |
| 126 | |
| 127 | The shared policy infrastructure supports different policies on subset |
| 128 | ranges of the shared object. However, Linux still splits the VMA of |
| 129 | the task that installs the policy for each range of distinct policy. |
| 130 | Thus, different tasks that attach to a shared memory segment can have |
| 131 | different VMA configurations mapping that one shared object. This |
| 132 | can be seen by examining the /proc/<pid>/numa_maps of tasks sharing |
| 133 | a shared memory region, when one task has installed shared policy on |
| 134 | one or more ranges of the region. |
| 135 | |
| 136 | Components of Memory Policies |
| 137 | |
David Rientjes | 65d66fc | 2008-04-28 02:12:31 -0700 | [diff] [blame^] | 138 | A Linux memory policy consists of a "mode", optional mode flags, and an |
| 139 | optional set of nodes. The mode determines the behavior of the policy, |
| 140 | the optional mode flags determine the behavior of the mode, and the |
| 141 | optional set of nodes can be viewed as the arguments to the policy |
| 142 | behavior. |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 143 | |
| 144 | Internally, memory policies are implemented by a reference counted |
| 145 | structure, struct mempolicy. Details of this structure will be discussed |
| 146 | in context, below, as required to explain the behavior. |
| 147 | |
| 148 | Note: in some functions AND in the struct mempolicy itself, the mode |
| 149 | is called "policy". However, to avoid confusion with the policy tuple, |
| 150 | this document will continue to use the term "mode". |
| 151 | |
| 152 | Linux memory policy supports the following 4 behavioral modes: |
| 153 | |
| 154 | Default Mode--MPOL_DEFAULT: The behavior specified by this mode is |
| 155 | context or scope dependent. |
| 156 | |
| 157 | As mentioned in the Policy Scope section above, during normal |
| 158 | system operation, the System Default Policy is hard coded to |
| 159 | contain the Default mode. |
| 160 | |
| 161 | In this context, default mode means "local" allocation--that is |
| 162 | attempt to allocate the page from the node associated with the cpu |
| 163 | where the fault occurs. If the "local" node has no memory, or the |
| 164 | node's memory can be exhausted [no free pages available], local |
| 165 | allocation will "fallback to"--attempt to allocate pages from-- |
| 166 | "nearby" nodes, in order of increasing "distance". |
| 167 | |
| 168 | Implementation detail -- subject to change: "Fallback" uses |
| 169 | a per node list of sibling nodes--called zonelists--built at |
| 170 | boot time, or when nodes or memory are added or removed from |
| 171 | the system [memory hotplug]. These per node zonelist are |
| 172 | constructed with nodes in order of increasing distance based |
| 173 | on information provided by the platform firmware. |
| 174 | |
| 175 | When a task/process policy or a shared policy contains the Default |
| 176 | mode, this also means "local allocation", as described above. |
| 177 | |
| 178 | In the context of a VMA, Default mode means "fall back to task |
| 179 | policy"--which may or may not specify Default mode. Thus, Default |
| 180 | mode can not be counted on to mean local allocation when used |
| 181 | on a non-shared region of the address space. However, see |
| 182 | MPOL_PREFERRED below. |
| 183 | |
David Rientjes | 65d66fc | 2008-04-28 02:12:31 -0700 | [diff] [blame^] | 184 | It is an error for the set of nodes specified for this policy to |
| 185 | be non-empty. |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 186 | |
| 187 | MPOL_BIND: This mode specifies that memory must come from the |
Mel Gorman | 19770b3 | 2008-04-28 02:12:18 -0700 | [diff] [blame] | 188 | set of nodes specified by the policy. Memory will be allocated from |
| 189 | the node in the set with sufficient free memory that is closest to |
| 190 | the node where the allocation takes place. |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 191 | |
| 192 | MPOL_PREFERRED: This mode specifies that the allocation should be |
| 193 | attempted from the single node specified in the policy. If that |
| 194 | allocation fails, the kernel will search other nodes, exactly as |
| 195 | it would for a local allocation that started at the preferred node |
| 196 | in increasing distance from the preferred node. "Local" allocation |
| 197 | policy can be viewed as a Preferred policy that starts at the node |
| 198 | containing the cpu where the allocation takes place. |
| 199 | |
| 200 | Internally, the Preferred policy uses a single node--the |
| 201 | preferred_node member of struct mempolicy. A "distinguished |
| 202 | value of this preferred_node, currently '-1', is interpreted |
| 203 | as "the node containing the cpu where the allocation takes |
| 204 | place"--local allocation. This is the way to specify |
| 205 | local allocation for a specific range of addresses--i.e. for |
| 206 | VMA policies. |
| 207 | |
| 208 | MPOL_INTERLEAVED: This mode specifies that page allocations be |
| 209 | interleaved, on a page granularity, across the nodes specified in |
| 210 | the policy. This mode also behaves slightly differently, based on |
| 211 | the context where it is used: |
| 212 | |
| 213 | For allocation of anonymous pages and shared memory pages, |
| 214 | Interleave mode indexes the set of nodes specified by the policy |
| 215 | using the page offset of the faulting address into the segment |
| 216 | [VMA] containing the address modulo the number of nodes specified |
| 217 | by the policy. It then attempts to allocate a page, starting at |
| 218 | the selected node, as if the node had been specified by a Preferred |
| 219 | policy or had been selected by a local allocation. That is, |
| 220 | allocation will follow the per node zonelist. |
| 221 | |
| 222 | For allocation of page cache pages, Interleave mode indexes the set |
| 223 | of nodes specified by the policy using a node counter maintained |
| 224 | per task. This counter wraps around to the lowest specified node |
| 225 | after it reaches the highest specified node. This will tend to |
| 226 | spread the pages out over the nodes specified by the policy based |
| 227 | on the order in which they are allocated, rather than based on any |
| 228 | page offset into an address range or file. During system boot up, |
| 229 | the temporary interleaved system default policy works in this |
| 230 | mode. |
| 231 | |
David Rientjes | 65d66fc | 2008-04-28 02:12:31 -0700 | [diff] [blame^] | 232 | Linux memory policy supports the following optional mode flags: |
| 233 | |
| 234 | MPOL_F_STATIC_NODES: This flag specifies that the nodemask passed by |
| 235 | the user should not be remapped if the task or VMA's set of allowed |
| 236 | nodes changes after the memory policy has been defined. |
| 237 | |
| 238 | Without this flag, anytime a mempolicy is rebound because of a |
| 239 | change in the set of allowed nodes, the node (Preferred) or |
| 240 | nodemask (Bind, Interleave) is remapped to the new set of |
| 241 | allowed nodes. This may result in nodes being used that were |
| 242 | previously undesired. |
| 243 | |
| 244 | With this flag, if the user-specified nodes overlap with the |
| 245 | nodes allowed by the task's cpuset, then the memory policy is |
| 246 | applied to their intersection. If the two sets of nodes do not |
| 247 | overlap, the Default policy is used. |
| 248 | |
| 249 | For example, consider a task that is attached to a cpuset with |
| 250 | mems 1-3 that sets an Interleave policy over the same set. If |
| 251 | the cpuset's mems change to 3-5, the Interleave will now occur |
| 252 | over nodes 3, 4, and 5. With this flag, however, since only node |
| 253 | 3 is allowed from the user's nodemask, the "interleave" only |
| 254 | occurs over that node. If no nodes from the user's nodemask are |
| 255 | now allowed, the Default behavior is used. |
| 256 | |
| 257 | MPOL_F_STATIC_NODES cannot be used with MPOL_F_RELATIVE_NODES. |
| 258 | |
| 259 | MPOL_F_RELATIVE_NODES: This flag specifies that the nodemask passed |
| 260 | by the user will be mapped relative to the set of the task or VMA's |
| 261 | set of allowed nodes. The kernel stores the user-passed nodemask, |
| 262 | and if the allowed nodes changes, then that original nodemask will |
| 263 | be remapped relative to the new set of allowed nodes. |
| 264 | |
| 265 | Without this flag (and without MPOL_F_STATIC_NODES), anytime a |
| 266 | mempolicy is rebound because of a change in the set of allowed |
| 267 | nodes, the node (Preferred) or nodemask (Bind, Interleave) is |
| 268 | remapped to the new set of allowed nodes. That remap may not |
| 269 | preserve the relative nature of the user's passed nodemask to its |
| 270 | set of allowed nodes upon successive rebinds: a nodemask of |
| 271 | 1,3,5 may be remapped to 7-9 and then to 1-3 if the set of |
| 272 | allowed nodes is restored to its original state. |
| 273 | |
| 274 | With this flag, the remap is done so that the node numbers from |
| 275 | the user's passed nodemask are relative to the set of allowed |
| 276 | nodes. In other words, if nodes 0, 2, and 4 are set in the user's |
| 277 | nodemask, the policy will be effected over the first (and in the |
| 278 | Bind or Interleave case, the third and fifth) nodes in the set of |
| 279 | allowed nodes. The nodemask passed by the user represents nodes |
| 280 | relative to task or VMA's set of allowed nodes. |
| 281 | |
| 282 | If the user's nodemask includes nodes that are outside the range |
| 283 | of the new set of allowed nodes (for example, node 5 is set in |
| 284 | the user's nodemask when the set of allowed nodes is only 0-3), |
| 285 | then the remap wraps around to the beginning of the nodemask and, |
| 286 | if not already set, sets the node in the mempolicy nodemask. |
| 287 | |
| 288 | For example, consider a task that is attached to a cpuset with |
| 289 | mems 2-5 that sets an Interleave policy over the same set with |
| 290 | MPOL_F_RELATIVE_NODES. If the cpuset's mems change to 3-7, the |
| 291 | interleave now occurs over nodes 3,5-6. If the cpuset's mems |
| 292 | then change to 0,2-3,5, then the interleave occurs over nodes |
| 293 | 0,3,5. |
| 294 | |
| 295 | Thanks to the consistent remapping, applications preparing |
| 296 | nodemasks to specify memory policies using this flag should |
| 297 | disregard their current, actual cpuset imposed memory placement |
| 298 | and prepare the nodemask as if they were always located on |
| 299 | memory nodes 0 to N-1, where N is the number of memory nodes the |
| 300 | policy is intended to manage. Let the kernel then remap to the |
| 301 | set of memory nodes allowed by the task's cpuset, as that may |
| 302 | change over time. |
| 303 | |
| 304 | MPOL_F_RELATIVE_NODES cannot be used with MPOL_F_STATIC_NODES. |
| 305 | |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 306 | MEMORY POLICY APIs |
| 307 | |
| 308 | Linux supports 3 system calls for controlling memory policy. These APIS |
| 309 | always affect only the calling task, the calling task's address space, or |
| 310 | some shared object mapped into the calling task's address space. |
| 311 | |
| 312 | Note: the headers that define these APIs and the parameter data types |
| 313 | for user space applications reside in a package that is not part of |
| 314 | the Linux kernel. The kernel system call interfaces, with the 'sys_' |
| 315 | prefix, are defined in <linux/syscalls.h>; the mode and flag |
| 316 | definitions are defined in <linux/mempolicy.h>. |
| 317 | |
| 318 | Set [Task] Memory Policy: |
| 319 | |
| 320 | long set_mempolicy(int mode, const unsigned long *nmask, |
| 321 | unsigned long maxnode); |
| 322 | |
| 323 | Set's the calling task's "task/process memory policy" to mode |
| 324 | specified by the 'mode' argument and the set of nodes defined |
| 325 | by 'nmask'. 'nmask' points to a bit mask of node ids containing |
David Rientjes | 65d66fc | 2008-04-28 02:12:31 -0700 | [diff] [blame^] | 326 | at least 'maxnode' ids. Optional mode flags may be passed by |
| 327 | combining the 'mode' argument with the flag (for example: |
| 328 | MPOL_INTERLEAVE | MPOL_F_STATIC_NODES). |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 329 | |
| 330 | See the set_mempolicy(2) man page for more details |
| 331 | |
| 332 | |
| 333 | Get [Task] Memory Policy or Related Information |
| 334 | |
| 335 | long get_mempolicy(int *mode, |
| 336 | const unsigned long *nmask, unsigned long maxnode, |
| 337 | void *addr, int flags); |
| 338 | |
| 339 | Queries the "task/process memory policy" of the calling task, or |
| 340 | the policy or location of a specified virtual address, depending |
| 341 | on the 'flags' argument. |
| 342 | |
| 343 | See the get_mempolicy(2) man page for more details |
| 344 | |
| 345 | |
| 346 | Install VMA/Shared Policy for a Range of Task's Address Space |
| 347 | |
| 348 | long mbind(void *start, unsigned long len, int mode, |
| 349 | const unsigned long *nmask, unsigned long maxnode, |
| 350 | unsigned flags); |
| 351 | |
| 352 | mbind() installs the policy specified by (mode, nmask, maxnodes) as |
| 353 | a VMA policy for the range of the calling task's address space |
| 354 | specified by the 'start' and 'len' arguments. Additional actions |
| 355 | may be requested via the 'flags' argument. |
| 356 | |
| 357 | See the mbind(2) man page for more details. |
| 358 | |
| 359 | MEMORY POLICY COMMAND LINE INTERFACE |
| 360 | |
| 361 | Although not strictly part of the Linux implementation of memory policy, |
| 362 | a command line tool, numactl(8), exists that allows one to: |
| 363 | |
| 364 | + set the task policy for a specified program via set_mempolicy(2), fork(2) and |
| 365 | exec(2) |
| 366 | |
| 367 | + set the shared policy for a shared memory segment via mbind(2) |
| 368 | |
| 369 | The numactl(8) tool is packages with the run-time version of the library |
| 370 | containing the memory policy system call wrappers. Some distributions |
| 371 | package the headers and compile-time libraries in a separate development |
| 372 | package. |
| 373 | |
| 374 | |
| 375 | MEMORY POLICIES AND CPUSETS |
| 376 | |
| 377 | Memory policies work within cpusets as described above. For memory policies |
| 378 | that require a node or set of nodes, the nodes are restricted to the set of |
Lee Schermerhorn | 754af6f | 2007-10-16 01:24:51 -0700 | [diff] [blame] | 379 | nodes whose memories are allowed by the cpuset constraints. If the nodemask |
David Rientjes | 65d66fc | 2008-04-28 02:12:31 -0700 | [diff] [blame^] | 380 | specified for the policy contains nodes that are not allowed by the cpuset and |
| 381 | MPOL_F_RELATIVE_NODES is not used, the intersection of the set of nodes |
| 382 | specified for the policy and the set of nodes with memory is used. If the |
| 383 | result is the empty set, the policy is considered invalid and cannot be |
| 384 | installed. If MPOL_F_RELATIVE_NODES is used, the policy's nodes are mapped |
| 385 | onto and folded into the task's set of allowed nodes as previously described. |
Lee Schermerhorn | 42b88e6 | 2007-08-22 14:01:06 -0700 | [diff] [blame] | 386 | |
David Rientjes | 65d66fc | 2008-04-28 02:12:31 -0700 | [diff] [blame^] | 387 | The interaction of memory policies and cpusets can be problematic when tasks |
| 388 | in two cpusets share access to a memory region, such as shared memory segments |
| 389 | created by shmget() of mmap() with the MAP_ANONYMOUS and MAP_SHARED flags, and |
| 390 | any of the tasks install shared policy on the region, only nodes whose |
| 391 | memories are allowed in both cpusets may be used in the policies. Obtaining |
| 392 | this information requires "stepping outside" the memory policy APIs to use the |
| 393 | cpuset information and requires that one know in what cpusets other task might |
| 394 | be attaching to the shared region. Furthermore, if the cpusets' allowed |
| 395 | memory sets are disjoint, "local" allocation is the only valid policy. |