Mike Rapoport | 45c9a74 | 2018-05-14 11:13:40 +0300 | [diff] [blame] | 1 | .. _admin_guide_transhuge: |
| 2 | |
| 3 | ============================ |
| 4 | Transparent Hugepage Support |
| 5 | ============================ |
| 6 | |
| 7 | Objective |
| 8 | ========= |
| 9 | |
| 10 | Performance critical computing applications dealing with large memory |
| 11 | working sets are already running on top of libhugetlbfs and in turn |
| 12 | hugetlbfs. Transparent HugePage Support (THP) is an alternative mean of |
| 13 | using huge pages for the backing of virtual memory with huge pages |
| 14 | that supports the automatic promotion and demotion of page sizes and |
| 15 | without the shortcomings of hugetlbfs. |
| 16 | |
| 17 | Currently THP only works for anonymous memory mappings and tmpfs/shmem. |
| 18 | But in the future it can expand to other filesystems. |
| 19 | |
| 20 | .. note:: |
| 21 | in the examples below we presume that the basic page size is 4K and |
| 22 | the huge page size is 2M, although the actual numbers may vary |
| 23 | depending on the CPU architecture. |
| 24 | |
| 25 | The reason applications are running faster is because of two |
| 26 | factors. The first factor is almost completely irrelevant and it's not |
| 27 | of significant interest because it'll also have the downside of |
| 28 | requiring larger clear-page copy-page in page faults which is a |
| 29 | potentially negative effect. The first factor consists in taking a |
| 30 | single page fault for each 2M virtual region touched by userland (so |
| 31 | reducing the enter/exit kernel frequency by a 512 times factor). This |
| 32 | only matters the first time the memory is accessed for the lifetime of |
| 33 | a memory mapping. The second long lasting and much more important |
| 34 | factor will affect all subsequent accesses to the memory for the whole |
| 35 | runtime of the application. The second factor consist of two |
| 36 | components: |
| 37 | |
| 38 | 1) the TLB miss will run faster (especially with virtualization using |
| 39 | nested pagetables but almost always also on bare metal without |
| 40 | virtualization) |
| 41 | |
| 42 | 2) a single TLB entry will be mapping a much larger amount of virtual |
| 43 | memory in turn reducing the number of TLB misses. With |
| 44 | virtualization and nested pagetables the TLB can be mapped of |
| 45 | larger size only if both KVM and the Linux guest are using |
| 46 | hugepages but a significant speedup already happens if only one of |
| 47 | the two is using hugepages just because of the fact the TLB miss is |
| 48 | going to run faster. |
| 49 | |
| 50 | THP can be enabled system wide or restricted to certain tasks or even |
| 51 | memory ranges inside task's address space. Unless THP is completely |
| 52 | disabled, there is ``khugepaged`` daemon that scans memory and |
| 53 | collapses sequences of basic pages into huge pages. |
| 54 | |
| 55 | The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>` |
| 56 | interface and using madivse(2) and prctl(2) system calls. |
| 57 | |
| 58 | Transparent Hugepage Support maximizes the usefulness of free memory |
| 59 | if compared to the reservation approach of hugetlbfs by allowing all |
| 60 | unused memory to be used as cache or other movable (or even unmovable |
| 61 | entities). It doesn't require reservation to prevent hugepage |
| 62 | allocation failures to be noticeable from userland. It allows paging |
| 63 | and all other advanced VM features to be available on the |
| 64 | hugepages. It requires no modifications for applications to take |
| 65 | advantage of it. |
| 66 | |
| 67 | Applications however can be further optimized to take advantage of |
| 68 | this feature, like for example they've been optimized before to avoid |
| 69 | a flood of mmap system calls for every malloc(4k). Optimizing userland |
| 70 | is by far not mandatory and khugepaged already can take care of long |
| 71 | lived page allocations even for hugepage unaware applications that |
| 72 | deals with large amounts of memory. |
| 73 | |
| 74 | In certain cases when hugepages are enabled system wide, application |
| 75 | may end up allocating more memory resources. An application may mmap a |
| 76 | large region but only touch 1 byte of it, in that case a 2M page might |
| 77 | be allocated instead of a 4k page for no good. This is why it's |
| 78 | possible to disable hugepages system-wide and to only have them inside |
| 79 | MADV_HUGEPAGE madvise regions. |
| 80 | |
| 81 | Embedded systems should enable hugepages only inside madvise regions |
| 82 | to eliminate any risk of wasting any precious byte of memory and to |
| 83 | only run faster. |
| 84 | |
| 85 | Applications that gets a lot of benefit from hugepages and that don't |
| 86 | risk to lose memory by using hugepages, should use |
| 87 | madvise(MADV_HUGEPAGE) on their critical mmapped regions. |
| 88 | |
| 89 | .. _thp_sysfs: |
| 90 | |
| 91 | sysfs |
| 92 | ===== |
| 93 | |
| 94 | Global THP controls |
| 95 | ------------------- |
| 96 | |
| 97 | Transparent Hugepage Support for anonymous memory can be entirely disabled |
| 98 | (mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE |
| 99 | regions (to avoid the risk of consuming more memory resources) or enabled |
| 100 | system wide. This can be achieved with one of:: |
| 101 | |
| 102 | echo always >/sys/kernel/mm/transparent_hugepage/enabled |
| 103 | echo madvise >/sys/kernel/mm/transparent_hugepage/enabled |
| 104 | echo never >/sys/kernel/mm/transparent_hugepage/enabled |
| 105 | |
| 106 | It's also possible to limit defrag efforts in the VM to generate |
| 107 | anonymous hugepages in case they're not immediately free to madvise |
| 108 | regions or to never try to defrag memory and simply fallback to regular |
| 109 | pages unless hugepages are immediately available. Clearly if we spend CPU |
| 110 | time to defrag memory, we would expect to gain even more by the fact we |
| 111 | use hugepages later instead of regular pages. This isn't always |
| 112 | guaranteed, but it may be more likely in case the allocation is for a |
| 113 | MADV_HUGEPAGE region. |
| 114 | |
| 115 | :: |
| 116 | |
| 117 | echo always >/sys/kernel/mm/transparent_hugepage/defrag |
| 118 | echo defer >/sys/kernel/mm/transparent_hugepage/defrag |
| 119 | echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag |
| 120 | echo madvise >/sys/kernel/mm/transparent_hugepage/defrag |
| 121 | echo never >/sys/kernel/mm/transparent_hugepage/defrag |
| 122 | |
| 123 | always |
| 124 | means that an application requesting THP will stall on |
| 125 | allocation failure and directly reclaim pages and compact |
| 126 | memory in an effort to allocate a THP immediately. This may be |
| 127 | desirable for virtual machines that benefit heavily from THP |
| 128 | use and are willing to delay the VM start to utilise them. |
| 129 | |
| 130 | defer |
| 131 | means that an application will wake kswapd in the background |
| 132 | to reclaim pages and wake kcompactd to compact memory so that |
| 133 | THP is available in the near future. It's the responsibility |
| 134 | of khugepaged to then install the THP pages later. |
| 135 | |
| 136 | defer+madvise |
| 137 | will enter direct reclaim and compaction like ``always``, but |
| 138 | only for regions that have used madvise(MADV_HUGEPAGE); all |
| 139 | other regions will wake kswapd in the background to reclaim |
| 140 | pages and wake kcompactd to compact memory so that THP is |
| 141 | available in the near future. |
| 142 | |
| 143 | madvise |
| 144 | will enter direct reclaim like ``always`` but only for regions |
| 145 | that are have used madvise(MADV_HUGEPAGE). This is the default |
| 146 | behaviour. |
| 147 | |
| 148 | never |
| 149 | should be self-explanatory. |
| 150 | |
| 151 | By default kernel tries to use huge zero page on read page fault to |
| 152 | anonymous mapping. It's possible to disable huge zero page by writing 0 |
| 153 | or enable it back by writing 1:: |
| 154 | |
| 155 | echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page |
| 156 | echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page |
| 157 | |
| 158 | Some userspace (such as a test program, or an optimized memory allocation |
| 159 | library) may want to know the size (in bytes) of a transparent hugepage:: |
| 160 | |
| 161 | cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size |
| 162 | |
| 163 | khugepaged will be automatically started when |
| 164 | transparent_hugepage/enabled is set to "always" or "madvise, and it'll |
| 165 | be automatically shutdown if it's set to "never". |
| 166 | |
| 167 | Khugepaged controls |
| 168 | ------------------- |
| 169 | |
| 170 | khugepaged runs usually at low frequency so while one may not want to |
| 171 | invoke defrag algorithms synchronously during the page faults, it |
| 172 | should be worth invoking defrag at least in khugepaged. However it's |
| 173 | also possible to disable defrag in khugepaged by writing 0 or enable |
| 174 | defrag in khugepaged by writing 1:: |
| 175 | |
| 176 | echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag |
| 177 | echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag |
| 178 | |
| 179 | You can also control how many pages khugepaged should scan at each |
| 180 | pass:: |
| 181 | |
| 182 | /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan |
| 183 | |
| 184 | and how many milliseconds to wait in khugepaged between each pass (you |
| 185 | can set this to 0 to run khugepaged at 100% utilization of one core):: |
| 186 | |
| 187 | /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs |
| 188 | |
| 189 | and how many milliseconds to wait in khugepaged if there's an hugepage |
| 190 | allocation failure to throttle the next allocation attempt:: |
| 191 | |
| 192 | /sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs |
| 193 | |
| 194 | The khugepaged progress can be seen in the number of pages collapsed:: |
| 195 | |
| 196 | /sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed |
| 197 | |
| 198 | for each pass:: |
| 199 | |
| 200 | /sys/kernel/mm/transparent_hugepage/khugepaged/full_scans |
| 201 | |
| 202 | ``max_ptes_none`` specifies how many extra small pages (that are |
| 203 | not already mapped) can be allocated when collapsing a group |
| 204 | of small pages into one large page:: |
| 205 | |
| 206 | /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none |
| 207 | |
| 208 | A higher value leads to use additional memory for programs. |
| 209 | A lower value leads to gain less thp performance. Value of |
| 210 | max_ptes_none can waste cpu time very little, you can |
| 211 | ignore it. |
| 212 | |
| 213 | ``max_ptes_swap`` specifies how many pages can be brought in from |
| 214 | swap when collapsing a group of pages into a transparent huge page:: |
| 215 | |
| 216 | /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap |
| 217 | |
| 218 | A higher value can cause excessive swap IO and waste |
| 219 | memory. A lower value can prevent THPs from being |
| 220 | collapsed, resulting fewer pages being collapsed into |
| 221 | THPs, and lower memory access performance. |
| 222 | |
| 223 | Boot parameter |
| 224 | ============== |
| 225 | |
| 226 | You can change the sysfs boot time defaults of Transparent Hugepage |
| 227 | Support by passing the parameter ``transparent_hugepage=always`` or |
| 228 | ``transparent_hugepage=madvise`` or ``transparent_hugepage=never`` |
| 229 | to the kernel command line. |
| 230 | |
| 231 | Hugepages in tmpfs/shmem |
| 232 | ======================== |
| 233 | |
| 234 | You can control hugepage allocation policy in tmpfs with mount option |
| 235 | ``huge=``. It can have following values: |
| 236 | |
| 237 | always |
| 238 | Attempt to allocate huge pages every time we need a new page; |
| 239 | |
| 240 | never |
| 241 | Do not allocate huge pages; |
| 242 | |
| 243 | within_size |
| 244 | Only allocate huge page if it will be fully within i_size. |
| 245 | Also respect fadvise()/madvise() hints; |
| 246 | |
| 247 | advise |
| 248 | Only allocate huge pages if requested with fadvise()/madvise(); |
| 249 | |
| 250 | The default policy is ``never``. |
| 251 | |
| 252 | ``mount -o remount,huge= /mountpoint`` works fine after mount: remounting |
| 253 | ``huge=never`` will not attempt to break up huge pages at all, just stop more |
| 254 | from being allocated. |
| 255 | |
| 256 | There's also sysfs knob to control hugepage allocation policy for internal |
| 257 | shmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount |
| 258 | is used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or |
| 259 | MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem. |
| 260 | |
| 261 | In addition to policies listed above, shmem_enabled allows two further |
| 262 | values: |
| 263 | |
| 264 | deny |
| 265 | For use in emergencies, to force the huge option off from |
| 266 | all mounts; |
| 267 | force |
| 268 | Force the huge option on for all - very useful for testing; |
| 269 | |
| 270 | Need of application restart |
| 271 | =========================== |
| 272 | |
| 273 | The transparent_hugepage/enabled values and tmpfs mount option only affect |
| 274 | future behavior. So to make them effective you need to restart any |
| 275 | application that could have been using hugepages. This also applies to the |
| 276 | regions registered in khugepaged. |
| 277 | |
| 278 | Monitoring usage |
| 279 | ================ |
| 280 | |
| 281 | The number of anonymous transparent huge pages currently used by the |
| 282 | system is available by reading the AnonHugePages field in ``/proc/meminfo``. |
| 283 | To identify what applications are using anonymous transparent huge pages, |
| 284 | it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages fields |
| 285 | for each mapping. |
| 286 | |
| 287 | The number of file transparent huge pages mapped to userspace is available |
| 288 | by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``. |
| 289 | To identify what applications are mapping file transparent huge pages, it |
| 290 | is necessary to read ``/proc/PID/smaps`` and count the FileHugeMapped fields |
| 291 | for each mapping. |
| 292 | |
| 293 | Note that reading the smaps file is expensive and reading it |
| 294 | frequently will incur overhead. |
| 295 | |
| 296 | There are a number of counters in ``/proc/vmstat`` that may be used to |
| 297 | monitor how successfully the system is providing huge pages for use. |
| 298 | |
| 299 | thp_fault_alloc |
| 300 | is incremented every time a huge page is successfully |
| 301 | allocated to handle a page fault. This applies to both the |
| 302 | first time a page is faulted and for COW faults. |
| 303 | |
| 304 | thp_collapse_alloc |
| 305 | is incremented by khugepaged when it has found |
| 306 | a range of pages to collapse into one huge page and has |
| 307 | successfully allocated a new huge page to store the data. |
| 308 | |
| 309 | thp_fault_fallback |
| 310 | is incremented if a page fault fails to allocate |
| 311 | a huge page and instead falls back to using small pages. |
| 312 | |
| 313 | thp_collapse_alloc_failed |
| 314 | is incremented if khugepaged found a range |
| 315 | of pages that should be collapsed into one huge page but failed |
| 316 | the allocation. |
| 317 | |
| 318 | thp_file_alloc |
| 319 | is incremented every time a file huge page is successfully |
| 320 | allocated. |
| 321 | |
| 322 | thp_file_mapped |
| 323 | is incremented every time a file huge page is mapped into |
| 324 | user address space. |
| 325 | |
| 326 | thp_split_page |
| 327 | is incremented every time a huge page is split into base |
| 328 | pages. This can happen for a variety of reasons but a common |
| 329 | reason is that a huge page is old and is being reclaimed. |
| 330 | This action implies splitting all PMD the page mapped with. |
| 331 | |
| 332 | thp_split_page_failed |
| 333 | is incremented if kernel fails to split huge |
| 334 | page. This can happen if the page was pinned by somebody. |
| 335 | |
| 336 | thp_deferred_split_page |
| 337 | is incremented when a huge page is put onto split |
| 338 | queue. This happens when a huge page is partially unmapped and |
| 339 | splitting it would free up some memory. Pages on split queue are |
| 340 | going to be split under memory pressure. |
| 341 | |
| 342 | thp_split_pmd |
| 343 | is incremented every time a PMD split into table of PTEs. |
| 344 | This can happen, for instance, when application calls mprotect() or |
| 345 | munmap() on part of huge page. It doesn't split huge page, only |
| 346 | page table entry. |
| 347 | |
| 348 | thp_zero_page_alloc |
| 349 | is incremented every time a huge zero page is |
| 350 | successfully allocated. It includes allocations which where |
| 351 | dropped due race with other allocation. Note, it doesn't count |
| 352 | every map of the huge zero page, only its allocation. |
| 353 | |
| 354 | thp_zero_page_alloc_failed |
| 355 | is incremented if kernel fails to allocate |
| 356 | huge zero page and falls back to using small pages. |
| 357 | |
| 358 | thp_swpout |
| 359 | is incremented every time a huge page is swapout in one |
| 360 | piece without splitting. |
| 361 | |
| 362 | thp_swpout_fallback |
| 363 | is incremented if a huge page has to be split before swapout. |
| 364 | Usually because failed to allocate some continuous swap space |
| 365 | for the huge page. |
| 366 | |
| 367 | As the system ages, allocating huge pages may be expensive as the |
| 368 | system uses memory compaction to copy data around memory to free a |
| 369 | huge page for use. There are some counters in ``/proc/vmstat`` to help |
| 370 | monitor this overhead. |
| 371 | |
| 372 | compact_stall |
| 373 | is incremented every time a process stalls to run |
| 374 | memory compaction so that a huge page is free for use. |
| 375 | |
| 376 | compact_success |
| 377 | is incremented if the system compacted memory and |
| 378 | freed a huge page for use. |
| 379 | |
| 380 | compact_fail |
| 381 | is incremented if the system tries to compact memory |
| 382 | but failed. |
| 383 | |
| 384 | compact_pages_moved |
| 385 | is incremented each time a page is moved. If |
| 386 | this value is increasing rapidly, it implies that the system |
| 387 | is copying a lot of data to satisfy the huge page allocation. |
| 388 | It is possible that the cost of copying exceeds any savings |
| 389 | from reduced TLB misses. |
| 390 | |
| 391 | compact_pagemigrate_failed |
| 392 | is incremented when the underlying mechanism |
| 393 | for moving a page failed. |
| 394 | |
| 395 | compact_blocks_moved |
| 396 | is incremented each time memory compaction examines |
| 397 | a huge page aligned range of pages. |
| 398 | |
| 399 | It is possible to establish how long the stalls were using the function |
| 400 | tracer to record how long was spent in __alloc_pages_nodemask and |
| 401 | using the mm_page_alloc tracepoint to identify which allocations were |
| 402 | for huge pages. |
| 403 | |
| 404 | Optimizing the applications |
| 405 | =========================== |
| 406 | |
| 407 | To be guaranteed that the kernel will map a 2M page immediately in any |
| 408 | memory region, the mmap region has to be hugepage naturally |
| 409 | aligned. posix_memalign() can provide that guarantee. |
| 410 | |
| 411 | Hugetlbfs |
| 412 | ========= |
| 413 | |
| 414 | You can use hugetlbfs on a kernel that has transparent hugepage |
| 415 | support enabled just fine as always. No difference can be noted in |
| 416 | hugetlbfs other than there will be less overall fragmentation. All |
| 417 | usual features belonging to hugetlbfs are preserved and |
| 418 | unaffected. libhugetlbfs will also work fine as usual. |