blob: 9d200762114f67b9b062431632d7ff6fc25b6f8c [file] [log] [blame]
Mike Rapoport88ececc2018-03-21 21:22:24 +02001.. _hugetlbfs_reserve:
2
3=====================
4Hugetlbfs Reservation
5=====================
6
7Overview
8========
9
10Huge pages as described at :ref:`hugetlbpage` are typically
Mike Kravetz70bc0dc2017-05-03 14:55:22 -070011preallocated for application use. These huge pages are instantiated in a
12task's address space at page fault time if the VMA indicates huge pages are
13to be used. If no huge page exists at page fault time, the task is sent
14a SIGBUS and often dies an unhappy death. Shortly after huge page support
15was added, it was determined that it would be better to detect a shortage
16of huge pages at mmap() time. The idea is that if there were not enough
17huge pages to cover the mapping, the mmap() would fail. This was first
18done with a simple check in the code at mmap() time to determine if there
19were enough free huge pages to cover the mapping. Like most things in the
20kernel, the code has evolved over time. However, the basic idea was to
21'reserve' huge pages at mmap() time to ensure that huge pages would be
22available for page faults in that mapping. The description below attempts to
23describe how huge page reserve processing is done in the v4.10 kernel.
24
25
26Audience
Mike Rapoport88ececc2018-03-21 21:22:24 +020027========
Mike Kravetz70bc0dc2017-05-03 14:55:22 -070028This description is primarily targeted at kernel developers who are modifying
29hugetlbfs code.
30
31
32The Data Structures
Mike Rapoport88ececc2018-03-21 21:22:24 +020033===================
34
Mike Kravetz70bc0dc2017-05-03 14:55:22 -070035resv_huge_pages
36 This is a global (per-hstate) count of reserved huge pages. Reserved
37 huge pages are only available to the task which reserved them.
38 Therefore, the number of huge pages generally available is computed
Mike Rapoport88ececc2018-03-21 21:22:24 +020039 as (``free_huge_pages - resv_huge_pages``).
Mike Kravetz70bc0dc2017-05-03 14:55:22 -070040Reserve Map
Mike Rapoport88ececc2018-03-21 21:22:24 +020041 A reserve map is described by the structure::
42
43 struct resv_map {
44 struct kref refs;
45 spinlock_t lock;
46 struct list_head regions;
47 long adds_in_progress;
48 struct list_head region_cache;
49 long region_cache_count;
50 };
51
Mike Kravetz70bc0dc2017-05-03 14:55:22 -070052 There is one reserve map for each huge page mapping in the system.
53 The regions list within the resv_map describes the regions within
Mike Rapoport88ececc2018-03-21 21:22:24 +020054 the mapping. A region is described as::
55
56 struct file_region {
57 struct list_head link;
58 long from;
59 long to;
60 };
61
Mike Kravetz70bc0dc2017-05-03 14:55:22 -070062 The 'from' and 'to' fields of the file region structure are huge page
63 indices into the mapping. Depending on the type of mapping, a
64 region in the reserv_map may indicate reservations exist for the
65 range, or reservations do not exist.
66Flags for MAP_PRIVATE Reservations
67 These are stored in the bottom bits of the reservation map pointer.
Mike Rapoport88ececc2018-03-21 21:22:24 +020068
69 ``#define HPAGE_RESV_OWNER (1UL << 0)``
70 Indicates this task is the owner of the reservations
71 associated with the mapping.
72 ``#define HPAGE_RESV_UNMAPPED (1UL << 1)``
73 Indicates task originally mapping this range (and creating
74 reserves) has unmapped a page from this task (the child)
75 due to a failed COW.
Mike Kravetz70bc0dc2017-05-03 14:55:22 -070076Page Flags
77 The PagePrivate page flag is used to indicate that a huge page
78 reservation must be restored when the huge page is freed. More
79 details will be discussed in the "Freeing huge pages" section.
80
81
82Reservation Map Location (Private or Shared)
Mike Rapoport88ececc2018-03-21 21:22:24 +020083============================================
84
Mike Kravetz70bc0dc2017-05-03 14:55:22 -070085A huge page mapping or segment is either private or shared. If private,
86it is typically only available to a single address space (task). If shared,
87it can be mapped into multiple address spaces (tasks). The location and
88semantics of the reservation map is significantly different for two types
89of mappings. Location differences are:
Mike Rapoport88ececc2018-03-21 21:22:24 +020090
Mike Kravetz70bc0dc2017-05-03 14:55:22 -070091- For private mappings, the reservation map hangs off the the VMA structure.
92 Specifically, vma->vm_private_data. This reserve map is created at the
93 time the mapping (mmap(MAP_PRIVATE)) is created.
94- For shared mappings, the reservation map hangs off the inode. Specifically,
95 inode->i_mapping->private_data. Since shared mappings are always backed
96 by files in the hugetlbfs filesystem, the hugetlbfs code ensures each inode
97 contains a reservation map. As a result, the reservation map is allocated
98 when the inode is created.
99
100
101Creating Reservations
Mike Rapoport88ececc2018-03-21 21:22:24 +0200102=====================
Mike Kravetz70bc0dc2017-05-03 14:55:22 -0700103Reservations are created when a huge page backed shared memory segment is
104created (shmget(SHM_HUGETLB)) or a mapping is created via mmap(MAP_HUGETLB).
Mike Rapoport88ececc2018-03-21 21:22:24 +0200105These operations result in a call to the routine hugetlb_reserve_pages()::
Mike Kravetz70bc0dc2017-05-03 14:55:22 -0700106
Mike Rapoport88ececc2018-03-21 21:22:24 +0200107 int hugetlb_reserve_pages(struct inode *inode,
108 long from, long to,
109 struct vm_area_struct *vma,
110 vm_flags_t vm_flags)
Mike Kravetz70bc0dc2017-05-03 14:55:22 -0700111
112The first thing hugetlb_reserve_pages() does is check for the NORESERVE
113flag was specified in either the shmget() or mmap() call. If NORESERVE
114was specified, then this routine returns immediately as no reservation
115are desired.
116
117The arguments 'from' and 'to' are huge page indices into the mapping or
118underlying file. For shmget(), 'from' is always 0 and 'to' corresponds to
119the length of the segment/mapping. For mmap(), the offset argument could
120be used to specify the offset into the underlying file. In such a case
121the 'from' and 'to' arguments have been adjusted by this offset.
122
123One of the big differences between PRIVATE and SHARED mappings is the way
124in which reservations are represented in the reservation map.
Mike Rapoport88ececc2018-03-21 21:22:24 +0200125
Mike Kravetz70bc0dc2017-05-03 14:55:22 -0700126- For shared mappings, an entry in the reservation map indicates a reservation
127 exists or did exist for the corresponding page. As reservations are
128 consumed, the reservation map is not modified.
129- For private mappings, the lack of an entry in the reservation map indicates
130 a reservation exists for the corresponding page. As reservations are
131 consumed, entries are added to the reservation map. Therefore, the
132 reservation map can also be used to determine which reservations have
133 been consumed.
134
135For private mappings, hugetlb_reserve_pages() creates the reservation map and
136hangs it off the VMA structure. In addition, the HPAGE_RESV_OWNER flag is set
137to indicate this VMA owns the reservations.
138
139The reservation map is consulted to determine how many huge page reservations
140are needed for the current mapping/segment. For private mappings, this is
141always the value (to - from). However, for shared mappings it is possible that some reservations may already exist within the range (to - from). See the
Mike Rapoport88ececc2018-03-21 21:22:24 +0200142section :ref:`Reservation Map Modifications <resv_map_modifications>`
143for details on how this is accomplished.
Mike Kravetz70bc0dc2017-05-03 14:55:22 -0700144
145The mapping may be associated with a subpool. If so, the subpool is consulted
146to ensure there is sufficient space for the mapping. It is possible that the
147subpool has set aside reservations that can be used for the mapping. See the
Mike Rapoport88ececc2018-03-21 21:22:24 +0200148section :ref:`Subpool Reservations <sub_pool_resv>` for more details.
Mike Kravetz70bc0dc2017-05-03 14:55:22 -0700149
150After consulting the reservation map and subpool, the number of needed new
151reservations is known. The routine hugetlb_acct_memory() is called to check
152for and take the requested number of reservations. hugetlb_acct_memory()
153calls into routines that potentially allocate and adjust surplus page counts.
154However, within those routines the code is simply checking to ensure there
155are enough free huge pages to accommodate the reservation. If there are,
156the global reservation count resv_huge_pages is adjusted something like the
Mike Rapoport88ececc2018-03-21 21:22:24 +0200157following::
158
Mike Kravetz70bc0dc2017-05-03 14:55:22 -0700159 if (resv_needed <= (resv_huge_pages - free_huge_pages))
160 resv_huge_pages += resv_needed;
Mike Rapoport88ececc2018-03-21 21:22:24 +0200161
Mike Kravetz70bc0dc2017-05-03 14:55:22 -0700162Note that the global lock hugetlb_lock is held when checking and adjusting
163these counters.
164
165If there were enough free huge pages and the global count resv_huge_pages
166was adjusted, then the reservation map associated with the mapping is
167modified to reflect the reservations. In the case of a shared mapping, a
168file_region will exist that includes the range 'from' 'to'. For private
169mappings, no modifications are made to the reservation map as lack of an
170entry indicates a reservation exists.
171
172If hugetlb_reserve_pages() was successful, the global reservation count and
173reservation map associated with the mapping will be modified as required to
174ensure reservations exist for the range 'from' - 'to'.
175
Mike Rapoport88ececc2018-03-21 21:22:24 +0200176.. _consume_resv:
Mike Kravetz70bc0dc2017-05-03 14:55:22 -0700177
178Consuming Reservations/Allocating a Huge Page
Mike Rapoport88ececc2018-03-21 21:22:24 +0200179=============================================
180
Mike Kravetz70bc0dc2017-05-03 14:55:22 -0700181Reservations are consumed when huge pages associated with the reservations
182are allocated and instantiated in the corresponding mapping. The allocation
Mike Rapoport88ececc2018-03-21 21:22:24 +0200183is performed within the routine alloc_huge_page()::
184
185 struct page *alloc_huge_page(struct vm_area_struct *vma,
186 unsigned long addr, int avoid_reserve)
187
Mike Kravetz70bc0dc2017-05-03 14:55:22 -0700188alloc_huge_page is passed a VMA pointer and a virtual address, so it can
189consult the reservation map to determine if a reservation exists. In addition,
190alloc_huge_page takes the argument avoid_reserve which indicates reserves
191should not be used even if it appears they have been set aside for the
192specified address. The avoid_reserve argument is most often used in the case
193of Copy on Write and Page Migration where additional copies of an existing
194page are being allocated.
195
196The helper routine vma_needs_reservation() is called to determine if a
197reservation exists for the address within the mapping(vma). See the section
Mike Rapoport88ececc2018-03-21 21:22:24 +0200198:ref:`Reservation Map Helper Routines <resv_map_helpers>` for detailed
199information on what this routine does.
200The value returned from vma_needs_reservation() is generally
Mike Kravetz70bc0dc2017-05-03 14:55:22 -07002010 or 1. 0 if a reservation exists for the address, 1 if no reservation exists.
202If a reservation does not exist, and there is a subpool associated with the
203mapping the subpool is consulted to determine if it contains reservations.
204If the subpool contains reservations, one can be used for this allocation.
205However, in every case the avoid_reserve argument overrides the use of
206a reservation for the allocation. After determining whether a reservation
207exists and can be used for the allocation, the routine dequeue_huge_page_vma()
208is called. This routine takes two arguments related to reservations:
Mike Rapoport88ececc2018-03-21 21:22:24 +0200209
Mike Kravetz70bc0dc2017-05-03 14:55:22 -0700210- avoid_reserve, this is the same value/argument passed to alloc_huge_page()
211- chg, even though this argument is of type long only the values 0 or 1 are
212 passed to dequeue_huge_page_vma. If the value is 0, it indicates a
213 reservation exists (see the section "Memory Policy and Reservations" for
214 possible issues). If the value is 1, it indicates a reservation does not
215 exist and the page must be taken from the global free pool if possible.
Mike Rapoport88ececc2018-03-21 21:22:24 +0200216
Mike Kravetz70bc0dc2017-05-03 14:55:22 -0700217The free lists associated with the memory policy of the VMA are searched for
218a free page. If a page is found, the value free_huge_pages is decremented
219when the page is removed from the free list. If there was a reservation
Mike Rapoport88ececc2018-03-21 21:22:24 +0200220associated with the page, the following adjustments are made::
221
Mike Kravetz70bc0dc2017-05-03 14:55:22 -0700222 SetPagePrivate(page); /* Indicates allocating this page consumed
223 * a reservation, and if an error is
224 * encountered such that the page must be
225 * freed, the reservation will be restored. */
226 resv_huge_pages--; /* Decrement the global reservation count */
Mike Rapoport88ececc2018-03-21 21:22:24 +0200227
Mike Kravetz70bc0dc2017-05-03 14:55:22 -0700228Note, if no huge page can be found that satisfies the VMA's memory policy
229an attempt will be made to allocate one using the buddy allocator. This
230brings up the issue of surplus huge pages and overcommit which is beyond
231the scope reservations. Even if a surplus page is allocated, the same
232reservation based adjustments as above will be made: SetPagePrivate(page) and
233resv_huge_pages--.
234
235After obtaining a new huge page, (page)->private is set to the value of
236the subpool associated with the page if it exists. This will be used for
237subpool accounting when the page is freed.
238
239The routine vma_commit_reservation() is then called to adjust the reserve
240map based on the consumption of the reservation. In general, this involves
241ensuring the page is represented within a file_region structure of the region
242map. For shared mappings where the the reservation was present, an entry
243in the reserve map already existed so no change is made. However, if there
244was no reservation in a shared mapping or this was a private mapping a new
245entry must be created.
246
247It is possible that the reserve map could have been changed between the call
248to vma_needs_reservation() at the beginning of alloc_huge_page() and the
249call to vma_commit_reservation() after the page was allocated. This would
250be possible if hugetlb_reserve_pages was called for the same page in a shared
251mapping. In such cases, the reservation count and subpool free page count
252will be off by one. This rare condition can be identified by comparing the
253return value from vma_needs_reservation and vma_commit_reservation. If such
254a race is detected, the subpool and global reserve counts are adjusted to
Mike Rapoport88ececc2018-03-21 21:22:24 +0200255compensate. See the section
256:ref:`Reservation Map Helper Routines <resv_map_helpers>` for more
Mike Kravetz70bc0dc2017-05-03 14:55:22 -0700257information on these routines.
258
259
260Instantiate Huge Pages
Mike Rapoport88ececc2018-03-21 21:22:24 +0200261======================
262
Mike Kravetz70bc0dc2017-05-03 14:55:22 -0700263After huge page allocation, the page is typically added to the page tables
264of the allocating task. Before this, pages in a shared mapping are added
265to the page cache and pages in private mappings are added to an anonymous
266reverse mapping. In both cases, the PagePrivate flag is cleared. Therefore,
267when a huge page that has been instantiated is freed no adjustment is made
268to the global reservation count (resv_huge_pages).
269
270
271Freeing Huge Pages
Mike Rapoport88ececc2018-03-21 21:22:24 +0200272==================
273
Mike Kravetz70bc0dc2017-05-03 14:55:22 -0700274Huge page freeing is performed by the routine free_huge_page(). This routine
275is the destructor for hugetlbfs compound pages. As a result, it is only
276passed a pointer to the page struct. When a huge page is freed, reservation
277accounting may need to be performed. This would be the case if the page was
278associated with a subpool that contained reserves, or the page is being freed
279on an error path where a global reserve count must be restored.
280
281The page->private field points to any subpool associated with the page.
282If the PagePrivate flag is set, it indicates the global reserve count should
Mike Rapoport88ececc2018-03-21 21:22:24 +0200283be adjusted (see the section
284:ref:`Consuming Reservations/Allocating a Huge Page <consume_resv>`
Mike Kravetz70bc0dc2017-05-03 14:55:22 -0700285for information on how these are set).
286
287The routine first calls hugepage_subpool_put_pages() for the page. If this
288routine returns a value of 0 (which does not equal the value passed 1) it
289indicates reserves are associated with the subpool, and this newly free page
290must be used to keep the number of subpool reserves above the minimum size.
291Therefore, the global resv_huge_pages counter is incremented in this case.
292
293If the PagePrivate flag was set in the page, the global resv_huge_pages counter
294will always be incremented.
295
Mike Rapoport88ececc2018-03-21 21:22:24 +0200296.. _sub_pool_resv:
Mike Kravetz70bc0dc2017-05-03 14:55:22 -0700297
298Subpool Reservations
Mike Rapoport88ececc2018-03-21 21:22:24 +0200299====================
300
Mike Kravetz70bc0dc2017-05-03 14:55:22 -0700301There is a struct hstate associated with each huge page size. The hstate
302tracks all huge pages of the specified size. A subpool represents a subset
303of pages within a hstate that is associated with a mounted hugetlbfs
304filesystem.
305
306When a hugetlbfs filesystem is mounted a min_size option can be specified
307which indicates the minimum number of huge pages required by the filesystem.
308If this option is specified, the number of huge pages corresponding to
309min_size are reserved for use by the filesystem. This number is tracked in
310the min_hpages field of a struct hugepage_subpool. At mount time,
311hugetlb_acct_memory(min_hpages) is called to reserve the specified number of
312huge pages. If they can not be reserved, the mount fails.
313
314The routines hugepage_subpool_get/put_pages() are called when pages are
315obtained from or released back to a subpool. They perform all subpool
316accounting, and track any reservations associated with the subpool.
317hugepage_subpool_get/put_pages are passed the number of huge pages by which
318to adjust the subpool 'used page' count (down for get, up for put). Normally,
319they return the same value that was passed or an error if not enough pages
320exist in the subpool.
321
322However, if reserves are associated with the subpool a return value less
323than the passed value may be returned. This return value indicates the
324number of additional global pool adjustments which must be made. For example,
325suppose a subpool contains 3 reserved huge pages and someone asks for 5.
326The 3 reserved pages associated with the subpool can be used to satisfy part
327of the request. But, 2 pages must be obtained from the global pools. To
328relay this information to the caller, the value 2 is returned. The caller
329is then responsible for attempting to obtain the additional two pages from
330the global pools.
331
332
333COW and Reservations
Mike Rapoport88ececc2018-03-21 21:22:24 +0200334====================
335
Mike Kravetz70bc0dc2017-05-03 14:55:22 -0700336Since shared mappings all point to and use the same underlying pages, the
337biggest reservation concern for COW is private mappings. In this case,
338two tasks can be pointing at the same previously allocated page. One task
339attempts to write to the page, so a new page must be allocated so that each
340task points to its own page.
341
342When the page was originally allocated, the reservation for that page was
343consumed. When an attempt to allocate a new page is made as a result of
344COW, it is possible that no free huge pages are free and the allocation
345will fail.
346
347When the private mapping was originally created, the owner of the mapping
348was noted by setting the HPAGE_RESV_OWNER bit in the pointer to the reservation
349map of the owner. Since the owner created the mapping, the owner owns all
350the reservations associated with the mapping. Therefore, when a write fault
351occurs and there is no page available, different action is taken for the owner
352and non-owner of the reservation.
353
354In the case where the faulting task is not the owner, the fault will fail and
355the task will typically receive a SIGBUS.
356
357If the owner is the faulting task, we want it to succeed since it owned the
358original reservation. To accomplish this, the page is unmapped from the
359non-owning task. In this way, the only reference is from the owning task.
360In addition, the HPAGE_RESV_UNMAPPED bit is set in the reservation map pointer
361of the non-owning task. The non-owning task may receive a SIGBUS if it later
362faults on a non-present page. But, the original owner of the
363mapping/reservation will behave as expected.
364
365
Mike Rapoport88ececc2018-03-21 21:22:24 +0200366.. _resv_map_modifications:
367
Mike Kravetz70bc0dc2017-05-03 14:55:22 -0700368Reservation Map Modifications
Mike Rapoport88ececc2018-03-21 21:22:24 +0200369=============================
370
Mike Kravetz70bc0dc2017-05-03 14:55:22 -0700371The following low level routines are used to make modifications to a
372reservation map. Typically, these routines are not called directly. Rather,
373a reservation map helper routine is called which calls one of these low level
374routines. These low level routines are fairly well documented in the source
Mike Rapoport88ececc2018-03-21 21:22:24 +0200375code (mm/hugetlb.c). These routines are::
376
377 long region_chg(struct resv_map *resv, long f, long t);
378 long region_add(struct resv_map *resv, long f, long t);
379 void region_abort(struct resv_map *resv, long f, long t);
380 long region_count(struct resv_map *resv, long f, long t);
Mike Kravetz70bc0dc2017-05-03 14:55:22 -0700381
382Operations on the reservation map typically involve two operations:
Mike Rapoport88ececc2018-03-21 21:22:24 +0200383
Mike Kravetz70bc0dc2017-05-03 14:55:22 -07003841) region_chg() is called to examine the reserve map and determine how
385 many pages in the specified range [f, t) are NOT currently represented.
386
387 The calling code performs global checks and allocations to determine if
388 there are enough huge pages for the operation to succeed.
389
Mike Rapoport88ececc2018-03-21 21:22:24 +02003902)
391 a) If the operation can succeed, region_add() is called to actually modify
392 the reservation map for the same range [f, t) previously passed to
393 region_chg().
394 b) If the operation can not succeed, region_abort is called for the same
395 range [f, t) to abort the operation.
Mike Kravetz70bc0dc2017-05-03 14:55:22 -0700396
397Note that this is a two step process where region_add() and region_abort()
398are guaranteed to succeed after a prior call to region_chg() for the same
399range. region_chg() is responsible for pre-allocating any data structures
400necessary to ensure the subsequent operations (specifically region_add()))
401will succeed.
402
403As mentioned above, region_chg() determines the number of pages in the range
404which are NOT currently represented in the map. This number is returned to
405the caller. region_add() returns the number of pages in the range added to
406the map. In most cases, the return value of region_add() is the same as the
407return value of region_chg(). However, in the case of shared mappings it is
408possible for changes to the reservation map to be made between the calls to
409region_chg() and region_add(). In this case, the return value of region_add()
410will not match the return value of region_chg(). It is likely that in such
411cases global counts and subpool accounting will be incorrect and in need of
412adjustment. It is the responsibility of the caller to check for this condition
413and make the appropriate adjustments.
414
415The routine region_del() is called to remove regions from a reservation map.
416It is typically called in the following situations:
Mike Rapoport88ececc2018-03-21 21:22:24 +0200417
Mike Kravetz70bc0dc2017-05-03 14:55:22 -0700418- When a file in the hugetlbfs filesystem is being removed, the inode will
419 be released and the reservation map freed. Before freeing the reservation
420 map, all the individual file_region structures must be freed. In this case
421 region_del is passed the range [0, LONG_MAX).
422- When a hugetlbfs file is being truncated. In this case, all allocated pages
423 after the new file size must be freed. In addition, any file_region entries
424 in the reservation map past the new end of file must be deleted. In this
425 case, region_del is passed the range [new_end_of_file, LONG_MAX).
426- When a hole is being punched in a hugetlbfs file. In this case, huge pages
427 are removed from the middle of the file one at a time. As the pages are
428 removed, region_del() is called to remove the corresponding entry from the
429 reservation map. In this case, region_del is passed the range
430 [page_idx, page_idx + 1).
Mike Rapoport88ececc2018-03-21 21:22:24 +0200431
Mike Kravetz70bc0dc2017-05-03 14:55:22 -0700432In every case, region_del() will return the number of pages removed from the
433reservation map. In VERY rare cases, region_del() can fail. This can only
434happen in the hole punch case where it has to split an existing file_region
435entry and can not allocate a new structure. In this error case, region_del()
436will return -ENOMEM. The problem here is that the reservation map will
437indicate that there is a reservation for the page. However, the subpool and
438global reservation counts will not reflect the reservation. To handle this
439situation, the routine hugetlb_fix_reserve_counts() is called to adjust the
440counters so that they correspond with the reservation map entry that could
441not be deleted.
442
443region_count() is called when unmapping a private huge page mapping. In
444private mappings, the lack of a entry in the reservation map indicates that
445a reservation exists. Therefore, by counting the number of entries in the
446reservation map we know how many reservations were consumed and how many are
447outstanding (outstanding = (end - start) - region_count(resv, start, end)).
448Since the mapping is going away, the subpool and global reservation counts
449are decremented by the number of outstanding reservations.
450
Mike Rapoport88ececc2018-03-21 21:22:24 +0200451.. _resv_map_helpers:
Mike Kravetz70bc0dc2017-05-03 14:55:22 -0700452
453Reservation Map Helper Routines
Mike Rapoport88ececc2018-03-21 21:22:24 +0200454===============================
455
Mike Kravetz70bc0dc2017-05-03 14:55:22 -0700456Several helper routines exist to query and modify the reservation maps.
457These routines are only interested with reservations for a specific huge
458page, so they just pass in an address instead of a range. In addition,
459they pass in the associated VMA. From the VMA, the type of mapping (private
460or shared) and the location of the reservation map (inode or VMA) can be
461determined. These routines simply call the underlying routines described
462in the section "Reservation Map Modifications". However, they do take into
463account the 'opposite' meaning of reservation map entries for private and
Mike Rapoport88ececc2018-03-21 21:22:24 +0200464shared mappings and hide this detail from the caller::
Mike Kravetz70bc0dc2017-05-03 14:55:22 -0700465
Mike Rapoport88ececc2018-03-21 21:22:24 +0200466 long vma_needs_reservation(struct hstate *h,
467 struct vm_area_struct *vma,
468 unsigned long addr)
469
Mike Kravetz70bc0dc2017-05-03 14:55:22 -0700470This routine calls region_chg() for the specified page. If no reservation
Mike Rapoport88ececc2018-03-21 21:22:24 +0200471exists, 1 is returned. If a reservation exists, 0 is returned::
Mike Kravetz70bc0dc2017-05-03 14:55:22 -0700472
Mike Rapoport88ececc2018-03-21 21:22:24 +0200473 long vma_commit_reservation(struct hstate *h,
474 struct vm_area_struct *vma,
475 unsigned long addr)
476
Mike Kravetz70bc0dc2017-05-03 14:55:22 -0700477This calls region_add() for the specified page. As in the case of region_chg
478and region_add, this routine is to be called after a previous call to
479vma_needs_reservation. It will add a reservation entry for the page. It
480returns 1 if the reservation was added and 0 if not. The return value should
481be compared with the return value of the previous call to
482vma_needs_reservation. An unexpected difference indicates the reservation
Mike Rapoport88ececc2018-03-21 21:22:24 +0200483map was modified between calls::
Mike Kravetz70bc0dc2017-05-03 14:55:22 -0700484
Mike Rapoport88ececc2018-03-21 21:22:24 +0200485 void vma_end_reservation(struct hstate *h,
486 struct vm_area_struct *vma,
487 unsigned long addr)
488
Mike Kravetz70bc0dc2017-05-03 14:55:22 -0700489This calls region_abort() for the specified page. As in the case of region_chg
490and region_abort, this routine is to be called after a previous call to
491vma_needs_reservation. It will abort/end the in progress reservation add
Mike Rapoport88ececc2018-03-21 21:22:24 +0200492operation::
Mike Kravetz70bc0dc2017-05-03 14:55:22 -0700493
Mike Rapoport88ececc2018-03-21 21:22:24 +0200494 long vma_add_reservation(struct hstate *h,
495 struct vm_area_struct *vma,
496 unsigned long addr)
497
Mike Kravetz70bc0dc2017-05-03 14:55:22 -0700498This is a special wrapper routine to help facilitate reservation cleanup
499on error paths. It is only called from the routine restore_reserve_on_error().
500This routine is used in conjunction with vma_needs_reservation in an attempt
501to add a reservation to the reservation map. It takes into account the
502different reservation map semantics for private and shared mappings. Hence,
503region_add is called for shared mappings (as an entry present in the map
504indicates a reservation), and region_del is called for private mappings (as
505the absence of an entry in the map indicates a reservation). See the section
506"Reservation cleanup in error paths" for more information on what needs to
507be done on error paths.
508
509
510Reservation Cleanup in Error Paths
Mike Rapoport88ececc2018-03-21 21:22:24 +0200511==================================
512
513As mentioned in the section
514:ref:`Reservation Map Helper Routines <resv_map_helpers>`, reservation
Mike Kravetz70bc0dc2017-05-03 14:55:22 -0700515map modifications are performed in two steps. First vma_needs_reservation
516is called before a page is allocated. If the allocation is successful,
517then vma_commit_reservation is called. If not, vma_end_reservation is called.
518Global and subpool reservation counts are adjusted based on success or failure
519of the operation and all is well.
520
521Additionally, after a huge page is instantiated the PagePrivate flag is
522cleared so that accounting when the page is ultimately freed is correct.
523
524However, there are several instances where errors are encountered after a huge
525page is allocated but before it is instantiated. In this case, the page
526allocation has consumed the reservation and made the appropriate subpool,
527reservation map and global count adjustments. If the page is freed at this
528time (before instantiation and clearing of PagePrivate), then free_huge_page
529will increment the global reservation count. However, the reservation map
530indicates the reservation was consumed. This resulting inconsistent state
531will cause the 'leak' of a reserved huge page. The global reserve count will
532be higher than it should and prevent allocation of a pre-allocated page.
533
534The routine restore_reserve_on_error() attempts to handle this situation. It
535is fairly well documented. The intention of this routine is to restore
536the reservation map to the way it was before the page allocation. In this
537way, the state of the reservation map will correspond to the global reservation
538count after the page is freed.
539
540The routine restore_reserve_on_error itself may encounter errors while
541attempting to restore the reservation map entry. In this case, it will
542simply clear the PagePrivate flag of the page. In this way, the global
543reserve count will not be incremented when the page is freed. However, the
544reservation map will continue to look as though the reservation was consumed.
545A page can still be allocated for the address, but it will not use a reserved
546page as originally intended.
547
548There is some code (most notably userfaultfd) which can not call
549restore_reserve_on_error. In this case, it simply modifies the PagePrivate
550so that a reservation will not be leaked when the huge page is freed.
551
552
553Reservations and Memory Policy
Mike Rapoport88ececc2018-03-21 21:22:24 +0200554==============================
Mike Kravetz70bc0dc2017-05-03 14:55:22 -0700555Per-node huge page lists existed in struct hstate when git was first used
556to manage Linux code. The concept of reservations was added some time later.
557When reservations were added, no attempt was made to take memory policy
558into account. While cpusets are not exactly the same as memory policy, this
559comment in hugetlb_acct_memory sums up the interaction between reservations
Mike Rapoport88ececc2018-03-21 21:22:24 +0200560and cpusets/memory policy::
561
Mike Kravetz70bc0dc2017-05-03 14:55:22 -0700562 /*
563 * When cpuset is configured, it breaks the strict hugetlb page
564 * reservation as the accounting is done on a global variable. Such
565 * reservation is completely rubbish in the presence of cpuset because
566 * the reservation is not checked against page availability for the
567 * current cpuset. Application can still potentially OOM'ed by kernel
568 * with lack of free htlb page in cpuset that the task is in.
569 * Attempt to enforce strict accounting with cpuset is almost
570 * impossible (or too ugly) because cpuset is too fluid that
571 * task or memory node can be dynamically moved between cpusets.
572 *
573 * The change of semantics for shared hugetlb mapping with cpuset is
574 * undesirable. However, in order to preserve some of the semantics,
575 * we fall back to check against current free page availability as
576 * a best attempt and hopefully to minimize the impact of changing
577 * semantics that cpuset has.
578 */
579
580Huge page reservations were added to prevent unexpected page allocation
581failures (OOM) at page fault time. However, if an application makes use
582of cpusets or memory policy there is no guarantee that huge pages will be
583available on the required nodes. This is true even if there are a sufficient
584number of global reservations.
585
Mike Rapoport946280c2018-04-18 11:07:45 +0300586Hugetlbfs regression testing
587============================
Mike Kravetz70bc0dc2017-05-03 14:55:22 -0700588
Mike Rapoport946280c2018-04-18 11:07:45 +0300589The most complete set of hugetlb tests are in the libhugetlbfs repository.
590If you modify any hugetlb related code, use the libhugetlbfs test suite
591to check for regressions. In addition, if you add any new hugetlb
592functionality, please add appropriate tests to libhugetlbfs.
593
594--
Mike Kravetz70bc0dc2017-05-03 14:55:22 -0700595Mike Kravetz, 7 April 2017