| |
| This document describes the Linux memory management "Unevictable LRU" |
| infrastructure and the use of this infrastructure to manage several types |
| of "unevictable" pages. The document attempts to provide the overall |
| rationale behind this mechanism and the rationale for some of the design |
| decisions that drove the implementation. The latter design rationale is |
| discussed in the context of an implementation description. Admittedly, one |
| can obtain the implementation details--the "what does it do?"--by reading the |
| code. One hopes that the descriptions below add value by provide the answer |
| to "why does it do that?". |
| |
| Unevictable LRU Infrastructure: |
| |
| The Unevictable LRU adds an additional LRU list to track unevictable pages |
| and to hide these pages from vmscan. This mechanism is based on a patch by |
| Larry Woodman of Red Hat to address several scalability problems with page |
| reclaim in Linux. The problems have been observed at customer sites on large |
| memory x86_64 systems. For example, a non-numal x86_64 platform with 128GB |
| of main memory will have over 32 million 4k pages in a single zone. When a |
| large fraction of these pages are not evictable for any reason [see below], |
| vmscan will spend a lot of time scanning the LRU lists looking for the small |
| fraction of pages that are evictable. This can result in a situation where |
| all cpus are spending 100% of their time in vmscan for hours or days on end, |
| with the system completely unresponsive. |
| |
| The Unevictable LRU infrastructure addresses the following classes of |
| unevictable pages: |
| |
| + page owned by ramfs |
| + page mapped into SHM_LOCKed shared memory regions |
| + page mapped into VM_LOCKED [mlock()ed] vmas |
| |
| The infrastructure might be able to handle other conditions that make pages |
| unevictable, either by definition or by circumstance, in the future. |
| |
| |
| The Unevictable LRU List |
| |
| The Unevictable LRU infrastructure consists of an additional, per-zone, LRU list |
| called the "unevictable" list and an associated page flag, PG_unevictable, to |
| indicate that the page is being managed on the unevictable list. The |
| PG_unevictable flag is analogous to, and mutually exclusive with, the PG_active |
| flag in that it indicates on which LRU list a page resides when PG_lru is set. |
| The unevictable LRU list is source configurable based on the UNEVICTABLE_LRU |
| Kconfig option. |
| |
| The Unevictable LRU infrastructure maintains unevictable pages on an additional |
| LRU list for a few reasons: |
| |
| 1) We get to "treat unevictable pages just like we treat other pages in the |
| system, which means we get to use the same code to manipulate them, the |
| same code to isolate them (for migrate, etc.), the same code to keep track |
| of the statistics, etc..." [Rik van Riel] |
| |
| 2) We want to be able to migrate unevictable pages between nodes--for memory |
| defragmentation, workload management and memory hotplug. The linux kernel |
| can only migrate pages that it can successfully isolate from the lru lists. |
| If we were to maintain pages elsewise than on an lru-like list, where they |
| can be found by isolate_lru_page(), we would prevent their migration, unless |
| we reworked migration code to find the unevictable pages. |
| |
| |
| The unevictable LRU list does not differentiate between file backed and swap |
| backed [anon] pages. This differentiation is only important while the pages |
| are, in fact, evictable. |
| |
| The unevictable LRU list benefits from the "arrayification" of the per-zone |
| LRU lists and statistics originally proposed and posted by Christoph Lameter. |
| |
| The unevictable list does not use the lru pagevec mechanism. Rather, |
| unevictable pages are placed directly on the page's zone's unevictable |
| list under the zone lru_lock. The reason for this is to prevent stranding |
| of pages on the unevictable list when one task has the page isolated from the |
| lru and other tasks are changing the "evictability" state of the page. |
| |
| |
| Unevictable LRU and Memory Controller Interaction |
| |
| The memory controller data structure automatically gets a per zone unevictable |
| lru list as a result of the "arrayification" of the per-zone LRU lists. The |
| memory controller tracks the movement of pages to and from the unevictable list. |
| When a memory control group comes under memory pressure, the controller will |
| not attempt to reclaim pages on the unevictable list. This has a couple of |
| effects. Because the pages are "hidden" from reclaim on the unevictable list, |
| the reclaim process can be more efficient, dealing only with pages that have |
| a chance of being reclaimed. On the other hand, if too many of the pages |
| charged to the control group are unevictable, the evictable portion of the |
| working set of the tasks in the control group may not fit into the available |
| memory. This can cause the control group to thrash or to oom-kill tasks. |
| |
| |
| Unevictable LRU: Detecting Unevictable Pages |
| |
| The function page_evictable(page, vma) in vmscan.c determines whether a |
| page is evictable or not. For ramfs pages and pages in SHM_LOCKed regions, |
| page_evictable() tests a new address space flag, AS_UNEVICTABLE, in the page's |
| address space using a wrapper function. Wrapper functions are used to set, |
| clear and test the flag to reduce the requirement for #ifdef's throughout the |
| source code. AS_UNEVICTABLE is set on ramfs inode/mapping when it is created. |
| This flag remains for the life of the inode. |
| |
| For shared memory regions, AS_UNEVICTABLE is set when an application |
| successfully SHM_LOCKs the region and is removed when the region is |
| SHM_UNLOCKed. Note that shmctl(SHM_LOCK, ...) does not populate the page |
| tables for the region as does, for example, mlock(). So, we make no special |
| effort to push any pages in the SHM_LOCKed region to the unevictable list. |
| Vmscan will do this when/if it encounters the pages during reclaim. On |
| SHM_UNLOCK, shmctl() scans the pages in the region and "rescues" them from the |
| unevictable list if no other condition keeps them unevictable. If a SHM_LOCKed |
| region is destroyed, the pages are also "rescued" from the unevictable list in |
| the process of freeing them. |
| |
| page_evictable() detects mlock()ed pages by testing an additional page flag, |
| PG_mlocked via the PageMlocked() wrapper. If the page is NOT mlocked, and a |
| non-NULL vma is supplied, page_evictable() will check whether the vma is |
| VM_LOCKED via is_mlocked_vma(). is_mlocked_vma() will SetPageMlocked() and |
| update the appropriate statistics if the vma is VM_LOCKED. This method allows |
| efficient "culling" of pages in the fault path that are being faulted in to |
| VM_LOCKED vmas. |
| |
| |
| Unevictable Pages and Vmscan [shrink_*_list()] |
| |
| If unevictable pages are culled in the fault path, or moved to the unevictable |
| list at mlock() or mmap() time, vmscan will never encounter the pages until |
| they have become evictable again, for example, via munlock() and have been |
| "rescued" from the unevictable list. However, there may be situations where we |
| decide, for the sake of expediency, to leave a unevictable page on one of the |
| regular active/inactive LRU lists for vmscan to deal with. Vmscan checks for |
| such pages in all of the shrink_{active|inactive|page}_list() functions and |
| will "cull" such pages that it encounters--that is, it diverts those pages to |
| the unevictable list for the zone being scanned. |
| |
| There may be situations where a page is mapped into a VM_LOCKED vma, but the |
| page is not marked as PageMlocked. Such pages will make it all the way to |
| shrink_page_list() where they will be detected when vmscan walks the reverse |
| map in try_to_unmap(). If try_to_unmap() returns SWAP_MLOCK, shrink_page_list() |
| will cull the page at that point. |
| |
| To "cull" an unevictable page, vmscan simply puts the page back on the lru |
| list using putback_lru_page()--the inverse operation to isolate_lru_page()-- |
| after dropping the page lock. Because the condition which makes the page |
| unevictable may change once the page is unlocked, putback_lru_page() will |
| recheck the unevictable state of a page that it places on the unevictable lru |
| list. If the page has become unevictable, putback_lru_page() removes it from |
| the list and retries, including the page_unevictable() test. Because such a |
| race is a rare event and movement of pages onto the unevictable list should be |
| rare, these extra evictabilty checks should not occur in the majority of calls |
| to putback_lru_page(). |
| |
| |
| Mlocked Page: Prior Work |
| |
| The "Unevictable Mlocked Pages" infrastructure is based on work originally |
| posted by Nick Piggin in an RFC patch entitled "mm: mlocked pages off LRU". |
| Nick posted his patch as an alternative to a patch posted by Christoph |
| Lameter to achieve the same objective--hiding mlocked pages from vmscan. |
| In Nick's patch, he used one of the struct page lru list link fields as a count |
| of VM_LOCKED vmas that map the page. This use of the link field for a count |
| prevented the management of the pages on an LRU list. Thus, mlocked pages were |
| not migratable as isolate_lru_page() could not find them and the lru list link |
| field was not available to the migration subsystem. Nick resolved this by |
| putting mlocked pages back on the lru list before attempting to isolate them, |
| thus abandoning the count of VM_LOCKED vmas. When Nick's patch was integrated |
| with the Unevictable LRU work, the count was replaced by walking the reverse |
| map to determine whether any VM_LOCKED vmas mapped the page. More on this |
| below. |
| |
| |
| Mlocked Pages: Basic Management |
| |
| Mlocked pages--pages mapped into a VM_LOCKED vma--represent one class of |
| unevictable pages. When such a page has been "noticed" by the memory |
| management subsystem, the page is marked with the PG_mlocked [PageMlocked()] |
| flag. A PageMlocked() page will be placed on the unevictable LRU list when |
| it is added to the LRU. Pages can be "noticed" by memory management in |
| several places: |
| |
| 1) in the mlock()/mlockall() system call handlers. |
| 2) in the mmap() system call handler when mmap()ing a region with the |
| MAP_LOCKED flag, or mmap()ing a region in a task that has called |
| mlockall() with the MCL_FUTURE flag. Both of these conditions result |
| in the VM_LOCKED flag being set for the vma. |
| 3) in the fault path, if mlocked pages are "culled" in the fault path, |
| and when a VM_LOCKED stack segment is expanded. |
| 4) as mentioned above, in vmscan:shrink_page_list() when attempting to |
| reclaim a page in a VM_LOCKED vma via try_to_unmap(). |
| |
| Mlocked pages become unlocked and rescued from the unevictable list when: |
| |
| 1) mapped in a range unlocked via the munlock()/munlockall() system calls. |
| 2) munmapped() out of the last VM_LOCKED vma that maps the page, including |
| unmapping at task exit. |
| 3) when the page is truncated from the last VM_LOCKED vma of an mmap()ed file. |
| 4) before a page is COWed in a VM_LOCKED vma. |
| |
| |
| Mlocked Pages: mlock()/mlockall() System Call Handling |
| |
| Both [do_]mlock() and [do_]mlockall() system call handlers call mlock_fixup() |
| for each vma in the range specified by the call. In the case of mlockall(), |
| this is the entire active address space of the task. Note that mlock_fixup() |
| is used for both mlock()ing and munlock()ing a range of memory. A call to |
| mlock() an already VM_LOCKED vma, or to munlock() a vma that is not VM_LOCKED |
| is treated as a no-op--mlock_fixup() simply returns. |
| |
| If the vma passes some filtering described in "Mlocked Pages: Filtering Vmas" |
| below, mlock_fixup() will attempt to merge the vma with its neighbors or split |
| off a subset of the vma if the range does not cover the entire vma. Once the |
| vma has been merged or split or neither, mlock_fixup() will call |
| __mlock_vma_pages_range() to fault in the pages via get_user_pages() and |
| to mark the pages as mlocked via mlock_vma_page(). |
| |
| Note that the vma being mlocked might be mapped with PROT_NONE. In this case, |
| get_user_pages() will be unable to fault in the pages. That's OK. If pages |
| do end up getting faulted into this VM_LOCKED vma, we'll handle them in the |
| fault path or in vmscan. |
| |
| Also note that a page returned by get_user_pages() could be truncated or |
| migrated out from under us, while we're trying to mlock it. To detect |
| this, __mlock_vma_pages_range() tests the page_mapping after acquiring |
| the page lock. If the page is still associated with its mapping, we'll |
| go ahead and call mlock_vma_page(). If the mapping is gone, we just |
| unlock the page and move on. Worse case, this results in page mapped |
| in a VM_LOCKED vma remaining on a normal LRU list without being |
| PageMlocked(). Again, vmscan will detect and cull such pages. |
| |
| mlock_vma_page(), called with the page locked [N.B., not "mlocked"], will |
| TestSetPageMlocked() for each page returned by get_user_pages(). We use |
| TestSetPageMlocked() because the page might already be mlocked by another |
| task/vma and we don't want to do extra work. We especially do not want to |
| count an mlocked page more than once in the statistics. If the page was |
| already mlocked, mlock_vma_page() is done. |
| |
| If the page was NOT already mlocked, mlock_vma_page() attempts to isolate the |
| page from the LRU, as it is likely on the appropriate active or inactive list |
| at that time. If the isolate_lru_page() succeeds, mlock_vma_page() will |
| putback the page--putback_lru_page()--which will notice that the page is now |
| mlocked and divert the page to the zone's unevictable LRU list. If |
| mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle |
| it later if/when it attempts to reclaim the page. |
| |
| |
| Mlocked Pages: Filtering Special Vmas |
| |
| mlock_fixup() filters several classes of "special" vmas: |
| |
| 1) vmas with VM_IO|VM_PFNMAP set are skipped entirely. The pages behind |
| these mappings are inherently pinned, so we don't need to mark them as |
| mlocked. In any case, most of the pages have no struct page in which to |
| so mark the page. Because of this, get_user_pages() will fail for these |
| vmas, so there is no sense in attempting to visit them. |
| |
| 2) vmas mapping hugetlbfs page are already effectively pinned into memory. |
| We don't need nor want to mlock() these pages. However, to preserve the |
| prior behavior of mlock()--before the unevictable/mlock changes-- |
| mlock_fixup() will call make_pages_present() in the hugetlbfs vma range |
| to allocate the huge pages and populate the ptes. |
| |
| 3) vmas with VM_DONTEXPAND|VM_RESERVED are generally user space mappings of |
| kernel pages, such as the vdso page, relay channel pages, etc. These pages |
| are inherently unevictable and are not managed on the LRU lists. |
| mlock_fixup() treats these vmas the same as hugetlbfs vmas. It calls |
| make_pages_present() to populate the ptes. |
| |
| Note that for all of these special vmas, mlock_fixup() does not set the |
| VM_LOCKED flag. Therefore, we won't have to deal with them later during |
| munlock() or munmap()--for example, at task exit. Neither does mlock_fixup() |
| account these vmas against the task's "locked_vm". |
| |
| Mlocked Pages: Downgrading the Mmap Semaphore. |
| |
| mlock_fixup() must be called with the mmap semaphore held for write, because |
| it may have to merge or split vmas. However, mlocking a large region of |
| memory can take a long time--especially if vmscan must reclaim pages to |
| satisfy the regions requirements. Faulting in a large region with the mmap |
| semaphore held for write can hold off other faults on the address space, in |
| the case of a multi-threaded task. It can also hold off scans of the task's |
| address space via /proc. While testing under heavy load, it was observed that |
| the ps(1) command could be held off for many minutes while a large segment was |
| mlock()ed down. |
| |
| To address this issue, and to make the system more responsive during mlock()ing |
| of large segments, mlock_fixup() downgrades the mmap semaphore to read mode |
| during the call to __mlock_vma_pages_range(). This works fine. However, the |
| callers of mlock_fixup() expect the semaphore to be returned in write mode. |
| So, mlock_fixup() "upgrades" the semphore to write mode. Linux does not |
| support an atomic upgrade_sem() call, so mlock_fixup() must drop the semaphore |
| and reacquire it in write mode. In a multi-threaded task, it is possible for |
| the task memory map to change while the semaphore is dropped. Therefore, |
| mlock_fixup() looks up the vma at the range start address after reacquiring |
| the semaphore in write mode and verifies that it still covers the original |
| range. If not, mlock_fixup() returns an error [-EAGAIN]. All callers of |
| mlock_fixup() have been changed to deal with this new error condition. |
| |
| Note: when munlocking a region, all of the pages should already be resident-- |
| unless we have racing threads mlocking() and munlocking() regions. So, |
| unlocking should not have to wait for page allocations nor faults of any kind. |
| Therefore mlock_fixup() does not downgrade the semaphore for munlock(). |
| |
| |
| Mlocked Pages: munlock()/munlockall() System Call Handling |
| |
| The munlock() and munlockall() system calls are handled by the same functions-- |
| do_mlock[all]()--as the mlock() and mlockall() system calls with the unlock |
| vs lock operation indicated by an argument. So, these system calls are also |
| handled by mlock_fixup(). Again, if called for an already munlock()ed vma, |
| mlock_fixup() simply returns. Because of the vma filtering discussed above, |
| VM_LOCKED will not be set in any "special" vmas. So, these vmas will be |
| ignored for munlock. |
| |
| If the vma is VM_LOCKED, mlock_fixup() again attempts to merge or split off |
| the specified range. The range is then munlocked via the function |
| __mlock_vma_pages_range()--the same function used to mlock a vma range-- |
| passing a flag to indicate that munlock() is being performed. |
| |
| Because the vma access protections could have been changed to PROT_NONE after |
| faulting in and mlocking pages, get_user_pages() was unreliable for visiting |
| these pages for munlocking. Because we don't want to leave pages mlocked(), |
| get_user_pages() was enhanced to accept a flag to ignore the permissions when |
| fetching the pages--all of which should be resident as a result of previous |
| mlock()ing. |
| |
| For munlock(), __mlock_vma_pages_range() unlocks individual pages by calling |
| munlock_vma_page(). munlock_vma_page() unconditionally clears the PG_mlocked |
| flag using TestClearPageMlocked(). As with mlock_vma_page(), munlock_vma_page() |
| use the Test*PageMlocked() function to handle the case where the page might |
| have already been unlocked by another task. If the page was mlocked, |
| munlock_vma_page() updates that zone statistics for the number of mlocked |
| pages. Note, however, that at this point we haven't checked whether the page |
| is mapped by other VM_LOCKED vmas. |
| |
| We can't call try_to_munlock(), the function that walks the reverse map to check |
| for other VM_LOCKED vmas, without first isolating the page from the LRU. |
| try_to_munlock() is a variant of try_to_unmap() and thus requires that the page |
| not be on an lru list. [More on these below.] However, the call to |
| isolate_lru_page() could fail, in which case we couldn't try_to_munlock(). |
| So, we go ahead and clear PG_mlocked up front, as this might be the only chance |
| we have. If we can successfully isolate the page, we go ahead and |
| try_to_munlock(), which will restore the PG_mlocked flag and update the zone |
| page statistics if it finds another vma holding the page mlocked. If we fail |
| to isolate the page, we'll have left a potentially mlocked page on the LRU. |
| This is fine, because we'll catch it later when/if vmscan tries to reclaim the |
| page. This should be relatively rare. |
| |
| Mlocked Pages: Migrating Them... |
| |
| A page that is being migrated has been isolated from the lru lists and is |
| held locked across unmapping of the page, updating the page's mapping |
| [address_space] entry and copying the contents and state, until the |
| page table entry has been replaced with an entry that refers to the new |
| page. Linux supports migration of mlocked pages and other unevictable |
| pages. This involves simply moving the PageMlocked and PageUnevictable states |
| from the old page to the new page. |
| |
| Note that page migration can race with mlocking or munlocking of the same |
| page. This has been discussed from the mlock/munlock perspective in the |
| respective sections above. Both processes [migration, m[un]locking], hold |
| the page locked. This provides the first level of synchronization. Page |
| migration zeros out the page_mapping of the old page before unlocking it, |
| so m[un]lock can skip these pages by testing the page mapping under page |
| lock. |
| |
| When completing page migration, we place the new and old pages back onto the |
| lru after dropping the page lock. The "unneeded" page--old page on success, |
| new page on failure--will be freed when the reference count held by the |
| migration process is released. To ensure that we don't strand pages on the |
| unevictable list because of a race between munlock and migration, page |
| migration uses the putback_lru_page() function to add migrated pages back to |
| the lru. |
| |
| |
| Mlocked Pages: mmap(MAP_LOCKED) System Call Handling |
| |
| In addition the the mlock()/mlockall() system calls, an application can request |
| that a region of memory be mlocked using the MAP_LOCKED flag with the mmap() |
| call. Furthermore, any mmap() call or brk() call that expands the heap by a |
| task that has previously called mlockall() with the MCL_FUTURE flag will result |
| in the newly mapped memory being mlocked. Before the unevictable/mlock changes, |
| the kernel simply called make_pages_present() to allocate pages and populate |
| the page table. |
| |
| To mlock a range of memory under the unevictable/mlock infrastructure, the |
| mmap() handler and task address space expansion functions call |
| mlock_vma_pages_range() specifying the vma and the address range to mlock. |
| mlock_vma_pages_range() filters vmas like mlock_fixup(), as described above in |
| "Mlocked Pages: Filtering Vmas". It will clear the VM_LOCKED flag, which will |
| have already been set by the caller, in filtered vmas. Thus these vma's need |
| not be visited for munlock when the region is unmapped. |
| |
| For "normal" vmas, mlock_vma_pages_range() calls __mlock_vma_pages_range() to |
| fault/allocate the pages and mlock them. Again, like mlock_fixup(), |
| mlock_vma_pages_range() downgrades the mmap semaphore to read mode before |
| attempting to fault/allocate and mlock the pages; and "upgrades" the semaphore |
| back to write mode before returning. |
| |
| The callers of mlock_vma_pages_range() will have already added the memory |
| range to be mlocked to the task's "locked_vm". To account for filtered vmas, |
| mlock_vma_pages_range() returns the number of pages NOT mlocked. All of the |
| callers then subtract a non-negative return value from the task's locked_vm. |
| A negative return value represent an error--for example, from get_user_pages() |
| attempting to fault in a vma with PROT_NONE access. In this case, we leave |
| the memory range accounted as locked_vm, as the protections could be changed |
| later and pages allocated into that region. |
| |
| |
| Mlocked Pages: munmap()/exit()/exec() System Call Handling |
| |
| When unmapping an mlocked region of memory, whether by an explicit call to |
| munmap() or via an internal unmap from exit() or exec() processing, we must |
| munlock the pages if we're removing the last VM_LOCKED vma that maps the pages. |
| Before the unevictable/mlock changes, mlocking did not mark the pages in any |
| way, so unmapping them required no processing. |
| |
| To munlock a range of memory under the unevictable/mlock infrastructure, the |
| munmap() hander and task address space tear down function call |
| munlock_vma_pages_all(). The name reflects the observation that one always |
| specifies the entire vma range when munlock()ing during unmap of a region. |
| Because of the vma filtering when mlocking() regions, only "normal" vmas that |
| actually contain mlocked pages will be passed to munlock_vma_pages_all(). |
| |
| munlock_vma_pages_all() clears the VM_LOCKED vma flag and, like mlock_fixup() |
| for the munlock case, calls __munlock_vma_pages_range() to walk the page table |
| for the vma's memory range and munlock_vma_page() each resident page mapped by |
| the vma. This effectively munlocks the page, only if this is the last |
| VM_LOCKED vma that maps the page. |
| |
| |
| Mlocked Page: try_to_unmap() |
| |
| [Note: the code changes represented by this section are really quite small |
| compared to the text to describe what happening and why, and to discuss the |
| implications.] |
| |
| Pages can, of course, be mapped into multiple vmas. Some of these vmas may |
| have VM_LOCKED flag set. It is possible for a page mapped into one or more |
| VM_LOCKED vmas not to have the PG_mlocked flag set and therefore reside on one |
| of the active or inactive LRU lists. This could happen if, for example, a |
| task in the process of munlock()ing the page could not isolate the page from |
| the LRU. As a result, vmscan/shrink_page_list() might encounter such a page |
| as described in "Unevictable Pages and Vmscan [shrink_*_list()]". To |
| handle this situation, try_to_unmap() has been enhanced to check for VM_LOCKED |
| vmas while it is walking a page's reverse map. |
| |
| try_to_unmap() is always called, by either vmscan for reclaim or for page |
| migration, with the argument page locked and isolated from the LRU. BUG_ON() |
| assertions enforce this requirement. Separate functions handle anonymous and |
| mapped file pages, as these types of pages have different reverse map |
| mechanisms. |
| |
| try_to_unmap_anon() |
| |
| To unmap anonymous pages, each vma in the list anchored in the anon_vma must be |
| visited--at least until a VM_LOCKED vma is encountered. If the page is being |
| unmapped for migration, VM_LOCKED vmas do not stop the process because mlocked |
| pages are migratable. However, for reclaim, if the page is mapped into a |
| VM_LOCKED vma, the scan stops. try_to_unmap() attempts to acquire the mmap |
| semphore of the mm_struct to which the vma belongs in read mode. If this is |
| successful, try_to_unmap() will mlock the page via mlock_vma_page()--we |
| wouldn't have gotten to try_to_unmap() if the page were already mlocked--and |
| will return SWAP_MLOCK, indicating that the page is unevictable. If the |
| mmap semaphore cannot be acquired, we are not sure whether the page is really |
| unevictable or not. In this case, try_to_unmap() will return SWAP_AGAIN. |
| |
| try_to_unmap_file() -- linear mappings |
| |
| Unmapping of a mapped file page works the same, except that the scan visits |
| all vmas that maps the page's index/page offset in the page's mapping's |
| reverse map priority search tree. It must also visit each vma in the page's |
| mapping's non-linear list, if the list is non-empty. As for anonymous pages, |
| on encountering a VM_LOCKED vma for a mapped file page, try_to_unmap() will |
| attempt to acquire the associated mm_struct's mmap semaphore to mlock the page, |
| returning SWAP_MLOCK if this is successful, and SWAP_AGAIN, if not. |
| |
| try_to_unmap_file() -- non-linear mappings |
| |
| If a page's mapping contains a non-empty non-linear mapping vma list, then |
| try_to_un{map|lock}() must also visit each vma in that list to determine |
| whether the page is mapped in a VM_LOCKED vma. Again, the scan must visit |
| all vmas in the non-linear list to ensure that the pages is not/should not be |
| mlocked. If a VM_LOCKED vma is found in the list, the scan could terminate. |
| However, there is no easy way to determine whether the page is actually mapped |
| in a given vma--either for unmapping or testing whether the VM_LOCKED vma |
| actually pins the page. |
| |
| So, try_to_unmap_file() handles non-linear mappings by scanning a certain |
| number of pages--a "cluster"--in each non-linear vma associated with the page's |
| mapping, for each file mapped page that vmscan tries to unmap. If this happens |
| to unmap the page we're trying to unmap, try_to_unmap() will notice this on |
| return--(page_mapcount(page) == 0)--and return SWAP_SUCCESS. Otherwise, it |
| will return SWAP_AGAIN, causing vmscan to recirculate this page. We take |
| advantage of the cluster scan in try_to_unmap_cluster() as follows: |
| |
| For each non-linear vma, try_to_unmap_cluster() attempts to acquire the mmap |
| semaphore of the associated mm_struct for read without blocking. If this |
| attempt is successful and the vma is VM_LOCKED, try_to_unmap_cluster() will |
| retain the mmap semaphore for the scan; otherwise it drops it here. Then, |
| for each page in the cluster, if we're holding the mmap semaphore for a locked |
| vma, try_to_unmap_cluster() calls mlock_vma_page() to mlock the page. This |
| call is a no-op if the page is already locked, but will mlock any pages in |
| the non-linear mapping that happen to be unlocked. If one of the pages so |
| mlocked is the page passed in to try_to_unmap(), try_to_unmap_cluster() will |
| return SWAP_MLOCK, rather than the default SWAP_AGAIN. This will allow vmscan |
| to cull the page, rather than recirculating it on the inactive list. Again, |
| if try_to_unmap_cluster() cannot acquire the vma's mmap sem, it returns |
| SWAP_AGAIN, indicating that the page is mapped by a VM_LOCKED vma, but |
| couldn't be mlocked. |
| |
| |
| Mlocked pages: try_to_munlock() Reverse Map Scan |
| |
| TODO/FIXME: a better name might be page_mlocked()--analogous to the |
| page_referenced() reverse map walker. |
| |
| When munlock_vma_page()--see "Mlocked Pages: munlock()/munlockall() |
| System Call Handling" above--tries to munlock a page, it needs to |
| determine whether or not the page is mapped by any VM_LOCKED vma, without |
| actually attempting to unmap all ptes from the page. For this purpose, the |
| unevictable/mlock infrastructure introduced a variant of try_to_unmap() called |
| try_to_munlock(). |
| |
| try_to_munlock() calls the same functions as try_to_unmap() for anonymous and |
| mapped file pages with an additional argument specifing unlock versus unmap |
| processing. Again, these functions walk the respective reverse maps looking |
| for VM_LOCKED vmas. When such a vma is found for anonymous pages and file |
| pages mapped in linear VMAs, as in the try_to_unmap() case, the functions |
| attempt to acquire the associated mmap semphore, mlock the page via |
| mlock_vma_page() and return SWAP_MLOCK. This effectively undoes the |
| pre-clearing of the page's PG_mlocked done by munlock_vma_page. |
| |
| If try_to_unmap() is unable to acquire a VM_LOCKED vma's associated mmap |
| semaphore, it will return SWAP_AGAIN. This will allow shrink_page_list() |
| to recycle the page on the inactive list and hope that it has better luck |
| with the page next time. |
| |
| For file pages mapped into non-linear vmas, the try_to_munlock() logic works |
| slightly differently. On encountering a VM_LOCKED non-linear vma that might |
| map the page, try_to_munlock() returns SWAP_AGAIN without actually mlocking |
| the page. munlock_vma_page() will just leave the page unlocked and let |
| vmscan deal with it--the usual fallback position. |
| |
| Note that try_to_munlock()'s reverse map walk must visit every vma in a pages' |
| reverse map to determine that a page is NOT mapped into any VM_LOCKED vma. |
| However, the scan can terminate when it encounters a VM_LOCKED vma and can |
| successfully acquire the vma's mmap semphore for read and mlock the page. |
| Although try_to_munlock() can be called many [very many!] times when |
| munlock()ing a large region or tearing down a large address space that has been |
| mlocked via mlockall(), overall this is a fairly rare event. |
| |
| Mlocked Page: Page Reclaim in shrink_*_list() |
| |
| shrink_active_list() culls any obviously unevictable pages--i.e., |
| !page_evictable(page, NULL)--diverting these to the unevictable lru |
| list. However, shrink_active_list() only sees unevictable pages that |
| made it onto the active/inactive lru lists. Note that these pages do not |
| have PageUnevictable set--otherwise, they would be on the unevictable list and |
| shrink_active_list would never see them. |
| |
| Some examples of these unevictable pages on the LRU lists are: |
| |
| 1) ramfs pages that have been placed on the lru lists when first allocated. |
| |
| 2) SHM_LOCKed shared memory pages. shmctl(SHM_LOCK) does not attempt to |
| allocate or fault in the pages in the shared memory region. This happens |
| when an application accesses the page the first time after SHM_LOCKing |
| the segment. |
| |
| 3) Mlocked pages that could not be isolated from the lru and moved to the |
| unevictable list in mlock_vma_page(). |
| |
| 3) Pages mapped into multiple VM_LOCKED vmas, but try_to_munlock() couldn't |
| acquire the vma's mmap semaphore to test the flags and set PageMlocked. |
| munlock_vma_page() was forced to let the page back on to the normal |
| LRU list for vmscan to handle. |
| |
| shrink_inactive_list() also culls any unevictable pages that it finds on |
| the inactive lists, again diverting them to the appropriate zone's unevictable |
| lru list. shrink_inactive_list() should only see SHM_LOCKed pages that became |
| SHM_LOCKed after shrink_active_list() had moved them to the inactive list, or |
| pages mapped into VM_LOCKED vmas that munlock_vma_page() couldn't isolate from |
| the lru to recheck via try_to_munlock(). shrink_inactive_list() won't notice |
| the latter, but will pass on to shrink_page_list(). |
| |
| shrink_page_list() again culls obviously unevictable pages that it could |
| encounter for similar reason to shrink_inactive_list(). Pages mapped into |
| VM_LOCKED vmas but without PG_mlocked set will make it all the way to |
| try_to_unmap(). shrink_page_list() will divert them to the unevictable list |
| when try_to_unmap() returns SWAP_MLOCK, as discussed above. |