Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1 | Started Oct 1999 by Kanoj Sarcar <kanojsarcar@yahoo.com> |
| 2 | |
| 3 | The intent of this file is to have an uptodate, running commentary |
| 4 | from different people about how locking and synchronization is done |
| 5 | in the Linux vm code. |
| 6 | |
| 7 | page_table_lock & mmap_sem |
| 8 | -------------------------------------- |
| 9 | |
| 10 | Page stealers pick processes out of the process pool and scan for |
| 11 | the best process to steal pages from. To guarantee the existence |
| 12 | of the victim mm, a mm_count inc and a mmdrop are done in swap_out(). |
| 13 | Page stealers hold kernel_lock to protect against a bunch of races. |
| 14 | The vma list of the victim mm is also scanned by the stealer, |
| 15 | and the page_table_lock is used to preserve list sanity against the |
| 16 | process adding/deleting to the list. This also guarantees existence |
| 17 | of the vma. Vma existence is not guaranteed once try_to_swap_out() |
| 18 | drops the page_table_lock. To guarantee the existence of the underlying |
| 19 | file structure, a get_file is done before the swapout() method is |
| 20 | invoked. The page passed into swapout() is guaranteed not to be reused |
| 21 | for a different purpose because the page reference count due to being |
| 22 | present in the user's pte is not released till after swapout() returns. |
| 23 | |
| 24 | Any code that modifies the vmlist, or the vm_start/vm_end/ |
| 25 | vm_flags:VM_LOCKED/vm_next of any vma *in the list* must prevent |
| 26 | kswapd from looking at the chain. |
| 27 | |
| 28 | The rules are: |
| 29 | 1. To scan the vmlist (look but don't touch) you must hold the |
| 30 | mmap_sem with read bias, i.e. down_read(&mm->mmap_sem) |
| 31 | 2. To modify the vmlist you need to hold the mmap_sem with |
| 32 | read&write bias, i.e. down_write(&mm->mmap_sem) *AND* |
| 33 | you need to take the page_table_lock. |
| 34 | 3. The swapper takes _just_ the page_table_lock, this is done |
| 35 | because the mmap_sem can be an extremely long lived lock |
| 36 | and the swapper just cannot sleep on that. |
| 37 | 4. The exception to this rule is expand_stack, which just |
| 38 | takes the read lock and the page_table_lock, this is ok |
| 39 | because it doesn't really modify fields anybody relies on. |
| 40 | 5. You must be able to guarantee that while holding page_table_lock |
| 41 | or page_table_lock of mm A, you will not try to get either lock |
| 42 | for mm B. |
| 43 | |
| 44 | The caveats are: |
| 45 | 1. find_vma() makes use of, and updates, the mmap_cache pointer hint. |
| 46 | The update of mmap_cache is racy (page stealer can race with other code |
| 47 | that invokes find_vma with mmap_sem held), but that is okay, since it |
| 48 | is a hint. This can be fixed, if desired, by having find_vma grab the |
| 49 | page_table_lock. |
| 50 | |
| 51 | |
| 52 | Code that add/delete elements from the vmlist chain are |
| 53 | 1. callers of insert_vm_struct |
| 54 | 2. callers of merge_segments |
| 55 | 3. callers of avl_remove |
| 56 | |
| 57 | Code that changes vm_start/vm_end/vm_flags:VM_LOCKED of vma's on |
| 58 | the list: |
| 59 | 1. expand_stack |
| 60 | 2. mprotect |
| 61 | 3. mlock |
| 62 | 4. mremap |
| 63 | |
| 64 | It is advisable that changes to vm_start/vm_end be protected, although |
| 65 | in some cases it is not really needed. Eg, vm_start is modified by |
| 66 | expand_stack(), it is hard to come up with a destructive scenario without |
| 67 | having the vmlist protection in this case. |
| 68 | |
| 69 | The page_table_lock nests with the inode i_mmap_lock and the kmem cache |
| 70 | c_spinlock spinlocks. This is okay, since the kmem code asks for pages after |
| 71 | dropping c_spinlock. The page_table_lock also nests with pagecache_lock and |
| 72 | pagemap_lru_lock spinlocks, and no code asks for memory with these locks |
| 73 | held. |
| 74 | |
| 75 | The page_table_lock is grabbed while holding the kernel_lock spinning monitor. |
| 76 | |
| 77 | The page_table_lock is a spin lock. |
| 78 | |
| 79 | Note: PTL can also be used to guarantee that no new clones using the |
| 80 | mm start up ... this is a loose form of stability on mm_users. For |
| 81 | example, it is used in copy_mm to protect against a racing tlb_gather_mmu |
| 82 | single address space optimization, so that the zap_page_range (from |
| 83 | vmtruncate) does not lose sending ipi's to cloned threads that might |
| 84 | be spawned underneath it and go to user mode to drag in pte's into tlbs. |
| 85 | |
| 86 | swap_list_lock/swap_device_lock |
| 87 | ------------------------------- |
| 88 | The swap devices are chained in priority order from the "swap_list" header. |
| 89 | The "swap_list" is used for the round-robin swaphandle allocation strategy. |
| 90 | The #free swaphandles is maintained in "nr_swap_pages". These two together |
| 91 | are protected by the swap_list_lock. |
| 92 | |
| 93 | The swap_device_lock, which is per swap device, protects the reference |
| 94 | counts on the corresponding swaphandles, maintained in the "swap_map" |
| 95 | array, and the "highest_bit" and "lowest_bit" fields. |
| 96 | |
| 97 | Both of these are spinlocks, and are never acquired from intr level. The |
| 98 | locking hierarchy is swap_list_lock -> swap_device_lock. |
| 99 | |
| 100 | To prevent races between swap space deletion or async readahead swapins |
| 101 | deciding whether a swap handle is being used, ie worthy of being read in |
| 102 | from disk, and an unmap -> swap_free making the handle unused, the swap |
| 103 | delete and readahead code grabs a temp reference on the swaphandle to |
| 104 | prevent warning messages from swap_duplicate <- read_swap_cache_async. |
| 105 | |
| 106 | Swap cache locking |
| 107 | ------------------ |
| 108 | Pages are added into the swap cache with kernel_lock held, to make sure |
| 109 | that multiple pages are not being added (and hence lost) by associating |
| 110 | all of them with the same swaphandle. |
| 111 | |
| 112 | Pages are guaranteed not to be removed from the scache if the page is |
| 113 | "shared": ie, other processes hold reference on the page or the associated |
| 114 | swap handle. The only code that does not follow this rule is shrink_mmap, |
| 115 | which deletes pages from the swap cache if no process has a reference on |
| 116 | the page (multiple processes might have references on the corresponding |
| 117 | swap handle though). lookup_swap_cache() races with shrink_mmap, when |
| 118 | establishing a reference on a scache page, so, it must check whether the |
| 119 | page it located is still in the swapcache, or shrink_mmap deleted it. |
| 120 | (This race is due to the fact that shrink_mmap looks at the page ref |
| 121 | count with pagecache_lock, but then drops pagecache_lock before deleting |
| 122 | the page from the scache). |
| 123 | |
| 124 | do_wp_page and do_swap_page have MP races in them while trying to figure |
| 125 | out whether a page is "shared", by looking at the page_count + swap_count. |
| 126 | To preserve the sum of the counts, the page lock _must_ be acquired before |
| 127 | calling is_page_shared (else processes might switch their swap_count refs |
| 128 | to the page count refs, after the page count ref has been snapshotted). |
| 129 | |
| 130 | Swap device deletion code currently breaks all the scache assumptions, |
| 131 | since it grabs neither mmap_sem nor page_table_lock. |