Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1 | Started Jan 2000 by Kanoj Sarcar <kanoj@sgi.com> |
| 2 | |
Mel Gorman | d0164ad | 2015-11-06 16:28:21 -0800 | [diff] [blame^] | 3 | Memory balancing is needed for !__GFP_ATOMIC and !__GFP_KSWAPD_RECLAIM as |
| 4 | well as for non __GFP_IO allocations. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 5 | |
Mel Gorman | d0164ad | 2015-11-06 16:28:21 -0800 | [diff] [blame^] | 6 | The first reason why a caller may avoid reclaim is that the caller can not |
| 7 | sleep due to holding a spinlock or is in interrupt context. The second may |
| 8 | be that the caller is willing to fail the allocation without incurring the |
| 9 | overhead of page reclaim. This may happen for opportunistic high-order |
| 10 | allocation requests that have order-0 fallback options. In such cases, |
| 11 | the caller may also wish to avoid waking kswapd. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 12 | |
| 13 | __GFP_IO allocation requests are made to prevent file system deadlocks. |
| 14 | |
| 15 | In the absence of non sleepable allocation requests, it seems detrimental |
| 16 | to be doing balancing. Page reclamation can be kicked off lazily, that |
| 17 | is, only when needed (aka zone free memory is 0), instead of making it |
| 18 | a proactive process. |
| 19 | |
| 20 | That being said, the kernel should try to fulfill requests for direct |
| 21 | mapped pages from the direct mapped pool, instead of falling back on |
| 22 | the dma pool, so as to keep the dma pool filled for dma requests (atomic |
| 23 | or not). A similar argument applies to highmem and direct mapped pages. |
| 24 | OTOH, if there is a lot of free dma pages, it is preferable to satisfy |
| 25 | regular memory requests by allocating one from the dma pool, instead |
| 26 | of incurring the overhead of regular zone balancing. |
| 27 | |
| 28 | In 2.2, memory balancing/page reclamation would kick off only when the |
| 29 | _total_ number of free pages fell below 1/64 th of total memory. With the |
| 30 | right ratio of dma and regular memory, it is quite possible that balancing |
| 31 | would not be done even when the dma zone was completely empty. 2.2 has |
| 32 | been running production machines of varying memory sizes, and seems to be |
| 33 | doing fine even with the presence of this problem. In 2.3, due to |
| 34 | HIGHMEM, this problem is aggravated. |
| 35 | |
| 36 | In 2.3, zone balancing can be done in one of two ways: depending on the |
| 37 | zone size (and possibly of the size of lower class zones), we can decide |
| 38 | at init time how many free pages we should aim for while balancing any |
| 39 | zone. The good part is, while balancing, we do not need to look at sizes |
| 40 | of lower class zones, the bad part is, we might do too frequent balancing |
| 41 | due to ignoring possibly lower usage in the lower class zones. Also, |
| 42 | with a slight change in the allocation routine, it is possible to reduce |
| 43 | the memclass() macro to be a simple equality. |
| 44 | |
| 45 | Another possible solution is that we balance only when the free memory |
| 46 | of a zone _and_ all its lower class zones falls below 1/64th of the |
| 47 | total memory in the zone and its lower class zones. This fixes the 2.2 |
| 48 | balancing problem, and stays as close to 2.2 behavior as possible. Also, |
| 49 | the balancing algorithm works the same way on the various architectures, |
| 50 | which have different numbers and types of zones. If we wanted to get |
| 51 | fancy, we could assign different weights to free pages in different |
| 52 | zones in the future. |
| 53 | |
| 54 | Note that if the size of the regular zone is huge compared to dma zone, |
| 55 | it becomes less significant to consider the free dma pages while |
| 56 | deciding whether to balance the regular zone. The first solution |
| 57 | becomes more attractive then. |
| 58 | |
| 59 | The appended patch implements the second solution. It also "fixes" two |
| 60 | problems: first, kswapd is woken up as in 2.2 on low memory conditions |
| 61 | for non-sleepable allocations. Second, the HIGHMEM zone is also balanced, |
| 62 | so as to give a fighting chance for replace_with_highmem() to get a |
| 63 | HIGHMEM page, as well as to ensure that HIGHMEM allocations do not |
| 64 | fall back into regular zone. This also makes sure that HIGHMEM pages |
| 65 | are not leaked (for example, in situations where a HIGHMEM page is in |
| 66 | the swapcache but is not being used by anyone) |
| 67 | |
| 68 | kswapd also needs to know about the zones it should balance. kswapd is |
| 69 | primarily needed in a situation where balancing can not be done, |
| 70 | probably because all allocation requests are coming from intr context |
| 71 | and all process contexts are sleeping. For 2.3, kswapd does not really |
| 72 | need to balance the highmem zone, since intr context does not request |
| 73 | highmem pages. kswapd looks at the zone_wake_kswapd field in the zone |
| 74 | structure to decide whether a zone needs balancing. |
| 75 | |
| 76 | Page stealing from process memory and shm is done if stealing the page would |
| 77 | alleviate memory pressure on any zone in the page's node that has fallen below |
| 78 | its watermark. |
| 79 | |
Mel Gorman | 4185896 | 2009-06-16 15:32:12 -0700 | [diff] [blame] | 80 | watemark[WMARK_MIN/WMARK_LOW/WMARK_HIGH]/low_on_memory/zone_wake_kswapd: These |
| 81 | are per-zone fields, used to determine when a zone needs to be balanced. When |
| 82 | the number of pages falls below watermark[WMARK_MIN], the hysteric field |
| 83 | low_on_memory gets set. This stays set till the number of free pages becomes |
| 84 | watermark[WMARK_HIGH]. When low_on_memory is set, page allocation requests will |
| 85 | try to free some pages in the zone (providing GFP_WAIT is set in the request). |
| 86 | Orthogonal to this, is the decision to poke kswapd to free some zone pages. |
| 87 | That decision is not hysteresis based, and is done when the number of free |
| 88 | pages is below watermark[WMARK_LOW]; in which case zone_wake_kswapd is also set. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 89 | |
| 90 | |
| 91 | (Good) Ideas that I have heard: |
| 92 | 1. Dynamic experience should influence balancing: number of failed requests |
| 93 | for a zone can be tracked and fed into the balancing scheme (jalvo@mbay.net) |
| 94 | 2. Implement a replace_with_highmem()-like replace_with_regular() to preserve |
| 95 | dma pages. (lkd@tantalophile.demon.co.uk) |