mm: page allocator: calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as it is cheaper than scanning a number of lists. To avoid synchronization overhead, counter deltas are maintained on a per-cpu basis and drained both periodically and when the delta is above a threshold. On large CPU systems, the difference between the estimated and real value of NR_FREE_PAGES can be very high. If NR_FREE_PAGES is much higher than number of real free page in buddy, the VM can allocate pages below min watermark, at worst reducing the real number of pages to zero. Even if the OOM killer kills some victim for freeing memory, it may not free memory if the exit path requires a new page resulting in livelock. This patch introduces a zone_page_state_snapshot() function (courtesy of Christoph) that takes a slightly more accurate view of an arbitrary vmstat counter. It is used to read NR_FREE_PAGES while kswapd is awake to avoid the watermark being accidentally broken. The estimate is not perfect and may result in cache line bounces but is expected to be lighter than the IPI calls necessary to continually drain the per-cpu counters while kswapd is awake. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit: aa45484031ddee09b06350ab8528bfe5b2c76d1c [log] [tgz]
author: Christoph Lameter <cl@linux.com> Thu Sep 09 16:38:17 2010 -0700
committer: Linus Torvalds <torvalds@linux-foundation.org> Thu Sep 09 18:57:25 2010 -0700
tree: 6758072232db9a54453022ec3e6cede35d52001c
parent: 72853e2991a2702ae93aaf889ac7db743a415dd3 [diff] [blame]
diff --git a/mm/vmstat.c b/mm/vmstat.c
index a8d6b59..355a9e6 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c

@@ -138,11 +138,24 @@
 	int threshold;
 
 	for_each_populated_zone(zone) {
+		unsigned long max_drift, tolerate_drift;
+
 		threshold = calculate_threshold(zone);
 
 		for_each_online_cpu(cpu)
 			per_cpu_ptr(zone->pageset, cpu)->stat_threshold
 							= threshold;
+
+		/*
+		 * Only set percpu_drift_mark if there is a danger that
+		 * NR_FREE_PAGES reports the low watermark is ok when in fact
+		 * the min watermark could be breached by an allocation
+		 */
+		tolerate_drift = low_wmark_pages(zone) - min_wmark_pages(zone);
+		max_drift = num_online_cpus() * threshold;
+		if (max_drift > tolerate_drift)
+			zone->percpu_drift_mark = high_wmark_pages(zone) +
+					max_drift;
 	}
 }
 
@@ -813,7 +826,7 @@
 		   "\n        scanned  %lu"
 		   "\n        spanned  %lu"
 		   "\n        present  %lu",
-		   zone_page_state(zone, NR_FREE_PAGES),
+		   zone_nr_free_pages(zone),
 		   min_wmark_pages(zone),
 		   low_wmark_pages(zone),
 		   high_wmark_pages(zone),
commit	aa45484031ddee09b06350ab8528bfe5b2c76d1c	[log] [tgz]
author	Christoph Lameter <cl@linux.com>	Thu Sep 09 16:38:17 2010 -0700
committer	Linus Torvalds <torvalds@linux-foundation.org>	Thu Sep 09 18:57:25 2010 -0700
tree	6758072232db9a54453022ec3e6cede35d52001c
parent	72853e2991a2702ae93aaf889ac7db743a415dd3 [diff] [blame]