Blame - Documentation/sysctl/vm.txt - kernel/msm-4.9

blob: 56dd29b97a91b2283b15d3438f0da3a151e228ed [file] [log] [blame]

Peter W Morreale	db0fb18	2009-01-15 13:50:42 -0800	[diff] [blame]	1	Documentation for /proc/sys/vm/* kernel version 2.6.29
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	2	(c) 1998, 1999, Rik van Riel <riel@nl.linux.org>
Peter W Morreale	db0fb18	2009-01-15 13:50:42 -0800	[diff] [blame]	3	(c) 2008 Peter W. Morreale <pmorreale@novell.com>
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	4
				5	For general info and legal blurb, please look in README.
				6
				7	==============================================================
				8
				9	This file contains the documentation for the sysctl files in
Peter W Morreale	db0fb18	2009-01-15 13:50:42 -0800	[diff] [blame]	10	/proc/sys/vm and is valid for Linux kernel version 2.6.29.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	11
				12	The files in this directory can be used to tune the operation
				13	of the virtual memory (VM) subsystem of the Linux kernel and
				14	the writeout of dirty data to disk.
				15
				16	Default values and initialization routines for most of these
				17	files can be found in mm/swap.c.
				18
				19	Currently, these files are in /proc/sys/vm:
Peter W Morreale	db0fb18	2009-01-15 13:50:42 -0800	[diff] [blame]	20
				21	- block_dump
Mel Gorman	76ab0f5	2010-05-24 14:32:28 -0700	[diff] [blame^]	22	- compact_memory
Peter W Morreale	db0fb18	2009-01-15 13:50:42 -0800	[diff] [blame]	23	- dirty_background_bytes
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	24	- dirty_background_ratio
Peter W Morreale	db0fb18	2009-01-15 13:50:42 -0800	[diff] [blame]	25	- dirty_bytes
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	26	- dirty_expire_centisecs
Peter W Morreale	db0fb18	2009-01-15 13:50:42 -0800	[diff] [blame]	27	- dirty_ratio
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	28	- dirty_writeback_centisecs
Peter W Morreale	db0fb18	2009-01-15 13:50:42 -0800	[diff] [blame]	29	- drop_caches
				30	- hugepages_treat_as_movable
				31	- hugetlb_shm_group
				32	- laptop_mode
				33	- legacy_va_layout
				34	- lowmem_reserve_ratio
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	35	- max_map_count
Andi Kleen	6a46079	2009-09-16 11:50:15 +0200	[diff] [blame]	36	- memory_failure_early_kill
				37	- memory_failure_recovery
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	38	- min_free_kbytes
Christoph Lameter	0ff3849	2006-09-25 23:31:52 -0700	[diff] [blame]	39	- min_slab_ratio
Peter W Morreale	db0fb18	2009-01-15 13:50:42 -0800	[diff] [blame]	40	- min_unmapped_ratio
				41	- mmap_min_addr
Nishanth Aravamudan	d5dbac8	2007-12-17 16:20:25 -0800	[diff] [blame]	42	- nr_hugepages
				43	- nr_overcommit_hugepages
Peter W Morreale	db0fb18	2009-01-15 13:50:42 -0800	[diff] [blame]	44	- nr_pdflush_threads
				45	- nr_trim_pages (only if CONFIG_MMU=n)
				46	- numa_zonelist_order
				47	- oom_dump_tasks
				48	- oom_kill_allocating_task
				49	- overcommit_memory
				50	- overcommit_ratio
				51	- page-cluster
				52	- panic_on_oom
				53	- percpu_pagelist_fraction
				54	- stat_interval
				55	- swappiness
				56	- vfs_cache_pressure
				57	- zone_reclaim_mode
				58
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	59	==============================================================
				60
Peter W Morreale	db0fb18	2009-01-15 13:50:42 -0800	[diff] [blame]	61	block_dump
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	62
Peter W Morreale	db0fb18	2009-01-15 13:50:42 -0800	[diff] [blame]	63	block_dump enables block I/O debugging when set to a nonzero value. More
				64	information on block I/O debugging is in Documentation/laptops/laptop-mode.txt.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	65
				66	==============================================================
				67
Mel Gorman	76ab0f5	2010-05-24 14:32:28 -0700	[diff] [blame^]	68	compact_memory
				69
				70	Available only when CONFIG_COMPACTION is set. When 1 is written to the file,
				71	all zones are compacted such that free memory is available in contiguous
				72	blocks where possible. This can be important for example in the allocation of
				73	huge pages although processes will also directly compact memory as required.
				74
				75	==============================================================
				76
Peter W Morreale	db0fb18	2009-01-15 13:50:42 -0800	[diff] [blame]	77	dirty_background_bytes
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	78
Peter W Morreale	db0fb18	2009-01-15 13:50:42 -0800	[diff] [blame]	79	Contains the amount of dirty memory at which the pdflush background writeback
				80	daemon will start writeback.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	81
Peter W Morreale	db0fb18	2009-01-15 13:50:42 -0800	[diff] [blame]	82	If dirty_background_bytes is written, dirty_background_ratio becomes a function
				83	of its value (dirty_background_bytes / the amount of dirtyable system memory).
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	84
				85	==============================================================
				86
Peter W Morreale	db0fb18	2009-01-15 13:50:42 -0800	[diff] [blame]	87	dirty_background_ratio
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	88
Peter W Morreale	db0fb18	2009-01-15 13:50:42 -0800	[diff] [blame]	89	Contains, as a percentage of total system memory, the number of pages at which
				90	the pdflush background writeback daemon will start writing out dirty data.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	91
				92	==============================================================
				93
Peter W Morreale	db0fb18	2009-01-15 13:50:42 -0800	[diff] [blame]	94	dirty_bytes
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	95
Peter W Morreale	db0fb18	2009-01-15 13:50:42 -0800	[diff] [blame]	96	Contains the amount of dirty memory at which a process generating disk writes
				97	will itself start writeback.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	98
Peter W Morreale	db0fb18	2009-01-15 13:50:42 -0800	[diff] [blame]	99	If dirty_bytes is written, dirty_ratio becomes a function of its value
				100	(dirty_bytes / the amount of dirtyable system memory).
				101
Andrea Righi	9e4a5bd	2009-04-30 15:08:57 -0700	[diff] [blame]	102	Note: the minimum value allowed for dirty_bytes is two pages (in bytes); any
				103	value lower than this limit will be ignored and the old configuration will be
				104	retained.
				105
Peter W Morreale	db0fb18	2009-01-15 13:50:42 -0800	[diff] [blame]	106	==============================================================
				107
				108	dirty_expire_centisecs
				109
				110	This tunable is used to define when dirty data is old enough to be eligible
				111	for writeout by the pdflush daemons. It is expressed in 100'ths of a second.
				112	Data which has been dirty in-memory for longer than this interval will be
				113	written out next time a pdflush daemon wakes up.
				114
				115	==============================================================
				116
				117	dirty_ratio
				118
				119	Contains, as a percentage of total system memory, the number of pages at which
				120	a process which is generating disk writes will itself start writing out dirty
				121	data.
				122
				123	==============================================================
				124
				125	dirty_writeback_centisecs
				126
				127	The pdflush writeback daemons will periodically wake up and write `old' data
				128	out to disk. This tunable expresses the interval between those wakeups, in
				129	100'ths of a second.
				130
				131	Setting this to zero disables periodic writeback altogether.
				132
				133	==============================================================
				134
				135	drop_caches
				136
				137	Writing to this will cause the kernel to drop clean caches, dentries and
				138	inodes from memory, causing that memory to become free.
				139
				140	To free pagecache:
				141	echo 1 > /proc/sys/vm/drop_caches
				142	To free dentries and inodes:
				143	echo 2 > /proc/sys/vm/drop_caches
				144	To free pagecache, dentries and inodes:
				145	echo 3 > /proc/sys/vm/drop_caches
				146
				147	As this is a non-destructive operation and dirty objects are not freeable, the
				148	user should run `sync' first.
				149
				150	==============================================================
				151
				152	hugepages_treat_as_movable
				153
				154	This parameter is only useful when kernelcore= is specified at boot time to
				155	create ZONE_MOVABLE for pages that may be reclaimed or migrated. Huge pages
				156	are not movable so are not normally allocated from ZONE_MOVABLE. A non-zero
				157	value written to hugepages_treat_as_movable allows huge pages to be allocated
				158	from ZONE_MOVABLE.
				159
				160	Once enabled, the ZONE_MOVABLE is treated as an area of memory the huge
				161	pages pool can easily grow or shrink within. Assuming that applications are
				162	not running that mlock() a lot of memory, it is likely the huge pages pool
				163	can grow to the size of ZONE_MOVABLE by repeatedly entering the desired value
				164	into nr_hugepages and triggering page reclaim.
				165
				166	==============================================================
				167
				168	hugetlb_shm_group
				169
				170	hugetlb_shm_group contains group id that is allowed to create SysV
				171	shared memory segment using hugetlb page.
				172
				173	==============================================================
				174
				175	laptop_mode
				176
				177	laptop_mode is a knob that controls "laptop mode". All the things that are
				178	controlled by this knob are discussed in Documentation/laptops/laptop-mode.txt.
				179
				180	==============================================================
				181
				182	legacy_va_layout
				183
				184	If non-zero, this sysctl disables the new 32-bit mmap mmap layout - the kernel
				185	will use the legacy (2.4) layout for all processes.
				186
				187	==============================================================
				188
				189	lowmem_reserve_ratio
				190
				191	For some specialised workloads on highmem machines it is dangerous for
				192	the kernel to allow process memory to be allocated from the "lowmem"
				193	zone. This is because that memory could then be pinned via the mlock()
				194	system call, or by unavailability of swapspace.
				195
				196	And on large highmem machines this lack of reclaimable lowmem memory
				197	can be fatal.
				198
				199	So the Linux page allocator has a mechanism which prevents allocations
				200	which _could_ use highmem from using too much lowmem. This means that
				201	a certain amount of lowmem is defended from the possibility of being
				202	captured into pinned user memory.
				203
				204	(The same argument applies to the old 16 megabyte ISA DMA region. This
				205	mechanism will also defend that region from allocations which could use
				206	highmem or lowmem).
				207
				208	The `lowmem_reserve_ratio' tunable determines how aggressive the kernel is
				209	in defending these lower zones.
				210
				211	If you have a machine which uses highmem or ISA DMA and your
				212	applications are using mlock(), or if you are running with no swap then
				213	you probably should change the lowmem_reserve_ratio setting.
				214
				215	The lowmem_reserve_ratio is an array. You can see them by reading this file.
				216	-
				217	% cat /proc/sys/vm/lowmem_reserve_ratio
				218	256 256 32
				219	-
				220	Note: # of this elements is one fewer than number of zones. Because the highest
				221	zone's value is not necessary for following calculation.
				222
				223	But, these values are not used directly. The kernel calculates # of protection
				224	pages for each zones from them. These are shown as array of protection pages
				225	in /proc/zoneinfo like followings. (This is an example of x86-64 box).
				226	Each zone has an array of protection pages like this.
				227
				228	-
				229	Node 0, zone DMA
				230	pages free 1355
				231	min 3
				232	low 3
				233	high 4
				234	:
				235	:
				236	numa_other 0
				237	protection: (0, 2004, 2004, 2004)
				238	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
				239	pagesets
				240	cpu: 0 pcp: 0
				241	:
				242	-
				243	These protections are added to score to judge whether this zone should be used
				244	for page allocation or should be reclaimed.
				245
				246	In this example, if normal pages (index=2) are required to this DMA zone and
Mel Gorman	4185896	2009-06-16 15:32:12 -0700	[diff] [blame]	247	watermark[WMARK_HIGH] is used for watermark, the kernel judges this zone should
				248	not be used because pages_free(1355) is smaller than watermark + protection[2]
Peter W Morreale	db0fb18	2009-01-15 13:50:42 -0800	[diff] [blame]	249	(4 + 2004 = 2008). If this protection value is 0, this zone would be used for
				250	normal page requirement. If requirement is DMA zone(index=0), protection[0]
				251	(=0) is used.
				252
				253	zone[i]'s protection[j] is calculated by following expression.
				254
				255	(i < j):
				256	zone[i]->protection[j]
				257	= (total sums of present_pages from zone[i+1] to zone[j] on the node)
				258	/ lowmem_reserve_ratio[i];
				259	(i = j):
				260	(should not be protected. = 0;
				261	(i > j):
				262	(not necessary, but looks 0)
				263
				264	The default values of lowmem_reserve_ratio[i] are
				265	256 (if zone[i] means DMA or DMA32 zone)
				266	32 (others).
				267	As above expression, they are reciprocal number of ratio.
				268	256 means 1/256. # of protection pages becomes about "0.39%" of total present
				269	pages of higher zones on the node.
				270
				271	If you would like to protect more pages, smaller values are effective.
				272	The minimum value is 1 (1/1 -> 100%).
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	273
				274	==============================================================
				275
				276	max_map_count:
				277
				278	This file contains the maximum number of memory map areas a process
				279	may have. Memory map areas are used as a side-effect of calling
				280	malloc, directly by mmap and mprotect, and also when loading shared
				281	libraries.
				282
				283	While most applications need less than a thousand maps, certain
				284	programs, particularly malloc debuggers, may consume lots of them,
				285	e.g., up to one or two maps per allocation.
				286
				287	The default value is 65536.
				288
Andi Kleen	6a46079	2009-09-16 11:50:15 +0200	[diff] [blame]	289	=============================================================
				290
				291	memory_failure_early_kill:
				292
				293	Control how to kill processes when uncorrected memory error (typically
				294	a 2bit error in a memory module) is detected in the background by hardware
				295	that cannot be handled by the kernel. In some cases (like the page
				296	still having a valid copy on disk) the kernel will handle the failure
				297	transparently without affecting any applications. But if there is
				298	no other uptodate copy of the data it will kill to prevent any data
				299	corruptions from propagating.
				300
				301	1: Kill all processes that have the corrupted and not reloadable page mapped
				302	as soon as the corruption is detected. Note this is not supported
				303	for a few types of pages, like kernel internally allocated data or
				304	the swap cache, but works for the majority of user pages.
				305
				306	0: Only unmap the corrupted page from all processes and only kill a process
				307	who tries to access it.
				308
				309	The kill is done using a catchable SIGBUS with BUS_MCEERR_AO, so processes can
				310	handle this if they want to.
				311
				312	This is only active on architectures/platforms with advanced machine
				313	check handling and depends on the hardware capabilities.
				314
				315	Applications can override this setting individually with the PR_MCE_KILL prctl
				316
				317	==============================================================
				318
				319	memory_failure_recovery
				320
				321	Enable memory failure recovery (when supported by the platform)
				322
				323	1: Attempt recovery.
				324
				325	0: Always panic on a memory failure.
				326
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	327	==============================================================
				328
				329	min_free_kbytes:
				330
Peter W Morreale	db0fb18	2009-01-15 13:50:42 -0800	[diff] [blame]	331	This is used to force the Linux VM to keep a minimum number
Mel Gorman	4185896	2009-06-16 15:32:12 -0700	[diff] [blame]	332	of kilobytes free. The VM uses this number to compute a
				333	watermark[WMARK_MIN] value for each lowmem zone in the system.
				334	Each lowmem zone gets a number of reserved free pages based
				335	proportionally on its size.
Rohit Seth	8ad4b1f	2006-01-08 01:00:40 -0800	[diff] [blame]	336
Matt LaPlante	d919588	2008-07-25 19:45:33 -0700	[diff] [blame]	337	Some minimal amount of memory is needed to satisfy PF_MEMALLOC
Pavel Machek	2495089	2007-10-16 23:31:28 -0700	[diff] [blame]	338	allocations; if you set this to lower than 1024KB, your system will
				339	become subtly broken, and prone to deadlock under high loads.
				340
				341	Setting this too high will OOM your machine instantly.
				342
Christoph Lameter	9614634	2006-07-03 00:24:13 -0700	[diff] [blame]	343	=============================================================
				344
Christoph Lameter	0ff3849	2006-09-25 23:31:52 -0700	[diff] [blame]	345	min_slab_ratio:
				346
				347	This is available only on NUMA kernels.
				348
				349	A percentage of the total pages in each zone. On Zone reclaim
				350	(fallback from the local zone occurs) slabs will be reclaimed if more
				351	than this percentage of pages in a zone are reclaimable slab pages.
				352	This insures that the slab growth stays under control even in NUMA
				353	systems that rarely perform global reclaim.
				354
				355	The default is 5 percent.
				356
				357	Note that slab reclaim is triggered in a per zone / node fashion.
				358	The process of reclaiming slab memory is currently not node specific
				359	and may not be fast.
				360
				361	=============================================================
				362
Peter W Morreale	db0fb18	2009-01-15 13:50:42 -0800	[diff] [blame]	363	min_unmapped_ratio:
KAMEZAWA Hiroyuki	fadd8fb	2006-06-23 02:03:13 -0700	[diff] [blame]	364
Peter W Morreale	db0fb18	2009-01-15 13:50:42 -0800	[diff] [blame]	365	This is available only on NUMA kernels.
Yasunori Goto	2b744c0	2007-05-06 14:49:59 -0700	[diff] [blame]	366
Mel Gorman	90afa5d	2009-06-16 15:33:20 -0700	[diff] [blame]	367	This is a percentage of the total pages in each zone. Zone reclaim will
				368	only occur if more than this percentage of pages are in a state that
				369	zone_reclaim_mode allows to be reclaimed.
				370
				371	If zone_reclaim_mode has the value 4 OR'd, then the percentage is compared
				372	against all file-backed unmapped pages including swapcache pages and tmpfs
				373	files. Otherwise, only unmapped pages backed by normal files but not tmpfs
				374	files and similar are considered.
Yasunori Goto	2b744c0	2007-05-06 14:49:59 -0700	[diff] [blame]	375
Peter W Morreale	db0fb18	2009-01-15 13:50:42 -0800	[diff] [blame]	376	The default is 1 percent.
David Rientjes	fe071d7	2007-10-16 23:25:56 -0700	[diff] [blame]	377
Eric Paris	ed03218	2007-06-28 15:55:21 -0400	[diff] [blame]	378	==============================================================
				379
				380	mmap_min_addr
				381
				382	This file indicates the amount of address space which a user process will
André Goddard Rosa	af901ca	2009-11-14 13:09:05 -0200	[diff] [blame]	383	be restricted from mmapping. Since kernel null dereference bugs could
Eric Paris	ed03218	2007-06-28 15:55:21 -0400	[diff] [blame]	384	accidentally operate based on the information in the first couple of pages
				385	of memory userspace processes should not be allowed to write to them. By
				386	default this value is set to 0 and no protections will be enforced by the
				387	security module. Setting this value to something like 64k will allow the
				388	vast majority of applications to work correctly and provide defense in depth
				389	against future potential kernel bugs.
				390
KAMEZAWA Hiroyuki	f0c0b2b	2007-07-15 23:38:01 -0700	[diff] [blame]	391	==============================================================
				392
Peter W Morreale	db0fb18	2009-01-15 13:50:42 -0800	[diff] [blame]	393	nr_hugepages
				394
				395	Change the minimum size of the hugepage pool.
				396
				397	See Documentation/vm/hugetlbpage.txt
				398
				399	==============================================================
				400
				401	nr_overcommit_hugepages
				402
				403	Change the maximum size of the hugepage pool. The maximum is
				404	nr_hugepages + nr_overcommit_hugepages.
				405
				406	See Documentation/vm/hugetlbpage.txt
				407
				408	==============================================================
				409
				410	nr_pdflush_threads
				411
				412	The current number of pdflush threads. This value is read-only.
				413	The value changes according to the number of dirty pages in the system.
				414
Matt LaPlante	19f5946	2009-04-27 15:06:31 +0200	[diff] [blame]	415	When necessary, additional pdflush threads are created, one per second, up to
Peter W Morreale	db0fb18	2009-01-15 13:50:42 -0800	[diff] [blame]	416	nr_pdflush_threads_max.
				417
				418	==============================================================
				419
				420	nr_trim_pages
				421
				422	This is available only on NOMMU kernels.
				423
				424	This value adjusts the excess page trimming behaviour of power-of-2 aligned
				425	NOMMU mmap allocations.
				426
				427	A value of 0 disables trimming of allocations entirely, while a value of 1
				428	trims excess pages aggressively. Any value >= 1 acts as the watermark where
				429	trimming of allocations is initiated.
				430
				431	The default value is 1.
				432
				433	See Documentation/nommu-mmap.txt for more information.
				434
				435	==============================================================
				436
KAMEZAWA Hiroyuki	f0c0b2b	2007-07-15 23:38:01 -0700	[diff] [blame]	437	numa_zonelist_order
				438
				439	This sysctl is only for NUMA.
				440	'where the memory is allocated from' is controlled by zonelists.
				441	(This documentation ignores ZONE_HIGHMEM/ZONE_DMA32 for simple explanation.
				442	you may be able to read ZONE_DMA as ZONE_DMA32...)
				443
				444	In non-NUMA case, a zonelist for GFP_KERNEL is ordered as following.
				445	ZONE_NORMAL -> ZONE_DMA
				446	This means that a memory allocation request for GFP_KERNEL will
				447	get memory from ZONE_DMA only when ZONE_NORMAL is not available.
				448
				449	In NUMA case, you can think of following 2 types of order.
				450	Assume 2 node NUMA and below is zonelist of Node(0)'s GFP_KERNEL
				451
				452	(A) Node(0) ZONE_NORMAL -> Node(0) ZONE_DMA -> Node(1) ZONE_NORMAL
				453	(B) Node(0) ZONE_NORMAL -> Node(1) ZONE_NORMAL -> Node(0) ZONE_DMA.
				454
				455	Type(A) offers the best locality for processes on Node(0), but ZONE_DMA
				456	will be used before ZONE_NORMAL exhaustion. This increases possibility of
				457	out-of-memory(OOM) of ZONE_DMA because ZONE_DMA is tend to be small.
				458
				459	Type(B) cannot offer the best locality but is more robust against OOM of
				460	the DMA zone.
				461
				462	Type(A) is called as "Node" order. Type (B) is "Zone" order.
				463
				464	"Node order" orders the zonelists by node, then by zone within each node.
				465	Specify "[Nn]ode" for zone order
				466
				467	"Zone Order" orders the zonelists by zone type, then by node within each
				468	zone. Specify "[Zz]one"for zode order.
				469
				470	Specify "[Dd]efault" to request automatic configuration. Autoconfiguration
				471	will select "node" order in following case.
				472	(1) if the DMA zone does not exist or
				473	(2) if the DMA zone comprises greater than 50% of the available memory or
				474	(3) if any node's DMA zone comprises greater than 60% of its local memory and
				475	the amount of local memory is big enough.
				476
				477	Otherwise, "zone" order will be selected. Default order is recommended unless
				478	this is causing problems for your system/application.
Nishanth Aravamudan	d5dbac8	2007-12-17 16:20:25 -0800	[diff] [blame]	479
				480	==============================================================
				481
Peter W Morreale	db0fb18	2009-01-15 13:50:42 -0800	[diff] [blame]	482	oom_dump_tasks
Nishanth Aravamudan	d5dbac8	2007-12-17 16:20:25 -0800	[diff] [blame]	483
Peter W Morreale	db0fb18	2009-01-15 13:50:42 -0800	[diff] [blame]	484	Enables a system-wide task dump (excluding kernel threads) to be
				485	produced when the kernel performs an OOM-killing and includes such
				486	information as pid, uid, tgid, vm size, rss, cpu, oom_adj score, and
				487	name. This is helpful to determine why the OOM killer was invoked
				488	and to identify the rogue task that caused it.
Nishanth Aravamudan	d5dbac8	2007-12-17 16:20:25 -0800	[diff] [blame]	489
Peter W Morreale	db0fb18	2009-01-15 13:50:42 -0800	[diff] [blame]	490	If this is set to zero, this information is suppressed. On very
				491	large systems with thousands of tasks it may not be feasible to dump
				492	the memory state information for each one. Such systems should not
				493	be forced to incur a performance penalty in OOM conditions when the
				494	information may not be desired.
				495
				496	If this is set to non-zero, this information is shown whenever the
				497	OOM killer actually kills a memory-hogging task.
				498
				499	The default value is 0.
Nishanth Aravamudan	d5dbac8	2007-12-17 16:20:25 -0800	[diff] [blame]	500
				501	==============================================================
				502
Peter W Morreale	db0fb18	2009-01-15 13:50:42 -0800	[diff] [blame]	503	oom_kill_allocating_task
Nishanth Aravamudan	d5dbac8	2007-12-17 16:20:25 -0800	[diff] [blame]	504
Peter W Morreale	db0fb18	2009-01-15 13:50:42 -0800	[diff] [blame]	505	This enables or disables killing the OOM-triggering task in
				506	out-of-memory situations.
Nishanth Aravamudan	d5dbac8	2007-12-17 16:20:25 -0800	[diff] [blame]	507
Peter W Morreale	db0fb18	2009-01-15 13:50:42 -0800	[diff] [blame]	508	If this is set to zero, the OOM killer will scan through the entire
				509	tasklist and select a task based on heuristics to kill. This normally
				510	selects a rogue memory-hogging task that frees up a large amount of
				511	memory when killed.
				512
				513	If this is set to non-zero, the OOM killer simply kills the task that
				514	triggered the out-of-memory condition. This avoids the expensive
				515	tasklist scan.
				516
				517	If panic_on_oom is selected, it takes precedence over whatever value
				518	is used in oom_kill_allocating_task.
				519
				520	The default value is 0.
Paul Mundt	dd8632a	2009-01-08 12:04:47 +0000	[diff] [blame]	521
				522	==============================================================
				523
Peter W Morreale	db0fb18	2009-01-15 13:50:42 -0800	[diff] [blame]	524	overcommit_memory:
Paul Mundt	dd8632a	2009-01-08 12:04:47 +0000	[diff] [blame]	525
Peter W Morreale	db0fb18	2009-01-15 13:50:42 -0800	[diff] [blame]	526	This value contains a flag that enables memory overcommitment.
Paul Mundt	dd8632a	2009-01-08 12:04:47 +0000	[diff] [blame]	527
Peter W Morreale	db0fb18	2009-01-15 13:50:42 -0800	[diff] [blame]	528	When this flag is 0, the kernel attempts to estimate the amount
				529	of free memory left when userspace requests more memory.
Paul Mundt	dd8632a	2009-01-08 12:04:47 +0000	[diff] [blame]	530
Peter W Morreale	db0fb18	2009-01-15 13:50:42 -0800	[diff] [blame]	531	When this flag is 1, the kernel pretends there is always enough
				532	memory until it actually runs out.
Paul Mundt	dd8632a	2009-01-08 12:04:47 +0000	[diff] [blame]	533
Peter W Morreale	db0fb18	2009-01-15 13:50:42 -0800	[diff] [blame]	534	When this flag is 2, the kernel uses a "never overcommit"
				535	policy that attempts to prevent any overcommit of memory.
Paul Mundt	dd8632a	2009-01-08 12:04:47 +0000	[diff] [blame]	536
Peter W Morreale	db0fb18	2009-01-15 13:50:42 -0800	[diff] [blame]	537	This feature can be very useful because there are a lot of
				538	programs that malloc() huge amounts of memory "just-in-case"
				539	and don't use much of it.
				540
				541	The default value is 0.
				542
				543	See Documentation/vm/overcommit-accounting and
				544	security/commoncap.c::cap_vm_enough_memory() for more information.
				545
				546	==============================================================
				547
				548	overcommit_ratio:
				549
				550	When overcommit_memory is set to 2, the committed address
				551	space is not permitted to exceed swap plus this percentage
				552	of physical RAM. See above.
				553
				554	==============================================================
				555
				556	page-cluster
				557
				558	page-cluster controls the number of pages which are written to swap in
				559	a single attempt. The swap I/O size.
				560
				561	It is a logarithmic value - setting it to zero means "1 page", setting
				562	it to 1 means "2 pages", setting it to 2 means "4 pages", etc.
				563
				564	The default value is three (eight pages at a time). There may be some
				565	small benefits in tuning this to a different value if your workload is
				566	swap-intensive.
				567
				568	=============================================================
				569
				570	panic_on_oom
				571
				572	This enables or disables panic on out-of-memory feature.
				573
				574	If this is set to 0, the kernel will kill some rogue process,
				575	called oom_killer. Usually, oom_killer can kill rogue processes and
				576	system will survive.
				577
				578	If this is set to 1, the kernel panics when out-of-memory happens.
				579	However, if a process limits using nodes by mempolicy/cpusets,
				580	and those nodes become memory exhaustion status, one process
				581	may be killed by oom-killer. No panic occurs in this case.
				582	Because other nodes' memory may be free. This means system total status
				583	may be not fatal yet.
				584
				585	If this is set to 2, the kernel panics compulsorily even on the
KAMEZAWA Hiroyuki	daaf1e6	2010-03-10 15:22:32 -0800	[diff] [blame]	586	above-mentioned. Even oom happens under memory cgroup, the whole
				587	system panics.
Peter W Morreale	db0fb18	2009-01-15 13:50:42 -0800	[diff] [blame]	588
				589	The default value is 0.
				590	1 and 2 are for failover of clustering. Please select either
				591	according to your policy of failover.
KAMEZAWA Hiroyuki	daaf1e6	2010-03-10 15:22:32 -0800	[diff] [blame]	592	panic_on_oom=2+kdump gives you very strong tool to investigate
				593	why oom happens. You can get snapshot.
Peter W Morreale	db0fb18	2009-01-15 13:50:42 -0800	[diff] [blame]	594
				595	=============================================================
				596
				597	percpu_pagelist_fraction
				598
				599	This is the fraction of pages at most (high mark pcp->high) in each zone that
				600	are allocated for each per cpu page list. The min value for this is 8. It
				601	means that we don't allow more than 1/8th of pages in each zone to be
				602	allocated in any single per_cpu_pagelist. This entry only changes the value
				603	of hot per cpu pagelists. User can specify a number like 100 to allocate
				604	1/100th of each zone to each per cpu page list.
				605
				606	The batch value of each per cpu pagelist is also updated as a result. It is
				607	set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8)
				608
				609	The initial value is zero. Kernel does not use this value at boot time to set
				610	the high water marks for each per cpu page list.
				611
				612	==============================================================
				613
				614	stat_interval
				615
				616	The time interval between which vm statistics are updated. The default
				617	is 1 second.
				618
				619	==============================================================
				620
				621	swappiness
				622
				623	This control is used to define how aggressive the kernel will swap
				624	memory pages. Higher values will increase agressiveness, lower values
Matt LaPlante	19f5946	2009-04-27 15:06:31 +0200	[diff] [blame]	625	decrease the amount of swap.
Peter W Morreale	db0fb18	2009-01-15 13:50:42 -0800	[diff] [blame]	626
				627	The default value is 60.
				628
				629	==============================================================
				630
				631	vfs_cache_pressure
				632	------------------
				633
				634	Controls the tendency of the kernel to reclaim the memory which is used for
				635	caching of directory and inode objects.
				636
				637	At the default value of vfs_cache_pressure=100 the kernel will attempt to
				638	reclaim dentries and inodes at a "fair" rate with respect to pagecache and
				639	swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer
Jan Kara	55c37a8	2009-09-21 17:01:40 -0700	[diff] [blame]	640	to retain dentry and inode caches. When vfs_cache_pressure=0, the kernel will
				641	never reclaim dentries and inodes due to memory pressure and this can easily
				642	lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100
Peter W Morreale	db0fb18	2009-01-15 13:50:42 -0800	[diff] [blame]	643	causes the kernel to prefer to reclaim dentries and inodes.
				644
				645	==============================================================
				646
				647	zone_reclaim_mode:
				648
				649	Zone_reclaim_mode allows someone to set more or less aggressive approaches to
				650	reclaim memory when a zone runs out of memory. If it is set to zero then no
				651	zone reclaim occurs. Allocations will be satisfied from other zones / nodes
				652	in the system.
				653
				654	This is value ORed together of
				655
				656	1 = Zone reclaim on
				657	2 = Zone reclaim writes dirty pages out
				658	4 = Zone reclaim swaps pages
				659
				660	zone_reclaim_mode is set during bootup to 1 if it is determined that pages
				661	from remote zones will cause a measurable performance reduction. The
				662	page allocator will then reclaim easily reusable pages (those page
				663	cache pages that are currently not used) before allocating off node pages.
				664
				665	It may be beneficial to switch off zone reclaim if the system is
				666	used for a file server and all of memory should be used for caching files
				667	from disk. In that case the caching effect is more important than
				668	data locality.
				669
				670	Allowing zone reclaim to write out pages stops processes that are
				671	writing large amounts of data from dirtying pages on other nodes. Zone
				672	reclaim will write out dirty pages if a zone fills up and so effectively
				673	throttle the process. This may decrease the performance of a single process
				674	since it cannot use all of system memory to buffer the outgoing writes
				675	anymore but it preserve the memory on other nodes so that the performance
				676	of other processes running on other nodes will not be affected.
				677
				678	Allowing regular swap effectively restricts allocations to the local
				679	node unless explicitly overridden by memory policies or cpuset
				680	configurations.
				681
				682	============ End of Document =================================