Blame - Documentation/block/as-iosched.txt - fp2-dev/kernel/msm

blob: a598fe10a2974f5757761df5ab8f7f98c5c5f84a [file] [log] [blame]

Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1	Anticipatory IO scheduler
				2	-------------------------
				3	Nick Piggin <piggin@cyberone.com.au> 13 Sep 2003
				4
				5	Attention! Database servers, especially those using "TCQ" disks should
				6	investigate performance with the 'deadline' IO scheduler. Any system with high
				7	disk performance requirements should do so, in fact.
				8
				9	If you see unusual performance characteristics of your disk systems, or you
				10	see big performance regressions versus the deadline scheduler, please email
				11	me. Database users don't bother unless you're willing to test a lot of patches
				12	from me ;) its a known issue.
				13
				14	Also, users with hardware RAID controllers, doing striping, may find
				15	highly variable performance results with using the as-iosched. The
				16	as-iosched anticipatory implementation is based on the notion that a disk
				17	device has only one physical seeking head. A striped RAID controller
				18	actually has a head for each physical device in the logical RAID device.
				19
				20	However, setting the antic_expire (see tunable parameters below) produces
				21	very similar behavior to the deadline IO scheduler.
				22
				23
				24	Selecting IO schedulers
				25	-----------------------
				26	To choose IO schedulers at boot time, use the argument 'elevator=deadline'.
Filipe	49033c8	2006-12-06 20:36:04 -0800	[diff] [blame]	27	'noop', 'as' and 'cfq' (the default) are also available. IO schedulers are
				28	assigned globally at boot time only presently. It's also possible to change
				29	the IO scheduler for a determined device on the fly, as described in
				30	Documentation/block/switching-sched.txt.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	31
				32
				33	Anticipatory IO scheduler Policies
				34	----------------------------------
				35	The as-iosched implementation implements several layers of policies
				36	to determine when an IO request is dispatched to the disk controller.
				37	Here are the policies outlined, in order of application.
				38
				39	1. one-way Elevator algorithm.
				40
				41	The elevator algorithm is similar to that used in deadline scheduler, with
				42	the addition that it allows limited backward movement of the elevator
				43	(i.e. seeks backwards). A seek backwards can occur when choosing between
				44	two IO requests where one is behind the elevator's current position, and
				45	the other is in front of the elevator's position. If the seek distance to
				46	the request in back of the elevator is less than half the seek distance to
				47	the request in front of the elevator, then the request in back can be chosen.
				48	Backward seeks are also limited to a maximum of MAXBACK (1024*1024) sectors.
				49	This favors forward movement of the elevator, while allowing opportunistic
				50	"short" backward seeks.
				51
				52	2. FIFO expiration times for reads and for writes.
				53
				54	This is again very similar to the deadline IO scheduler. The expiration
				55	times for requests on these lists is tunable using the parameters read_expire
				56	and write_expire discussed below. When a read or a write expires in this way,
				57	the IO scheduler will interrupt its current elevator sweep or read anticipation
				58	to service the expired request.
				59
				60	3. Read and write request batching
				61
				62	A batch is a collection of read requests or a collection of write
				63	requests. The as scheduler alternates dispatching read and write batches
				64	to the driver. In the case a read batch, the scheduler submits read
				65	requests to the driver as long as there are read requests to submit, and
				66	the read batch time limit has not been exceeded (read_batch_expire).
				67	The read batch time limit begins counting down only when there are
				68	competing write requests pending.
				69
				70	In the case of a write batch, the scheduler submits write requests to
				71	the driver as long as there are write requests available, and the
				72	write batch time limit has not been exceeded (write_batch_expire).
				73	However, the length of write batches will be gradually shortened
				74	when read batches frequently exceed their time limit.
				75
				76	When changing between batch types, the scheduler waits for all requests
				77	from the previous batch to complete before scheduling requests for the
				78	next batch.
				79
				80	The read and write fifo expiration times described in policy 2 above
				81	are checked only when in scheduling IO of a batch for the corresponding
				82	(read/write) type. So for example, the read FIFO timeout values are
				83	tested only during read batches. Likewise, the write FIFO timeout
				84	values are tested only during write batches. For this reason,
				85	it is generally not recommended for the read batch time
				86	to be longer than the write expiration time, nor for the write batch
				87	time to exceed the read expiration time (see tunable parameters below).
				88
				89	When the IO scheduler changes from a read to a write batch,
				90	it begins the elevator from the request that is on the head of the
				91	write expiration FIFO. Likewise, when changing from a write batch to
				92	a read batch, scheduler begins the elevator from the first entry
				93	on the read expiration FIFO.
				94
				95	4. Read anticipation.
				96
				97	Read anticipation occurs only when scheduling a read batch.
				98	This implementation of read anticipation allows only one read request
				99	to be dispatched to the disk controller at a time. In
				100	contrast, many write requests may be dispatched to the disk controller
				101	at a time during a write batch. It is this characteristic that can make
				102	the anticipatory scheduler perform anomalously with controllers supporting
				103	TCQ, or with hardware striped RAID devices. Setting the antic_expire
Matt LaPlante	992caac	2006-10-03 22:52:05 +0200	[diff] [blame]	104	queue parameter (see below) to zero disables this behavior, and the
				105	anticipatory scheduler behaves essentially like the deadline scheduler.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	106
				107	When read anticipation is enabled (antic_expire is not zero), reads
				108	are dispatched to the disk controller one at a time.
				109	At the end of each read request, the IO scheduler examines its next
				110	candidate read request from its sorted read list. If that next request
				111	is from the same process as the request that just completed,
				112	or if the next request in the queue is "very close" to the
				113	just completed request, it is dispatched immediately. Otherwise,
				114	statistics (average think time, average seek distance) on the process
				115	that submitted the just completed request are examined. If it seems
				116	likely that that process will submit another request soon, and that
				117	request is likely to be near the just completed request, then the IO
				118	scheduler will stop dispatching more read requests for up time (antic_expire)
				119	milliseconds, hoping that process will submit a new request near the one
				120	that just completed. If such a request is made, then it is dispatched
				121	immediately. If the antic_expire wait time expires, then the IO scheduler
				122	will dispatch the next read request from the sorted read queue.
				123
				124	To decide whether an anticipatory wait is worthwhile, the scheduler
				125	maintains statistics for each process that can be used to compute
				126	mean "think time" (the time between read requests), and mean seek
				127	distance for that process. One observation is that these statistics
				128	are associated with each process, but those statistics are not associated
				129	with a specific IO device. So for example, if a process is doing IO
				130	on several file systems on separate devices, the statistics will be
				131	a combination of IO behavior from all those devices.
				132
				133
				134	Tuning the anticipatory IO scheduler
				135	------------------------------------
				136	When using 'as', the anticipatory IO scheduler there are 5 parameters under
				137	/sys/block/*/queue/iosched/. All are units of milliseconds.
				138
				139	The parameters are:
				140	* read_expire
				141	Controls how long until a read request becomes "expired". It also controls the
				142	interval between which expired requests are served, so set to 50, a request
				143	might take anywhere < 100ms to be serviced _if_ it is the next on the
				144	expired list. Obviously request expiration strategies won't make the disk
				145	go faster. The result basically equates to the timeslice a single reader
				146	gets in the presence of other IO. 100*((seek time / read_expire) + 1) is
				147	very roughly the % streaming read efficiency your disk should get with
				148	multiple readers.
				149
				150	* read_batch_expire
				151	Controls how much time a batch of reads is given before pending writes are
				152	served. A higher value is more efficient. This might be set below read_expire
				153	if writes are to be given higher priority than reads, but reads are to be
				154	as efficient as possible when there are no writes. Generally though, it
				155	should be some multiple of read_expire.
				156
				157	* write_expire, and
				158	* write_batch_expire are equivalent to the above, for writes.
				159
				160	* antic_expire
				161	Controls the maximum amount of time we can anticipate a good read (one
				162	with a short seek distance from the most recently completed request) before
				163	giving up. Many other factors may cause anticipation to be stopped early,
				164	or some processes will not be "anticipated" at all. Should be a bit higher
				165	for big seek time devices though not a linear correspondence - most
				166	processes have only a few ms thinktime.
				167