Blame - Documentation/block/bfq-iosched.txt - kernel/msm-4.19

blob: 05e2822a80b34d0f31649cfb486ade2b4679f9e8 [file] [log] [blame]

Paolo Valente	aee69d7	2017-04-19 08:29:02 -0600	[diff] [blame]	1	BFQ (Budget Fair Queueing)
				2	==========================
				3
				4	BFQ is a proportional-share I/O scheduler, with some extra
				5	low-latency capabilities. In addition to cgroups support (blkio or io
				6	controllers), BFQ's main features are:
				7	- BFQ guarantees a high system and application responsiveness, and a
				8	low latency for time-sensitive applications, such as audio or video
				9	players;
				10	- BFQ distributes bandwidth, and not just time, among processes or
				11	groups (switching back to time distribution when needed to keep
				12	throughput high).
				13
Paolo Valente	43c1b3d	2017-05-09 12:54:23 +0200	[diff] [blame^]	14	In its default configuration, BFQ privileges latency over
				15	throughput. So, when needed for achieving a lower latency, BFQ builds
				16	schedules that may lead to a lower throughput. If your main or only
				17	goal, for a given device, is to achieve the maximum-possible
				18	throughput at all times, then do switch off all low-latency heuristics
				19	for that device, by setting low_latency to 0. Full details in Section 3.
				20
Paolo Valente	aee69d7	2017-04-19 08:29:02 -0600	[diff] [blame]	21	On average CPUs, the current version of BFQ can handle devices
				22	performing at most ~30K IOPS; at most ~50 KIOPS on faster CPUs. As a
				23	reference, 30-50 KIOPS correspond to very high bandwidths with
				24	sequential I/O (e.g., 8-12 GB/s if I/O requests are 256 KB large), and
				25	to 120-200 MB/s with 4KB random I/O. BFQ has not yet been tested on
				26	multi-queue devices.
				27
				28	The table of contents follow. Impatients can just jump to Section 3.
				29
				30	CONTENTS
				31
				32	1. When may BFQ be useful?
				33	1-1 Personal systems
				34	1-2 Server systems
				35	2. How does BFQ work?
				36	3. What are BFQ's tunable?
				37	4. BFQ group scheduling
				38	4-1 Service guarantees provided
				39	4-2 Interface
				40
				41	1. When may BFQ be useful?
				42	==========================
				43
				44	BFQ provides the following benefits on personal and server systems.
				45
				46	1-1 Personal systems
				47	--------------------
				48
				49	Low latency for interactive applications
				50
				51	Regardless of the actual background workload, BFQ guarantees that, for
				52	interactive tasks, the storage device is virtually as responsive as if
				53	it was idle. For example, even if one or more of the following
				54	background workloads are being executed:
				55	- one or more large files are being read, written or copied,
				56	- a tree of source files is being compiled,
				57	- one or more virtual machines are performing I/O,
				58	- a software update is in progress,
				59	- indexing daemons are scanning filesystems and updating their
				60	databases,
				61	starting an application or loading a file from within an application
				62	takes about the same time as if the storage device was idle. As a
				63	comparison, with CFQ, NOOP or DEADLINE, and in the same conditions,
				64	applications experience high latencies, or even become unresponsive
				65	until the background workload terminates (also on SSDs).
				66
				67	Low latency for soft real-time applications
				68
				69	Also soft real-time applications, such as audio and video
				70	players/streamers, enjoy a low latency and a low drop rate, regardless
				71	of the background I/O workload. As a consequence, these applications
				72	do not suffer from almost any glitch due to the background workload.
				73
				74	Higher speed for code-development tasks
				75
				76	If some additional workload happens to be executed in parallel, then
				77	BFQ executes the I/O-related components of typical code-development
				78	tasks (compilation, checkout, merge, ...) much more quickly than CFQ,
				79	NOOP or DEADLINE.
				80
				81	High throughput
				82
				83	On hard disks, BFQ achieves up to 30% higher throughput than CFQ, and
				84	up to 150% higher throughput than DEADLINE and NOOP, with all the
				85	sequential workloads considered in our tests. With random workloads,
				86	and with all the workloads on flash-based devices, BFQ achieves,
				87	instead, about the same throughput as the other schedulers.
				88
				89	Strong fairness, bandwidth and delay guarantees
				90
				91	BFQ distributes the device throughput, and not just the device time,
				92	among I/O-bound applications in proportion their weights, with any
				93	workload and regardless of the device parameters. From these bandwidth
				94	guarantees, it is possible to compute tight per-I/O-request delay
				95	guarantees by a simple formula. If not configured for strict service
				96	guarantees, BFQ switches to time-based resource sharing (only) for
				97	applications that would otherwise cause a throughput loss.
				98
				99	1-2 Server systems
				100	------------------
				101
				102	Most benefits for server systems follow from the same service
				103	properties as above. In particular, regardless of whether additional,
				104	possibly heavy workloads are being served, BFQ guarantees:
				105
				106	. audio and video-streaming with zero or very low jitter and drop
				107	rate;
				108
				109	. fast retrieval of WEB pages and embedded objects;
				110
				111	. real-time recording of data in live-dumping applications (e.g.,
				112	packet logging);
				113
				114	. responsiveness in local and remote access to a server.
				115
				116
				117	2. How does BFQ work?
				118	=====================
				119
				120	BFQ is a proportional-share I/O scheduler, whose general structure,
				121	plus a lot of code, are borrowed from CFQ.
				122
				123	- Each process doing I/O on a device is associated with a weight and a
				124	(bfq_)queue.
				125
				126	- BFQ grants exclusive access to the device, for a while, to one queue
				127	(process) at a time, and implements this service model by
				128	associating every queue with a budget, measured in number of
				129	sectors.
				130
				131	- After a queue is granted access to the device, the budget of the
				132	queue is decremented, on each request dispatch, by the size of the
				133	request.
				134
				135	- The in-service queue is expired, i.e., its service is suspended,
				136	only if one of the following events occurs: 1) the queue finishes
				137	its budget, 2) the queue empties, 3) a "budget timeout" fires.
				138
				139	- The budget timeout prevents processes doing random I/O from
				140	holding the device for too long and dramatically reducing
				141	throughput.
				142
				143	- Actually, as in CFQ, a queue associated with a process issuing
				144	sync requests may not be expired immediately when it empties. In
				145	contrast, BFQ may idle the device for a short time interval,
				146	giving the process the chance to go on being served if it issues
				147	a new request in time. Device idling typically boosts the
				148	throughput on rotational devices, if processes do synchronous
				149	and sequential I/O. In addition, under BFQ, device idling is
				150	also instrumental in guaranteeing the desired throughput
				151	fraction to processes issuing sync requests (see the description
				152	of the slice_idle tunable in this document, or [1, 2], for more
				153	details).
				154
				155	- With respect to idling for service guarantees, if several
				156	processes are competing for the device at the same time, but
				157	all processes (and groups, after the following commit) have
				158	the same weight, then BFQ guarantees the expected throughput
				159	distribution without ever idling the device. Throughput is
				160	thus as high as possible in this common scenario.
				161
				162	- If low-latency mode is enabled (default configuration), BFQ
				163	executes some special heuristics to detect interactive and soft
				164	real-time applications (e.g., video or audio players/streamers),
				165	and to reduce their latency. The most important action taken to
				166	achieve this goal is to give to the queues associated with these
				167	applications more than their fair share of the device
				168	throughput. For brevity, we call just "weight-raising" the whole
				169	sets of actions taken by BFQ to privilege these queues. In
				170	particular, BFQ provides a milder form of weight-raising for
				171	interactive applications, and a stronger form for soft real-time
				172	applications.
				173
				174	- BFQ automatically deactivates idling for queues born in a burst of
				175	queue creations. In fact, these queues are usually associated with
				176	the processes of applications and services that benefit mostly
				177	from a high throughput. Examples are systemd during boot, or git
				178	grep.
				179
				180	- As CFQ, BFQ merges queues performing interleaved I/O, i.e.,
				181	performing random I/O that becomes mostly sequential if
				182	merged. Differently from CFQ, BFQ achieves this goal with a more
				183	reactive mechanism, called Early Queue Merge (EQM). EQM is so
				184	responsive in detecting interleaved I/O (cooperating processes),
				185	that it enables BFQ to achieve a high throughput, by queue
				186	merging, even for queues for which CFQ needs a different
				187	mechanism, preemption, to get a high throughput. As such EQM is a
				188	unified mechanism to achieve a high throughput with interleaved
				189	I/O.
				190
				191	- Queues are scheduled according to a variant of WF2Q+, named
				192	B-WF2Q+, and implemented using an augmented rb-tree to preserve an
				193	O(log N) overall complexity. See [2] for more details. B-WF2Q+ is
				194	also ready for hierarchical scheduling. However, for a cleaner
				195	logical breakdown, the code that enables and completes
				196	hierarchical support is provided in the next commit, which focuses
				197	exactly on this feature.
				198
				199	- B-WF2Q+ guarantees a tight deviation with respect to an ideal,
				200	perfectly fair, and smooth service. In particular, B-WF2Q+
				201	guarantees that each queue receives a fraction of the device
				202	throughput proportional to its weight, even if the throughput
				203	fluctuates, and regardless of: the device parameters, the current
				204	workload and the budgets assigned to the queue.
				205
				206	- The last, budget-independence, property (although probably
				207	counterintuitive in the first place) is definitely beneficial, for
				208	the following reasons:
				209
				210	- First, with any proportional-share scheduler, the maximum
				211	deviation with respect to an ideal service is proportional to
				212	the maximum budget (slice) assigned to queues. As a consequence,
				213	BFQ can keep this deviation tight not only because of the
				214	accurate service of B-WF2Q+, but also because BFQ does not
				215	need to assign a larger budget to a queue to let the queue
				216	receive a higher fraction of the device throughput.
				217
				218	- Second, BFQ is free to choose, for every process (queue), the
				219	budget that best fits the needs of the process, or best
				220	leverages the I/O pattern of the process. In particular, BFQ
				221	updates queue budgets with a simple feedback-loop algorithm that
				222	allows a high throughput to be achieved, while still providing
				223	tight latency guarantees to time-sensitive applications. When
				224	the in-service queue expires, this algorithm computes the next
				225	budget of the queue so as to:
				226
				227	- Let large budgets be eventually assigned to the queues
				228	associated with I/O-bound applications performing sequential
				229	I/O: in fact, the longer these applications are served once
				230	got access to the device, the higher the throughput is.
				231
				232	- Let small budgets be eventually assigned to the queues
				233	associated with time-sensitive applications (which typically
				234	perform sporadic and short I/O), because, the smaller the
				235	budget assigned to a queue waiting for service is, the sooner
				236	B-WF2Q+ will serve that queue (Subsec 3.3 in [2]).
				237
				238	- If several processes are competing for the device at the same time,
				239	but all processes and groups have the same weight, then BFQ
				240	guarantees the expected throughput distribution without ever idling
				241	the device. It uses preemption instead. Throughput is then much
				242	higher in this common scenario.
				243
				244	- ioprio classes are served in strict priority order, i.e.,
				245	lower-priority queues are not served as long as there are
				246	higher-priority queues. Among queues in the same class, the
				247	bandwidth is distributed in proportion to the weight of each
				248	queue. A very thin extra bandwidth is however guaranteed to
				249	the Idle class, to prevent it from starving.
				250
				251
				252	3. What are BFQ's tunable?
				253	==========================
				254
				255	The tunables back_seek-max, back_seek_penalty, fifo_expire_async and
				256	fifo_expire_sync below are the same as in CFQ. Their description is
				257	just copied from that for CFQ. Some considerations in the description
				258	of slice_idle are copied from CFQ too.
				259
				260	per-process ioprio and weight
				261	-----------------------------
				262
Arianna Avanzini	e21b7a0	2017-04-12 18:23:08 +0200	[diff] [blame]	263	Unless the cgroups interface is used (see "4. BFQ group scheduling"),
				264	weights can be assigned to processes only indirectly, through I/O
				265	priorities, and according to the relation:
				266	weight = (IOPRIO_BE_NR - ioprio) * 10.
				267
				268	Beware that, if low-latency is set, then BFQ automatically raises the
				269	weight of the queues associated with interactive and soft real-time
				270	applications. Unset this tunable if you need/want to control weights.
Paolo Valente	aee69d7	2017-04-19 08:29:02 -0600	[diff] [blame]	271
				272	slice_idle
				273	----------
				274
				275	This parameter specifies how long BFQ should idle for next I/O
				276	request, when certain sync BFQ queues become empty. By default
				277	slice_idle is a non-zero value. Idling has a double purpose: boosting
				278	throughput and making sure that the desired throughput distribution is
				279	respected (see the description of how BFQ works, and, if needed, the
				280	papers referred there).
				281
				282	As for throughput, idling can be very helpful on highly seeky media
				283	like single spindle SATA/SAS disks where we can cut down on overall
				284	number of seeks and see improved throughput.
				285
				286	Setting slice_idle to 0 will remove all the idling on queues and one
				287	should see an overall improved throughput on faster storage devices
				288	like multiple SATA/SAS disks in hardware RAID configuration.
				289
				290	So depending on storage and workload, it might be useful to set
				291	slice_idle=0. In general for SATA/SAS disks and software RAID of
				292	SATA/SAS disks keeping slice_idle enabled should be useful. For any
				293	configurations where there are multiple spindles behind single LUN
				294	(Host based hardware RAID controller or for storage arrays), setting
				295	slice_idle=0 might end up in better throughput and acceptable
				296	latencies.
				297
				298	Idling is however necessary to have service guarantees enforced in
				299	case of differentiated weights or differentiated I/O-request lengths.
				300	To see why, suppose that a given BFQ queue A must get several I/O
				301	requests served for each request served for another queue B. Idling
				302	ensures that, if A makes a new I/O request slightly after becoming
				303	empty, then no request of B is dispatched in the middle, and thus A
				304	does not lose the possibility to get more than one request dispatched
				305	before the next request of B is dispatched. Note that idling
				306	guarantees the desired differentiated treatment of queues only in
				307	terms of I/O-request dispatches. To guarantee that the actual service
				308	order then corresponds to the dispatch order, the strict_guarantees
				309	tunable must be set too.
				310
				311	There is an important flipside for idling: apart from the above cases
				312	where it is beneficial also for throughput, idling can severely impact
				313	throughput. One important case is random workload. Because of this
				314	issue, BFQ tends to avoid idling as much as possible, when it is not
				315	beneficial also for throughput. As a consequence of this behavior, and
				316	of further issues described for the strict_guarantees tunable,
				317	short-term service guarantees may be occasionally violated. And, in
				318	some cases, these guarantees may be more important than guaranteeing
				319	maximum throughput. For example, in video playing/streaming, a very
				320	low drop rate may be more important than maximum throughput. In these
				321	cases, consider setting the strict_guarantees parameter.
				322
				323	strict_guarantees
				324	-----------------
				325
				326	If this parameter is set (default: unset), then BFQ
				327
				328	- always performs idling when the in-service queue becomes empty;
				329
				330	- forces the device to serve one I/O request at a time, by dispatching a
				331	new request only if there is no outstanding request.
				332
				333	In the presence of differentiated weights or I/O-request sizes, both
				334	the above conditions are needed to guarantee that every BFQ queue
				335	receives its allotted share of the bandwidth. The first condition is
				336	needed for the reasons explained in the description of the slice_idle
				337	tunable. The second condition is needed because all modern storage
				338	devices reorder internally-queued requests, which may trivially break
				339	the service guarantees enforced by the I/O scheduler.
				340
				341	Setting strict_guarantees may evidently affect throughput.
				342
				343	back_seek_max
				344	-------------
				345
				346	This specifies, given in Kbytes, the maximum "distance" for backward seeking.
				347	The distance is the amount of space from the current head location to the
				348	sectors that are backward in terms of distance.
				349
				350	This parameter allows the scheduler to anticipate requests in the "backward"
				351	direction and consider them as being the "next" if they are within this
				352	distance from the current head location.
				353
				354	back_seek_penalty
				355	-----------------
				356
				357	This parameter is used to compute the cost of backward seeking. If the
				358	backward distance of request is just 1/back_seek_penalty from a "front"
				359	request, then the seeking cost of two requests is considered equivalent.
				360
				361	So scheduler will not bias toward one or the other request (otherwise scheduler
				362	will bias toward front request). Default value of back_seek_penalty is 2.
				363
				364	fifo_expire_async
				365	-----------------
				366
				367	This parameter is used to set the timeout of asynchronous requests. Default
				368	value of this is 248ms.
				369
				370	fifo_expire_sync
				371	----------------
				372
				373	This parameter is used to set the timeout of synchronous requests. Default
				374	value of this is 124ms. In case to favor synchronous requests over asynchronous
				375	one, this value should be decreased relative to fifo_expire_async.
				376
				377	low_latency
				378	-----------
				379
				380	This parameter is used to enable/disable BFQ's low latency mode. By
				381	default, low latency mode is enabled. If enabled, interactive and soft
				382	real-time applications are privileged and experience a lower latency,
				383	as explained in more detail in the description of how BFQ works.
				384
Paolo Valente	43c1b3d	2017-05-09 12:54:23 +0200	[diff] [blame^]	385	DISABLE this mode if you need full control on bandwidth
Paolo Valente	44e44a1	2017-04-12 18:23:12 +0200	[diff] [blame]	386	distribution. In fact, if it is enabled, then BFQ automatically
				387	increases the bandwidth share of privileged applications, as the main
				388	means to guarantee a lower latency to them.
				389
Paolo Valente	43c1b3d	2017-05-09 12:54:23 +0200	[diff] [blame^]	390	In addition, as already highlighted at the beginning of this document,
				391	DISABLE this mode if your only goal is to achieve a high throughput.
				392	In fact, privileging the I/O of some application over the rest may
				393	entail a lower throughput. To achieve the highest-possible throughput
				394	on a non-rotational device, setting slice_idle to 0 may be needed too
				395	(at the cost of giving up any strong guarantee on fairness and low
				396	latency).
				397
Paolo Valente	aee69d7	2017-04-19 08:29:02 -0600	[diff] [blame]	398	timeout_sync
				399	------------
				400
				401	Maximum amount of device time that can be given to a task (queue) once
				402	it has been selected for service. On devices with costly seeks,
				403	increasing this time usually increases maximum throughput. On the
				404	opposite end, increasing this time coarsens the granularity of the
				405	short-term bandwidth and latency guarantees, especially if the
				406	following parameter is set to zero.
				407
				408	max_budget
				409	----------
				410
				411	Maximum amount of service, measured in sectors, that can be provided
				412	to a BFQ queue once it is set in service (of course within the limits
				413	of the above timeout). According to what said in the description of
				414	the algorithm, larger values increase the throughput in proportion to
				415	the percentage of sequential I/O requests issued. The price of larger
				416	values is that they coarsen the granularity of short-term bandwidth
				417	and latency guarantees.
				418
				419	The default value is 0, which enables auto-tuning: BFQ sets max_budget
				420	to the maximum number of sectors that can be served during
				421	timeout_sync, according to the estimated peak rate.
				422
				423	weights
				424	-------
				425
				426	Read-only parameter, used to show the weights of the currently active
				427	BFQ queues.
				428
				429
				430	wr_ tunables
				431	------------
				432
				433	BFQ exports a few parameters to control/tune the behavior of
				434	low-latency heuristics.
				435
				436	wr_coeff
				437
				438	Factor by which the weight of a weight-raised queue is multiplied. If
				439	the queue is deemed soft real-time, then the weight is further
				440	multiplied by an additional, constant factor.
				441
				442	wr_max_time
				443
				444	Maximum duration of a weight-raising period for an interactive task
				445	(ms). If set to zero (default value), then this value is computed
				446	automatically, as a function of the peak rate of the device. In any
				447	case, when the value of this parameter is read, it always reports the
				448	current duration, regardless of whether it has been set manually or
				449	computed automatically.
				450
				451	wr_max_softrt_rate
				452
				453	Maximum service rate below which a queue is deemed to be associated
				454	with a soft real-time application, and is then weight-raised
				455	accordingly (sectors/sec).
				456
				457	wr_min_idle_time
				458
				459	Minimum idle period after which interactive weight-raising may be
				460	reactivated for a queue (in ms).
				461
				462	wr_rt_max_time
				463
				464	Maximum weight-raising duration for soft real-time queues (in ms). The
				465	start time from which this duration is considered is automatically
				466	moved forward if the queue is detected to be still soft real-time
				467	before the current soft real-time weight-raising period finishes.
				468
				469	wr_min_inter_arr_async
				470
				471	Minimum period between I/O request arrivals after which weight-raising
				472	may be reactivated for an already busy async queue (in ms).
				473
				474
				475	4. Group scheduling with BFQ
				476	============================
				477
Arianna Avanzini	e21b7a0	2017-04-12 18:23:08 +0200	[diff] [blame]	478	BFQ supports both cgroups-v1 and cgroups-v2 io controllers, namely
				479	blkio and io. In particular, BFQ supports weight-based proportional
				480	share. To activate cgroups support, set BFQ_GROUP_IOSCHED.
Paolo Valente	aee69d7	2017-04-19 08:29:02 -0600	[diff] [blame]	481
				482	4-1 Service guarantees provided
				483	-------------------------------
				484
				485	With BFQ, proportional share means true proportional share of the
				486	device bandwidth, according to group weights. For example, a group
				487	with weight 200 gets twice the bandwidth, and not just twice the time,
				488	of a group with weight 100.
				489
				490	BFQ supports hierarchies (group trees) of any depth. Bandwidth is
				491	distributed among groups and processes in the expected way: for each
				492	group, the children of the group share the whole bandwidth of the
				493	group in proportion to their weights. In particular, this implies
				494	that, for each leaf group, every process of the group receives the
				495	same share of the whole group bandwidth, unless the ioprio of the
				496	process is modified.
				497
				498	The resource-sharing guarantee for a group may partially or totally
				499	switch from bandwidth to time, if providing bandwidth guarantees to
				500	the group lowers the throughput too much. This switch occurs on a
				501	per-process basis: if a process of a leaf group causes throughput loss
				502	if served in such a way to receive its share of the bandwidth, then
				503	BFQ switches back to just time-based proportional share for that
				504	process.
				505
				506	4-2 Interface
				507	-------------
				508
				509	To get proportional sharing of bandwidth with BFQ for a given device,
				510	BFQ must of course be the active scheduler for that device.
				511
				512	Within each group directory, the names of the files associated with
				513	BFQ-specific cgroup parameters and stats begin with the "bfq."
				514	prefix. So, with cgroups-v1 or cgroups-v2, the full prefix for
				515	BFQ-specific files is "blkio.bfq." or "io.bfq." For example, the group
				516	parameter to set the weight of a group with BFQ is blkio.bfq.weight
				517	or io.bfq.weight.
				518
				519	Parameters to set
				520	-----------------
				521
				522	For each group, there is only the following parameter to set.
				523
				524	weight (namely blkio.bfq.weight or io.bfq-weight): the weight of the
				525	group inside its parent. Available values: 1..10000 (default 100). The
				526	linear mapping between ioprio and weights, described at the beginning
				527	of the tunable section, is still valid, but all weights higher than
				528	IOPRIO_BE_NR*10 are mapped to ioprio 0.
				529
Paolo Valente	44e44a1	2017-04-12 18:23:12 +0200	[diff] [blame]	530	Recall that, if low-latency is set, then BFQ automatically raises the
				531	weight of the queues associated with interactive and soft real-time
				532	applications. Unset this tunable if you need/want to control weights.
				533
Paolo Valente	aee69d7	2017-04-19 08:29:02 -0600	[diff] [blame]	534
				535	[1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O
				536	Scheduler", Proceedings of the First Workshop on Mobile System
				537	Technologies (MST-2015), May 2015.
				538	http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf
				539
				540	[2] P. Valente and M. Andreolini, "Improving Application
				541	Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
				542	the 5th Annual International Systems and Storage Conference
				543	(SYSTOR '12), June 2012.
				544	Slightly extended version:
				545	http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
				546	results.pdf