blob: 05e2822a80b34d0f31649cfb486ade2b4679f9e8 [file] [log] [blame]
Paolo Valenteaee69d72017-04-19 08:29:02 -06001BFQ (Budget Fair Queueing)
2==========================
3
4BFQ is a proportional-share I/O scheduler, with some extra
5low-latency capabilities. In addition to cgroups support (blkio or io
6controllers), BFQ's main features are:
7- BFQ guarantees a high system and application responsiveness, and a
8 low latency for time-sensitive applications, such as audio or video
9 players;
10- BFQ distributes bandwidth, and not just time, among processes or
11 groups (switching back to time distribution when needed to keep
12 throughput high).
13
Paolo Valente43c1b3d2017-05-09 12:54:23 +020014In its default configuration, BFQ privileges latency over
15throughput. So, when needed for achieving a lower latency, BFQ builds
16schedules that may lead to a lower throughput. If your main or only
17goal, for a given device, is to achieve the maximum-possible
18throughput at all times, then do switch off all low-latency heuristics
19for that device, by setting low_latency to 0. Full details in Section 3.
20
Paolo Valenteaee69d72017-04-19 08:29:02 -060021On average CPUs, the current version of BFQ can handle devices
22performing at most ~30K IOPS; at most ~50 KIOPS on faster CPUs. As a
23reference, 30-50 KIOPS correspond to very high bandwidths with
24sequential I/O (e.g., 8-12 GB/s if I/O requests are 256 KB large), and
25to 120-200 MB/s with 4KB random I/O. BFQ has not yet been tested on
26multi-queue devices.
27
28The table of contents follow. Impatients can just jump to Section 3.
29
30CONTENTS
31
321. When may BFQ be useful?
33 1-1 Personal systems
34 1-2 Server systems
352. How does BFQ work?
363. What are BFQ's tunable?
374. BFQ group scheduling
38 4-1 Service guarantees provided
39 4-2 Interface
40
411. When may BFQ be useful?
42==========================
43
44BFQ provides the following benefits on personal and server systems.
45
461-1 Personal systems
47--------------------
48
49Low latency for interactive applications
50
51Regardless of the actual background workload, BFQ guarantees that, for
52interactive tasks, the storage device is virtually as responsive as if
53it was idle. For example, even if one or more of the following
54background workloads are being executed:
55- one or more large files are being read, written or copied,
56- a tree of source files is being compiled,
57- one or more virtual machines are performing I/O,
58- a software update is in progress,
59- indexing daemons are scanning filesystems and updating their
60 databases,
61starting an application or loading a file from within an application
62takes about the same time as if the storage device was idle. As a
63comparison, with CFQ, NOOP or DEADLINE, and in the same conditions,
64applications experience high latencies, or even become unresponsive
65until the background workload terminates (also on SSDs).
66
67Low latency for soft real-time applications
68
69Also soft real-time applications, such as audio and video
70players/streamers, enjoy a low latency and a low drop rate, regardless
71of the background I/O workload. As a consequence, these applications
72do not suffer from almost any glitch due to the background workload.
73
74Higher speed for code-development tasks
75
76If some additional workload happens to be executed in parallel, then
77BFQ executes the I/O-related components of typical code-development
78tasks (compilation, checkout, merge, ...) much more quickly than CFQ,
79NOOP or DEADLINE.
80
81High throughput
82
83On hard disks, BFQ achieves up to 30% higher throughput than CFQ, and
84up to 150% higher throughput than DEADLINE and NOOP, with all the
85sequential workloads considered in our tests. With random workloads,
86and with all the workloads on flash-based devices, BFQ achieves,
87instead, about the same throughput as the other schedulers.
88
89Strong fairness, bandwidth and delay guarantees
90
91BFQ distributes the device throughput, and not just the device time,
92among I/O-bound applications in proportion their weights, with any
93workload and regardless of the device parameters. From these bandwidth
94guarantees, it is possible to compute tight per-I/O-request delay
95guarantees by a simple formula. If not configured for strict service
96guarantees, BFQ switches to time-based resource sharing (only) for
97applications that would otherwise cause a throughput loss.
98
991-2 Server systems
100------------------
101
102Most benefits for server systems follow from the same service
103properties as above. In particular, regardless of whether additional,
104possibly heavy workloads are being served, BFQ guarantees:
105
106. audio and video-streaming with zero or very low jitter and drop
107 rate;
108
109. fast retrieval of WEB pages and embedded objects;
110
111. real-time recording of data in live-dumping applications (e.g.,
112 packet logging);
113
114. responsiveness in local and remote access to a server.
115
116
1172. How does BFQ work?
118=====================
119
120BFQ is a proportional-share I/O scheduler, whose general structure,
121plus a lot of code, are borrowed from CFQ.
122
123- Each process doing I/O on a device is associated with a weight and a
124 (bfq_)queue.
125
126- BFQ grants exclusive access to the device, for a while, to one queue
127 (process) at a time, and implements this service model by
128 associating every queue with a budget, measured in number of
129 sectors.
130
131 - After a queue is granted access to the device, the budget of the
132 queue is decremented, on each request dispatch, by the size of the
133 request.
134
135 - The in-service queue is expired, i.e., its service is suspended,
136 only if one of the following events occurs: 1) the queue finishes
137 its budget, 2) the queue empties, 3) a "budget timeout" fires.
138
139 - The budget timeout prevents processes doing random I/O from
140 holding the device for too long and dramatically reducing
141 throughput.
142
143 - Actually, as in CFQ, a queue associated with a process issuing
144 sync requests may not be expired immediately when it empties. In
145 contrast, BFQ may idle the device for a short time interval,
146 giving the process the chance to go on being served if it issues
147 a new request in time. Device idling typically boosts the
148 throughput on rotational devices, if processes do synchronous
149 and sequential I/O. In addition, under BFQ, device idling is
150 also instrumental in guaranteeing the desired throughput
151 fraction to processes issuing sync requests (see the description
152 of the slice_idle tunable in this document, or [1, 2], for more
153 details).
154
155 - With respect to idling for service guarantees, if several
156 processes are competing for the device at the same time, but
157 all processes (and groups, after the following commit) have
158 the same weight, then BFQ guarantees the expected throughput
159 distribution without ever idling the device. Throughput is
160 thus as high as possible in this common scenario.
161
162 - If low-latency mode is enabled (default configuration), BFQ
163 executes some special heuristics to detect interactive and soft
164 real-time applications (e.g., video or audio players/streamers),
165 and to reduce their latency. The most important action taken to
166 achieve this goal is to give to the queues associated with these
167 applications more than their fair share of the device
168 throughput. For brevity, we call just "weight-raising" the whole
169 sets of actions taken by BFQ to privilege these queues. In
170 particular, BFQ provides a milder form of weight-raising for
171 interactive applications, and a stronger form for soft real-time
172 applications.
173
174 - BFQ automatically deactivates idling for queues born in a burst of
175 queue creations. In fact, these queues are usually associated with
176 the processes of applications and services that benefit mostly
177 from a high throughput. Examples are systemd during boot, or git
178 grep.
179
180 - As CFQ, BFQ merges queues performing interleaved I/O, i.e.,
181 performing random I/O that becomes mostly sequential if
182 merged. Differently from CFQ, BFQ achieves this goal with a more
183 reactive mechanism, called Early Queue Merge (EQM). EQM is so
184 responsive in detecting interleaved I/O (cooperating processes),
185 that it enables BFQ to achieve a high throughput, by queue
186 merging, even for queues for which CFQ needs a different
187 mechanism, preemption, to get a high throughput. As such EQM is a
188 unified mechanism to achieve a high throughput with interleaved
189 I/O.
190
191 - Queues are scheduled according to a variant of WF2Q+, named
192 B-WF2Q+, and implemented using an augmented rb-tree to preserve an
193 O(log N) overall complexity. See [2] for more details. B-WF2Q+ is
194 also ready for hierarchical scheduling. However, for a cleaner
195 logical breakdown, the code that enables and completes
196 hierarchical support is provided in the next commit, which focuses
197 exactly on this feature.
198
199 - B-WF2Q+ guarantees a tight deviation with respect to an ideal,
200 perfectly fair, and smooth service. In particular, B-WF2Q+
201 guarantees that each queue receives a fraction of the device
202 throughput proportional to its weight, even if the throughput
203 fluctuates, and regardless of: the device parameters, the current
204 workload and the budgets assigned to the queue.
205
206 - The last, budget-independence, property (although probably
207 counterintuitive in the first place) is definitely beneficial, for
208 the following reasons:
209
210 - First, with any proportional-share scheduler, the maximum
211 deviation with respect to an ideal service is proportional to
212 the maximum budget (slice) assigned to queues. As a consequence,
213 BFQ can keep this deviation tight not only because of the
214 accurate service of B-WF2Q+, but also because BFQ *does not*
215 need to assign a larger budget to a queue to let the queue
216 receive a higher fraction of the device throughput.
217
218 - Second, BFQ is free to choose, for every process (queue), the
219 budget that best fits the needs of the process, or best
220 leverages the I/O pattern of the process. In particular, BFQ
221 updates queue budgets with a simple feedback-loop algorithm that
222 allows a high throughput to be achieved, while still providing
223 tight latency guarantees to time-sensitive applications. When
224 the in-service queue expires, this algorithm computes the next
225 budget of the queue so as to:
226
227 - Let large budgets be eventually assigned to the queues
228 associated with I/O-bound applications performing sequential
229 I/O: in fact, the longer these applications are served once
230 got access to the device, the higher the throughput is.
231
232 - Let small budgets be eventually assigned to the queues
233 associated with time-sensitive applications (which typically
234 perform sporadic and short I/O), because, the smaller the
235 budget assigned to a queue waiting for service is, the sooner
236 B-WF2Q+ will serve that queue (Subsec 3.3 in [2]).
237
238- If several processes are competing for the device at the same time,
239 but all processes and groups have the same weight, then BFQ
240 guarantees the expected throughput distribution without ever idling
241 the device. It uses preemption instead. Throughput is then much
242 higher in this common scenario.
243
244- ioprio classes are served in strict priority order, i.e.,
245 lower-priority queues are not served as long as there are
246 higher-priority queues. Among queues in the same class, the
247 bandwidth is distributed in proportion to the weight of each
248 queue. A very thin extra bandwidth is however guaranteed to
249 the Idle class, to prevent it from starving.
250
251
2523. What are BFQ's tunable?
253==========================
254
255The tunables back_seek-max, back_seek_penalty, fifo_expire_async and
256fifo_expire_sync below are the same as in CFQ. Their description is
257just copied from that for CFQ. Some considerations in the description
258of slice_idle are copied from CFQ too.
259
260per-process ioprio and weight
261-----------------------------
262
Arianna Avanzinie21b7a02017-04-12 18:23:08 +0200263Unless the cgroups interface is used (see "4. BFQ group scheduling"),
264weights can be assigned to processes only indirectly, through I/O
265priorities, and according to the relation:
266weight = (IOPRIO_BE_NR - ioprio) * 10.
267
268Beware that, if low-latency is set, then BFQ automatically raises the
269weight of the queues associated with interactive and soft real-time
270applications. Unset this tunable if you need/want to control weights.
Paolo Valenteaee69d72017-04-19 08:29:02 -0600271
272slice_idle
273----------
274
275This parameter specifies how long BFQ should idle for next I/O
276request, when certain sync BFQ queues become empty. By default
277slice_idle is a non-zero value. Idling has a double purpose: boosting
278throughput and making sure that the desired throughput distribution is
279respected (see the description of how BFQ works, and, if needed, the
280papers referred there).
281
282As for throughput, idling can be very helpful on highly seeky media
283like single spindle SATA/SAS disks where we can cut down on overall
284number of seeks and see improved throughput.
285
286Setting slice_idle to 0 will remove all the idling on queues and one
287should see an overall improved throughput on faster storage devices
288like multiple SATA/SAS disks in hardware RAID configuration.
289
290So depending on storage and workload, it might be useful to set
291slice_idle=0. In general for SATA/SAS disks and software RAID of
292SATA/SAS disks keeping slice_idle enabled should be useful. For any
293configurations where there are multiple spindles behind single LUN
294(Host based hardware RAID controller or for storage arrays), setting
295slice_idle=0 might end up in better throughput and acceptable
296latencies.
297
298Idling is however necessary to have service guarantees enforced in
299case of differentiated weights or differentiated I/O-request lengths.
300To see why, suppose that a given BFQ queue A must get several I/O
301requests served for each request served for another queue B. Idling
302ensures that, if A makes a new I/O request slightly after becoming
303empty, then no request of B is dispatched in the middle, and thus A
304does not lose the possibility to get more than one request dispatched
305before the next request of B is dispatched. Note that idling
306guarantees the desired differentiated treatment of queues only in
307terms of I/O-request dispatches. To guarantee that the actual service
308order then corresponds to the dispatch order, the strict_guarantees
309tunable must be set too.
310
311There is an important flipside for idling: apart from the above cases
312where it is beneficial also for throughput, idling can severely impact
313throughput. One important case is random workload. Because of this
314issue, BFQ tends to avoid idling as much as possible, when it is not
315beneficial also for throughput. As a consequence of this behavior, and
316of further issues described for the strict_guarantees tunable,
317short-term service guarantees may be occasionally violated. And, in
318some cases, these guarantees may be more important than guaranteeing
319maximum throughput. For example, in video playing/streaming, a very
320low drop rate may be more important than maximum throughput. In these
321cases, consider setting the strict_guarantees parameter.
322
323strict_guarantees
324-----------------
325
326If this parameter is set (default: unset), then BFQ
327
328- always performs idling when the in-service queue becomes empty;
329
330- forces the device to serve one I/O request at a time, by dispatching a
331 new request only if there is no outstanding request.
332
333In the presence of differentiated weights or I/O-request sizes, both
334the above conditions are needed to guarantee that every BFQ queue
335receives its allotted share of the bandwidth. The first condition is
336needed for the reasons explained in the description of the slice_idle
337tunable. The second condition is needed because all modern storage
338devices reorder internally-queued requests, which may trivially break
339the service guarantees enforced by the I/O scheduler.
340
341Setting strict_guarantees may evidently affect throughput.
342
343back_seek_max
344-------------
345
346This specifies, given in Kbytes, the maximum "distance" for backward seeking.
347The distance is the amount of space from the current head location to the
348sectors that are backward in terms of distance.
349
350This parameter allows the scheduler to anticipate requests in the "backward"
351direction and consider them as being the "next" if they are within this
352distance from the current head location.
353
354back_seek_penalty
355-----------------
356
357This parameter is used to compute the cost of backward seeking. If the
358backward distance of request is just 1/back_seek_penalty from a "front"
359request, then the seeking cost of two requests is considered equivalent.
360
361So scheduler will not bias toward one or the other request (otherwise scheduler
362will bias toward front request). Default value of back_seek_penalty is 2.
363
364fifo_expire_async
365-----------------
366
367This parameter is used to set the timeout of asynchronous requests. Default
368value of this is 248ms.
369
370fifo_expire_sync
371----------------
372
373This parameter is used to set the timeout of synchronous requests. Default
374value of this is 124ms. In case to favor synchronous requests over asynchronous
375one, this value should be decreased relative to fifo_expire_async.
376
377low_latency
378-----------
379
380This parameter is used to enable/disable BFQ's low latency mode. By
381default, low latency mode is enabled. If enabled, interactive and soft
382real-time applications are privileged and experience a lower latency,
383as explained in more detail in the description of how BFQ works.
384
Paolo Valente43c1b3d2017-05-09 12:54:23 +0200385DISABLE this mode if you need full control on bandwidth
Paolo Valente44e44a12017-04-12 18:23:12 +0200386distribution. In fact, if it is enabled, then BFQ automatically
387increases the bandwidth share of privileged applications, as the main
388means to guarantee a lower latency to them.
389
Paolo Valente43c1b3d2017-05-09 12:54:23 +0200390In addition, as already highlighted at the beginning of this document,
391DISABLE this mode if your only goal is to achieve a high throughput.
392In fact, privileging the I/O of some application over the rest may
393entail a lower throughput. To achieve the highest-possible throughput
394on a non-rotational device, setting slice_idle to 0 may be needed too
395(at the cost of giving up any strong guarantee on fairness and low
396latency).
397
Paolo Valenteaee69d72017-04-19 08:29:02 -0600398timeout_sync
399------------
400
401Maximum amount of device time that can be given to a task (queue) once
402it has been selected for service. On devices with costly seeks,
403increasing this time usually increases maximum throughput. On the
404opposite end, increasing this time coarsens the granularity of the
405short-term bandwidth and latency guarantees, especially if the
406following parameter is set to zero.
407
408max_budget
409----------
410
411Maximum amount of service, measured in sectors, that can be provided
412to a BFQ queue once it is set in service (of course within the limits
413of the above timeout). According to what said in the description of
414the algorithm, larger values increase the throughput in proportion to
415the percentage of sequential I/O requests issued. The price of larger
416values is that they coarsen the granularity of short-term bandwidth
417and latency guarantees.
418
419The default value is 0, which enables auto-tuning: BFQ sets max_budget
420to the maximum number of sectors that can be served during
421timeout_sync, according to the estimated peak rate.
422
423weights
424-------
425
426Read-only parameter, used to show the weights of the currently active
427BFQ queues.
428
429
430wr_ tunables
431------------
432
433BFQ exports a few parameters to control/tune the behavior of
434low-latency heuristics.
435
436wr_coeff
437
438Factor by which the weight of a weight-raised queue is multiplied. If
439the queue is deemed soft real-time, then the weight is further
440multiplied by an additional, constant factor.
441
442wr_max_time
443
444Maximum duration of a weight-raising period for an interactive task
445(ms). If set to zero (default value), then this value is computed
446automatically, as a function of the peak rate of the device. In any
447case, when the value of this parameter is read, it always reports the
448current duration, regardless of whether it has been set manually or
449computed automatically.
450
451wr_max_softrt_rate
452
453Maximum service rate below which a queue is deemed to be associated
454with a soft real-time application, and is then weight-raised
455accordingly (sectors/sec).
456
457wr_min_idle_time
458
459Minimum idle period after which interactive weight-raising may be
460reactivated for a queue (in ms).
461
462wr_rt_max_time
463
464Maximum weight-raising duration for soft real-time queues (in ms). The
465start time from which this duration is considered is automatically
466moved forward if the queue is detected to be still soft real-time
467before the current soft real-time weight-raising period finishes.
468
469wr_min_inter_arr_async
470
471Minimum period between I/O request arrivals after which weight-raising
472may be reactivated for an already busy async queue (in ms).
473
474
4754. Group scheduling with BFQ
476============================
477
Arianna Avanzinie21b7a02017-04-12 18:23:08 +0200478BFQ supports both cgroups-v1 and cgroups-v2 io controllers, namely
479blkio and io. In particular, BFQ supports weight-based proportional
480share. To activate cgroups support, set BFQ_GROUP_IOSCHED.
Paolo Valenteaee69d72017-04-19 08:29:02 -0600481
4824-1 Service guarantees provided
483-------------------------------
484
485With BFQ, proportional share means true proportional share of the
486device bandwidth, according to group weights. For example, a group
487with weight 200 gets twice the bandwidth, and not just twice the time,
488of a group with weight 100.
489
490BFQ supports hierarchies (group trees) of any depth. Bandwidth is
491distributed among groups and processes in the expected way: for each
492group, the children of the group share the whole bandwidth of the
493group in proportion to their weights. In particular, this implies
494that, for each leaf group, every process of the group receives the
495same share of the whole group bandwidth, unless the ioprio of the
496process is modified.
497
498The resource-sharing guarantee for a group may partially or totally
499switch from bandwidth to time, if providing bandwidth guarantees to
500the group lowers the throughput too much. This switch occurs on a
501per-process basis: if a process of a leaf group causes throughput loss
502if served in such a way to receive its share of the bandwidth, then
503BFQ switches back to just time-based proportional share for that
504process.
505
5064-2 Interface
507-------------
508
509To get proportional sharing of bandwidth with BFQ for a given device,
510BFQ must of course be the active scheduler for that device.
511
512Within each group directory, the names of the files associated with
513BFQ-specific cgroup parameters and stats begin with the "bfq."
514prefix. So, with cgroups-v1 or cgroups-v2, the full prefix for
515BFQ-specific files is "blkio.bfq." or "io.bfq." For example, the group
516parameter to set the weight of a group with BFQ is blkio.bfq.weight
517or io.bfq.weight.
518
519Parameters to set
520-----------------
521
522For each group, there is only the following parameter to set.
523
524weight (namely blkio.bfq.weight or io.bfq-weight): the weight of the
525group inside its parent. Available values: 1..10000 (default 100). The
526linear mapping between ioprio and weights, described at the beginning
527of the tunable section, is still valid, but all weights higher than
528IOPRIO_BE_NR*10 are mapped to ioprio 0.
529
Paolo Valente44e44a12017-04-12 18:23:12 +0200530Recall that, if low-latency is set, then BFQ automatically raises the
531weight of the queues associated with interactive and soft real-time
532applications. Unset this tunable if you need/want to control weights.
533
Paolo Valenteaee69d72017-04-19 08:29:02 -0600534
535[1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O
536 Scheduler", Proceedings of the First Workshop on Mobile System
537 Technologies (MST-2015), May 2015.
538 http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf
539
540[2] P. Valente and M. Andreolini, "Improving Application
541 Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
542 the 5th Annual International Systems and Storage Conference
543 (SYSTOR '12), June 2012.
544 Slightly extended version:
545 http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
546 results.pdf