| CFQ (Complete Fairness Queueing) |
| =============================== |
| |
| The main aim of CFQ scheduler is to provide a fair allocation of the disk |
| I/O bandwidth for all the processes which requests an I/O operation. |
| |
| CFQ maintains the per process queue for the processes which request I/O |
| operation(syncronous requests). In case of asynchronous requests, all the |
| requests from all the processes are batched together according to their |
| process's I/O priority. |
| |
| CFQ ioscheduler tunables |
| ======================== |
| |
| slice_idle |
| ---------- |
| This specifies how long CFQ should idle for next request on certain cfq queues |
| (for sequential workloads) and service trees (for random workloads) before |
| queue is expired and CFQ selects next queue to dispatch from. |
| |
| By default slice_idle is a non-zero value. That means by default we idle on |
| queues/service trees. This can be very helpful on highly seeky media like |
| single spindle SATA/SAS disks where we can cut down on overall number of |
| seeks and see improved throughput. |
| |
| Setting slice_idle to 0 will remove all the idling on queues/service tree |
| level and one should see an overall improved throughput on faster storage |
| devices like multiple SATA/SAS disks in hardware RAID configuration. The down |
| side is that isolation provided from WRITES also goes down and notion of |
| IO priority becomes weaker. |
| |
| So depending on storage and workload, it might be useful to set slice_idle=0. |
| In general I think for SATA/SAS disks and software RAID of SATA/SAS disks |
| keeping slice_idle enabled should be useful. For any configurations where |
| there are multiple spindles behind single LUN (Host based hardware RAID |
| controller or for storage arrays), setting slice_idle=0 might end up in better |
| throughput and acceptable latencies. |
| |
| back_seek_max |
| ------------- |
| This specifies, given in Kbytes, the maximum "distance" for backward seeking. |
| The distance is the amount of space from the current head location to the |
| sectors that are backward in terms of distance. |
| |
| This parameter allows the scheduler to anticipate requests in the "backward" |
| direction and consider them as being the "next" if they are within this |
| distance from the current head location. |
| |
| back_seek_penalty |
| ----------------- |
| This parameter is used to compute the cost of backward seeking. If the |
| backward distance of request is just 1/back_seek_penalty from a "front" |
| request, then the seeking cost of two requests is considered equivalent. |
| |
| So scheduler will not bias toward one or the other request (otherwise scheduler |
| will bias toward front request). Default value of back_seek_penalty is 2. |
| |
| fifo_expire_async |
| ----------------- |
| This parameter is used to set the timeout of asynchronous requests. Default |
| value of this is 248ms. |
| |
| fifo_expire_sync |
| ---------------- |
| This parameter is used to set the timeout of synchronous requests. Default |
| value of this is 124ms. In case to favor synchronous requests over asynchronous |
| one, this value should be decreased relative to fifo_expire_async. |
| |
| slice_async |
| ----------- |
| This parameter is same as of slice_sync but for asynchronous queue. The |
| default value is 40ms. |
| |
| slice_async_rq |
| -------------- |
| This parameter is used to limit the dispatching of asynchronous request to |
| device request queue in queue's slice time. The maximum number of request that |
| are allowed to be dispatched also depends upon the io priority. Default value |
| for this is 2. |
| |
| slice_sync |
| ---------- |
| When a queue is selected for execution, the queues IO requests are only |
| executed for a certain amount of time(time_slice) before switching to another |
| queue. This parameter is used to calculate the time slice of synchronous |
| queue. |
| |
| time_slice is computed using the below equation:- |
| time_slice = slice_sync + (slice_sync/5 * (4 - prio)). To increase the |
| time_slice of synchronous queue, increase the value of slice_sync. Default |
| value is 100ms. |
| |
| quantum |
| ------- |
| This specifies the number of request dispatched to the device queue. In a |
| queue's time slice, a request will not be dispatched if the number of request |
| in the device exceeds this parameter. This parameter is used for synchronous |
| request. |
| |
| In case of storage with several disk, this setting can limit the parallel |
| processing of request. Therefore, increasing the value can imporve the |
| performace although this can cause the latency of some I/O to increase due |
| to more number of requests. |
| |
| CFQ Group scheduling |
| ==================== |
| |
| CFQ supports blkio cgroup and has "blkio." prefixed files in each |
| blkio cgroup directory. It is weight-based and there are four knobs |
| for configuration - weight[_device] and leaf_weight[_device]. |
| Internal cgroup nodes (the ones with children) can also have tasks in |
| them, so the former two configure how much proportion the cgroup as a |
| whole is entitled to at its parent's level while the latter two |
| configure how much proportion the tasks in the cgroup have compared to |
| its direct children. |
| |
| Another way to think about it is assuming that each internal node has |
| an implicit leaf child node which hosts all the tasks whose weight is |
| configured by leaf_weight[_device]. Let's assume a blkio hierarchy |
| composed of five cgroups - root, A, B, AA and AB - with the following |
| weights where the names represent the hierarchy. |
| |
| weight leaf_weight |
| root : 125 125 |
| A : 500 750 |
| B : 250 500 |
| AA : 500 500 |
| AB : 1000 500 |
| |
| root never has a parent making its weight is meaningless. For backward |
| compatibility, weight is always kept in sync with leaf_weight. B, AA |
| and AB have no child and thus its tasks have no children cgroup to |
| compete with. They always get 100% of what the cgroup won at the |
| parent level. Considering only the weights which matter, the hierarchy |
| looks like the following. |
| |
| root |
| / | \ |
| A B leaf |
| 500 250 125 |
| / | \ |
| AA AB leaf |
| 500 1000 750 |
| |
| If all cgroups have active IOs and competing with each other, disk |
| time will be distributed like the following. |
| |
| Distribution below root. The total active weight at this level is |
| A:500 + B:250 + C:125 = 875. |
| |
| root-leaf : 125 / 875 =~ 14% |
| A : 500 / 875 =~ 57% |
| B(-leaf) : 250 / 875 =~ 28% |
| |
| A has children and further distributes its 57% among the children and |
| the implicit leaf node. The total active weight at this level is |
| AA:500 + AB:1000 + A-leaf:750 = 2250. |
| |
| A-leaf : ( 750 / 2250) * A =~ 19% |
| AA(-leaf) : ( 500 / 2250) * A =~ 12% |
| AB(-leaf) : (1000 / 2250) * A =~ 25% |
| |
| CFQ IOPS Mode for group scheduling |
| =================================== |
| Basic CFQ design is to provide priority based time slices. Higher priority |
| process gets bigger time slice and lower priority process gets smaller time |
| slice. Measuring time becomes harder if storage is fast and supports NCQ and |
| it would be better to dispatch multiple requests from multiple cfq queues in |
| request queue at a time. In such scenario, it is not possible to measure time |
| consumed by single queue accurately. |
| |
| What is possible though is to measure number of requests dispatched from a |
| single queue and also allow dispatch from multiple cfq queue at the same time. |
| This effectively becomes the fairness in terms of IOPS (IO operations per |
| second). |
| |
| If one sets slice_idle=0 and if storage supports NCQ, CFQ internally switches |
| to IOPS mode and starts providing fairness in terms of number of requests |
| dispatched. Note that this mode switching takes effect only for group |
| scheduling. For non-cgroup users nothing should change. |
| |
| CFQ IO scheduler Idling Theory |
| =============================== |
| Idling on a queue is primarily about waiting for the next request to come |
| on same queue after completion of a request. In this process CFQ will not |
| dispatch requests from other cfq queues even if requests are pending there. |
| |
| The rationale behind idling is that it can cut down on number of seeks |
| on rotational media. For example, if a process is doing dependent |
| sequential reads (next read will come on only after completion of previous |
| one), then not dispatching request from other queue should help as we |
| did not move the disk head and kept on dispatching sequential IO from |
| one queue. |
| |
| CFQ has following service trees and various queues are put on these trees. |
| |
| sync-idle sync-noidle async |
| |
| All cfq queues doing synchronous sequential IO go on to sync-idle tree. |
| On this tree we idle on each queue individually. |
| |
| All synchronous non-sequential queues go on sync-noidle tree. Also any |
| request which are marked with REQ_NOIDLE go on this service tree. On this |
| tree we do not idle on individual queues instead idle on the whole group |
| of queues or the tree. So if there are 4 queues waiting for IO to dispatch |
| we will idle only once last queue has dispatched the IO and there is |
| no more IO on this service tree. |
| |
| All async writes go on async service tree. There is no idling on async |
| queues. |
| |
| CFQ has some optimizations for SSDs and if it detects a non-rotational |
| media which can support higher queue depth (multiple requests at in |
| flight at a time), then it cuts down on idling of individual queues and |
| all the queues move to sync-noidle tree and only tree idle remains. This |
| tree idling provides isolation with buffered write queues on async tree. |
| |
| FAQ |
| === |
| Q1. Why to idle at all on queues marked with REQ_NOIDLE. |
| |
| A1. We only do tree idle (all queues on sync-noidle tree) on queues marked |
| with REQ_NOIDLE. This helps in providing isolation with all the sync-idle |
| queues. Otherwise in presence of many sequential readers, other |
| synchronous IO might not get fair share of disk. |
| |
| For example, if there are 10 sequential readers doing IO and they get |
| 100ms each. If a REQ_NOIDLE request comes in, it will be scheduled |
| roughly after 1 second. If after completion of REQ_NOIDLE request we |
| do not idle, and after a couple of milli seconds a another REQ_NOIDLE |
| request comes in, again it will be scheduled after 1second. Repeat it |
| and notice how a workload can lose its disk share and suffer due to |
| multiple sequential readers. |
| |
| fsync can generate dependent IO where bunch of data is written in the |
| context of fsync, and later some journaling data is written. Journaling |
| data comes in only after fsync has finished its IO (atleast for ext4 |
| that seemed to be the case). Now if one decides not to idle on fsync |
| thread due to REQ_NOIDLE, then next journaling write will not get |
| scheduled for another second. A process doing small fsync, will suffer |
| badly in presence of multiple sequential readers. |
| |
| Hence doing tree idling on threads using REQ_NOIDLE flag on requests |
| provides isolation from multiple sequential readers and at the same |
| time we do not idle on individual threads. |
| |
| Q2. When to specify REQ_NOIDLE |
| A2. I would think whenever one is doing synchronous write and not expecting |
| more writes to be dispatched from same context soon, should be able |
| to specify REQ_NOIDLE on writes and that probably should work well for |
| most of the cases. |