Vivek Goyal | 6d6ac1c | 2010-08-23 12:25:29 +0200 | [diff] [blame] | 1 | CFQ ioscheduler tunables |
| 2 | ======================== |
| 3 | |
| 4 | slice_idle |
| 5 | ---------- |
| 6 | This specifies how long CFQ should idle for next request on certain cfq queues |
| 7 | (for sequential workloads) and service trees (for random workloads) before |
| 8 | queue is expired and CFQ selects next queue to dispatch from. |
| 9 | |
| 10 | By default slice_idle is a non-zero value. That means by default we idle on |
| 11 | queues/service trees. This can be very helpful on highly seeky media like |
| 12 | single spindle SATA/SAS disks where we can cut down on overall number of |
| 13 | seeks and see improved throughput. |
| 14 | |
| 15 | Setting slice_idle to 0 will remove all the idling on queues/service tree |
| 16 | level and one should see an overall improved throughput on faster storage |
| 17 | devices like multiple SATA/SAS disks in hardware RAID configuration. The down |
| 18 | side is that isolation provided from WRITES also goes down and notion of |
| 19 | IO priority becomes weaker. |
| 20 | |
| 21 | So depending on storage and workload, it might be useful to set slice_idle=0. |
| 22 | In general I think for SATA/SAS disks and software RAID of SATA/SAS disks |
| 23 | keeping slice_idle enabled should be useful. For any configurations where |
| 24 | there are multiple spindles behind single LUN (Host based hardware RAID |
| 25 | controller or for storage arrays), setting slice_idle=0 might end up in better |
| 26 | throughput and acceptable latencies. |
| 27 | |
| 28 | CFQ IOPS Mode for group scheduling |
| 29 | =================================== |
| 30 | Basic CFQ design is to provide priority based time slices. Higher priority |
| 31 | process gets bigger time slice and lower priority process gets smaller time |
| 32 | slice. Measuring time becomes harder if storage is fast and supports NCQ and |
| 33 | it would be better to dispatch multiple requests from multiple cfq queues in |
| 34 | request queue at a time. In such scenario, it is not possible to measure time |
| 35 | consumed by single queue accurately. |
| 36 | |
| 37 | What is possible though is to measure number of requests dispatched from a |
| 38 | single queue and also allow dispatch from multiple cfq queue at the same time. |
| 39 | This effectively becomes the fairness in terms of IOPS (IO operations per |
| 40 | second). |
| 41 | |
| 42 | If one sets slice_idle=0 and if storage supports NCQ, CFQ internally switches |
| 43 | to IOPS mode and starts providing fairness in terms of number of requests |
| 44 | dispatched. Note that this mode switching takes effect only for group |
| 45 | scheduling. For non-cgroup users nothing should change. |
Vivek Goyal | 4931402 | 2011-08-05 09:42:20 +0200 | [diff] [blame] | 46 | |
| 47 | CFQ IO scheduler Idling Theory |
| 48 | =============================== |
| 49 | Idling on a queue is primarily about waiting for the next request to come |
| 50 | on same queue after completion of a request. In this process CFQ will not |
| 51 | dispatch requests from other cfq queues even if requests are pending there. |
| 52 | |
| 53 | The rationale behind idling is that it can cut down on number of seeks |
| 54 | on rotational media. For example, if a process is doing dependent |
| 55 | sequential reads (next read will come on only after completion of previous |
| 56 | one), then not dispatching request from other queue should help as we |
| 57 | did not move the disk head and kept on dispatching sequential IO from |
| 58 | one queue. |
| 59 | |
| 60 | CFQ has following service trees and various queues are put on these trees. |
| 61 | |
| 62 | sync-idle sync-noidle async |
| 63 | |
| 64 | All cfq queues doing synchronous sequential IO go on to sync-idle tree. |
| 65 | On this tree we idle on each queue individually. |
| 66 | |
| 67 | All synchronous non-sequential queues go on sync-noidle tree. Also any |
| 68 | request which are marked with REQ_NOIDLE go on this service tree. On this |
| 69 | tree we do not idle on individual queues instead idle on the whole group |
| 70 | of queues or the tree. So if there are 4 queues waiting for IO to dispatch |
| 71 | we will idle only once last queue has dispatched the IO and there is |
| 72 | no more IO on this service tree. |
| 73 | |
| 74 | All async writes go on async service tree. There is no idling on async |
| 75 | queues. |
| 76 | |
| 77 | CFQ has some optimizations for SSDs and if it detects a non-rotational |
| 78 | media which can support higher queue depth (multiple requests at in |
| 79 | flight at a time), then it cuts down on idling of individual queues and |
| 80 | all the queues move to sync-noidle tree and only tree idle remains. This |
| 81 | tree idling provides isolation with buffered write queues on async tree. |
| 82 | |
| 83 | FAQ |
| 84 | === |
| 85 | Q1. Why to idle at all on queues marked with REQ_NOIDLE. |
| 86 | |
| 87 | A1. We only do tree idle (all queues on sync-noidle tree) on queues marked |
| 88 | with REQ_NOIDLE. This helps in providing isolation with all the sync-idle |
| 89 | queues. Otherwise in presence of many sequential readers, other |
| 90 | synchronous IO might not get fair share of disk. |
| 91 | |
| 92 | For example, if there are 10 sequential readers doing IO and they get |
| 93 | 100ms each. If a REQ_NOIDLE request comes in, it will be scheduled |
| 94 | roughly after 1 second. If after completion of REQ_NOIDLE request we |
| 95 | do not idle, and after a couple of milli seconds a another REQ_NOIDLE |
| 96 | request comes in, again it will be scheduled after 1second. Repeat it |
| 97 | and notice how a workload can lose its disk share and suffer due to |
| 98 | multiple sequential readers. |
| 99 | |
| 100 | fsync can generate dependent IO where bunch of data is written in the |
| 101 | context of fsync, and later some journaling data is written. Journaling |
| 102 | data comes in only after fsync has finished its IO (atleast for ext4 |
| 103 | that seemed to be the case). Now if one decides not to idle on fsync |
| 104 | thread due to REQ_NOIDLE, then next journaling write will not get |
| 105 | scheduled for another second. A process doing small fsync, will suffer |
| 106 | badly in presence of multiple sequential readers. |
| 107 | |
| 108 | Hence doing tree idling on threads using REQ_NOIDLE flag on requests |
| 109 | provides isolation from multiple sequential readers and at the same |
| 110 | time we do not idle on individual threads. |
| 111 | |
| 112 | Q2. When to specify REQ_NOIDLE |
| 113 | A2. I would think whenever one is doing synchronous write and not expecting |
| 114 | more writes to be dispatched from same context soon, should be able |
| 115 | to specify REQ_NOIDLE on writes and that probably should work well for |
| 116 | most of the cases. |