Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1 | Anticipatory IO scheduler |
| 2 | ------------------------- |
| 3 | Nick Piggin <piggin@cyberone.com.au> 13 Sep 2003 |
| 4 | |
| 5 | Attention! Database servers, especially those using "TCQ" disks should |
| 6 | investigate performance with the 'deadline' IO scheduler. Any system with high |
| 7 | disk performance requirements should do so, in fact. |
| 8 | |
| 9 | If you see unusual performance characteristics of your disk systems, or you |
| 10 | see big performance regressions versus the deadline scheduler, please email |
| 11 | me. Database users don't bother unless you're willing to test a lot of patches |
| 12 | from me ;) its a known issue. |
| 13 | |
| 14 | Also, users with hardware RAID controllers, doing striping, may find |
| 15 | highly variable performance results with using the as-iosched. The |
| 16 | as-iosched anticipatory implementation is based on the notion that a disk |
| 17 | device has only one physical seeking head. A striped RAID controller |
| 18 | actually has a head for each physical device in the logical RAID device. |
| 19 | |
| 20 | However, setting the antic_expire (see tunable parameters below) produces |
| 21 | very similar behavior to the deadline IO scheduler. |
| 22 | |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 23 | Selecting IO schedulers |
| 24 | ----------------------- |
Alan D. Brunelle | 23c7698 | 2007-10-15 13:22:26 +0200 | [diff] [blame] | 25 | Refer to Documentation/block/switching-sched.txt for information on |
| 26 | selecting an io scheduler on a per-device basis. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 27 | |
| 28 | Anticipatory IO scheduler Policies |
| 29 | ---------------------------------- |
| 30 | The as-iosched implementation implements several layers of policies |
| 31 | to determine when an IO request is dispatched to the disk controller. |
| 32 | Here are the policies outlined, in order of application. |
| 33 | |
| 34 | 1. one-way Elevator algorithm. |
| 35 | |
| 36 | The elevator algorithm is similar to that used in deadline scheduler, with |
| 37 | the addition that it allows limited backward movement of the elevator |
| 38 | (i.e. seeks backwards). A seek backwards can occur when choosing between |
| 39 | two IO requests where one is behind the elevator's current position, and |
| 40 | the other is in front of the elevator's position. If the seek distance to |
| 41 | the request in back of the elevator is less than half the seek distance to |
| 42 | the request in front of the elevator, then the request in back can be chosen. |
| 43 | Backward seeks are also limited to a maximum of MAXBACK (1024*1024) sectors. |
| 44 | This favors forward movement of the elevator, while allowing opportunistic |
| 45 | "short" backward seeks. |
| 46 | |
| 47 | 2. FIFO expiration times for reads and for writes. |
| 48 | |
| 49 | This is again very similar to the deadline IO scheduler. The expiration |
| 50 | times for requests on these lists is tunable using the parameters read_expire |
| 51 | and write_expire discussed below. When a read or a write expires in this way, |
| 52 | the IO scheduler will interrupt its current elevator sweep or read anticipation |
| 53 | to service the expired request. |
| 54 | |
| 55 | 3. Read and write request batching |
| 56 | |
| 57 | A batch is a collection of read requests or a collection of write |
| 58 | requests. The as scheduler alternates dispatching read and write batches |
| 59 | to the driver. In the case a read batch, the scheduler submits read |
| 60 | requests to the driver as long as there are read requests to submit, and |
| 61 | the read batch time limit has not been exceeded (read_batch_expire). |
| 62 | The read batch time limit begins counting down only when there are |
| 63 | competing write requests pending. |
| 64 | |
| 65 | In the case of a write batch, the scheduler submits write requests to |
| 66 | the driver as long as there are write requests available, and the |
| 67 | write batch time limit has not been exceeded (write_batch_expire). |
| 68 | However, the length of write batches will be gradually shortened |
| 69 | when read batches frequently exceed their time limit. |
| 70 | |
| 71 | When changing between batch types, the scheduler waits for all requests |
| 72 | from the previous batch to complete before scheduling requests for the |
| 73 | next batch. |
| 74 | |
| 75 | The read and write fifo expiration times described in policy 2 above |
| 76 | are checked only when in scheduling IO of a batch for the corresponding |
| 77 | (read/write) type. So for example, the read FIFO timeout values are |
| 78 | tested only during read batches. Likewise, the write FIFO timeout |
| 79 | values are tested only during write batches. For this reason, |
| 80 | it is generally not recommended for the read batch time |
| 81 | to be longer than the write expiration time, nor for the write batch |
| 82 | time to exceed the read expiration time (see tunable parameters below). |
| 83 | |
| 84 | When the IO scheduler changes from a read to a write batch, |
| 85 | it begins the elevator from the request that is on the head of the |
| 86 | write expiration FIFO. Likewise, when changing from a write batch to |
| 87 | a read batch, scheduler begins the elevator from the first entry |
| 88 | on the read expiration FIFO. |
| 89 | |
| 90 | 4. Read anticipation. |
| 91 | |
| 92 | Read anticipation occurs only when scheduling a read batch. |
| 93 | This implementation of read anticipation allows only one read request |
| 94 | to be dispatched to the disk controller at a time. In |
| 95 | contrast, many write requests may be dispatched to the disk controller |
| 96 | at a time during a write batch. It is this characteristic that can make |
| 97 | the anticipatory scheduler perform anomalously with controllers supporting |
| 98 | TCQ, or with hardware striped RAID devices. Setting the antic_expire |
Matt LaPlante | 992caac | 2006-10-03 22:52:05 +0200 | [diff] [blame] | 99 | queue parameter (see below) to zero disables this behavior, and the |
| 100 | anticipatory scheduler behaves essentially like the deadline scheduler. |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 101 | |
| 102 | When read anticipation is enabled (antic_expire is not zero), reads |
| 103 | are dispatched to the disk controller one at a time. |
| 104 | At the end of each read request, the IO scheduler examines its next |
| 105 | candidate read request from its sorted read list. If that next request |
| 106 | is from the same process as the request that just completed, |
| 107 | or if the next request in the queue is "very close" to the |
| 108 | just completed request, it is dispatched immediately. Otherwise, |
| 109 | statistics (average think time, average seek distance) on the process |
| 110 | that submitted the just completed request are examined. If it seems |
| 111 | likely that that process will submit another request soon, and that |
| 112 | request is likely to be near the just completed request, then the IO |
Alan D. Brunelle | 23c7698 | 2007-10-15 13:22:26 +0200 | [diff] [blame] | 113 | scheduler will stop dispatching more read requests for up to (antic_expire) |
Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 114 | milliseconds, hoping that process will submit a new request near the one |
| 115 | that just completed. If such a request is made, then it is dispatched |
| 116 | immediately. If the antic_expire wait time expires, then the IO scheduler |
| 117 | will dispatch the next read request from the sorted read queue. |
| 118 | |
| 119 | To decide whether an anticipatory wait is worthwhile, the scheduler |
| 120 | maintains statistics for each process that can be used to compute |
| 121 | mean "think time" (the time between read requests), and mean seek |
| 122 | distance for that process. One observation is that these statistics |
| 123 | are associated with each process, but those statistics are not associated |
| 124 | with a specific IO device. So for example, if a process is doing IO |
| 125 | on several file systems on separate devices, the statistics will be |
| 126 | a combination of IO behavior from all those devices. |
| 127 | |
| 128 | |
| 129 | Tuning the anticipatory IO scheduler |
| 130 | ------------------------------------ |
| 131 | When using 'as', the anticipatory IO scheduler there are 5 parameters under |
| 132 | /sys/block/*/queue/iosched/. All are units of milliseconds. |
| 133 | |
| 134 | The parameters are: |
| 135 | * read_expire |
| 136 | Controls how long until a read request becomes "expired". It also controls the |
| 137 | interval between which expired requests are served, so set to 50, a request |
| 138 | might take anywhere < 100ms to be serviced _if_ it is the next on the |
| 139 | expired list. Obviously request expiration strategies won't make the disk |
| 140 | go faster. The result basically equates to the timeslice a single reader |
| 141 | gets in the presence of other IO. 100*((seek time / read_expire) + 1) is |
| 142 | very roughly the % streaming read efficiency your disk should get with |
| 143 | multiple readers. |
| 144 | |
| 145 | * read_batch_expire |
| 146 | Controls how much time a batch of reads is given before pending writes are |
| 147 | served. A higher value is more efficient. This might be set below read_expire |
| 148 | if writes are to be given higher priority than reads, but reads are to be |
| 149 | as efficient as possible when there are no writes. Generally though, it |
| 150 | should be some multiple of read_expire. |
| 151 | |
| 152 | * write_expire, and |
| 153 | * write_batch_expire are equivalent to the above, for writes. |
| 154 | |
| 155 | * antic_expire |
| 156 | Controls the maximum amount of time we can anticipate a good read (one |
| 157 | with a short seek distance from the most recently completed request) before |
| 158 | giving up. Many other factors may cause anticipation to be stopped early, |
| 159 | or some processes will not be "anticipated" at all. Should be a bit higher |
| 160 | for big seek time devices though not a linear correspondence - most |
| 161 | processes have only a few ms thinktime. |
| 162 | |
Alan D. Brunelle | 23c7698 | 2007-10-15 13:22:26 +0200 | [diff] [blame] | 163 | In addition to the tunables above there is a read-only file named est_time |
| 164 | which, when read, will show: |
| 165 | |
| 166 | - The probability of a task exiting without a cooperating task |
| 167 | submitting an anticipated IO. |
| 168 | |
| 169 | - The current mean think time. |
| 170 | |
| 171 | - The seek distance used to determine if an incoming IO is better. |
| 172 | |