Thomas Gleixner | df78488 | 2006-01-09 20:52:33 -0800 | [diff] [blame] | 1 | |
| 2 | hrtimers - subsystem for high-resolution kernel timers |
| 3 | ---------------------------------------------------- |
| 4 | |
| 5 | This patch introduces a new subsystem for high-resolution kernel timers. |
| 6 | |
| 7 | One might ask the question: we already have a timer subsystem |
| 8 | (kernel/timers.c), why do we need two timer subsystems? After a lot of |
| 9 | back and forth trying to integrate high-resolution and high-precision |
| 10 | features into the existing timer framework, and after testing various |
| 11 | such high-resolution timer implementations in practice, we came to the |
| 12 | conclusion that the timer wheel code is fundamentally not suitable for |
| 13 | such an approach. We initially didnt believe this ('there must be a way |
| 14 | to solve this'), and spent a considerable effort trying to integrate |
| 15 | things into the timer wheel, but we failed. In hindsight, there are |
| 16 | several reasons why such integration is hard/impossible: |
| 17 | |
| 18 | - the forced handling of low-resolution and high-resolution timers in |
| 19 | the same way leads to a lot of compromises, macro magic and #ifdef |
| 20 | mess. The timers.c code is very "tightly coded" around jiffies and |
| 21 | 32-bitness assumptions, and has been honed and micro-optimized for a |
| 22 | relatively narrow use case (jiffies in a relatively narrow HZ range) |
| 23 | for many years - and thus even small extensions to it easily break |
| 24 | the wheel concept, leading to even worse compromises. The timer wheel |
| 25 | code is very good and tight code, there's zero problems with it in its |
| 26 | current usage - but it is simply not suitable to be extended for |
| 27 | high-res timers. |
| 28 | |
| 29 | - the unpredictable [O(N)] overhead of cascading leads to delays which |
| 30 | necessiate a more complex handling of high resolution timers, which |
| 31 | in turn decreases robustness. Such a design still led to rather large |
| 32 | timing inaccuracies. Cascading is a fundamental property of the timer |
| 33 | wheel concept, it cannot be 'designed out' without unevitably |
| 34 | degrading other portions of the timers.c code in an unacceptable way. |
| 35 | |
| 36 | - the implementation of the current posix-timer subsystem on top of |
| 37 | the timer wheel has already introduced a quite complex handling of |
| 38 | the required readjusting of absolute CLOCK_REALTIME timers at |
| 39 | settimeofday or NTP time - further underlying our experience by |
| 40 | example: that the timer wheel data structure is too rigid for high-res |
| 41 | timers. |
| 42 | |
| 43 | - the timer wheel code is most optimal for use cases which can be |
| 44 | identified as "timeouts". Such timeouts are usually set up to cover |
| 45 | error conditions in various I/O paths, such as networking and block |
| 46 | I/O. The vast majority of those timers never expire and are rarely |
| 47 | recascaded because the expected correct event arrives in time so they |
| 48 | can be removed from the timer wheel before any further processing of |
| 49 | them becomes necessary. Thus the users of these timeouts can accept |
| 50 | the granularity and precision tradeoffs of the timer wheel, and |
| 51 | largely expect the timer subsystem to have near-zero overhead. |
| 52 | Accurate timing for them is not a core purpose - in fact most of the |
| 53 | timeout values used are ad-hoc. For them it is at most a necessary |
| 54 | evil to guarantee the processing of actual timeout completions |
| 55 | (because most of the timeouts are deleted before completion), which |
| 56 | should thus be as cheap and unintrusive as possible. |
| 57 | |
| 58 | The primary users of precision timers are user-space applications that |
| 59 | utilize nanosleep, posix-timers and itimer interfaces. Also, in-kernel |
| 60 | users like drivers and subsystems which require precise timed events |
| 61 | (e.g. multimedia) can benefit from the availability of a seperate |
| 62 | high-resolution timer subsystem as well. |
| 63 | |
| 64 | While this subsystem does not offer high-resolution clock sources just |
| 65 | yet, the hrtimer subsystem can be easily extended with high-resolution |
| 66 | clock capabilities, and patches for that exist and are maturing quickly. |
| 67 | The increasing demand for realtime and multimedia applications along |
| 68 | with other potential users for precise timers gives another reason to |
| 69 | separate the "timeout" and "precise timer" subsystems. |
| 70 | |
| 71 | Another potential benefit is that such a seperation allows even more |
| 72 | special-purpose optimization of the existing timer wheel for the low |
| 73 | resolution and low precision use cases - once the precision-sensitive |
| 74 | APIs are separated from the timer wheel and are migrated over to |
| 75 | hrtimers. E.g. we could decrease the frequency of the timeout subsystem |
| 76 | from 250 Hz to 100 HZ (or even smaller). |
| 77 | |
| 78 | hrtimer subsystem implementation details |
| 79 | ---------------------------------------- |
| 80 | |
| 81 | the basic design considerations were: |
| 82 | |
| 83 | - simplicity |
| 84 | |
| 85 | - data structure not bound to jiffies or any other granularity. All the |
| 86 | kernel logic works at 64-bit nanoseconds resolution - no compromises. |
| 87 | |
| 88 | - simplification of existing, timing related kernel code |
| 89 | |
| 90 | another basic requirement was the immediate enqueueing and ordering of |
| 91 | timers at activation time. After looking at several possible solutions |
| 92 | such as radix trees and hashes, we chose the red black tree as the basic |
| 93 | data structure. Rbtrees are available as a library in the kernel and are |
| 94 | used in various performance-critical areas of e.g. memory management and |
| 95 | file systems. The rbtree is solely used for time sorted ordering, while |
| 96 | a separate list is used to give the expiry code fast access to the |
| 97 | queued timers, without having to walk the rbtree. |
| 98 | |
| 99 | (This seperate list is also useful for later when we'll introduce |
| 100 | high-resolution clocks, where we need seperate pending and expired |
| 101 | queues while keeping the time-order intact.) |
| 102 | |
| 103 | Time-ordered enqueueing is not purely for the purposes of |
| 104 | high-resolution clocks though, it also simplifies the handling of |
| 105 | absolute timers based on a low-resolution CLOCK_REALTIME. The existing |
| 106 | implementation needed to keep an extra list of all armed absolute |
| 107 | CLOCK_REALTIME timers along with complex locking. In case of |
| 108 | settimeofday and NTP, all the timers (!) had to be dequeued, the |
| 109 | time-changing code had to fix them up one by one, and all of them had to |
| 110 | be enqueued again. The time-ordered enqueueing and the storage of the |
| 111 | expiry time in absolute time units removes all this complex and poorly |
| 112 | scaling code from the posix-timer implementation - the clock can simply |
| 113 | be set without having to touch the rbtree. This also makes the handling |
| 114 | of posix-timers simpler in general. |
| 115 | |
| 116 | The locking and per-CPU behavior of hrtimers was mostly taken from the |
| 117 | existing timer wheel code, as it is mature and well suited. Sharing code |
| 118 | was not really a win, due to the different data structures. Also, the |
| 119 | hrtimer functions now have clearer behavior and clearer names - such as |
| 120 | hrtimer_try_to_cancel() and hrtimer_cancel() [which are roughly |
| 121 | equivalent to del_timer() and del_timer_sync()] - so there's no direct |
| 122 | 1:1 mapping between them on the algorithmical level, and thus no real |
| 123 | potential for code sharing either. |
| 124 | |
| 125 | Basic data types: every time value, absolute or relative, is in a |
| 126 | special nanosecond-resolution type: ktime_t. The kernel-internal |
| 127 | representation of ktime_t values and operations is implemented via |
| 128 | macros and inline functions, and can be switched between a "hybrid |
| 129 | union" type and a plain "scalar" 64bit nanoseconds representation (at |
| 130 | compile time). The hybrid union type optimizes time conversions on 32bit |
| 131 | CPUs. This build-time-selectable ktime_t storage format was implemented |
| 132 | to avoid the performance impact of 64-bit multiplications and divisions |
| 133 | on 32bit CPUs. Such operations are frequently necessary to convert |
| 134 | between the storage formats provided by kernel and userspace interfaces |
| 135 | and the internal time format. (See include/linux/ktime.h for further |
| 136 | details.) |
| 137 | |
| 138 | hrtimers - rounding of timer values |
| 139 | ----------------------------------- |
| 140 | |
| 141 | the hrtimer code will round timer events to lower-resolution clocks |
| 142 | because it has to. Otherwise it will do no artificial rounding at all. |
| 143 | |
| 144 | one question is, what resolution value should be returned to the user by |
| 145 | the clock_getres() interface. This will return whatever real resolution |
| 146 | a given clock has - be it low-res, high-res, or artificially-low-res. |
| 147 | |
| 148 | hrtimers - testing and verification |
| 149 | ---------------------------------- |
| 150 | |
| 151 | We used the high-resolution clock subsystem ontop of hrtimers to verify |
| 152 | the hrtimer implementation details in praxis, and we also ran the posix |
| 153 | timer tests in order to ensure specification compliance. We also ran |
| 154 | tests on low-resolution clocks. |
| 155 | |
| 156 | The hrtimer patch converts the following kernel functionality to use |
| 157 | hrtimers: |
| 158 | |
| 159 | - nanosleep |
| 160 | - itimers |
| 161 | - posix-timers |
| 162 | |
| 163 | The conversion of nanosleep and posix-timers enabled the unification of |
| 164 | nanosleep and clock_nanosleep. |
| 165 | |
| 166 | The code was successfully compiled for the following platforms: |
| 167 | |
| 168 | i386, x86_64, ARM, PPC, PPC64, IA64 |
| 169 | |
| 170 | The code was run-tested on the following platforms: |
| 171 | |
| 172 | i386(UP/SMP), x86_64(UP/SMP), ARM, PPC |
| 173 | |
| 174 | hrtimers were also integrated into the -rt tree, along with a |
| 175 | hrtimers-based high-resolution clock implementation, so the hrtimers |
| 176 | code got a healthy amount of testing and use in practice. |
| 177 | |
| 178 | Thomas Gleixner, Ingo Molnar |