Linus Walleij | 7806f60 | 2014-07-10 09:52:27 +0200 | [diff] [blame] | 1 | Clock sources, Clock events, sched_clock() and delay timers |
| 2 | ----------------------------------------------------------- |
| 3 | |
| 4 | This document tries to briefly explain some basic kernel timekeeping |
| 5 | abstractions. It partly pertains to the drivers usually found in |
| 6 | drivers/clocksource in the kernel tree, but the code may be spread out |
| 7 | across the kernel. |
| 8 | |
| 9 | If you grep through the kernel source you will find a number of architecture- |
| 10 | specific implementations of clock sources, clockevents and several likewise |
| 11 | architecture-specific overrides of the sched_clock() function and some |
| 12 | delay timers. |
| 13 | |
| 14 | To provide timekeeping for your platform, the clock source provides |
| 15 | the basic timeline, whereas clock events shoot interrupts on certain points |
| 16 | on this timeline, providing facilities such as high-resolution timers. |
| 17 | sched_clock() is used for scheduling and timestamping, and delay timers |
| 18 | provide an accurate delay source using hardware counters. |
| 19 | |
| 20 | |
| 21 | Clock sources |
| 22 | ------------- |
| 23 | |
| 24 | The purpose of the clock source is to provide a timeline for the system that |
| 25 | tells you where you are in time. For example issuing the command 'date' on |
| 26 | a Linux system will eventually read the clock source to determine exactly |
| 27 | what time it is. |
| 28 | |
| 29 | Typically the clock source is a monotonic, atomic counter which will provide |
| 30 | n bits which count from 0 to 2^(n-1) and then wraps around to 0 and start over. |
| 31 | It will ideally NEVER stop ticking as long as the system is running. It |
| 32 | may stop during system suspend. |
| 33 | |
| 34 | The clock source shall have as high resolution as possible, and the frequency |
| 35 | shall be as stable and correct as possible as compared to a real-world wall |
| 36 | clock. It should not move unpredictably back and forth in time or miss a few |
| 37 | cycles here and there. |
| 38 | |
| 39 | It must be immune to the kind of effects that occur in hardware where e.g. |
| 40 | the counter register is read in two phases on the bus lowest 16 bits first |
| 41 | and the higher 16 bits in a second bus cycle with the counter bits |
| 42 | potentially being updated in between leading to the risk of very strange |
| 43 | values from the counter. |
| 44 | |
| 45 | When the wall-clock accuracy of the clock source isn't satisfactory, there |
| 46 | are various quirks and layers in the timekeeping code for e.g. synchronizing |
| 47 | the user-visible time to RTC clocks in the system or against networked time |
| 48 | servers using NTP, but all they do basically is update an offset against |
| 49 | the clock source, which provides the fundamental timeline for the system. |
| 50 | These measures does not affect the clock source per se, they only adapt the |
| 51 | system to the shortcomings of it. |
| 52 | |
| 53 | The clock source struct shall provide means to translate the provided counter |
| 54 | into a nanosecond value as an unsigned long long (unsigned 64 bit) number. |
| 55 | Since this operation may be invoked very often, doing this in a strict |
| 56 | mathematical sense is not desirable: instead the number is taken as close as |
| 57 | possible to a nanosecond value using only the arithmetic operations |
| 58 | multiply and shift, so in clocksource_cyc2ns() you find: |
| 59 | |
| 60 | ns ~= (clocksource * mult) >> shift |
| 61 | |
| 62 | You will find a number of helper functions in the clock source code intended |
| 63 | to aid in providing these mult and shift values, such as |
| 64 | clocksource_khz2mult(), clocksource_hz2mult() that help determine the |
| 65 | mult factor from a fixed shift, and clocksource_register_hz() and |
| 66 | clocksource_register_khz() which will help out assigning both shift and mult |
| 67 | factors using the frequency of the clock source as the only input. |
| 68 | |
| 69 | For real simple clock sources accessed from a single I/O memory location |
| 70 | there is nowadays even clocksource_mmio_init() which will take a memory |
| 71 | location, bit width, a parameter telling whether the counter in the |
| 72 | register counts up or down, and the timer clock rate, and then conjure all |
| 73 | necessary parameters. |
| 74 | |
| 75 | Since a 32-bit counter at say 100 MHz will wrap around to zero after some 43 |
| 76 | seconds, the code handling the clock source will have to compensate for this. |
| 77 | That is the reason why the clock source struct also contains a 'mask' |
| 78 | member telling how many bits of the source are valid. This way the timekeeping |
| 79 | code knows when the counter will wrap around and can insert the necessary |
| 80 | compensation code on both sides of the wrap point so that the system timeline |
| 81 | remains monotonic. |
| 82 | |
| 83 | |
| 84 | Clock events |
| 85 | ------------ |
| 86 | |
| 87 | Clock events are the conceptual reverse of clock sources: they take a |
| 88 | desired time specification value and calculate the values to poke into |
| 89 | hardware timer registers. |
| 90 | |
| 91 | Clock events are orthogonal to clock sources. The same hardware |
| 92 | and register range may be used for the clock event, but it is essentially |
| 93 | a different thing. The hardware driving clock events has to be able to |
| 94 | fire interrupts, so as to trigger events on the system timeline. On an SMP |
| 95 | system, it is ideal (and customary) to have one such event driving timer per |
| 96 | CPU core, so that each core can trigger events independently of any other |
| 97 | core. |
| 98 | |
| 99 | You will notice that the clock event device code is based on the same basic |
| 100 | idea about translating counters to nanoseconds using mult and shift |
| 101 | arithmetic, and you find the same family of helper functions again for |
| 102 | assigning these values. The clock event driver does not need a 'mask' |
| 103 | attribute however: the system will not try to plan events beyond the time |
| 104 | horizon of the clock event. |
| 105 | |
| 106 | |
| 107 | sched_clock() |
| 108 | ------------- |
| 109 | |
| 110 | In addition to the clock sources and clock events there is a special weak |
| 111 | function in the kernel called sched_clock(). This function shall return the |
| 112 | number of nanoseconds since the system was started. An architecture may or |
| 113 | may not provide an implementation of sched_clock() on its own. If a local |
| 114 | implementation is not provided, the system jiffy counter will be used as |
| 115 | sched_clock(). |
| 116 | |
| 117 | As the name suggests, sched_clock() is used for scheduling the system, |
| 118 | determining the absolute timeslice for a certain process in the CFS scheduler |
| 119 | for example. It is also used for printk timestamps when you have selected to |
| 120 | include time information in printk for things like bootcharts. |
| 121 | |
| 122 | Compared to clock sources, sched_clock() has to be very fast: it is called |
| 123 | much more often, especially by the scheduler. If you have to do trade-offs |
| 124 | between accuracy compared to the clock source, you may sacrifice accuracy |
| 125 | for speed in sched_clock(). It however requires some of the same basic |
| 126 | characteristics as the clock source, i.e. it should be monotonic. |
| 127 | |
| 128 | The sched_clock() function may wrap only on unsigned long long boundaries, |
| 129 | i.e. after 64 bits. Since this is a nanosecond value this will mean it wraps |
| 130 | after circa 585 years. (For most practical systems this means "never".) |
| 131 | |
| 132 | If an architecture does not provide its own implementation of this function, |
| 133 | it will fall back to using jiffies, making its maximum resolution 1/HZ of the |
| 134 | jiffy frequency for the architecture. This will affect scheduling accuracy |
| 135 | and will likely show up in system benchmarks. |
| 136 | |
| 137 | The clock driving sched_clock() may stop or reset to zero during system |
| 138 | suspend/sleep. This does not matter to the function it serves of scheduling |
| 139 | events on the system. However it may result in interesting timestamps in |
| 140 | printk(). |
| 141 | |
| 142 | The sched_clock() function should be callable in any context, IRQ- and |
| 143 | NMI-safe and return a sane value in any context. |
| 144 | |
| 145 | Some architectures may have a limited set of time sources and lack a nice |
| 146 | counter to derive a 64-bit nanosecond value, so for example on the ARM |
| 147 | architecture, special helper functions have been created to provide a |
| 148 | sched_clock() nanosecond base from a 16- or 32-bit counter. Sometimes the |
| 149 | same counter that is also used as clock source is used for this purpose. |
| 150 | |
| 151 | On SMP systems, it is crucial for performance that sched_clock() can be called |
| 152 | independently on each CPU without any synchronization performance hits. |
| 153 | Some hardware (such as the x86 TSC) will cause the sched_clock() function to |
| 154 | drift between the CPUs on the system. The kernel can work around this by |
| 155 | enabling the CONFIG_HAVE_UNSTABLE_SCHED_CLOCK option. This is another aspect |
| 156 | that makes sched_clock() different from the ordinary clock source. |
| 157 | |
| 158 | |
| 159 | Delay timers (some architectures only) |
| 160 | -------------------------------------- |
| 161 | |
| 162 | On systems with variable CPU frequency, the various kernel delay() functions |
| 163 | will sometimes behave strangely. Basically these delays usually use a hard |
| 164 | loop to delay a certain number of jiffy fractions using a "lpj" (loops per |
| 165 | jiffy) value, calibrated on boot. |
| 166 | |
| 167 | Let's hope that your system is running on maximum frequency when this value |
| 168 | is calibrated: as an effect when the frequency is geared down to half the |
| 169 | full frequency, any delay() will be twice as long. Usually this does not |
| 170 | hurt, as you're commonly requesting that amount of delay *or more*. But |
| 171 | basically the semantics are quite unpredictable on such systems. |
| 172 | |
| 173 | Enter timer-based delays. Using these, a timer read may be used instead of |
| 174 | a hard-coded loop for providing the desired delay. |
| 175 | |
| 176 | This is done by declaring a struct delay_timer and assigning the appropriate |
| 177 | function pointers and rate settings for this delay timer. |
| 178 | |
| 179 | This is available on some architectures like OpenRISC or ARM. |