Zachary Amsden | f392eb2 | 2010-08-19 22:07:33 -1000 | [diff] [blame] | 1 | |
| 2 | Timekeeping Virtualization for X86-Based Architectures |
| 3 | |
| 4 | Zachary Amsden <zamsden@redhat.com> |
| 5 | Copyright (c) 2010, Red Hat. All rights reserved. |
| 6 | |
| 7 | 1) Overview |
| 8 | 2) Timing Devices |
| 9 | 3) TSC Hardware |
| 10 | 4) Virtualization Problems |
| 11 | |
| 12 | ========================================================================= |
| 13 | |
| 14 | 1) Overview |
| 15 | |
| 16 | One of the most complicated parts of the X86 platform, and specifically, |
| 17 | the virtualization of this platform is the plethora of timing devices available |
| 18 | and the complexity of emulating those devices. In addition, virtualization of |
| 19 | time introduces a new set of challenges because it introduces a multiplexed |
| 20 | division of time beyond the control of the guest CPU. |
| 21 | |
| 22 | First, we will describe the various timekeeping hardware available, then |
| 23 | present some of the problems which arise and solutions available, giving |
| 24 | specific recommendations for certain classes of KVM guests. |
| 25 | |
| 26 | The purpose of this document is to collect data and information relevant to |
| 27 | timekeeping which may be difficult to find elsewhere, specifically, |
| 28 | information relevant to KVM and hardware-based virtualization. |
| 29 | |
| 30 | ========================================================================= |
| 31 | |
| 32 | 2) Timing Devices |
| 33 | |
| 34 | First we discuss the basic hardware devices available. TSC and the related |
| 35 | KVM clock are special enough to warrant a full exposition and are described in |
| 36 | the following section. |
| 37 | |
| 38 | 2.1) i8254 - PIT |
| 39 | |
| 40 | One of the first timer devices available is the programmable interrupt timer, |
| 41 | or PIT. The PIT has a fixed frequency 1.193182 MHz base clock and three |
| 42 | channels which can be programmed to deliver periodic or one-shot interrupts. |
| 43 | These three channels can be configured in different modes and have individual |
| 44 | counters. Channel 1 and 2 were not available for general use in the original |
| 45 | IBM PC, and historically were connected to control RAM refresh and the PC |
| 46 | speaker. Now the PIT is typically integrated as part of an emulated chipset |
| 47 | and a separate physical PIT is not used. |
| 48 | |
| 49 | The PIT uses I/O ports 0x40 - 0x43. Access to the 16-bit counters is done |
| 50 | using single or multiple byte access to the I/O ports. There are 6 modes |
| 51 | available, but not all modes are available to all timers, as only timer 2 |
| 52 | has a connected gate input, required for modes 1 and 5. The gate line is |
| 53 | controlled by port 61h, bit 0, as illustrated in the following diagram. |
| 54 | |
| 55 | -------------- ---------------- |
| 56 | | | | | |
| 57 | | 1.1932 MHz |---------->| CLOCK OUT | ---------> IRQ 0 |
| 58 | | Clock | | | | |
| 59 | -------------- | +->| GATE TIMER 0 | |
| 60 | | ---------------- |
| 61 | | |
| 62 | | ---------------- |
| 63 | | | | |
| 64 | |------>| CLOCK OUT | ---------> 66.3 KHZ DRAM |
| 65 | | | | (aka /dev/null) |
| 66 | | +->| GATE TIMER 1 | |
| 67 | | ---------------- |
| 68 | | |
| 69 | | ---------------- |
| 70 | | | | |
| 71 | |------>| CLOCK OUT | ---------> Port 61h, bit 5 |
| 72 | | | | |
| 73 | Port 61h, bit 0 ---------->| GATE TIMER 2 | \_.---- ____ |
| 74 | ---------------- _| )--|LPF|---Speaker |
| 75 | / *---- \___/ |
| 76 | Port 61h, bit 1 -----------------------------------/ |
| 77 | |
| 78 | The timer modes are now described. |
| 79 | |
| 80 | Mode 0: Single Timeout. This is a one-shot software timeout that counts down |
| 81 | when the gate is high (always true for timers 0 and 1). When the count |
| 82 | reaches zero, the output goes high. |
| 83 | |
Lucas De Marchi | 25985ed | 2011-03-30 22:57:33 -0300 | [diff] [blame] | 84 | Mode 1: Triggered One-shot. The output is initially set high. When the gate |
Zachary Amsden | f392eb2 | 2010-08-19 22:07:33 -1000 | [diff] [blame] | 85 | line is set high, a countdown is initiated (which does not stop if the gate is |
| 86 | lowered), during which the output is set low. When the count reaches zero, |
| 87 | the output goes high. |
| 88 | |
| 89 | Mode 2: Rate Generator. The output is initially set high. When the countdown |
| 90 | reaches 1, the output goes low for one count and then returns high. The value |
| 91 | is reloaded and the countdown automatically resumes. If the gate line goes |
| 92 | low, the count is halted. If the output is low when the gate is lowered, the |
| 93 | output automatically goes high (this only affects timer 2). |
| 94 | |
| 95 | Mode 3: Square Wave. This generates a high / low square wave. The count |
| 96 | determines the length of the pulse, which alternates between high and low |
| 97 | when zero is reached. The count only proceeds when gate is high and is |
| 98 | automatically reloaded on reaching zero. The count is decremented twice at |
| 99 | each clock to generate a full high / low cycle at the full periodic rate. |
| 100 | If the count is even, the clock remains high for N/2 counts and low for N/2 |
| 101 | counts; if the clock is odd, the clock is high for (N+1)/2 counts and low |
| 102 | for (N-1)/2 counts. Only even values are latched by the counter, so odd |
| 103 | values are not observed when reading. This is the intended mode for timer 2, |
| 104 | which generates sine-like tones by low-pass filtering the square wave output. |
| 105 | |
| 106 | Mode 4: Software Strobe. After programming this mode and loading the counter, |
| 107 | the output remains high until the counter reaches zero. Then the output |
| 108 | goes low for 1 clock cycle and returns high. The counter is not reloaded. |
| 109 | Counting only occurs when gate is high. |
| 110 | |
| 111 | Mode 5: Hardware Strobe. After programming and loading the counter, the |
| 112 | output remains high. When the gate is raised, a countdown is initiated |
| 113 | (which does not stop if the gate is lowered). When the counter reaches zero, |
| 114 | the output goes low for 1 clock cycle and then returns high. The counter is |
| 115 | not reloaded. |
| 116 | |
| 117 | In addition to normal binary counting, the PIT supports BCD counting. The |
| 118 | command port, 0x43 is used to set the counter and mode for each of the three |
| 119 | timers. |
| 120 | |
| 121 | PIT commands, issued to port 0x43, using the following bit encoding: |
| 122 | |
| 123 | Bit 7-4: Command (See table below) |
| 124 | Bit 3-1: Mode (000 = Mode 0, 101 = Mode 5, 11X = undefined) |
| 125 | Bit 0 : Binary (0) / BCD (1) |
| 126 | |
| 127 | Command table: |
| 128 | |
| 129 | 0000 - Latch Timer 0 count for port 0x40 |
| 130 | sample and hold the count to be read in port 0x40; |
| 131 | additional commands ignored until counter is read; |
| 132 | mode bits ignored. |
| 133 | |
| 134 | 0001 - Set Timer 0 LSB mode for port 0x40 |
| 135 | set timer to read LSB only and force MSB to zero; |
| 136 | mode bits set timer mode |
| 137 | |
| 138 | 0010 - Set Timer 0 MSB mode for port 0x40 |
| 139 | set timer to read MSB only and force LSB to zero; |
| 140 | mode bits set timer mode |
| 141 | |
| 142 | 0011 - Set Timer 0 16-bit mode for port 0x40 |
| 143 | set timer to read / write LSB first, then MSB; |
| 144 | mode bits set timer mode |
| 145 | |
| 146 | 0100 - Latch Timer 1 count for port 0x41 - as described above |
| 147 | 0101 - Set Timer 1 LSB mode for port 0x41 - as described above |
| 148 | 0110 - Set Timer 1 MSB mode for port 0x41 - as described above |
| 149 | 0111 - Set Timer 1 16-bit mode for port 0x41 - as described above |
| 150 | |
| 151 | 1000 - Latch Timer 2 count for port 0x42 - as described above |
| 152 | 1001 - Set Timer 2 LSB mode for port 0x42 - as described above |
| 153 | 1010 - Set Timer 2 MSB mode for port 0x42 - as described above |
| 154 | 1011 - Set Timer 2 16-bit mode for port 0x42 as described above |
| 155 | |
| 156 | 1101 - General counter latch |
| 157 | Latch combination of counters into corresponding ports |
| 158 | Bit 3 = Counter 2 |
| 159 | Bit 2 = Counter 1 |
| 160 | Bit 1 = Counter 0 |
| 161 | Bit 0 = Unused |
| 162 | |
| 163 | 1110 - Latch timer status |
| 164 | Latch combination of counter mode into corresponding ports |
| 165 | Bit 3 = Counter 2 |
| 166 | Bit 2 = Counter 1 |
| 167 | Bit 1 = Counter 0 |
| 168 | |
| 169 | The output of ports 0x40-0x42 following this command will be: |
| 170 | |
| 171 | Bit 7 = Output pin |
| 172 | Bit 6 = Count loaded (0 if timer has expired) |
| 173 | Bit 5-4 = Read / Write mode |
| 174 | 01 = MSB only |
| 175 | 10 = LSB only |
| 176 | 11 = LSB / MSB (16-bit) |
| 177 | Bit 3-1 = Mode |
| 178 | Bit 0 = Binary (0) / BCD mode (1) |
| 179 | |
| 180 | 2.2) RTC |
| 181 | |
| 182 | The second device which was available in the original PC was the MC146818 real |
| 183 | time clock. The original device is now obsolete, and usually emulated by the |
| 184 | system chipset, sometimes by an HPET and some frankenstein IRQ routing. |
| 185 | |
| 186 | The RTC is accessed through CMOS variables, which uses an index register to |
| 187 | control which bytes are read. Since there is only one index register, read |
| 188 | of the CMOS and read of the RTC require lock protection (in addition, it is |
| 189 | dangerous to allow userspace utilities such as hwclock to have direct RTC |
| 190 | access, as they could corrupt kernel reads and writes of CMOS memory). |
| 191 | |
| 192 | The RTC generates an interrupt which is usually routed to IRQ 8. The interrupt |
| 193 | can function as a periodic timer, an additional once a day alarm, and can issue |
| 194 | interrupts after an update of the CMOS registers by the MC146818 is complete. |
| 195 | The type of interrupt is signalled in the RTC status registers. |
| 196 | |
| 197 | The RTC will update the current time fields by battery power even while the |
| 198 | system is off. The current time fields should not be read while an update is |
| 199 | in progress, as indicated in the status register. |
| 200 | |
| 201 | The clock uses a 32.768kHz crystal, so bits 6-4 of register A should be |
| 202 | programmed to a 32kHz divider if the RTC is to count seconds. |
| 203 | |
| 204 | This is the RAM map originally used for the RTC/CMOS: |
| 205 | |
| 206 | Location Size Description |
| 207 | ------------------------------------------ |
| 208 | 00h byte Current second (BCD) |
| 209 | 01h byte Seconds alarm (BCD) |
| 210 | 02h byte Current minute (BCD) |
| 211 | 03h byte Minutes alarm (BCD) |
| 212 | 04h byte Current hour (BCD) |
| 213 | 05h byte Hours alarm (BCD) |
| 214 | 06h byte Current day of week (BCD) |
| 215 | 07h byte Current day of month (BCD) |
| 216 | 08h byte Current month (BCD) |
| 217 | 09h byte Current year (BCD) |
| 218 | 0Ah byte Register A |
| 219 | bit 7 = Update in progress |
| 220 | bit 6-4 = Divider for clock |
| 221 | 000 = 4.194 MHz |
| 222 | 001 = 1.049 MHz |
| 223 | 010 = 32 kHz |
| 224 | 10X = test modes |
| 225 | 110 = reset / disable |
| 226 | 111 = reset / disable |
| 227 | bit 3-0 = Rate selection for periodic interrupt |
| 228 | 000 = periodic timer disabled |
| 229 | 001 = 3.90625 uS |
| 230 | 010 = 7.8125 uS |
| 231 | 011 = .122070 mS |
| 232 | 100 = .244141 mS |
| 233 | ... |
| 234 | 1101 = 125 mS |
| 235 | 1110 = 250 mS |
| 236 | 1111 = 500 mS |
| 237 | 0Bh byte Register B |
| 238 | bit 7 = Run (0) / Halt (1) |
| 239 | bit 6 = Periodic interrupt enable |
| 240 | bit 5 = Alarm interrupt enable |
| 241 | bit 4 = Update-ended interrupt enable |
| 242 | bit 3 = Square wave interrupt enable |
| 243 | bit 2 = BCD calendar (0) / Binary (1) |
| 244 | bit 1 = 12-hour mode (0) / 24-hour mode (1) |
| 245 | bit 0 = 0 (DST off) / 1 (DST enabled) |
| 246 | OCh byte Register C (read only) |
| 247 | bit 7 = interrupt request flag (IRQF) |
| 248 | bit 6 = periodic interrupt flag (PF) |
| 249 | bit 5 = alarm interrupt flag (AF) |
| 250 | bit 4 = update interrupt flag (UF) |
| 251 | bit 3-0 = reserved |
| 252 | ODh byte Register D (read only) |
| 253 | bit 7 = RTC has power |
| 254 | bit 6-0 = reserved |
| 255 | 32h byte Current century BCD (*) |
| 256 | (*) location vendor specific and now determined from ACPI global tables |
| 257 | |
| 258 | 2.3) APIC |
| 259 | |
| 260 | On Pentium and later processors, an on-board timer is available to each CPU |
| 261 | as part of the Advanced Programmable Interrupt Controller. The APIC is |
| 262 | accessed through memory-mapped registers and provides interrupt service to each |
| 263 | CPU, used for IPIs and local timer interrupts. |
| 264 | |
| 265 | Although in theory the APIC is a safe and stable source for local interrupts, |
| 266 | in practice, many bugs and glitches have occurred due to the special nature of |
| 267 | the APIC CPU-local memory-mapped hardware. Beware that CPU errata may affect |
| 268 | the use of the APIC and that workarounds may be required. In addition, some of |
| 269 | these workarounds pose unique constraints for virtualization - requiring either |
| 270 | extra overhead incurred from extra reads of memory-mapped I/O or additional |
| 271 | functionality that may be more computationally expensive to implement. |
| 272 | |
| 273 | Since the APIC is documented quite well in the Intel and AMD manuals, we will |
| 274 | avoid repetition of the detail here. It should be pointed out that the APIC |
| 275 | timer is programmed through the LVT (local vector timer) register, is capable |
| 276 | of one-shot or periodic operation, and is based on the bus clock divided down |
| 277 | by the programmable divider register. |
| 278 | |
| 279 | 2.4) HPET |
| 280 | |
| 281 | HPET is quite complex, and was originally intended to replace the PIT / RTC |
| 282 | support of the X86 PC. It remains to be seen whether that will be the case, as |
| 283 | the de facto standard of PC hardware is to emulate these older devices. Some |
| 284 | systems designated as legacy free may support only the HPET as a hardware timer |
| 285 | device. |
| 286 | |
| 287 | The HPET spec is rather loose and vague, requiring at least 3 hardware timers, |
| 288 | but allowing implementation freedom to support many more. It also imposes no |
| 289 | fixed rate on the timer frequency, but does impose some extremal values on |
| 290 | frequency, error and slew. |
| 291 | |
| 292 | In general, the HPET is recommended as a high precision (compared to PIT /RTC) |
| 293 | time source which is independent of local variation (as there is only one HPET |
| 294 | in any given system). The HPET is also memory-mapped, and its presence is |
| 295 | indicated through ACPI tables by the BIOS. |
| 296 | |
| 297 | Detailed specification of the HPET is beyond the current scope of this |
| 298 | document, as it is also very well documented elsewhere. |
| 299 | |
| 300 | 2.5) Offboard Timers |
| 301 | |
| 302 | Several cards, both proprietary (watchdog boards) and commonplace (e1000) have |
| 303 | timing chips built into the cards which may have registers which are accessible |
| 304 | to kernel or user drivers. To the author's knowledge, using these to generate |
| 305 | a clocksource for a Linux or other kernel has not yet been attempted and is in |
| 306 | general frowned upon as not playing by the agreed rules of the game. Such a |
| 307 | timer device would require additional support to be virtualized properly and is |
| 308 | not considered important at this time as no known operating system does this. |
| 309 | |
| 310 | ========================================================================= |
| 311 | |
| 312 | 3) TSC Hardware |
| 313 | |
| 314 | The TSC or time stamp counter is relatively simple in theory; it counts |
| 315 | instruction cycles issued by the processor, which can be used as a measure of |
| 316 | time. In practice, due to a number of problems, it is the most complicated |
| 317 | timekeeping device to use. |
| 318 | |
| 319 | The TSC is represented internally as a 64-bit MSR which can be read with the |
| 320 | RDMSR, RDTSC, or RDTSCP (when available) instructions. In the past, hardware |
| 321 | limitations made it possible to write the TSC, but generally on old hardware it |
| 322 | was only possible to write the low 32-bits of the 64-bit counter, and the upper |
| 323 | 32-bits of the counter were cleared. Now, however, on Intel processors family |
| 324 | 0Fh, for models 3, 4 and 6, and family 06h, models e and f, this restriction |
| 325 | has been lifted and all 64-bits are writable. On AMD systems, the ability to |
| 326 | write the TSC MSR is not an architectural guarantee. |
| 327 | |
| 328 | The TSC is accessible from CPL-0 and conditionally, for CPL > 0 software by |
| 329 | means of the CR4.TSD bit, which when enabled, disables CPL > 0 TSC access. |
| 330 | |
| 331 | Some vendors have implemented an additional instruction, RDTSCP, which returns |
| 332 | atomically not just the TSC, but an indicator which corresponds to the |
| 333 | processor number. This can be used to index into an array of TSC variables to |
| 334 | determine offset information in SMP systems where TSCs are not synchronized. |
| 335 | The presence of this instruction must be determined by consulting CPUID feature |
| 336 | bits. |
| 337 | |
| 338 | Both VMX and SVM provide extension fields in the virtualization hardware which |
| 339 | allows the guest visible TSC to be offset by a constant. Newer implementations |
| 340 | promise to allow the TSC to additionally be scaled, but this hardware is not |
| 341 | yet widely available. |
| 342 | |
| 343 | 3.1) TSC synchronization |
| 344 | |
| 345 | The TSC is a CPU-local clock in most implementations. This means, on SMP |
| 346 | platforms, the TSCs of different CPUs may start at different times depending |
| 347 | on when the CPUs are powered on. Generally, CPUs on the same die will share |
| 348 | the same clock, however, this is not always the case. |
| 349 | |
| 350 | The BIOS may attempt to resynchronize the TSCs during the poweron process and |
| 351 | the operating system or other system software may attempt to do this as well. |
| 352 | Several hardware limitations make the problem worse - if it is not possible to |
| 353 | write the full 64-bits of the TSC, it may be impossible to match the TSC in |
| 354 | newly arriving CPUs to that of the rest of the system, resulting in |
| 355 | unsynchronized TSCs. This may be done by BIOS or system software, but in |
| 356 | practice, getting a perfectly synchronized TSC will not be possible unless all |
| 357 | values are read from the same clock, which generally only is possible on single |
| 358 | socket systems or those with special hardware support. |
| 359 | |
| 360 | 3.2) TSC and CPU hotplug |
| 361 | |
| 362 | As touched on already, CPUs which arrive later than the boot time of the system |
| 363 | may not have a TSC value that is synchronized with the rest of the system. |
| 364 | Either system software, BIOS, or SMM code may actually try to establish the TSC |
| 365 | to a value matching the rest of the system, but a perfect match is usually not |
| 366 | a guarantee. This can have the effect of bringing a system from a state where |
| 367 | TSC is synchronized back to a state where TSC synchronization flaws, however |
| 368 | small, may be exposed to the OS and any virtualization environment. |
| 369 | |
| 370 | 3.3) TSC and multi-socket / NUMA |
| 371 | |
| 372 | Multi-socket systems, especially large multi-socket systems are likely to have |
| 373 | individual clocksources rather than a single, universally distributed clock. |
| 374 | Since these clocks are driven by different crystals, they will not have |
| 375 | perfectly matched frequency, and temperature and electrical variations will |
| 376 | cause the CPU clocks, and thus the TSCs to drift over time. Depending on the |
| 377 | exact clock and bus design, the drift may or may not be fixed in absolute |
| 378 | error, and may accumulate over time. |
| 379 | |
| 380 | In addition, very large systems may deliberately slew the clocks of individual |
| 381 | cores. This technique, known as spread-spectrum clocking, reduces EMI at the |
| 382 | clock frequency and harmonics of it, which may be required to pass FCC |
| 383 | standards for telecommunications and computer equipment. |
| 384 | |
| 385 | It is recommended not to trust the TSCs to remain synchronized on NUMA or |
| 386 | multiple socket systems for these reasons. |
| 387 | |
| 388 | 3.4) TSC and C-states |
| 389 | |
| 390 | C-states, or idling states of the processor, especially C1E and deeper sleep |
| 391 | states may be problematic for TSC as well. The TSC may stop advancing in such |
| 392 | a state, resulting in a TSC which is behind that of other CPUs when execution |
| 393 | is resumed. Such CPUs must be detected and flagged by the operating system |
| 394 | based on CPU and chipset identifications. |
| 395 | |
| 396 | The TSC in such a case may be corrected by catching it up to a known external |
| 397 | clocksource. |
| 398 | |
| 399 | 3.5) TSC frequency change / P-states |
| 400 | |
| 401 | To make things slightly more interesting, some CPUs may change frequency. They |
| 402 | may or may not run the TSC at the same rate, and because the frequency change |
| 403 | may be staggered or slewed, at some points in time, the TSC rate may not be |
| 404 | known other than falling within a range of values. In this case, the TSC will |
| 405 | not be a stable time source, and must be calibrated against a known, stable, |
| 406 | external clock to be a usable source of time. |
| 407 | |
| 408 | Whether the TSC runs at a constant rate or scales with the P-state is model |
| 409 | dependent and must be determined by inspecting CPUID, chipset or vendor |
| 410 | specific MSR fields. |
| 411 | |
| 412 | In addition, some vendors have known bugs where the P-state is actually |
| 413 | compensated for properly during normal operation, but when the processor is |
| 414 | inactive, the P-state may be raised temporarily to service cache misses from |
| 415 | other processors. In such cases, the TSC on halted CPUs could advance faster |
| 416 | than that of non-halted processors. AMD Turion processors are known to have |
| 417 | this problem. |
| 418 | |
| 419 | 3.6) TSC and STPCLK / T-states |
| 420 | |
| 421 | External signals given to the processor may also have the effect of stopping |
| 422 | the TSC. This is typically done for thermal emergency power control to prevent |
| 423 | an overheating condition, and typically, there is no way to detect that this |
| 424 | condition has happened. |
| 425 | |
| 426 | 3.7) TSC virtualization - VMX |
| 427 | |
| 428 | VMX provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP |
| 429 | instructions, which is enough for full virtualization of TSC in any manner. In |
| 430 | addition, VMX allows passing through the host TSC plus an additional TSC_OFFSET |
| 431 | field specified in the VMCS. Special instructions must be used to read and |
| 432 | write the VMCS field. |
| 433 | |
| 434 | 3.8) TSC virtualization - SVM |
| 435 | |
| 436 | SVM provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP |
| 437 | instructions, which is enough for full virtualization of TSC in any manner. In |
| 438 | addition, SVM allows passing through the host TSC plus an additional offset |
| 439 | field specified in the SVM control block. |
| 440 | |
| 441 | 3.9) TSC feature bits in Linux |
| 442 | |
| 443 | In summary, there is no way to guarantee the TSC remains in perfect |
| 444 | synchronization unless it is explicitly guaranteed by the architecture. Even |
| 445 | if so, the TSCs in multi-sockets or NUMA systems may still run independently |
| 446 | despite being locally consistent. |
| 447 | |
| 448 | The following feature bits are used by Linux to signal various TSC attributes, |
| 449 | but they can only be taken to be meaningful for UP or single node systems. |
| 450 | |
| 451 | X86_FEATURE_TSC : The TSC is available in hardware |
| 452 | X86_FEATURE_RDTSCP : The RDTSCP instruction is available |
| 453 | X86_FEATURE_CONSTANT_TSC : The TSC rate is unchanged with P-states |
| 454 | X86_FEATURE_NONSTOP_TSC : The TSC does not stop in C-states |
| 455 | X86_FEATURE_TSC_RELIABLE : TSC sync checks are skipped (VMware) |
| 456 | |
| 457 | 4) Virtualization Problems |
| 458 | |
| 459 | Timekeeping is especially problematic for virtualization because a number of |
| 460 | challenges arise. The most obvious problem is that time is now shared between |
| 461 | the host and, potentially, a number of virtual machines. Thus the virtual |
| 462 | operating system does not run with 100% usage of the CPU, despite the fact that |
| 463 | it may very well make that assumption. It may expect it to remain true to very |
| 464 | exacting bounds when interrupt sources are disabled, but in reality only its |
| 465 | virtual interrupt sources are disabled, and the machine may still be preempted |
| 466 | at any time. This causes problems as the passage of real time, the injection |
| 467 | of machine interrupts and the associated clock sources are no longer completely |
| 468 | synchronized with real time. |
| 469 | |
| 470 | This same problem can occur on native harware to a degree, as SMM mode may |
| 471 | steal cycles from the naturally on X86 systems when SMM mode is used by the |
| 472 | BIOS, but not in such an extreme fashion. However, the fact that SMM mode may |
| 473 | cause similar problems to virtualization makes it a good justification for |
| 474 | solving many of these problems on bare metal. |
| 475 | |
| 476 | 4.1) Interrupt clocking |
| 477 | |
| 478 | One of the most immediate problems that occurs with legacy operating systems |
| 479 | is that the system timekeeping routines are often designed to keep track of |
| 480 | time by counting periodic interrupts. These interrupts may come from the PIT |
| 481 | or the RTC, but the problem is the same: the host virtualization engine may not |
| 482 | be able to deliver the proper number of interrupts per second, and so guest |
| 483 | time may fall behind. This is especially problematic if a high interrupt rate |
| 484 | is selected, such as 1000 HZ, which is unfortunately the default for many Linux |
| 485 | guests. |
| 486 | |
| 487 | There are three approaches to solving this problem; first, it may be possible |
| 488 | to simply ignore it. Guests which have a separate time source for tracking |
| 489 | 'wall clock' or 'real time' may not need any adjustment of their interrupts to |
| 490 | maintain proper time. If this is not sufficient, it may be necessary to inject |
| 491 | additional interrupts into the guest in order to increase the effective |
| 492 | interrupt rate. This approach leads to complications in extreme conditions, |
| 493 | where host load or guest lag is too much to compensate for, and thus another |
| 494 | solution to the problem has risen: the guest may need to become aware of lost |
| 495 | ticks and compensate for them internally. Although promising in theory, the |
| 496 | implementation of this policy in Linux has been extremely error prone, and a |
| 497 | number of buggy variants of lost tick compensation are distributed across |
| 498 | commonly used Linux systems. |
| 499 | |
| 500 | Windows uses periodic RTC clocking as a means of keeping time internally, and |
| 501 | thus requires interrupt slewing to keep proper time. It does use a low enough |
| 502 | rate (ed: is it 18.2 Hz?) however that it has not yet been a problem in |
| 503 | practice. |
| 504 | |
| 505 | 4.2) TSC sampling and serialization |
| 506 | |
| 507 | As the highest precision time source available, the cycle counter of the CPU |
| 508 | has aroused much interest from developers. As explained above, this timer has |
| 509 | many problems unique to its nature as a local, potentially unstable and |
| 510 | potentially unsynchronized source. One issue which is not unique to the TSC, |
| 511 | but is highlighted because of its very precise nature is sampling delay. By |
| 512 | definition, the counter, once read is already old. However, it is also |
| 513 | possible for the counter to be read ahead of the actual use of the result. |
| 514 | This is a consequence of the superscalar execution of the instruction stream, |
| 515 | which may execute instructions out of order. Such execution is called |
| 516 | non-serialized. Forcing serialized execution is necessary for precise |
| 517 | measurement with the TSC, and requires a serializing instruction, such as CPUID |
| 518 | or an MSR read. |
| 519 | |
| 520 | Since CPUID may actually be virtualized by a trap and emulate mechanism, this |
| 521 | serialization can pose a performance issue for hardware virtualization. An |
| 522 | accurate time stamp counter reading may therefore not always be available, and |
| 523 | it may be necessary for an implementation to guard against "backwards" reads of |
| 524 | the TSC as seen from other CPUs, even in an otherwise perfectly synchronized |
| 525 | system. |
| 526 | |
| 527 | 4.3) Timespec aliasing |
| 528 | |
| 529 | Additionally, this lack of serialization from the TSC poses another challenge |
| 530 | when using results of the TSC when measured against another time source. As |
| 531 | the TSC is much higher precision, many possible values of the TSC may be read |
| 532 | while another clock is still expressing the same value. |
| 533 | |
| 534 | That is, you may read (T,T+10) while external clock C maintains the same value. |
| 535 | Due to non-serialized reads, you may actually end up with a range which |
| 536 | fluctuates - from (T-1.. T+10). Thus, any time calculated from a TSC, but |
| 537 | calibrated against an external value may have a range of valid values. |
| 538 | Re-calibrating this computation may actually cause time, as computed after the |
| 539 | calibration, to go backwards, compared with time computed before the |
| 540 | calibration. |
| 541 | |
| 542 | This problem is particularly pronounced with an internal time source in Linux, |
| 543 | the kernel time, which is expressed in the theoretically high resolution |
| 544 | timespec - but which advances in much larger granularity intervals, sometimes |
| 545 | at the rate of jiffies, and possibly in catchup modes, at a much larger step. |
| 546 | |
| 547 | This aliasing requires care in the computation and recalibration of kvmclock |
| 548 | and any other values derived from TSC computation (such as TSC virtualization |
| 549 | itself). |
| 550 | |
| 551 | 4.4) Migration |
| 552 | |
| 553 | Migration of a virtual machine raises problems for timekeeping in two ways. |
| 554 | First, the migration itself may take time, during which interrupts cannot be |
| 555 | delivered, and after which, the guest time may need to be caught up. NTP may |
| 556 | be able to help to some degree here, as the clock correction required is |
| 557 | typically small enough to fall in the NTP-correctable window. |
| 558 | |
| 559 | An additional concern is that timers based off the TSC (or HPET, if the raw bus |
| 560 | clock is exposed) may now be running at different rates, requiring compensation |
| 561 | in some way in the hypervisor by virtualizing these timers. In addition, |
| 562 | migrating to a faster machine may preclude the use of a passthrough TSC, as a |
| 563 | faster clock cannot be made visible to a guest without the potential of time |
| 564 | advancing faster than usual. A slower clock is less of a problem, as it can |
| 565 | always be caught up to the original rate. KVM clock avoids these problems by |
| 566 | simply storing multipliers and offsets against the TSC for the guest to convert |
| 567 | back into nanosecond resolution values. |
| 568 | |
| 569 | 4.5) Scheduling |
| 570 | |
| 571 | Since scheduling may be based on precise timing and firing of interrupts, the |
| 572 | scheduling algorithms of an operating system may be adversely affected by |
| 573 | virtualization. In theory, the effect is random and should be universally |
| 574 | distributed, but in contrived as well as real scenarios (guest device access, |
| 575 | causes of virtualization exits, possible context switch), this may not always |
| 576 | be the case. The effect of this has not been well studied. |
| 577 | |
| 578 | In an attempt to work around this, several implementations have provided a |
| 579 | paravirtualized scheduler clock, which reveals the true amount of CPU time for |
| 580 | which a virtual machine has been running. |
| 581 | |
| 582 | 4.6) Watchdogs |
| 583 | |
| 584 | Watchdog timers, such as the lock detector in Linux may fire accidentally when |
| 585 | running under hardware virtualization due to timer interrupts being delayed or |
| 586 | misinterpretation of the passage of real time. Usually, these warnings are |
| 587 | spurious and can be ignored, but in some circumstances it may be necessary to |
| 588 | disable such detection. |
| 589 | |
| 590 | 4.7) Delays and precision timing |
| 591 | |
| 592 | Precise timing and delays may not be possible in a virtualized system. This |
| 593 | can happen if the system is controlling physical hardware, or issues delays to |
| 594 | compensate for slower I/O to and from devices. The first issue is not solvable |
| 595 | in general for a virtualized system; hardware control software can't be |
| 596 | adequately virtualized without a full real-time operating system, which would |
| 597 | require an RT aware virtualization platform. |
| 598 | |
| 599 | The second issue may cause performance problems, but this is unlikely to be a |
| 600 | significant issue. In many cases these delays may be eliminated through |
| 601 | configuration or paravirtualization. |
| 602 | |
| 603 | 4.8) Covert channels and leaks |
| 604 | |
| 605 | In addition to the above problems, time information will inevitably leak to the |
| 606 | guest about the host in anything but a perfect implementation of virtualized |
| 607 | time. This may allow the guest to infer the presence of a hypervisor (as in a |
| 608 | red-pill type detection), and it may allow information to leak between guests |
| 609 | by using CPU utilization itself as a signalling channel. Preventing such |
| 610 | problems would require completely isolated virtual time which may not track |
| 611 | real time any longer. This may be useful in certain security or QA contexts, |
| 612 | but in general isn't recommended for real-world deployment scenarios. |