Commit | Line | Data |
---|---|---|
f392eb25 ZA |
1 | |
2 | Timekeeping Virtualization for X86-Based Architectures | |
3 | ||
4 | Zachary Amsden <zamsden@redhat.com> | |
5 | Copyright (c) 2010, Red Hat. All rights reserved. | |
6 | ||
7 | 1) Overview | |
8 | 2) Timing Devices | |
9 | 3) TSC Hardware | |
10 | 4) Virtualization Problems | |
11 | ||
12 | ========================================================================= | |
13 | ||
14 | 1) Overview | |
15 | ||
16 | One of the most complicated parts of the X86 platform, and specifically, | |
17 | the virtualization of this platform is the plethora of timing devices available | |
18 | and the complexity of emulating those devices. In addition, virtualization of | |
19 | time introduces a new set of challenges because it introduces a multiplexed | |
20 | division of time beyond the control of the guest CPU. | |
21 | ||
22 | First, we will describe the various timekeeping hardware available, then | |
23 | present some of the problems which arise and solutions available, giving | |
24 | specific recommendations for certain classes of KVM guests. | |
25 | ||
26 | The purpose of this document is to collect data and information relevant to | |
27 | timekeeping which may be difficult to find elsewhere, specifically, | |
28 | information relevant to KVM and hardware-based virtualization. | |
29 | ||
30 | ========================================================================= | |
31 | ||
32 | 2) Timing Devices | |
33 | ||
34 | First we discuss the basic hardware devices available. TSC and the related | |
35 | KVM clock are special enough to warrant a full exposition and are described in | |
36 | the following section. | |
37 | ||
38 | 2.1) i8254 - PIT | |
39 | ||
40 | One of the first timer devices available is the programmable interrupt timer, | |
41 | or PIT. The PIT has a fixed frequency 1.193182 MHz base clock and three | |
42 | channels which can be programmed to deliver periodic or one-shot interrupts. | |
43 | These three channels can be configured in different modes and have individual | |
44 | counters. Channel 1 and 2 were not available for general use in the original | |
45 | IBM PC, and historically were connected to control RAM refresh and the PC | |
46 | speaker. Now the PIT is typically integrated as part of an emulated chipset | |
47 | and a separate physical PIT is not used. | |
48 | ||
49 | The PIT uses I/O ports 0x40 - 0x43. Access to the 16-bit counters is done | |
50 | using single or multiple byte access to the I/O ports. There are 6 modes | |
51 | available, but not all modes are available to all timers, as only timer 2 | |
52 | has a connected gate input, required for modes 1 and 5. The gate line is | |
53 | controlled by port 61h, bit 0, as illustrated in the following diagram. | |
54 | ||
55 | -------------- ---------------- | |
56 | | | | | | |
57 | | 1.1932 MHz |---------->| CLOCK OUT | ---------> IRQ 0 | |
58 | | Clock | | | | | |
59 | -------------- | +->| GATE TIMER 0 | | |
60 | | ---------------- | |
61 | | | |
62 | | ---------------- | |
63 | | | | | |
64 | |------>| CLOCK OUT | ---------> 66.3 KHZ DRAM | |
65 | | | | (aka /dev/null) | |
66 | | +->| GATE TIMER 1 | | |
67 | | ---------------- | |
68 | | | |
69 | | ---------------- | |
70 | | | | | |
71 | |------>| CLOCK OUT | ---------> Port 61h, bit 5 | |
72 | | | | | |
73 | Port 61h, bit 0 ---------->| GATE TIMER 2 | \_.---- ____ | |
74 | ---------------- _| )--|LPF|---Speaker | |
75 | / *---- \___/ | |
76 | Port 61h, bit 1 -----------------------------------/ | |
77 | ||
78 | The timer modes are now described. | |
79 | ||
80 | Mode 0: Single Timeout. This is a one-shot software timeout that counts down | |
81 | when the gate is high (always true for timers 0 and 1). When the count | |
82 | reaches zero, the output goes high. | |
83 | ||
84 | Mode 1: Triggered One-shot. The output is intially set high. When the gate | |
85 | line is set high, a countdown is initiated (which does not stop if the gate is | |
86 | lowered), during which the output is set low. When the count reaches zero, | |
87 | the output goes high. | |
88 | ||
89 | Mode 2: Rate Generator. The output is initially set high. When the countdown | |
90 | reaches 1, the output goes low for one count and then returns high. The value | |
91 | is reloaded and the countdown automatically resumes. If the gate line goes | |
92 | low, the count is halted. If the output is low when the gate is lowered, the | |
93 | output automatically goes high (this only affects timer 2). | |
94 | ||
95 | Mode 3: Square Wave. This generates a high / low square wave. The count | |
96 | determines the length of the pulse, which alternates between high and low | |
97 | when zero is reached. The count only proceeds when gate is high and is | |
98 | automatically reloaded on reaching zero. The count is decremented twice at | |
99 | each clock to generate a full high / low cycle at the full periodic rate. | |
100 | If the count is even, the clock remains high for N/2 counts and low for N/2 | |
101 | counts; if the clock is odd, the clock is high for (N+1)/2 counts and low | |
102 | for (N-1)/2 counts. Only even values are latched by the counter, so odd | |
103 | values are not observed when reading. This is the intended mode for timer 2, | |
104 | which generates sine-like tones by low-pass filtering the square wave output. | |
105 | ||
106 | Mode 4: Software Strobe. After programming this mode and loading the counter, | |
107 | the output remains high until the counter reaches zero. Then the output | |
108 | goes low for 1 clock cycle and returns high. The counter is not reloaded. | |
109 | Counting only occurs when gate is high. | |
110 | ||
111 | Mode 5: Hardware Strobe. After programming and loading the counter, the | |
112 | output remains high. When the gate is raised, a countdown is initiated | |
113 | (which does not stop if the gate is lowered). When the counter reaches zero, | |
114 | the output goes low for 1 clock cycle and then returns high. The counter is | |
115 | not reloaded. | |
116 | ||
117 | In addition to normal binary counting, the PIT supports BCD counting. The | |
118 | command port, 0x43 is used to set the counter and mode for each of the three | |
119 | timers. | |
120 | ||
121 | PIT commands, issued to port 0x43, using the following bit encoding: | |
122 | ||
123 | Bit 7-4: Command (See table below) | |
124 | Bit 3-1: Mode (000 = Mode 0, 101 = Mode 5, 11X = undefined) | |
125 | Bit 0 : Binary (0) / BCD (1) | |
126 | ||
127 | Command table: | |
128 | ||
129 | 0000 - Latch Timer 0 count for port 0x40 | |
130 | sample and hold the count to be read in port 0x40; | |
131 | additional commands ignored until counter is read; | |
132 | mode bits ignored. | |
133 | ||
134 | 0001 - Set Timer 0 LSB mode for port 0x40 | |
135 | set timer to read LSB only and force MSB to zero; | |
136 | mode bits set timer mode | |
137 | ||
138 | 0010 - Set Timer 0 MSB mode for port 0x40 | |
139 | set timer to read MSB only and force LSB to zero; | |
140 | mode bits set timer mode | |
141 | ||
142 | 0011 - Set Timer 0 16-bit mode for port 0x40 | |
143 | set timer to read / write LSB first, then MSB; | |
144 | mode bits set timer mode | |
145 | ||
146 | 0100 - Latch Timer 1 count for port 0x41 - as described above | |
147 | 0101 - Set Timer 1 LSB mode for port 0x41 - as described above | |
148 | 0110 - Set Timer 1 MSB mode for port 0x41 - as described above | |
149 | 0111 - Set Timer 1 16-bit mode for port 0x41 - as described above | |
150 | ||
151 | 1000 - Latch Timer 2 count for port 0x42 - as described above | |
152 | 1001 - Set Timer 2 LSB mode for port 0x42 - as described above | |
153 | 1010 - Set Timer 2 MSB mode for port 0x42 - as described above | |
154 | 1011 - Set Timer 2 16-bit mode for port 0x42 as described above | |
155 | ||
156 | 1101 - General counter latch | |
157 | Latch combination of counters into corresponding ports | |
158 | Bit 3 = Counter 2 | |
159 | Bit 2 = Counter 1 | |
160 | Bit 1 = Counter 0 | |
161 | Bit 0 = Unused | |
162 | ||
163 | 1110 - Latch timer status | |
164 | Latch combination of counter mode into corresponding ports | |
165 | Bit 3 = Counter 2 | |
166 | Bit 2 = Counter 1 | |
167 | Bit 1 = Counter 0 | |
168 | ||
169 | The output of ports 0x40-0x42 following this command will be: | |
170 | ||
171 | Bit 7 = Output pin | |
172 | Bit 6 = Count loaded (0 if timer has expired) | |
173 | Bit 5-4 = Read / Write mode | |
174 | 01 = MSB only | |
175 | 10 = LSB only | |
176 | 11 = LSB / MSB (16-bit) | |
177 | Bit 3-1 = Mode | |
178 | Bit 0 = Binary (0) / BCD mode (1) | |
179 | ||
180 | 2.2) RTC | |
181 | ||
182 | The second device which was available in the original PC was the MC146818 real | |
183 | time clock. The original device is now obsolete, and usually emulated by the | |
184 | system chipset, sometimes by an HPET and some frankenstein IRQ routing. | |
185 | ||
186 | The RTC is accessed through CMOS variables, which uses an index register to | |
187 | control which bytes are read. Since there is only one index register, read | |
188 | of the CMOS and read of the RTC require lock protection (in addition, it is | |
189 | dangerous to allow userspace utilities such as hwclock to have direct RTC | |
190 | access, as they could corrupt kernel reads and writes of CMOS memory). | |
191 | ||
192 | The RTC generates an interrupt which is usually routed to IRQ 8. The interrupt | |
193 | can function as a periodic timer, an additional once a day alarm, and can issue | |
194 | interrupts after an update of the CMOS registers by the MC146818 is complete. | |
195 | The type of interrupt is signalled in the RTC status registers. | |
196 | ||
197 | The RTC will update the current time fields by battery power even while the | |
198 | system is off. The current time fields should not be read while an update is | |
199 | in progress, as indicated in the status register. | |
200 | ||
201 | The clock uses a 32.768kHz crystal, so bits 6-4 of register A should be | |
202 | programmed to a 32kHz divider if the RTC is to count seconds. | |
203 | ||
204 | This is the RAM map originally used for the RTC/CMOS: | |
205 | ||
206 | Location Size Description | |
207 | ------------------------------------------ | |
208 | 00h byte Current second (BCD) | |
209 | 01h byte Seconds alarm (BCD) | |
210 | 02h byte Current minute (BCD) | |
211 | 03h byte Minutes alarm (BCD) | |
212 | 04h byte Current hour (BCD) | |
213 | 05h byte Hours alarm (BCD) | |
214 | 06h byte Current day of week (BCD) | |
215 | 07h byte Current day of month (BCD) | |
216 | 08h byte Current month (BCD) | |
217 | 09h byte Current year (BCD) | |
218 | 0Ah byte Register A | |
219 | bit 7 = Update in progress | |
220 | bit 6-4 = Divider for clock | |
221 | 000 = 4.194 MHz | |
222 | 001 = 1.049 MHz | |
223 | 010 = 32 kHz | |
224 | 10X = test modes | |
225 | 110 = reset / disable | |
226 | 111 = reset / disable | |
227 | bit 3-0 = Rate selection for periodic interrupt | |
228 | 000 = periodic timer disabled | |
229 | 001 = 3.90625 uS | |
230 | 010 = 7.8125 uS | |
231 | 011 = .122070 mS | |
232 | 100 = .244141 mS | |
233 | ... | |
234 | 1101 = 125 mS | |
235 | 1110 = 250 mS | |
236 | 1111 = 500 mS | |
237 | 0Bh byte Register B | |
238 | bit 7 = Run (0) / Halt (1) | |
239 | bit 6 = Periodic interrupt enable | |
240 | bit 5 = Alarm interrupt enable | |
241 | bit 4 = Update-ended interrupt enable | |
242 | bit 3 = Square wave interrupt enable | |
243 | bit 2 = BCD calendar (0) / Binary (1) | |
244 | bit 1 = 12-hour mode (0) / 24-hour mode (1) | |
245 | bit 0 = 0 (DST off) / 1 (DST enabled) | |
246 | OCh byte Register C (read only) | |
247 | bit 7 = interrupt request flag (IRQF) | |
248 | bit 6 = periodic interrupt flag (PF) | |
249 | bit 5 = alarm interrupt flag (AF) | |
250 | bit 4 = update interrupt flag (UF) | |
251 | bit 3-0 = reserved | |
252 | ODh byte Register D (read only) | |
253 | bit 7 = RTC has power | |
254 | bit 6-0 = reserved | |
255 | 32h byte Current century BCD (*) | |
256 | (*) location vendor specific and now determined from ACPI global tables | |
257 | ||
258 | 2.3) APIC | |
259 | ||
260 | On Pentium and later processors, an on-board timer is available to each CPU | |
261 | as part of the Advanced Programmable Interrupt Controller. The APIC is | |
262 | accessed through memory-mapped registers and provides interrupt service to each | |
263 | CPU, used for IPIs and local timer interrupts. | |
264 | ||
265 | Although in theory the APIC is a safe and stable source for local interrupts, | |
266 | in practice, many bugs and glitches have occurred due to the special nature of | |
267 | the APIC CPU-local memory-mapped hardware. Beware that CPU errata may affect | |
268 | the use of the APIC and that workarounds may be required. In addition, some of | |
269 | these workarounds pose unique constraints for virtualization - requiring either | |
270 | extra overhead incurred from extra reads of memory-mapped I/O or additional | |
271 | functionality that may be more computationally expensive to implement. | |
272 | ||
273 | Since the APIC is documented quite well in the Intel and AMD manuals, we will | |
274 | avoid repetition of the detail here. It should be pointed out that the APIC | |
275 | timer is programmed through the LVT (local vector timer) register, is capable | |
276 | of one-shot or periodic operation, and is based on the bus clock divided down | |
277 | by the programmable divider register. | |
278 | ||
279 | 2.4) HPET | |
280 | ||
281 | HPET is quite complex, and was originally intended to replace the PIT / RTC | |
282 | support of the X86 PC. It remains to be seen whether that will be the case, as | |
283 | the de facto standard of PC hardware is to emulate these older devices. Some | |
284 | systems designated as legacy free may support only the HPET as a hardware timer | |
285 | device. | |
286 | ||
287 | The HPET spec is rather loose and vague, requiring at least 3 hardware timers, | |
288 | but allowing implementation freedom to support many more. It also imposes no | |
289 | fixed rate on the timer frequency, but does impose some extremal values on | |
290 | frequency, error and slew. | |
291 | ||
292 | In general, the HPET is recommended as a high precision (compared to PIT /RTC) | |
293 | time source which is independent of local variation (as there is only one HPET | |
294 | in any given system). The HPET is also memory-mapped, and its presence is | |
295 | indicated through ACPI tables by the BIOS. | |
296 | ||
297 | Detailed specification of the HPET is beyond the current scope of this | |
298 | document, as it is also very well documented elsewhere. | |
299 | ||
300 | 2.5) Offboard Timers | |
301 | ||
302 | Several cards, both proprietary (watchdog boards) and commonplace (e1000) have | |
303 | timing chips built into the cards which may have registers which are accessible | |
304 | to kernel or user drivers. To the author's knowledge, using these to generate | |
305 | a clocksource for a Linux or other kernel has not yet been attempted and is in | |
306 | general frowned upon as not playing by the agreed rules of the game. Such a | |
307 | timer device would require additional support to be virtualized properly and is | |
308 | not considered important at this time as no known operating system does this. | |
309 | ||
310 | ========================================================================= | |
311 | ||
312 | 3) TSC Hardware | |
313 | ||
314 | The TSC or time stamp counter is relatively simple in theory; it counts | |
315 | instruction cycles issued by the processor, which can be used as a measure of | |
316 | time. In practice, due to a number of problems, it is the most complicated | |
317 | timekeeping device to use. | |
318 | ||
319 | The TSC is represented internally as a 64-bit MSR which can be read with the | |
320 | RDMSR, RDTSC, or RDTSCP (when available) instructions. In the past, hardware | |
321 | limitations made it possible to write the TSC, but generally on old hardware it | |
322 | was only possible to write the low 32-bits of the 64-bit counter, and the upper | |
323 | 32-bits of the counter were cleared. Now, however, on Intel processors family | |
324 | 0Fh, for models 3, 4 and 6, and family 06h, models e and f, this restriction | |
325 | has been lifted and all 64-bits are writable. On AMD systems, the ability to | |
326 | write the TSC MSR is not an architectural guarantee. | |
327 | ||
328 | The TSC is accessible from CPL-0 and conditionally, for CPL > 0 software by | |
329 | means of the CR4.TSD bit, which when enabled, disables CPL > 0 TSC access. | |
330 | ||
331 | Some vendors have implemented an additional instruction, RDTSCP, which returns | |
332 | atomically not just the TSC, but an indicator which corresponds to the | |
333 | processor number. This can be used to index into an array of TSC variables to | |
334 | determine offset information in SMP systems where TSCs are not synchronized. | |
335 | The presence of this instruction must be determined by consulting CPUID feature | |
336 | bits. | |
337 | ||
338 | Both VMX and SVM provide extension fields in the virtualization hardware which | |
339 | allows the guest visible TSC to be offset by a constant. Newer implementations | |
340 | promise to allow the TSC to additionally be scaled, but this hardware is not | |
341 | yet widely available. | |
342 | ||
343 | 3.1) TSC synchronization | |
344 | ||
345 | The TSC is a CPU-local clock in most implementations. This means, on SMP | |
346 | platforms, the TSCs of different CPUs may start at different times depending | |
347 | on when the CPUs are powered on. Generally, CPUs on the same die will share | |
348 | the same clock, however, this is not always the case. | |
349 | ||
350 | The BIOS may attempt to resynchronize the TSCs during the poweron process and | |
351 | the operating system or other system software may attempt to do this as well. | |
352 | Several hardware limitations make the problem worse - if it is not possible to | |
353 | write the full 64-bits of the TSC, it may be impossible to match the TSC in | |
354 | newly arriving CPUs to that of the rest of the system, resulting in | |
355 | unsynchronized TSCs. This may be done by BIOS or system software, but in | |
356 | practice, getting a perfectly synchronized TSC will not be possible unless all | |
357 | values are read from the same clock, which generally only is possible on single | |
358 | socket systems or those with special hardware support. | |
359 | ||
360 | 3.2) TSC and CPU hotplug | |
361 | ||
362 | As touched on already, CPUs which arrive later than the boot time of the system | |
363 | may not have a TSC value that is synchronized with the rest of the system. | |
364 | Either system software, BIOS, or SMM code may actually try to establish the TSC | |
365 | to a value matching the rest of the system, but a perfect match is usually not | |
366 | a guarantee. This can have the effect of bringing a system from a state where | |
367 | TSC is synchronized back to a state where TSC synchronization flaws, however | |
368 | small, may be exposed to the OS and any virtualization environment. | |
369 | ||
370 | 3.3) TSC and multi-socket / NUMA | |
371 | ||
372 | Multi-socket systems, especially large multi-socket systems are likely to have | |
373 | individual clocksources rather than a single, universally distributed clock. | |
374 | Since these clocks are driven by different crystals, they will not have | |
375 | perfectly matched frequency, and temperature and electrical variations will | |
376 | cause the CPU clocks, and thus the TSCs to drift over time. Depending on the | |
377 | exact clock and bus design, the drift may or may not be fixed in absolute | |
378 | error, and may accumulate over time. | |
379 | ||
380 | In addition, very large systems may deliberately slew the clocks of individual | |
381 | cores. This technique, known as spread-spectrum clocking, reduces EMI at the | |
382 | clock frequency and harmonics of it, which may be required to pass FCC | |
383 | standards for telecommunications and computer equipment. | |
384 | ||
385 | It is recommended not to trust the TSCs to remain synchronized on NUMA or | |
386 | multiple socket systems for these reasons. | |
387 | ||
388 | 3.4) TSC and C-states | |
389 | ||
390 | C-states, or idling states of the processor, especially C1E and deeper sleep | |
391 | states may be problematic for TSC as well. The TSC may stop advancing in such | |
392 | a state, resulting in a TSC which is behind that of other CPUs when execution | |
393 | is resumed. Such CPUs must be detected and flagged by the operating system | |
394 | based on CPU and chipset identifications. | |
395 | ||
396 | The TSC in such a case may be corrected by catching it up to a known external | |
397 | clocksource. | |
398 | ||
399 | 3.5) TSC frequency change / P-states | |
400 | ||
401 | To make things slightly more interesting, some CPUs may change frequency. They | |
402 | may or may not run the TSC at the same rate, and because the frequency change | |
403 | may be staggered or slewed, at some points in time, the TSC rate may not be | |
404 | known other than falling within a range of values. In this case, the TSC will | |
405 | not be a stable time source, and must be calibrated against a known, stable, | |
406 | external clock to be a usable source of time. | |
407 | ||
408 | Whether the TSC runs at a constant rate or scales with the P-state is model | |
409 | dependent and must be determined by inspecting CPUID, chipset or vendor | |
410 | specific MSR fields. | |
411 | ||
412 | In addition, some vendors have known bugs where the P-state is actually | |
413 | compensated for properly during normal operation, but when the processor is | |
414 | inactive, the P-state may be raised temporarily to service cache misses from | |
415 | other processors. In such cases, the TSC on halted CPUs could advance faster | |
416 | than that of non-halted processors. AMD Turion processors are known to have | |
417 | this problem. | |
418 | ||
419 | 3.6) TSC and STPCLK / T-states | |
420 | ||
421 | External signals given to the processor may also have the effect of stopping | |
422 | the TSC. This is typically done for thermal emergency power control to prevent | |
423 | an overheating condition, and typically, there is no way to detect that this | |
424 | condition has happened. | |
425 | ||
426 | 3.7) TSC virtualization - VMX | |
427 | ||
428 | VMX provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP | |
429 | instructions, which is enough for full virtualization of TSC in any manner. In | |
430 | addition, VMX allows passing through the host TSC plus an additional TSC_OFFSET | |
431 | field specified in the VMCS. Special instructions must be used to read and | |
432 | write the VMCS field. | |
433 | ||
434 | 3.8) TSC virtualization - SVM | |
435 | ||
436 | SVM provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP | |
437 | instructions, which is enough for full virtualization of TSC in any manner. In | |
438 | addition, SVM allows passing through the host TSC plus an additional offset | |
439 | field specified in the SVM control block. | |
440 | ||
441 | 3.9) TSC feature bits in Linux | |
442 | ||
443 | In summary, there is no way to guarantee the TSC remains in perfect | |
444 | synchronization unless it is explicitly guaranteed by the architecture. Even | |
445 | if so, the TSCs in multi-sockets or NUMA systems may still run independently | |
446 | despite being locally consistent. | |
447 | ||
448 | The following feature bits are used by Linux to signal various TSC attributes, | |
449 | but they can only be taken to be meaningful for UP or single node systems. | |
450 | ||
451 | X86_FEATURE_TSC : The TSC is available in hardware | |
452 | X86_FEATURE_RDTSCP : The RDTSCP instruction is available | |
453 | X86_FEATURE_CONSTANT_TSC : The TSC rate is unchanged with P-states | |
454 | X86_FEATURE_NONSTOP_TSC : The TSC does not stop in C-states | |
455 | X86_FEATURE_TSC_RELIABLE : TSC sync checks are skipped (VMware) | |
456 | ||
457 | 4) Virtualization Problems | |
458 | ||
459 | Timekeeping is especially problematic for virtualization because a number of | |
460 | challenges arise. The most obvious problem is that time is now shared between | |
461 | the host and, potentially, a number of virtual machines. Thus the virtual | |
462 | operating system does not run with 100% usage of the CPU, despite the fact that | |
463 | it may very well make that assumption. It may expect it to remain true to very | |
464 | exacting bounds when interrupt sources are disabled, but in reality only its | |
465 | virtual interrupt sources are disabled, and the machine may still be preempted | |
466 | at any time. This causes problems as the passage of real time, the injection | |
467 | of machine interrupts and the associated clock sources are no longer completely | |
468 | synchronized with real time. | |
469 | ||
470 | This same problem can occur on native harware to a degree, as SMM mode may | |
471 | steal cycles from the naturally on X86 systems when SMM mode is used by the | |
472 | BIOS, but not in such an extreme fashion. However, the fact that SMM mode may | |
473 | cause similar problems to virtualization makes it a good justification for | |
474 | solving many of these problems on bare metal. | |
475 | ||
476 | 4.1) Interrupt clocking | |
477 | ||
478 | One of the most immediate problems that occurs with legacy operating systems | |
479 | is that the system timekeeping routines are often designed to keep track of | |
480 | time by counting periodic interrupts. These interrupts may come from the PIT | |
481 | or the RTC, but the problem is the same: the host virtualization engine may not | |
482 | be able to deliver the proper number of interrupts per second, and so guest | |
483 | time may fall behind. This is especially problematic if a high interrupt rate | |
484 | is selected, such as 1000 HZ, which is unfortunately the default for many Linux | |
485 | guests. | |
486 | ||
487 | There are three approaches to solving this problem; first, it may be possible | |
488 | to simply ignore it. Guests which have a separate time source for tracking | |
489 | 'wall clock' or 'real time' may not need any adjustment of their interrupts to | |
490 | maintain proper time. If this is not sufficient, it may be necessary to inject | |
491 | additional interrupts into the guest in order to increase the effective | |
492 | interrupt rate. This approach leads to complications in extreme conditions, | |
493 | where host load or guest lag is too much to compensate for, and thus another | |
494 | solution to the problem has risen: the guest may need to become aware of lost | |
495 | ticks and compensate for them internally. Although promising in theory, the | |
496 | implementation of this policy in Linux has been extremely error prone, and a | |
497 | number of buggy variants of lost tick compensation are distributed across | |
498 | commonly used Linux systems. | |
499 | ||
500 | Windows uses periodic RTC clocking as a means of keeping time internally, and | |
501 | thus requires interrupt slewing to keep proper time. It does use a low enough | |
502 | rate (ed: is it 18.2 Hz?) however that it has not yet been a problem in | |
503 | practice. | |
504 | ||
505 | 4.2) TSC sampling and serialization | |
506 | ||
507 | As the highest precision time source available, the cycle counter of the CPU | |
508 | has aroused much interest from developers. As explained above, this timer has | |
509 | many problems unique to its nature as a local, potentially unstable and | |
510 | potentially unsynchronized source. One issue which is not unique to the TSC, | |
511 | but is highlighted because of its very precise nature is sampling delay. By | |
512 | definition, the counter, once read is already old. However, it is also | |
513 | possible for the counter to be read ahead of the actual use of the result. | |
514 | This is a consequence of the superscalar execution of the instruction stream, | |
515 | which may execute instructions out of order. Such execution is called | |
516 | non-serialized. Forcing serialized execution is necessary for precise | |
517 | measurement with the TSC, and requires a serializing instruction, such as CPUID | |
518 | or an MSR read. | |
519 | ||
520 | Since CPUID may actually be virtualized by a trap and emulate mechanism, this | |
521 | serialization can pose a performance issue for hardware virtualization. An | |
522 | accurate time stamp counter reading may therefore not always be available, and | |
523 | it may be necessary for an implementation to guard against "backwards" reads of | |
524 | the TSC as seen from other CPUs, even in an otherwise perfectly synchronized | |
525 | system. | |
526 | ||
527 | 4.3) Timespec aliasing | |
528 | ||
529 | Additionally, this lack of serialization from the TSC poses another challenge | |
530 | when using results of the TSC when measured against another time source. As | |
531 | the TSC is much higher precision, many possible values of the TSC may be read | |
532 | while another clock is still expressing the same value. | |
533 | ||
534 | That is, you may read (T,T+10) while external clock C maintains the same value. | |
535 | Due to non-serialized reads, you may actually end up with a range which | |
536 | fluctuates - from (T-1.. T+10). Thus, any time calculated from a TSC, but | |
537 | calibrated against an external value may have a range of valid values. | |
538 | Re-calibrating this computation may actually cause time, as computed after the | |
539 | calibration, to go backwards, compared with time computed before the | |
540 | calibration. | |
541 | ||
542 | This problem is particularly pronounced with an internal time source in Linux, | |
543 | the kernel time, which is expressed in the theoretically high resolution | |
544 | timespec - but which advances in much larger granularity intervals, sometimes | |
545 | at the rate of jiffies, and possibly in catchup modes, at a much larger step. | |
546 | ||
547 | This aliasing requires care in the computation and recalibration of kvmclock | |
548 | and any other values derived from TSC computation (such as TSC virtualization | |
549 | itself). | |
550 | ||
551 | 4.4) Migration | |
552 | ||
553 | Migration of a virtual machine raises problems for timekeeping in two ways. | |
554 | First, the migration itself may take time, during which interrupts cannot be | |
555 | delivered, and after which, the guest time may need to be caught up. NTP may | |
556 | be able to help to some degree here, as the clock correction required is | |
557 | typically small enough to fall in the NTP-correctable window. | |
558 | ||
559 | An additional concern is that timers based off the TSC (or HPET, if the raw bus | |
560 | clock is exposed) may now be running at different rates, requiring compensation | |
561 | in some way in the hypervisor by virtualizing these timers. In addition, | |
562 | migrating to a faster machine may preclude the use of a passthrough TSC, as a | |
563 | faster clock cannot be made visible to a guest without the potential of time | |
564 | advancing faster than usual. A slower clock is less of a problem, as it can | |
565 | always be caught up to the original rate. KVM clock avoids these problems by | |
566 | simply storing multipliers and offsets against the TSC for the guest to convert | |
567 | back into nanosecond resolution values. | |
568 | ||
569 | 4.5) Scheduling | |
570 | ||
571 | Since scheduling may be based on precise timing and firing of interrupts, the | |
572 | scheduling algorithms of an operating system may be adversely affected by | |
573 | virtualization. In theory, the effect is random and should be universally | |
574 | distributed, but in contrived as well as real scenarios (guest device access, | |
575 | causes of virtualization exits, possible context switch), this may not always | |
576 | be the case. The effect of this has not been well studied. | |
577 | ||
578 | In an attempt to work around this, several implementations have provided a | |
579 | paravirtualized scheduler clock, which reveals the true amount of CPU time for | |
580 | which a virtual machine has been running. | |
581 | ||
582 | 4.6) Watchdogs | |
583 | ||
584 | Watchdog timers, such as the lock detector in Linux may fire accidentally when | |
585 | running under hardware virtualization due to timer interrupts being delayed or | |
586 | misinterpretation of the passage of real time. Usually, these warnings are | |
587 | spurious and can be ignored, but in some circumstances it may be necessary to | |
588 | disable such detection. | |
589 | ||
590 | 4.7) Delays and precision timing | |
591 | ||
592 | Precise timing and delays may not be possible in a virtualized system. This | |
593 | can happen if the system is controlling physical hardware, or issues delays to | |
594 | compensate for slower I/O to and from devices. The first issue is not solvable | |
595 | in general for a virtualized system; hardware control software can't be | |
596 | adequately virtualized without a full real-time operating system, which would | |
597 | require an RT aware virtualization platform. | |
598 | ||
599 | The second issue may cause performance problems, but this is unlikely to be a | |
600 | significant issue. In many cases these delays may be eliminated through | |
601 | configuration or paravirtualization. | |
602 | ||
603 | 4.8) Covert channels and leaks | |
604 | ||
605 | In addition to the above problems, time information will inevitably leak to the | |
606 | guest about the host in anything but a perfect implementation of virtualized | |
607 | time. This may allow the guest to infer the presence of a hypervisor (as in a | |
608 | red-pill type detection), and it may allow information to leak between guests | |
609 | by using CPU utilization itself as a signalling channel. Preventing such | |
610 | problems would require completely isolated virtual time which may not track | |
611 | real time any longer. This may be useful in certain security or QA contexts, | |
612 | but in general isn't recommended for real-world deployment scenarios. |