kvmclock - A quick introduction
Background
Introduction to the classic problem
Time keeping in guests is complicated. It's easy for the guest to be confused because the guest can foolishly think that interrupt sources are disabled where as infact, only the virtual interrupt sources are disabled. The (virtual) system may still be pre-empted and while so, it's not making any progress but time is actually moving forward. Infact, this can also happen on a operating system running on baremetal when system BIOS invokes System Management Mode (SMM). So, timekeeping seems to be an universal problem, not just virtualization.
Motivation for paravirtualized clock
Typically, system time in the guest is measured in terms of the number of interrupts elapsed. In most cases, these are the timer interrupts. But owing to architectural constraints (SVM/VMX), the time at which the interrupt occured may not be the time when it gets delivered to the guest.
The other approach typically used is something like an emulated High Precision Event Timer (HPET). The problem with it is that any read will result in an exit to the hypervisor.
kvmclock
Basic Working
The guest registers a page with the hypervisor where it wants time information to be written to. It then notifies this address to the hypervisor. The hypervisor keeps updating this page unless it's told not to. So, the guest can simply read this page whenever it wants time information. Since, this is a shared page between the guest and the host, no vmexits are required.
The clock structure also has a multiplier that the guest multiplies with the tsc delta to obtain the time in nanoseconds. The host can change the value of the multiplier anytime, for example, if the frequency changes due to a pcpu or live migration.
During a migration, the source host can execute an IOCTL to get the last valid timestamp. The destination then uses this information to set time for the newly migrated guest.
A little bit more details.
Initially, the guest detects if the hypervisor supports kvmclock. kvm features are presented to the guest in leaf 0x40000001.
There are two kvmclock related structures
struct pvclock_wall_clock { u32 version; u32 sec; u32 nsec; } __attribute__((__packed__));
This structure represents (as its name says) the wall clock. This is only used by the guest at boot time or suspend-resume. Once the data is obtained, the guest can then use this memory for something else or in other words, this structure is not on the hot path
struct pvclock_vcpu_time_info { u32 version; u32 pad0; u64 tsc_timestamp; u64 system_time; u32 tsc_to_system_mul; s8 tsc_shift; u8 flags; u8 pad[1]; } __attribute__((__packed__));
This structure is continuously updated the host and is used by the guest to obtain time information.
One of the initial functions that gets called during Linux bootup is setup_arch(). When the guest boots up, setup_arch() calls kvmclock_init() that performs the above mentioned detection of support for kvmclock (using cpuid). It then allocates a page which will act as the kvm clock page and then calls kvm_register_clock() on it. This function does a wrmsr to MSR_KVM_SYSTEM_TIME; from the hypervisor's perspective, this wrmsr is an indication that the the data it's being given as part of this write is the address of the kvmclock page in the guest address space where it needs to write system time fields. The update of this structure (struct pvclock_vcpu_time_info), it seems happens when the hypervisor enters the guest after a vmexit happened for some reason in vcpu_enter_guest()
if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, vcpu)) { r = kvm_guest_time_update(vcpu); if (unlikely(r)) goto out; }
Shutting it down
And finally, the last bit of the page address indicates whether the hypervisor should update kvmclock. If the guest writes anything to this MSR with the last bit as 0, this is an indication that the hypervisor should stop updating it.
References
- https://lkml.org/lkml/2010/4/15/355
- Documentation/virtual/kvm/timekeeping.txt
- arch/x86/kvm/x86.c
- arch/x86/kernel/kvmclock.c
- arch/x86/kernel/pvclock.c
- arch/x86/kernel/setup.c