Bug 507834 - Clock drift in RHEL5.3 KVM Guests on Heavily Loaded Servers
Clock drift in RHEL5.3 KVM Guests on Heavily Loaded Servers
Status: CLOSED CURRENTRELEASE
Product: Fedora
Classification: Fedora
Component: qemu (Show other bugs)
11
x86_64 Linux
medium Severity medium
: ---
: ---
Assigned To: Glauber Costa
Fedora Extras Quality Assurance
:
Depends On:
Blocks: F11VirtTarget
  Show dependency treegraph
 
Reported: 2009-06-24 09:24 EDT by Allen Payne
Modified: 2010-04-20 09:02 EDT (History)
18 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-10-02 04:08:50 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
kvm logfile (516 bytes, text/plain)
2009-06-24 14:13 EDT, Allen Payne
no flags Details

  None (edit)
Description Allen Payne 2009-06-24 09:24:36 EDT
Description of problem:
We're seeing the time in a KVM guest, on a heavily loaded Fedora 11 server, gain time very quickly, up to 20 seconds in a 60 second period. The data/time on the server is correct and does not drift.

Version-Release number of selected component (if applicable):
Fedora 11, kernel 2.6.29.4-167.fc11.x86_64, qemu-kvm-0.10.5-3.fc11.x86_64,
Dell R710 server with Nehalem based CPUs, multiple 1 CPU KVM Guests all running RHEL-5.3, ntp enabled on both server and guests

How reproducible:
Easily

Steps to Reproduce:
1. Create lots of guests
2. Run heavy CPU based workloads in all the guests
3. Watch the time drift.
  
Actual results:
Clock drift up to 20 seconds in a 60 second period.

Expected results:
No clock drift

Additional info:
Adding divider=10 to the guests kernel command line reduces the clock drift to about a second every 60 seconds
The server is using the TSC clock source.
Comment 1 Mark McLoughlin 2009-06-24 13:10:47 EDT
Thanks for the report

What is the qemu-kvm command line you are using? If you're launching the guest using libvirt, please include the guest's log from /var/log/libvirt/qemu

What clocksource is the guest using - look at /sys/devices/system/clocksource/clocksource0/current_clocksource

I think this may be a well known issue, and AFAIR the paravirt kvm clocksource is included in RHEL5.4 and will fix this issue. I've very unclear on the details, though. Marcelo, Glauber?
Comment 2 Allen Payne 2009-06-24 14:13:17 EDT
Created attachment 349287 [details]
kvm logfile
Comment 3 Allen Payne 2009-06-24 14:15:21 EDT
The guests current_clocksource reports "jiffies"
Comment 5 Chris Lalancette 2009-06-25 02:52:13 EDT
(In reply to comment #1)
> Thanks for the report
> 
> What is the qemu-kvm command line you are using? If you're launching the guest
> using libvirt, please include the guest's log from /var/log/libvirt/qemu
> 
> What clocksource is the guest using - look at
> /sys/devices/system/clocksource/clocksource0/current_clocksource
> 
> I think this may be a well known issue, and AFAIR the paravirt kvm clocksource
> is included in RHEL5.4 and will fix this issue. I've very unclear on the
> details, though. Marcelo, Glauber?  

No, 5.4 does *not* have the paravirt kvm clocksource.  What it does have is the hypercall to fetch the lpj from the host during boot; that helps to properly configure the clocksources initially, but if there are further drifting problems, they will still be present.

In terms of testing, it would be worthwhile to test out a 5.4 guest and see if the lpj stuff helps.  Possibly with a 5.4 kernel + the tick divider, you might be able to get the drift under control.

Another thing to try would be to run an F-11 guest, and see what it looks like.  F-11 does have the full paravirt clocksource, so confirming that the paravirt clocksource makes a difference here would be another good data point to have.

Chris Lalancette
Comment 6 Allen Payne 2009-06-29 06:43:52 EDT
I've tried a Fedora 11 guest (2.6.29.4-167.fc11.x86_64) and the time keeping is much better with that kernel. 

The current_clocksource file reports the kernel is using the "kvm-clock" kernel timer. 

Are they're any plans to include this clock source in the 5.3/5.4 kernels?
Comment 9 Mark McLoughlin 2009-07-03 02:07:18 EDT
Allen: could you try the RHEL5 guest with "notsc divider=10" ? We've seen reports that this helps

Also, could you include the dmesg from the guest? Looking for messages like "time:c Using ... timer."

(There is a "time drift fix" -tdf option for qemu-kvm, but that affects the missed PIT interrupts, so I'm not sure that'll help)
Comment 10 Allen Payne 2009-07-03 03:18:51 EDT
We have tested a guest running RHEL-5.3 (2.6.18-128.1.1.el5) with the "divider=10 notsc" options and whilst the clock drift was reduced it was still loosing about a second every minute when the host, and guest, were heavily loaded.

$ cat /proc/cmdline
ro root=LABEL=/ panic=20 log_buf_len=131072 crashkernel=128M@16M divider=10 notsc iommu=soft elevator=noop

$ dmesg |grep time
Calibrating delay using timer specific routine.. 5328.84 BogoMIPS (lpj=2664424)
Using local APIC timer interrupts.
Detected 62.500 MHz APIC timer.
Disabling vsyscall due to use of PM timer
time.c: Using 3.579545 MHz WALL PM GTOD PM timer.
time.c: Detected 2659.908 MHz processor.
PCI: Setting latency timer of device 0000:00:01.1 to 64
PCI: Setting latency timer of device 0000:00:01.2 to 64
SELinux:  Disabled at runtime.

$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
jiffies
Comment 11 Marcelo Tosatti 2009-07-03 06:18:46 EDT
Allen,

Two options:

1. pass clock=tsc on the RHEL5 guest, where the system clock will lose time (but depending on the load, the drift might be acceptable for ntp to adjust frequency).

2. clock=jiffies with -no-kvm-pit-reinjection option to qemu-kvm (without divider option).

Make sure to delete /var/lib/ntp/drift (and reboot the guest) between tests.

A better solution is being worked on.
Comment 12 Mark McLoughlin 2009-07-03 08:11:21 EDT
Allen, if you're using libvirt and want to try -no-kvm-pit-reinjection, then create e.g. a /usr/bin/qemu-kvm-no-pit-reinjection script:

  #!/bin/bash
  exec /usr/bin/qemu-kvm -no-kvm-pit-reinjection $@

and change the <emulator> element in the guest XML config to point to it.

(I needed to put selinux into enforcing mode to make that work)

Also, you need a 2.6.30 kernel for -no-pit-reinjection

(In reply to comment #11)

> A better solution is being worked on.  

Care to elaborate a little Marcelo?
Comment 13 Marcelo Tosatti 2009-07-03 09:57:58 EDT
> > A better solution is being worked on.  
> 
> Care to elaborate a little Marcelo?

The cause of the drift is that the guest time code expects a correlation between timer interrupts and TSC, which is very unprecise in KVM. clock=jiffies attempts to correct for lost ticks, and since KVM reinjects lost interrupts the end result is the time gain mentioned in comment #1.

-no-kvm-pit-reinjection stops that, but the guest is still suspectible to time loss (negative drift) depending on the load of the system. clock=tsc does not 
attempt to correct for lost ticks.

So the suggestions on comment #11 alleviates the drift problem, in the hope its within the acceptable range for ntp jitter correction, ntpd(8):

"The maximum slew rate possible is limited to 500 parts-per-million (PPM) as a consequence of the correctness principles on which the NTP  protocol  and  algorithm  design are based."

Note that even if the drift (or jitter in ntpd) is larger than 500PPM the clock
will be corrected via offset (offset correction = time jumps can be seen).

A better solution is planned which will improve the current situation.
Comment 14 Roel Gloudemans 2009-07-13 02:53:23 EDT
With RHEL 5.3 on 5.4beta I'm seeing this too, without the heavy load. It is so bad that even the work-arounds posted here don't help. I see timejumps of a few secs every hour when the system is not doing much. I few minutes when it is.
Comment 15 Marcelo Tosatti 2009-07-13 16:25:07 EDT
Roel,

Are you running ntpd on the guest?
Comment 16 Roel Gloudemans 2009-07-14 01:13:39 EDT
Yes (and removed drift files before start).
I switched VMs to "divider=10 notsc". During the night, the clocks were actually synchronised. So that is looking good. Will put some load on the systems this afternoon and see if it stays that way.
Comment 17 Roel Gloudemans 2009-07-15 03:06:47 EDT
OK, the results of the last 24 hours (4am - 4am) of operation.

The good; It looks like it is under control
The bad; Positive and negative drift. The mailserver has an average utelization of only 12%, who knows what will happen when the average is 50% or so.

The desktop VM:
Jul 14 04:08:02 ntpd[3474]: time reset -0.295500 s
Jul 14 04:12:20 ntpd[3474]: synchronized to LOCAL(0), stratum 10
Jul 14 04:13:26 ntpd[3474]: synchronized to a.b.c.d, stratum 3

The Mailserver VM (the busiest one)
Jul 14 07:50:11 ntpd[9434]: time reset +0.137525 s
Jul 14 07:54:32 ntpd[9434]: synchronized to LOCAL(0), stratum 10
Jul 14 07:55:37 ntpd[9434]: synchronized to a.b.c.d, stratum 3
Jul 14 08:42:09 ntpd[9434]: time reset -0.268744 s
Jul 14 08:46:09 ntpd[9434]: synchronized to LOCAL(0), stratum 10
Jul 14 08:48:18 ntpd[9434]: synchronized to a.b.c.d, stratum 3
Jul 14 13:38:42 ntpd[9434]: time reset -0.130925 s
Jul 14 13:43:01 ntpd[9434]: synchronized to LOCAL(0), stratum 10
Jul 14 13:44:05 ntpd[9434]: synchronized to a.b.c.d, stratum 3
Jul 14 16:55:47 ntpd[9434]: time reset -0.389383 s
Jul 14 17:00:05 ntpd[9434]: synchronized to LOCAL(0), stratum 10
Jul 14 17:01:11 ntpd[9434]: synchronized to a.b.c.d, stratum 3
Jul 14 17:29:11 ntpd[9434]: time reset -0.140383 s
Jul 14 17:33:17 ntpd[9434]: synchronized to LOCAL(0), stratum 10
Jul 14 17:33:32 ntpd[9434]: synchronized to a.b.c.d, stratum 3
Jul 14 22:52:22 ntpd[9434]: time reset -0.335109 s
Jul 14 22:56:44 ntpd[9434]: synchronized to LOCAL(0), stratum 10
Jul 14 22:57:32 ntpd[9434]: synchronized to a.b.c.d, stratum 3
Jul 15 03:37:50 ntpd[9434]: time reset -0.228269 s
Jul 15 03:41:31 ntpd[9434]: synchronized to a.b.c.d, stratum 3

The Webserver VM:
Jul 14 08:35:27 ntpd[5382]: time reset -0.155370 s
Jul 14 08:39:47 ntpd[5382]: synchronized to LOCAL(0), stratum 10
Jul 14 08:40:53 ntpd[5382]: synchronized to a.b.c.d, stratum 3
Jul 14 14:38:47 ntpd[5382]: time reset -0.168018 s
Jul 14 14:43:02 ntpd[5382]: synchronized to LOCAL(0), stratum 10
Jul 14 14:44:08 ntpd[5382]: synchronized to a.b.c.d, stratum 3
Jul 14 17:46:41 ntpd[5382]: time reset -0.260242 s
Jul 14 17:51:01 ntpd[5382]: synchronized to LOCAL(0), stratum 10
Jul 14 17:52:06 ntpd[5382]: synchronized to a.b.c.d, stratum 3
Comment 18 Allen Payne 2009-07-17 11:47:03 EDT
After more testing the combination of "divider=10 notsc" and the 'inject timer interrupts that got lost' option, -tdf, seem to control the worst of the clock drift issues with guests running 5.3.

The -tdf option is not sufficient to control the drift without the "divider=10 notsc" options.
Comment 19 Marcelo Tosatti 2009-07-17 11:52:42 EDT
Allen,

Unless you are using -no-kvm-irqchip, -tdf option has no effect.
Comment 20 Mark McLoughlin 2009-08-07 11:15:23 EDT
Okay, it sounds to me like the conclusion here is that 5.3 guests need 'notsc divider=10' in order to avoid drift?

Are people happy for this bug to be closed?
Comment 21 Chris Van Hoof 2009-08-07 14:17:04 EDT
(In reply to comment #20)
> Okay, it sounds to me like the conclusion here is that 5.3 guests need 'notsc
> divider=10' in order to avoid drift?
> 
> Are people happy for this bug to be closed?  

I'll let Allen chime in here, but from what I have seen while working with Allen is that regardless of -no-kvm-irqchip, -tdf, notsc, and divider=10, we still see skew.  Quite a bit less, but still very evident.

The only thing we've used thus far where skew was eliminated was kvm-clock.

--chris
Comment 22 Marcelo Tosatti 2009-08-07 16:03:27 EDT
Should backport the -no-kvm-pit-reinjection support for FC11's 2.6.29 kernel. Unfortunately I won't be able to do that until Aug 24th.

Perhaps testing a 2.6.30 kernel (or kvm-88 modules) in the meantime is desired.
Comment 23 Allen Payne 2009-08-10 04:23:53 EDT
Testing with the RHEV KVM version we still see clock drifts of a second or more, on guests running RHEL-5.3 or 5.4 using the "divider=10 notsc" kernel options.

The Fedora 11 guests (2.6.29.4-167.fc11), with the kvm-clock timer, seem to keep time much more accurately.

These are the sort of variations of times reported by the guests I'm seeing (after about 30 minutes of load on the KVM server)

    07:45:50.304438114
    07:45:50.522347000
    07:45:51.032452000
    07:45:51.072752559
    07:45:51.546714000
    07:45:51.486993000
    07:45:53.768901000
    07:45:51.340134000
    07:45:53.584773000
    07:45:55.983851000


The Fedora 11 guest, using the kvm-clock timer, keeps time much more accurately:

    07:50:08.427770702
    07:50:08.492880000

What would it take to get the kvm-clock timer added to the RHEL kernels?
Comment 24 Roel Gloudemans 2009-10-02 01:40:01 EDT
RHEL 5.4 kernel 164.2.1 seems to have resolved to problem. I've seen no more time jumps in the last 24 hours. (No kernel parameters relating to clock set)
Comment 25 Mark McLoughlin 2009-10-02 04:08:50 EDT
Many thanks for testing Roel, I'll close the bug then
Comment 26 Gabor Balogh 2010-01-28 05:44:03 EST
Hi,

I have RHEL 5.4 with kvm virtualization, the time on the guests is drifted, but the time on the server is correct.

Kernel version:

uname -a
Linux mrbbo-admin.pgsm.hu 2.6.18-164.11.1.el5 #1 SMP Wed Jan 6 13:26:04 EST 2010 x86_64 x86_64 x86_64 GNU/Linux

Needed informations:

cat /proc/cmdline
ro root=/dev/vg00/lvol0 rhgb quiet

dmesg |grep time
time.c: Using tsc for timekeeping HZ 1000
Calibrating delay loop (skipped), value calculated using timer frequency.. 6000.20 BogoMIPS (lpj=3000104)
Using local APIC timer interrupts.
WARNING calibrate_APIC_clock: the APIC timer calibration may be wrong.
Detected 62.500 MHz APIC timer.
Calibrating delay using timer specific routine.. 5992.21 BogoMIPS (lpj=2996106)
Calibrating delay using timer specific routine.. 5993.51 BogoMIPS (lpj=2996757)
Calibrating delay using timer specific routine.. 5992.45 BogoMIPS (lpj=2996228)
time.c: Using 1.193182 MHz WALL KVM GTOD KVM timer.
time.c: Detected 3000.104 MHz processor.
PCI: Setting latency timer of device 0000:00:01.1 to 64
PCI: Setting latency timer of device 0000:00:01.2 to 64
SELinux:  Disabled at runtime.
PCI: Setting latency timer of device 0000:00:03.0 to 64
time.c: can't update CMOS clock from 0 to 59

ntpq -p
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 time           192.168.96.68    4 u   57   64  377    0.205  99181.1 35920.0

How can i correct the ntp syncronization?

Thanks,
Gabor
Comment 27 Glauber Costa 2010-01-28 06:45:44 EST
2.6.18-164.11.1.el5 should get all kvmclock stacked already. Are you seeing this need for synchronization at boot time only?
Comment 28 Gabor Balogh 2010-01-28 08:41:10 EST
How can i check this kernel contain the kvmclock? Not only boot time, i would like the keep the time always synchronized.
Comment 29 Glauber Costa 2010-01-28 10:16:03 EST
time.c: Using 1.193182 MHz WALL KVM GTOD KVM timer. <=== this means you are using kvmclock.

In x86_64 RHEL5, you cannot change clocksources at runtime. So if you booted with it, you'll be using it.
Comment 30 Gabor Balogh 2010-01-28 10:43:30 EST
How can i change the clocksource? In KVM guest the time drifted just in jiffies mode?
In guest OS: time.c: Using 1.193182 MHz WALL KVM GTOD KVM timer.

In host OS: time.c: Using 14.318180 MHz WALL HPET GTOD HPET/TSC timer.

In host OS:
cat /sys/devices/system/clocksource/clocksource0/available_clocksource
jiffies

In the host the clocksource is jiffies, and the time is correct:
[root@mrbbo-admin5 ~]# ntpq -p
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*mrbbo-admin2.pg 192.168.96.69    4 u   19  128  377    0.188   -1.469   0.108
+mrbbo-admin1.pg 192.168.96.69    4 u   64  128  377    0.200   -0.146   0.119
 LOCAL(0)        .LOCL.          10 l   34   64  377    0.000    0.000   0.001

Note You need to log in before you can comment on or make changes to this bug.