Bug 307201

Summary:	Time flapping / drifting on guest, getting worse over time
Product:	Red Hat Enterprise Linux 5	Reporter:	tn
Component:	kernel-xen	Assignee:	Rik van Riel <riel>
Status:	CLOSED DUPLICATE	QA Contact:	Martin Jenner <mjenner>
Severity:	urgent	Docs Contact:
Priority:	low
Version:	5.0	CC:	bryan_sauser, clalance, jburke, joukio, kdmasary, k.georgiou, ralston, riel, rodney.mckee, sputhenp, xen-maint
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2009-07-13 20:49:14 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	492570

Description tn 2007-09-26 14:58:02 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7

Description of problem:
I am running a xen guest (HAV-Mode) with also RHEL5 as OS.
After a while I notice that the system-date is "jumping" around.

Every second the date jumps between two values. At first only
a few seconds difference, and over time its several minutes and so on.

# date; date;
Wed Sep 26 16:50:20 CEST 2007
Wed Sep 26 16:50:16 CEST 2007

A few seconds later:

# date; date;
Wed Sep 26 16:50:48 CEST 2007
Wed Sep 26 16:50:44 CEST 2007

Here we only have a drift of 4 seconds... but its getting worse.

The whole system starts to behave strangely after a while, because the
time is messed up.

Some more info:

Guest: Linux xxx 2.6.18-8.el5 #1 SMP Fri Jan 26 14:15:14 EST 2007 x86_64 x86_64 x86_64 GNU/Linux

Host: Linux xen 2.6.18-8.1.8.el5xen #1 SMP Mon Jun 25 17:19:38 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux

# cat /sys/devices/system/clocksource/clocksource0/current_clocksource
jiffies

# cat /proc/sys/xen/independent_wallclock
1

There is no ntpd running, but I do sync the time every few hours
with a timeserver. (On both host and guest).

There are no errors reported.

Version-Release number of selected component (if applicable):
xen-3.0.3-25.0.3.el5

How reproducible:
Always

Steps to Reproduce:
1. Install Guest like any other RHEL5 system in HAV mode
2. run guest
3. wait a few hours, run "date; date;"

Actual Results:
Dates differ by several seconds/minutes.

Expected Results:
Same date

Additional info:

Comment 1 tn 2007-10-08 09:15:43 UTC

Any News here? We running into serious problems here, because we using the
Xen-Enabled hosts in a HA-Linux Cluster environment and all the Cluster-
Failover mechamisms are broken because of the flapping time. 

This clusters are unfortunately in production already which makes this really
critical for us.

Comment 2 Chris Lalancette 2007-10-08 14:26:01 UTC

Why do you have /proc/sys/xen/independent_wallclock set to 1?  If these are
paravirt guests (which it seems like they are), you really want to have
independent_wallclock set to 0 so that it will keep the time in sync with the
HV.  Otherwise it is a losing battle, since the guest cannot account for time it
wasn't on the processor at all.

Chris Lalancette

Comment 3 tn 2007-10-08 14:41:50 UTC

At first  /proc/sys/xen/independent_wallclock was set to 0 on the xen-host. 
But we also had the problem with the guest there. We tried to fix the problem
by setting this to 1, and enabling a npt-sync on the guest. But even with the
npt-sync the times keep drifting apart. 

(Currently - update of the guest is 6 Days, and two "/bin/date" calls in the
shell right after each other print a difference of 10 minutes.)

I am not sure if the guest is paravirtual - it completely hardware emulated,
there is nothing xen-related installed on it. (Its a plain RHEL5 install.)

If you need more information I am more to happy to provide it.

Comment 4 Chris Lalancette 2007-10-10 21:06:50 UTC

Oh, OK, I was confused.  You are running a fully-virtualized guest; in that
case, independent_wallclock really does nothing for you.  It's an optimization
for paravirtualized guests.  You should really leave it enabled (set to 1) in
the dom0.

As far as the guest drift is concerned, sometimes it has to do with the actual
physical clock drifting, and sometimes it has to do with the guest kernel
requesting too many interrupts a second, which the host can't provide.  If you
run NTP inside the guest, does that help keep it in sync?  Is either the dom0 or
the guest very heavily loaded?  Are you perhaps running a number of guests at
the same time?

Chris Lalancette

Comment 5 tn 2007-10-11 12:27:46 UTC

A NPT sync within the guest does indeed correct the time for one of two dates.
The other date corrects itself by the same offset, but is still wrong.
And it is the only guest on the Xen host and we have very low load.

However, we have some new information now - when we set the virtual CPUs from
two to one, then it is ok. So, its seems to be a xen bug that only happens 
when the "vcpus"-Settings is bigger then 1. 

Its not a good workaround because the host is a quad-core machine and we are
basically just using one core now....

Comment 6 AJ Hettema 2007-12-17 14:52:59 UTC

I have the same problem with RHEL5 hosts and rhel4 and rhel5 guests on it. 
What's the status of this bug, cause it's giving me a lott of problems?

Comment 7 tn 2007-12-17 15:09:24 UTC

Hello AJ,

now with RedHat 5.1 available with Xen 3.1 included, we just stopped to use
fully virtualized hosts. Paravirtual now works MUCH better and the virtual
machines are multiple times faster as with the "old" method. 

However, we tested also the above setup (fully virtualized, more than one 
virtual CPUs assigned) and having the exact same bug. Its even much worse, the
drifting of time is more extreme.

We just reinstalled all our xen machines to paravirtual and things are much
better there.

Comment 8 AJ Hettema 2007-12-17 15:19:44 UTC

hi all,

partly in reply to comment 7, partly some more information.

We are running RHEL5.1 latest patch level 64-bits on intel quad core cpus with 
16GB of memory. 1, sometimes 2 guests on 1 host. The hosts keep up with the 
time quiet well, but the fully virtualized guests not. We cannot make all 
guests para because we need some 32-bits guests as well. And they need to be 
fully virtualized. Within an hour, the time drifts a couple of seconds. So this 
is getting really frustrating.

Comment 9 Chris Lalancette 2008-02-27 03:46:34 UTC

OK.  There are a couple of things to try here to keep your fully virtualized
guest clocks in sync:

1.  Pass "clocksource=pit" in the guest kernel command-line; this should force
it to use the emulated PIT timesource, which seems to be a little more accurate
in the virtual environment than the emulated HPET.

2.  Using RHEL-5.1 or later, use the tick divider by using "divider=10" on the
guest kernel command-line.  This will effectively make the guest use a 100HZ
clock instead of a 1000HZ one, reducing the load on the HV and possibly keeping
your clocks in better sync.  Note a couple of things here, however; you need at
least 2.6.18-53.1.4.el5 (since there were bugs before that), and you can't
currently use the divider option in conjunction with the recommendation in step
1 (because of BZ 427588).

3.  Make sure to run NTP inside the guests to keep the clocks in sync.

Please let us know if some combination of the above improves the situation for you.

Chris Lalancette

Comment 10 tn 2008-02-27 09:02:18 UTC

Hello Chris,

unfortunately I have no longer fully virtualized systems around. I order to
fix this problem we reinstalled all our systems in way we could use a
paravirtual setup. 

Sorry, so I am in no position to help you here testing anymore.

Thank you,
Thomas

Comment 11 James Ralston 2008-02-27 22:44:29 UTC

We are also experiencing problems with clock drift in fully-virtualized guests.

From our experience, running a paravirtualized guest is the best solution; not
only is the performance near-native, but clock drift isn't an issue.

Unfortunately, as with AJ in comment 8, we can't make all of our guests
paravirtualized, because some of them are 32-bit guests.

I'll try the workarounds in comment 9 and report our experiences.

Comment 12 Chris Lalancette 2009-01-22 11:30:08 UTC

Waiting on feedback from the reporters, so I'll put this in NEEDINFO for now.

Chris Lalancette

Comment 13 Rik van Riel 2009-01-26 23:48:08 UTC

This looks related to bug 449346, which is next on my todo list.

Comment 14 fishermania 2009-01-30 19:22:23 UTC

I am running RHEL 5.2 for my base xen server. Any fully virtualized RHEL 5.2 install I have I see this bug when more then 1 VCPU is used. I have a test system setup to track this down. I am willing to get any information needed or test any sugggested fixes on this test system. When under load you can literally see the minutes roll by. Please let me know I am willing to do any testing requested.

Comment 15 Rik van Riel 2009-01-30 19:28:39 UTC

I am backporting a dozen or so HVM timer fixes from newer upstream Xen releases into RHEL 5.  I will upload test RPMs once I have finished the backport and done some initial sanity testing on my systems.

Comment 16 Rik van Riel 2009-04-09 15:06:17 UTC

On further reflection, one VCPU being behind the other may also be caused by the disk emulation code in qemu-dm, which can block timer interrupts going to the virtual CPU that handles the disk in the guest.

I have done a backport of the upstream qemu AIO code, which should alleviate that part of the problem to the point of it disappearing.  I have some (experimental!) test RPMs of the AIO backport available at http://people.redhat.com/riel/.xen-aio/

If you are willing to experiment, trying out those RPMs could give us an important data point.

Comment 17 Rik van Riel 2009-04-10 17:06:04 UTC

This bug can be caused by a combination of two main factors:
- while doing disk IO, one VCPU of an HVM guest can miss timer ticks
- Xen did not re-deliver those missed timer ticks later on, causing clock skew between VCPUs inside an HVM guest

Both of these issues should be resolved with the backport of the AIO disk handling code and upstream Xen 'no missed-tick accounting' timer code. Please test the test RPMs from http://people.redhat.com/riel/.xenaiotime/ and let us know if those (experimental!) test packages resolve the issue.

Comment 19 Jeff Burke 2009-07-13 20:49:14 UTC

I believe this issue is resolved with the changes from Comment #17.

The fix is in the Beta release that addresses this particular request.  Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner!

*** This bug has been marked as a duplicate of bug 449346 ***