From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7 Description of problem: I am running a xen guest (HAV-Mode) with also RHEL5 as OS. After a while I notice that the system-date is "jumping" around. Every second the date jumps between two values. At first only a few seconds difference, and over time its several minutes and so on. # date; date; Wed Sep 26 16:50:20 CEST 2007 Wed Sep 26 16:50:16 CEST 2007 A few seconds later: # date; date; Wed Sep 26 16:50:48 CEST 2007 Wed Sep 26 16:50:44 CEST 2007 Here we only have a drift of 4 seconds... but its getting worse. The whole system starts to behave strangely after a while, because the time is messed up. Some more info: Guest: Linux xxx 2.6.18-8.el5 #1 SMP Fri Jan 26 14:15:14 EST 2007 x86_64 x86_64 x86_64 GNU/Linux Host: Linux xen 2.6.18-8.1.8.el5xen #1 SMP Mon Jun 25 17:19:38 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux # cat /sys/devices/system/clocksource/clocksource0/current_clocksource jiffies # cat /proc/sys/xen/independent_wallclock 1 There is no ntpd running, but I do sync the time every few hours with a timeserver. (On both host and guest). There are no errors reported. Version-Release number of selected component (if applicable): xen-3.0.3-25.0.3.el5 How reproducible: Always Steps to Reproduce: 1. Install Guest like any other RHEL5 system in HAV mode 2. run guest 3. wait a few hours, run "date; date;" Actual Results: Dates differ by several seconds/minutes. Expected Results: Same date Additional info:
Any News here? We running into serious problems here, because we using the Xen-Enabled hosts in a HA-Linux Cluster environment and all the Cluster- Failover mechamisms are broken because of the flapping time. This clusters are unfortunately in production already which makes this really critical for us.
Why do you have /proc/sys/xen/independent_wallclock set to 1? If these are paravirt guests (which it seems like they are), you really want to have independent_wallclock set to 0 so that it will keep the time in sync with the HV. Otherwise it is a losing battle, since the guest cannot account for time it wasn't on the processor at all. Chris Lalancette
At first /proc/sys/xen/independent_wallclock was set to 0 on the xen-host. But we also had the problem with the guest there. We tried to fix the problem by setting this to 1, and enabling a npt-sync on the guest. But even with the npt-sync the times keep drifting apart. (Currently - update of the guest is 6 Days, and two "/bin/date" calls in the shell right after each other print a difference of 10 minutes.) I am not sure if the guest is paravirtual - it completely hardware emulated, there is nothing xen-related installed on it. (Its a plain RHEL5 install.) If you need more information I am more to happy to provide it.
Oh, OK, I was confused. You are running a fully-virtualized guest; in that case, independent_wallclock really does nothing for you. It's an optimization for paravirtualized guests. You should really leave it enabled (set to 1) in the dom0. As far as the guest drift is concerned, sometimes it has to do with the actual physical clock drifting, and sometimes it has to do with the guest kernel requesting too many interrupts a second, which the host can't provide. If you run NTP inside the guest, does that help keep it in sync? Is either the dom0 or the guest very heavily loaded? Are you perhaps running a number of guests at the same time? Chris Lalancette
A NPT sync within the guest does indeed correct the time for one of two dates. The other date corrects itself by the same offset, but is still wrong. And it is the only guest on the Xen host and we have very low load. However, we have some new information now - when we set the virtual CPUs from two to one, then it is ok. So, its seems to be a xen bug that only happens when the "vcpus"-Settings is bigger then 1. Its not a good workaround because the host is a quad-core machine and we are basically just using one core now....
I have the same problem with RHEL5 hosts and rhel4 and rhel5 guests on it. What's the status of this bug, cause it's giving me a lott of problems?
Hello AJ, now with RedHat 5.1 available with Xen 3.1 included, we just stopped to use fully virtualized hosts. Paravirtual now works MUCH better and the virtual machines are multiple times faster as with the "old" method. However, we tested also the above setup (fully virtualized, more than one virtual CPUs assigned) and having the exact same bug. Its even much worse, the drifting of time is more extreme. We just reinstalled all our xen machines to paravirtual and things are much better there.
hi all, partly in reply to comment 7, partly some more information. We are running RHEL5.1 latest patch level 64-bits on intel quad core cpus with 16GB of memory. 1, sometimes 2 guests on 1 host. The hosts keep up with the time quiet well, but the fully virtualized guests not. We cannot make all guests para because we need some 32-bits guests as well. And they need to be fully virtualized. Within an hour, the time drifts a couple of seconds. So this is getting really frustrating.
OK. There are a couple of things to try here to keep your fully virtualized guest clocks in sync: 1. Pass "clocksource=pit" in the guest kernel command-line; this should force it to use the emulated PIT timesource, which seems to be a little more accurate in the virtual environment than the emulated HPET. 2. Using RHEL-5.1 or later, use the tick divider by using "divider=10" on the guest kernel command-line. This will effectively make the guest use a 100HZ clock instead of a 1000HZ one, reducing the load on the HV and possibly keeping your clocks in better sync. Note a couple of things here, however; you need at least 2.6.18-53.1.4.el5 (since there were bugs before that), and you can't currently use the divider option in conjunction with the recommendation in step 1 (because of BZ 427588). 3. Make sure to run NTP inside the guests to keep the clocks in sync. Please let us know if some combination of the above improves the situation for you. Chris Lalancette
Hello Chris, unfortunately I have no longer fully virtualized systems around. I order to fix this problem we reinstalled all our systems in way we could use a paravirtual setup. Sorry, so I am in no position to help you here testing anymore. Thank you, Thomas
We are also experiencing problems with clock drift in fully-virtualized guests. From our experience, running a paravirtualized guest is the best solution; not only is the performance near-native, but clock drift isn't an issue. Unfortunately, as with AJ in comment 8, we can't make all of our guests paravirtualized, because some of them are 32-bit guests. I'll try the workarounds in comment 9 and report our experiences.
Waiting on feedback from the reporters, so I'll put this in NEEDINFO for now. Chris Lalancette
This looks related to bug 449346, which is next on my todo list.
I am running RHEL 5.2 for my base xen server. Any fully virtualized RHEL 5.2 install I have I see this bug when more then 1 VCPU is used. I have a test system setup to track this down. I am willing to get any information needed or test any sugggested fixes on this test system. When under load you can literally see the minutes roll by. Please let me know I am willing to do any testing requested.
I am backporting a dozen or so HVM timer fixes from newer upstream Xen releases into RHEL 5. I will upload test RPMs once I have finished the backport and done some initial sanity testing on my systems.
On further reflection, one VCPU being behind the other may also be caused by the disk emulation code in qemu-dm, which can block timer interrupts going to the virtual CPU that handles the disk in the guest. I have done a backport of the upstream qemu AIO code, which should alleviate that part of the problem to the point of it disappearing. I have some (experimental!) test RPMs of the AIO backport available at http://people.redhat.com/riel/.xen-aio/ If you are willing to experiment, trying out those RPMs could give us an important data point.
This bug can be caused by a combination of two main factors: - while doing disk IO, one VCPU of an HVM guest can miss timer ticks - Xen did not re-deliver those missed timer ticks later on, causing clock skew between VCPUs inside an HVM guest Both of these issues should be resolved with the backport of the AIO disk handling code and upstream Xen 'no missed-tick accounting' timer code. Please test the test RPMs from http://people.redhat.com/riel/.xenaiotime/ and let us know if those (experimental!) test packages resolve the issue.
I believe this issue is resolved with the changes from Comment #17. The fix is in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner! *** This bug has been marked as a duplicate of bug 449346 ***