Description of problem: After doing a save/restore of a Xen guest, eg. to reboot dom0, crond no longer starts up cron jobs. Restarting crond after a save/restore works around the issue. Version-Release number of selected component (if applicable): vixie-cron-4.1-70.el5 Steps to Reproduce: 1. have a dom0 with various (paravirt) Xen guests 2. set up a regular cron job in one of the guests 3. type "init 6" in dom0 4. watch the guests get saved to disk 5. after reboot, watch the guests get loaded 6. watch /var/log/cron in the guest with the test cron job 7. cron jobs do not get started after the guest restore Expected results: Cron jobs continue to be started after a save/restore. Additional info: This seems to be happening in both RHEL4 and RHEL5 paravirt guests.
I am not yet sure how to reliably reproduce this, or whether it also affects other programs. However, I did just notice that I had 2 week old email sitting in the queue and kicking exim dislodged it. I have only seen that once though, so it could be another fluke. The problem of crond not starting any cron jobs after a save/restore is an issue I have seen multiple times on 3 virtual machines, so that is confirmed...
I just found other stuck processes, so it's not just crond. # strace -p 24859 Process 24859 attached - interrupt to quit restart_syscall(<... resuming interrupted call ...> Unfortunately, I have no idea what syscall it is restarting. The good news is, the only syscalls that should take a long time in this process are nanosleep and wait4 - both of which are potential candidates for crond trouble, too.
Have the same issue with migration at a customer site with dom0's and domU's running RHEL5u1 X86_64. The first time a domU migrate to another dom0 it seems to be okay. The second migration succeed according xen migrate, but not all services/applications are coming up again. Is this case the clock (date) is not running any more and also ssh access isn't their anymore.
Here some time information about the migration. -server3 and server4 are the dom0's -student08 is de domU migrated from server3 to server4. #Before migration [root@satellite ~]# for i in server3 server4 student12 student08; do echo $i: `ssh $i date`; done server3: Fri Feb 8 12:41:14 CET 2008 server4: Fri Feb 8 12:41:14 CET 2008 student12: Fri Feb 8 12:41:14 CET 2008 student08: Fri Feb 8 12:41:14 CET 2008 #After migration [root@satellite ~]# for i in server3 server4 student12 student08; do echo $i: `ssh $i date`; done server3: Fri Feb 8 12:41:22 CET 2008 server4: Fri Feb 8 12:41:22 CET 2008 student12: Fri Feb 8 12:41:22 CET 2008 student08: Fri Feb 8 12:48:16 CET 2008 #HW-clock done at the same time. [root@server3 virt]# hwclock --show Fri 08 Feb 2008 12:51:50 PM CET -0.739546 seconds [root@server4 ~]# hwclock --show Fri 08 Feb 2008 12:45:49 PM CET -0.291921 seconds ################################ Fixing HW clock hwclock -w #Before migration [root@satellite ~]# for i in server3 server4 student12 student08; do echo $i: `ssh $i date`; done server3: Fri Feb 8 13:06:00 CET 2008 server4: Fri Feb 8 13:06:00 CET 2008 student12: Fri Feb 8 13:06:00 CET 2008 student08: Fri Feb 8 13:06:00 CET 2008 #After migration [root@satellite ~]# for i in server3 server4 student12 student08; do echo $i: `ssh $i date`; done server3: Fri Feb 8 13:09:22 CET 2008 server4: Fri Feb 8 13:09:22 CET 2008 student12: Fri Feb 8 13:09:23 CET 2008 student08: Fri Feb 8 13:16:16 CET 2008 #HW-clock [root@satellite ~]# for i in server3 server4; do echo $i date: `ssh $i date`; echo $i hwclock: `ssh $i hwclock --show`;doneserver3 date: Fri Feb 8 13:10:31 CET 2008 server3 hwclock: Fri 08 Feb 2008 01:10:33 PM CET -0.858803 seconds server4 date: Fri Feb 8 13:10:32 CET 2008 server4 hwclock: Fri 08 Feb 2008 01:10:34 PM CET -0.707565 seconds
Same time reported on multiple date commands within several seconds: #Before migration [root@satellite ~]# for i in server3 server4 student12 student08; do echo $i: `ssh $i date`; done server3: Fri Feb 8 14:18:54 CET 2008 server4: Fri Feb 8 14:18:54 CET 2008 student12: Fri Feb 8 14:18:54 CET 2008 student08: Fri Feb 8 14:18:54 CET 2008 #After migration of student08 from server4 to server3 the date command [root@student08 ~]# date Fri Feb 8 14:18:59 CET 2008 [root@student08 ~]# date Fri Feb 8 14:18:59 CET 2008 [root@student08 ~]# date Fri Feb 8 14:18:59 CET 2008 [root@student08 ~]# date Fri Feb 8 14:18:59 CET 2008 [root@student08 ~]# date Fri Feb 8 14:18:59 CET 2008 [root@student08 ~]# date Fri Feb 8 14:18:59 CET 2008 [root@student08 ~]# date Fri Feb 8 14:18:59 CET 2008 [root@student08 ~]# date Fri Feb 8 14:18:59 CET 2008 [root@student08 ~]# date Fri Feb 8 14:18:59 CET 2008 [root@student08 ~]# date Fri Feb 8 14:18:59 CET 2008 [root@student08 ~]# date Fri Feb 8 14:18:59 CET 2008 [root@student08 ~]# date Fri Feb 8 14:18:59 CET 2008 Time diffs: [root@satellite ~]# for i in server3 server4 student12 student08; do echo $i: `ssh $i date`; done server3: Fri Feb 8 14:19:34 CET 2008 server4: Fri Feb 8 14:19:34 CET 2008 student12: Fri Feb 8 14:19:34 CET 2008 student08: Fri Feb 8 14:18:59 CET 2008
*** Bug 430245 has been marked as a duplicate of this bug. ***
I have experienced the same issue doing live migrations. Normally this is fixed by rebooting all the nodes in the cluster. Time on guest is fine, once migration completes the clock on the guest will either be stuck (time will not change) or it wil be set to some arb time/date which I have not been able to corrolate to anything of note. Using the date command has worked once of twice to get the clock on stuck, however if the clock is still running but of an incorrect value nothing help asside from shutting the guest down and starting it again. Once the guest is broken I get this on the console which may or may not be related: BUG: soft lockup detected on CPU#1! Call Trace: <IRQ> [<ffffffff802aaa83>] softlockup_tick+0xd5/0xe7 [<ffffffff8026cb4a>] timer_interrupt+0x396/0x3f2 [<ffffffff80210afe>] handle_IRQ_event+0x2d/0x60 [<ffffffff802aae0b>] __do_IRQ+0xa4/0x105 [<ffffffff80288712>] _local_bh_enable+0x61/0xc5 [<ffffffff8026a90e>] do_IRQ+0xe7/0xf5 [<ffffffff8025db2b>] child_rip+0x11/0x12 [<ffffffff8039664c>] evtchn_do_upcall+0x86/0xe0 [<ffffffff8025d8ce>] do_hypervisor_callback+0x1e/0x2c <EOI> [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000 [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000 [<ffffffff8026be87>] raw_safe_halt+0x84/0xa8 [<ffffffff80269453>] xen_idle+0x38/0x4a [<ffffffff80247b8e>] cpu_idle+0x97/0xba
No, that softlockup is probably not related, and, in fact, that message is probably fixed in 5.2 by the patch in BZ 250994. The time stopping is definitely a problem, though. Chris Lalancette
I have just upgraded my home system to 5.2 beta and rebooted dom0. Inside the guests, crond got stuck and I took a crashdump of one of the guests. I'll try to get more info on what is going wrong soon.
We commonly experience this bug in GLS. Let me know if any resources from the training organization might be helpful in dealing with this issue. -Zak Brown
Hi Zak, one of the things you can do is take a crash dump of the domain after migration and analyze the domain to see what's going on. This is the first time I have used a crash dump to analyze a bug (I started kernel hacking before tools were available, and I still don't use them :)), so maybe you'll find something that I miss...
I experienced the same issue: Red Hat Enterprise Linux 5.1 xen host Xen guest running Red Hat Enterprise linux 4 U6 64bit. As a workaround to the varying time I isolated the xen guest from its host clock: echo "1" > /proc/sys/xen/independent_wallclock for persistence: add: xen.independent_wallclock = 1 to /etc/sysctl.conf activate with sysctl -p When migrating live xen guests within a cluster suite environment this solved the wandering time issue I was experiencing with only a short pause encountered on the guest. This means however that you will need to rely on ntp for your xen guests to maintain time sync. Hope that this is usefull.
This parameters on sysctl.conf worked ok for me on RHEL 5.1 guests in RHEL 5.1 hosts. No more issues with clock in live migration.
Today at the customer site and tested the independent_wallclock parameter with success. So also at this customer the live migration works smoothly running RHEl5.1 Dom0's and DomU's.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
One report of this problem mentions that in their case, they see time go forwards or stall when migrating to a system with greater uptime than the originating host and backwards shifts when migrating to a system with a shorter uptime. Not sure if this is related/relevant or already known but I didn't see it mentioned in this bugzilla.
Thanks for info.
For you information, I've got the same problem with xen 3.2.1 from xen.org. I only tested it with live migration. I move a VM from A to B and then back to A again, and the clock stops.
http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=1282
echo "1" > /proc/sys/xen/independent_wallclock workaround works for me. I set this on both dom0s and domU.
The independent_wallclock setting works fairly well for live migration, but has the issue that the wall clock time does not advance across a dom0 reboot (where the guests are saved and restored), which puts the guest several minutes behind for me. I suspect we should just reset the monotonic_tv values on restore and live migrate, which can cause the time to go backwards in guests (if the dom0 clocks are out of sync), but at least things will continue to run. I will run some experiments with this.
Created attachment 311400 [details] patch to reset monotonic_tv.* on backwards time jump If the monotonic_tv time is more than 1/8th of a second (maximum drift ntpd allows) ahead of the hypervisor time, reset monotonic_tv. This patch is still untested.
Created attachment 311410 [details] add printk, timespec uses ns, timeval uses usec
Comment on attachment 311410 [details] add printk, timespec uses ns, timeval uses usec Since gettimeofday should never go backwards, I believe it is better to have it return the same time over and over again than make a backwards jump.
NTP syncing is no fix for this problem. Jiffies and realtime in the guest seem to advance normally after a live migration between two NTP synced dom0s, but the wakeup event still gets lost! I guess the main problem is that wakeups can get lost after a live migrate. Not having the time the same between two hosts can be fixed by NTP syncing the dom0s involved.
OK, it turns out NTP syncing does not always fix this problem. However, a time jump observed correlates very nicely with the difference in uptime between the host systems! On the guest: # while sleep 1 ; do date ; done Wed Aug 6 14:45:48 EDT 2008 Sun Aug 10 20:38:32 EDT 2008 The dom0s in question: [root@tethys ~]# uptime 15:23:52 up 6 days, 6:25, 1 user, load average: 0.11, 0.07, 0.01 [root@kenny xen]# uptime 15:23:54 up 2 days, 33 min, 1 user, load average: 0.00, 0.00, 0.00
OK, found a problem. On the kernel side, get_time_values_from_xen() gets its time by looking at the vcpu_time_info struct that is exported by the hypervisor. This contains the field "system_time", which is the time in nanoseconds since system bootup - host system bootup! This means the guest keeps its clock by looking at how much time has elapsed since the host booted up. This totally breaks down in a live migrate, when the guest moves from a host with one uptime to a host with another uptime. I'll start a discussion upstream to see what kind of fix would be best.
Fixed with upstream changeset xen-unstable.hg:15706. It turns out it was a userspace bug, with the restore code overwriting some time data of the current hypervisor with data from the old hypervisor. The problem is that Xen time is calculated as (HV boot time + HV uptime), but on a restore the userland code would clobber the HV boot time with the boot time of the HV on which the guest previously ran. This screws up timekeeping. Avoiding that clobber avoids the problem.
Release note added. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team.
it looks like this is slated for 5.3. any chance of an interim fix until then?
Making hotfix RPMs available is something only support can do. However, as an engineer I can provide *test* RPMs. Note that these contain all kinds of other commits (they are a snapshot of RHEL 5.3 development), some of which may end up being reverted before 5.3 comes out. If you still want to test them, you can get a test RPM from http://people.redhat.com/riel/.bz426861/ If you feel adventurous, feel free to try it out. Please let us know whether or not it RPM works for you. With NTP synced dom0s and the test RPM, live migration of paravirtualized guests should work correctly.
is ntp and the unsynch'ed clock necessary? or just the updated userland restore tool?
Joe, a paravirtualized Xen guest derives its clock from the hypervisor clock. As a consequence, if you have two host systems with their clocks out of sync, a live migration will cause a clock jump in the guest. If the clocks of both hosts are synced (preferably using ntp), everything will work fine.
*** Bug 459384 has been marked as a duplicate of this bug. ***
Rik, there is no i386 RPM in you're Test download. and I had trouble to build that: + chmod +x tools/check/check_libvncserver + CFLAGS='-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m32 -march=i386 -mtune=generic -fasynchronous-unwind-tables' + /usr/bin/make XENFB_TOOLS=n XEN_PYTHON_NATIVE_INSTALL=1 DESTDIR=/var/tmp/xen-3.0.3-69.0-root tools docs /usr/bin/make -C tools install make[1]: Entering directory `/usr/src/redhat/BUILD/xen-3.1.0-src/tools' /usr/bin/make -C check make[2]: Entering directory `/usr/src/redhat/BUILD/xen-3.1.0-src/tools/check' make[2]: *** ../../.config: Is a directory. Stop. make[2]: Leaving directory `/usr/src/redhat/BUILD/xen-3.1.0-src/tools/check' make[1]: *** [check] Error 2 make[1]: Leaving directory `/usr/src/redhat/BUILD/xen-3.1.0-src/tools' make: *** [install-tools] Error 2 error: Bad exit status from /var/tmp/rpm-tmp.4331 (%build) RPM build errors: Bad exit status from /var/tmp/rpm-tmp.4331 (%build) Any chance to get a i386 RPM?
i grabbed the srpm that rik provided and rebuilt on my system and it has fixed the problem. so hopefully this will get pushed into a real release sometime soon.
Release note updated. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,3 +1 @@ -In order for the time of a paravirtualized Xen guest to stay constant during a live migration, the dom0s should have their clocks in sync. Using NTP to sync the clocks of the hosts is recommended. +In live migrations of paravirtualized guests, time-dependent guest processes may function improperly if the corresponding hosts' (dom0) times are not synchronized. Use NTP to synchronize system times for all corresponding hosts before migration.- -(please wordsmith this into something readable :))
Built into xen-3.0.3-71.el5
Re Comment #54. You need to get the hotfix flag set to ? I expect. I cannot do it...
Release note updated. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1,3 @@ +The following release note is no longer required. + In live migrations of paravirtualized guests, time-dependent guest processes may function improperly if the corresponding hosts' (dom0) times are not synchronized. Use NTP to synchronize system times for all corresponding hosts before migration.
Release note updated. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,3 +1 @@ -The following release note is no longer required. - In live migrations of paravirtualized guests, time-dependent guest processes may function improperly if the corresponding hosts' (dom0) times are not synchronized. Use NTP to synchronize system times for all corresponding hosts before migration.
I can confirm that the xen-3.0.3-69.0 recompiled for 32 bits fix this issue, and this is the first time since RHEL 5.0 (one year ago) that I am able to live migrate all my vm several times between two hosts without any glitch ! It is a pity habing to wait for the 5.3 release for a such blocking bug ! Good job for correcting it. Regards,
I can also confirm that this fixed my live migrations as well. Please don't make us wait for 5.3 to get this issue fixed.
Created attachment 319209 [details] Don't clobber wallclock on restore The final patch applied to the RPM for RHEL-5.3
now that there is a new version of xen (3.0.3-64.el5_2.3), for those of us who have installed the 3.0.3-69.0 version, what should we do? is the errata RHSA-2008:0892-10 an issue for 3.0.3-69?
Seems -69.0 already contains those security updates, so I assume you can continue using it. regards, Florian La Roche
Further, I'm not quite sure what running -69.0 is buying you; the patch to fix this issue wasn't committed until -71. That being said, it is all beta stuff, and not supported at all yet, so I would usually recommend to stay with the supported stuff. There is an updated xen package coming out soon (if it's not already out) with this fix in it. Chris Lalancette
(In reply to comment #71) > now that there is a new version of xen (3.0.3-64.el5_2.3), for those of us who > have installed the 3.0.3-69.0 version, what should we do? is the errata > RHSA-2008:0892-10 an issue for 3.0.3-69? According to bug 464455 the fix will be in 3.0.3-64.el5_2.4. (In reply to comment #73) > Further, I'm not quite sure what running -69.0 is buying you; the patch to fix > this issue wasn't committed until -71. I don't know what issue you are taking about but the existing -64 live migrate will freeze the domU timer about 80% of the time. With -69 I've yet to have the clock freeze. So obviously there is something between -64 and -69 that fixes this issue.
Huh, very odd. Like I said, the patch we specifically proposed and integrated for this BZ was committed in -71 (I know, I did it!). If something else between -64 and -69 helped the situation, then great, but we also need this patch (and the fix that's in this BZ is also the one that is going to be committed to 3.0.3-64.el5_2.4). Chris Lalancette
Chris, I believe the -69 they are referring to is the test RPM I put on my people.redhat.com page, not the -69 directly from our CVS tree. Joe, Shad, xen 3.0.3-64.el5_2.4 will have the migrate fix that you need.
yes, i am talking about the -69 rpm that rik provided. when is the 3.0.3-64.el5_2.4 version going to be released? and i've always wondered what is the best way to go backwards on an rpm. is: rpm -Uvh --force the best way?
Joe, you'll want to use rpm -Uvh --oldpackage
I'm also refering to the one in Rik's page.
so rik, do you have an idea when the 2.4 rpm will be released?
*** Bug 467253 has been marked as a duplicate of this bug. ***
This is retested with -80. Verifying the bug.
Hey, many people are waiting for this fix, so please release it for 5.2 ! Nobody understands why it needs months for this little but urgent fix for QA. Sincerly, Klaus
Klaus, the fixed Xen package for RHEL 5.2 (xen-3.0.3-64.el5_2.4) was released on November 11th.
2.4 is not available via yum. 2.3 is the latest version i see: # yum list xen Loading "rhnplugin" plugin rhel-i386-server-5 100% |=========================| 1.4 kB 00:00 rhel-i386-server-vt-5 100% |=========================| 1.4 kB 00:00 rhn-tools-rhel-i386-serve 100% |=========================| 1.2 kB 00:00 Available Packages xen.i386 3.0.3-64.el5_2.3 rhel-i386-server
Rik, are you really sure the package has gone out? I can't find any errata regarding xen-3.0.3-64.el5_2.4 as well as no source RPM on the redhat servers. Also both CentOS and Scientific don't have it, so I suspect it has not gone out. Sincerly, Klaus
You are right, it appears that xen-3.0.3-64.el5_2.4 is not on RHN. From what I understood it was supposed to have been released. I will try to figure out what happened.
Hi Rik, did you find out what happened? Can we expect the release of xen-3.0.3-64.el5_2.4? Sincerly, Klaus
(In reply to comment #99) > Hi Rik, > > did you find out what happened? Can we expect the release of > xen-3.0.3-64.el5_2.4? > > Sincerly, > Klaus Klaus, it was just pushed out today (January 7th) . http://rhn.redhat.com/errata/RHSA-2009-0003.html Thanks.
Klaus, the reason for the delay, is that this fix was in an RPM update that also contained two security fixes. These required quite extensive QA before we could release them to avoid risk of regressions, which unfortunately delayed the release of this timer / migration bug fix. As Gurhan just mentioned, this should be on RHN today.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2009-0118.html
*** Bug 360741 has been marked as a duplicate of this bug. ***