Description of problem: After upgrading a guest to a problematic kernel, I found the guest crashed at boot time. As 'enable-dump' was set as 'yes' in xend and 'on_crash' was set as 'restart' in the config file, dump core was done for the guest. Finally in the restart() function of xend, elapse time is expanded so that it is larger than MINIMUM_RESTART_TIME(20s). Thus xend wouldn't destroy the guest even it crashes early at boot time, and the guest drops into a restart-crash-dumpcore loop. My domain0 was full of dump core files. Version-Release number of selected component (if applicable): xen-3.0.3-115.el5 How reproducible: Always Steps to Reproduce: 1. Edit a grub file of guest to make it crash at boot time 2. set 'enable-dump' as 'yes' in xend and 'on_crash' as 'restart' for the guest 3. Start the guest Actual results: The guest drops into a restart-crash-dumpcore loop. Domain0 was full of dump core files. Expected results: xend should prevent such loops. Additional info: I have modified xend to show elapse time at key point, and get such results: [2010-08-18 16:00:34 xend.XendDomainInfo 17839] WARNING (XendDomainInfo:1185) Domain has crashed: name=vm1 id=114. [2010-08-18 16:00:34 xend.XendDomainInfo 17839] INFO (XendDomainInfo:1187) elapse time when guest crashes 4.61353492737 [2010-08-18 16:00:34 xend.XendDomainInfo 17839] INFO (XendDomainInfo:1195) Starting automatic crash dump [2010-08-18 16:00:53 xend.XendDomainInfo 17839] INFO (XendDomainInfo:1205) elapse time after core dump 24.5423090458 [2010-08-18 16:00:54 xend.XendDomainInfo 17839] INFO (XendDomainInfo:2723) elapse time when compared with MINIMUM_RESTART_TIME: 24.8053679466 [2010-08-18 16:00:54 xend.XendDomainInfo 17839] DEBUG (XendDomainInfo:1035) Storing domain details: {'console/ring-ref': '1180241', 'console/port': '2', 'cpu/3/availability': 'online', 'name': 'vm1', 'console/limit': '1048576', 'cpu/2/availability': 'online', 'vm': '/vm/1efb30c3-86fd-9dd7-4934-9b72b6a833fc', 'domid': '114', 'cpu/0/availability': 'online', 'memory/target': '1536000', 'store/ring-ref': '1180242', 'cpu/1/availability': 'online', 'store/port': '1'} [2010-08-18 16:00:54 xend.XendDomainInfo 17839] DEBUG (XendDomainInfo:2262) XendDomainInfo.destroy: domid=114 [2010-08-18 16:00:54 xend.XendDomainInfo 17839] INFO (XendDomainInfo:2403) Dev 51712 still active, looping... [2010-08-18 16:00:54 xend.XendDomainInfo 17839] INFO (XendDomainInfo:2403) Dev 51712 still active, looping... [2010-08-18 16:00:54 xend.XendDomainInfo 17839] INFO (XendDomainInfo:2403) Dev 51712 still active, looping... [2010-08-18 16:00:54 xend.XendDomainInfo 17839] INFO (XendDomainInfo:2403) Dev 51712 still active, looping... [2010-08-18 16:00:54 xend.XendDomainInfo 17839] INFO (XendDomainInfo:2403) Dev 51712 still active, looping... [2010-08-18 16:00:55 xend.XendDomainInfo 17839] DEBUG (XendDomainInfo:2187) UUID Created: True [2010-08-18 16:00:55 xend.XendDomainInfo 17839] DEBUG (XendDomainInfo:2188) Devices to release: [], domid = 114 [2010-08-18 16:00:55 xend.XendDomainInfo 17839] DEBUG (XendDomainInfo:2200) Releasing PVFB backend devices ...
Created attachment 439328 [details] patch to prevent loops I have figured out a patch to solve this problem, which records the time when the guest crashes and uses this time to compute elapse time in restart() function. I have tested the situation with the patch, xend would destroy the guest when it crashes at boot time.
Thanks for patch Yufang. Can you please test this scenario with xen-3.0.3-114.el5? Is there any relevant difference in behavior?
Created attachment 439572 [details] xend.log Hi Miroslav, I have tested this scenario with xen-3.0.3-114.el5. I hit the same problem but with a small difference in behaviour. xend would restart the guest at its first boot-and-crash, because 'xend/previous_restart_time' is None. After the guest reboots, timeout value is also expanded due to dump core. Thus the guest drops into loops. You could check more detailed information from xend.log in the attachment.
With xen-116 and kernel-xen-233, I can reproduce this bug with comment#1 steps with RHEL5.5-64bit-pv-guest whose elapse time is larger than MINIMUM_RESTART_TIME. But with xen-119 and kernel-xen-233, there is a problem that the bug will still occur while I execute "xm list " continually during the process. The details is in the followings: Version-Release number of selected component (if applicable): xen-3.0.3-119.el5 kernel-xen-2.6.18-233.el5 How reproducible: Always Actual steps: 1. Edit a grub file of guest to make it crash at boot time 2. set 'enable-dump' as 'yes' in xend and 'on_crash' as 'restart' for the guest 3. Start the guest 4. In the same time with step3, open another console, execute "xm list" continually. Actual results: 1. The guest drops into a restart-crash-dumpcore loop. Domain0 was full of dump core files. 2. When "xm li" stops, the guest will work well later.
Created attachment 462276 [details] xend.log of Xen-119
I see the problem in log. I do not know why, but in your case, refreshShutdown is call more than once - on each call time of crash is rewritten. This is probably due to xm list blocking xend to handle crashDump immediately. I will rewrite patch for this. Without xm list interfere, was this reproducible on 119?
(In reply to comment #10) > I see the problem in log. I do not know why, but in your case, refreshShutdown > is call more than once - on each call time of crash is rewritten. This is > probably due to xm list blocking xend to handle crashDump immediately. I will > rewrite patch for this. Without xm list interfere, was this reproducible on > 119? Without xm list, I cannot reproduce it.
Fix built into xen-3.0.3-120.el5
Version-Release number of selected component (if applicable): xen-3.0.3-120.el5 kernel-xen-2.6.18-233.el5 host: RHEL5.5-x86_64 guest: RHEL5.5-x86_64-PV Actual steps: same with comment#8 Actual steps: the guest works well. So I change it to verified.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0031.html