Bug 624959

Summary: xend should prevent restarting loops when guest crashes at boot time and dump-core is enabled
Product: Red Hat Enterprise Linux 5 Reporter: Yufang Zhang <yuzhang>
Component: xenAssignee: Miroslav Rezanina <mrezanin>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: medium Docs Contact:
Priority: low    
Version: 5.6CC: gyang, leiwang, mrezanin, mshao, xen-maint
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: xen-3.0.3-120.el5 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-01-13 22:23:42 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 514500    
Attachments:
Description Flags
patch to prevent loops
none
xend.log
none
xend.log of Xen-119 none

Description Yufang Zhang 2010-08-18 09:00:26 UTC
Description of problem:
After upgrading a guest to a problematic kernel, I found the guest crashed at boot time. As 'enable-dump' was set as 'yes' in xend and 'on_crash' was set as 'restart' in the config file, dump core was done for the guest. Finally in the restart() function of xend, elapse time is expanded so that it is larger than MINIMUM_RESTART_TIME(20s). Thus xend wouldn't destroy the guest even it crashes early at boot time, and the guest drops into a restart-crash-dumpcore loop. My domain0 was full of dump core files. 

Version-Release number of selected component (if applicable):
xen-3.0.3-115.el5

How reproducible:
Always

Steps to Reproduce:
1. Edit a grub file of guest to make it crash at boot time
2. set 'enable-dump' as 'yes' in xend and 'on_crash' as 'restart' for the guest 
3. Start the guest
  
Actual results:
The guest drops into a restart-crash-dumpcore loop. Domain0 was full of dump core files.

Expected results:
xend should prevent such loops.

Additional info:
I have modified xend to show elapse time at key point, and get such results:

[2010-08-18 16:00:34 xend.XendDomainInfo 17839] WARNING (XendDomainInfo:1185) Domain has crashed: name=vm1 id=114.
[2010-08-18 16:00:34 xend.XendDomainInfo 17839] INFO (XendDomainInfo:1187) elapse time when guest crashes 4.61353492737
[2010-08-18 16:00:34 xend.XendDomainInfo 17839] INFO (XendDomainInfo:1195) Starting automatic crash dump
[2010-08-18 16:00:53 xend.XendDomainInfo 17839] INFO (XendDomainInfo:1205) elapse time after core dump 24.5423090458
[2010-08-18 16:00:54 xend.XendDomainInfo 17839] INFO (XendDomainInfo:2723) elapse time when compared with MINIMUM_RESTART_TIME: 24.8053679466
[2010-08-18 16:00:54 xend.XendDomainInfo 17839] DEBUG (XendDomainInfo:1035) Storing domain details: {'console/ring-ref': '1180241', 'console/port': '2', 'cpu/3/availability': 'online', 'name': 'vm1', 'console/limit': '1048576', 'cpu/2/availability': 'online', 'vm': '/vm/1efb30c3-86fd-9dd7-4934-9b72b6a833fc', 'domid': '114', 'cpu/0/availability': 'online', 'memory/target': '1536000', 'store/ring-ref': '1180242', 'cpu/1/availability': 'online', 'store/port': '1'}
[2010-08-18 16:00:54 xend.XendDomainInfo 17839] DEBUG (XendDomainInfo:2262) XendDomainInfo.destroy: domid=114
[2010-08-18 16:00:54 xend.XendDomainInfo 17839] INFO (XendDomainInfo:2403) Dev 51712 still active, looping...
[2010-08-18 16:00:54 xend.XendDomainInfo 17839] INFO (XendDomainInfo:2403) Dev 51712 still active, looping...
[2010-08-18 16:00:54 xend.XendDomainInfo 17839] INFO (XendDomainInfo:2403) Dev 51712 still active, looping...
[2010-08-18 16:00:54 xend.XendDomainInfo 17839] INFO (XendDomainInfo:2403) Dev 51712 still active, looping...
[2010-08-18 16:00:54 xend.XendDomainInfo 17839] INFO (XendDomainInfo:2403) Dev 51712 still active, looping...
[2010-08-18 16:00:55 xend.XendDomainInfo 17839] DEBUG (XendDomainInfo:2187) UUID Created: True
[2010-08-18 16:00:55 xend.XendDomainInfo 17839] DEBUG (XendDomainInfo:2188) Devices to release: [], domid = 114
[2010-08-18 16:00:55 xend.XendDomainInfo 17839] DEBUG (XendDomainInfo:2200) Releasing PVFB backend devices ...

Comment 1 Yufang Zhang 2010-08-18 09:02:03 UTC
Created attachment 439328 [details]
patch to prevent loops

I have figured out a patch to solve this problem, which records the time when the guest crashes and uses this time to compute elapse time in restart() function. I have tested the situation with the patch, xend would destroy the guest when it crashes at boot time.

Comment 2 Miroslav Rezanina 2010-08-18 09:57:40 UTC
Thanks for patch Yufang. Can you please test this scenario with xen-3.0.3-114.el5? Is there any relevant difference in behavior?

Comment 3 Yufang Zhang 2010-08-19 03:27:03 UTC
Created attachment 439572 [details]
xend.log

Hi Miroslav, 

I have tested this scenario with xen-3.0.3-114.el5. I hit the same problem but with a small difference in behaviour. xend would restart the guest at its first boot-and-crash, because 'xend/previous_restart_time' is None. After the guest reboots, timeout value is also expanded due to dump core. Thus the guest drops into loops. You could check more detailed information from xend.log in the attachment.

Comment 8 YangGuang 2010-11-23 09:33:31 UTC
With xen-116 and kernel-xen-233, I can reproduce this bug with comment#1 steps with RHEL5.5-64bit-pv-guest whose elapse time is larger than MINIMUM_RESTART_TIME.

But with xen-119 and kernel-xen-233, there is a problem that the bug will still occur while I execute "xm list " continually during the process. The details is in the followings:

Version-Release number of selected component (if applicable):
xen-3.0.3-119.el5
kernel-xen-2.6.18-233.el5

How reproducible:
Always

Actual steps:
1. Edit a grub file of guest to make it crash at boot time
2. set 'enable-dump' as 'yes' in xend and 'on_crash' as 'restart' for the guest 
3. Start the guest
4. In the same time with step3, open another console, execute "xm list" continually.

Actual results:
1. The guest drops into a restart-crash-dumpcore loop. Domain0 was full of dump
core files.
2. When "xm li" stops, the guest will work well later.

Comment 9 YangGuang 2010-11-23 09:35:21 UTC
Created attachment 462276 [details]
xend.log of Xen-119

Comment 10 Miroslav Rezanina 2010-11-23 10:24:59 UTC
I see the problem in log. I do not know why, but in your case, refreshShutdown is call more than once - on each call time of crash is rewritten. This is probably due to xm list blocking xend to handle crashDump immediately. I will rewrite patch for this. Without xm list interfere, was this reproducible on 119?

Comment 11 YangGuang 2010-11-24 02:08:11 UTC
(In reply to comment #10)
> I see the problem in log. I do not know why, but in your case, refreshShutdown
> is call more than once - on each call time of crash is rewritten. This is
> probably due to xm list blocking xend to handle crashDump immediately. I will
> rewrite patch for this. Without xm list interfere, was this reproducible on
> 119?

Without  xm list, I cannot reproduce it.

Comment 12 Miroslav Rezanina 2010-11-24 14:21:53 UTC
Fix built into xen-3.0.3-120.el5

Comment 14 YangGuang 2010-11-29 07:19:56 UTC
Version-Release number of selected component (if applicable):
xen-3.0.3-120.el5
kernel-xen-2.6.18-233.el5
host: RHEL5.5-x86_64
guest: RHEL5.5-x86_64-PV

Actual steps:
same with comment#8

Actual steps:
the guest works well.

So I change it to verified.

Comment 16 errata-xmlrpc 2011-01-13 22:23:42 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0031.html