624959 – xend should prevent restarting loops when guest crashes at boot time and dump-core is enabled

Bug 624959 - xend should prevent restarting loops when guest crashes at boot time and dump-core is enabled

Summary: xend should prevent restarting loops when guest crashes at boot time and dump...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	xen
Sub Component:
Version:	5.6
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Miroslav Rezanina
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	514500
TreeView+	depends on / blocked

Reported:	2010-08-18 09:00 UTC by Yufang Zhang
Modified:	2011-01-13 22:23 UTC (History)
CC List:	5 users (show)
Fixed In Version:	xen-3.0.3-120.el5
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-01-13 22:23:42 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
patch to prevent loops (1.07 KB, patch) 2010-08-18 09:02 UTC, Yufang Zhang	no flags	Details \| Diff
xend.log (67.53 KB, text/plain) 2010-08-19 03:27 UTC, Yufang Zhang	no flags	Details
xend.log of Xen-119 (58.24 KB, text/plain) 2010-11-23 09:35 UTC, YangGuang	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2011:0031	0	normal	SHIPPED_LIVE	xen bug fix and enhancement update	2011-01-12 15:59:24 UTC

Description Yufang Zhang 2010-08-18 09:00:26 UTC

Description of problem:
After upgrading a guest to a problematic kernel, I found the guest crashed at boot time. As 'enable-dump' was set as 'yes' in xend and 'on_crash' was set as 'restart' in the config file, dump core was done for the guest. Finally in the restart() function of xend, elapse time is expanded so that it is larger than MINIMUM_RESTART_TIME(20s). Thus xend wouldn't destroy the guest even it crashes early at boot time, and the guest drops into a restart-crash-dumpcore loop. My domain0 was full of dump core files. 

Version-Release number of selected component (if applicable):
xen-3.0.3-115.el5

How reproducible:
Always

Steps to Reproduce:
1. Edit a grub file of guest to make it crash at boot time
2. set 'enable-dump' as 'yes' in xend and 'on_crash' as 'restart' for the guest 
3. Start the guest
  
Actual results:
The guest drops into a restart-crash-dumpcore loop. Domain0 was full of dump core files.

Expected results:
xend should prevent such loops.

Additional info:
I have modified xend to show elapse time at key point, and get such results:

[2010-08-18 16:00:34 xend.XendDomainInfo 17839] WARNING (XendDomainInfo:1185) Domain has crashed: name=vm1 id=114.
[2010-08-18 16:00:34 xend.XendDomainInfo 17839] INFO (XendDomainInfo:1187) elapse time when guest crashes 4.61353492737
[2010-08-18 16:00:34 xend.XendDomainInfo 17839] INFO (XendDomainInfo:1195) Starting automatic crash dump
[2010-08-18 16:00:53 xend.XendDomainInfo 17839] INFO (XendDomainInfo:1205) elapse time after core dump 24.5423090458
[2010-08-18 16:00:54 xend.XendDomainInfo 17839] INFO (XendDomainInfo:2723) elapse time when compared with MINIMUM_RESTART_TIME: 24.8053679466
[2010-08-18 16:00:54 xend.XendDomainInfo 17839] DEBUG (XendDomainInfo:1035) Storing domain details: {'console/ring-ref': '1180241', 'console/port': '2', 'cpu/3/availability': 'online', 'name': 'vm1', 'console/limit': '1048576', 'cpu/2/availability': 'online', 'vm': '/vm/1efb30c3-86fd-9dd7-4934-9b72b6a833fc', 'domid': '114', 'cpu/0/availability': 'online', 'memory/target': '1536000', 'store/ring-ref': '1180242', 'cpu/1/availability': 'online', 'store/port': '1'}
[2010-08-18 16:00:54 xend.XendDomainInfo 17839] DEBUG (XendDomainInfo:2262) XendDomainInfo.destroy: domid=114
[2010-08-18 16:00:54 xend.XendDomainInfo 17839] INFO (XendDomainInfo:2403) Dev 51712 still active, looping...
[2010-08-18 16:00:54 xend.XendDomainInfo 17839] INFO (XendDomainInfo:2403) Dev 51712 still active, looping...
[2010-08-18 16:00:54 xend.XendDomainInfo 17839] INFO (XendDomainInfo:2403) Dev 51712 still active, looping...
[2010-08-18 16:00:54 xend.XendDomainInfo 17839] INFO (XendDomainInfo:2403) Dev 51712 still active, looping...
[2010-08-18 16:00:54 xend.XendDomainInfo 17839] INFO (XendDomainInfo:2403) Dev 51712 still active, looping...
[2010-08-18 16:00:55 xend.XendDomainInfo 17839] DEBUG (XendDomainInfo:2187) UUID Created: True
[2010-08-18 16:00:55 xend.XendDomainInfo 17839] DEBUG (XendDomainInfo:2188) Devices to release: [], domid = 114
[2010-08-18 16:00:55 xend.XendDomainInfo 17839] DEBUG (XendDomainInfo:2200) Releasing PVFB backend devices ...

Comment 1 Yufang Zhang 2010-08-18 09:02:03 UTC

Created attachment 439328 [details]
patch to prevent loops

I have figured out a patch to solve this problem, which records the time when the guest crashes and uses this time to compute elapse time in restart() function. I have tested the situation with the patch, xend would destroy the guest when it crashes at boot time.

Comment 2 Miroslav Rezanina 2010-08-18 09:57:40 UTC

Thanks for patch Yufang. Can you please test this scenario with xen-3.0.3-114.el5? Is there any relevant difference in behavior?

Comment 3 Yufang Zhang 2010-08-19 03:27:03 UTC

Created attachment 439572 [details]
xend.log

Hi Miroslav, 

I have tested this scenario with xen-3.0.3-114.el5. I hit the same problem but with a small difference in behaviour. xend would restart the guest at its first boot-and-crash, because 'xend/previous_restart_time' is None. After the guest reboots, timeout value is also expanded due to dump core. Thus the guest drops into loops. You could check more detailed information from xend.log in the attachment.

Comment 8 YangGuang 2010-11-23 09:33:31 UTC

With xen-116 and kernel-xen-233, I can reproduce this bug with comment#1 steps with RHEL5.5-64bit-pv-guest whose elapse time is larger than MINIMUM_RESTART_TIME.

But with xen-119 and kernel-xen-233, there is a problem that the bug will still occur while I execute "xm list " continually during the process. The details is in the followings:

Version-Release number of selected component (if applicable):
xen-3.0.3-119.el5
kernel-xen-2.6.18-233.el5

How reproducible:
Always

Actual steps:
1. Edit a grub file of guest to make it crash at boot time
2. set 'enable-dump' as 'yes' in xend and 'on_crash' as 'restart' for the guest 
3. Start the guest
4. In the same time with step3, open another console, execute "xm list" continually.

Actual results:
1. The guest drops into a restart-crash-dumpcore loop. Domain0 was full of dump
core files.
2. When "xm li" stops, the guest will work well later.

Comment 9 YangGuang 2010-11-23 09:35:21 UTC

Created attachment 462276 [details]
xend.log of Xen-119

Comment 10 Miroslav Rezanina 2010-11-23 10:24:59 UTC

I see the problem in log. I do not know why, but in your case, refreshShutdown is call more than once - on each call time of crash is rewritten. This is probably due to xm list blocking xend to handle crashDump immediately. I will rewrite patch for this. Without xm list interfere, was this reproducible on 119?

Comment 11 YangGuang 2010-11-24 02:08:11 UTC

(In reply to comment #10)
> I see the problem in log. I do not know why, but in your case, refreshShutdown
> is call more than once - on each call time of crash is rewritten. This is
> probably due to xm list blocking xend to handle crashDump immediately. I will
> rewrite patch for this. Without xm list interfere, was this reproducible on
> 119?

Without  xm list, I cannot reproduce it.

Comment 12 Miroslav Rezanina 2010-11-24 14:21:53 UTC

Fix built into xen-3.0.3-120.el5

Comment 14 YangGuang 2010-11-29 07:19:56 UTC

Version-Release number of selected component (if applicable):
xen-3.0.3-120.el5
kernel-xen-2.6.18-233.el5
host: RHEL5.5-x86_64
guest: RHEL5.5-x86_64-PV

Actual steps:
same with comment#8

Actual steps:
the guest works well.

So I change it to verified.

Comment 16 errata-xmlrpc 2011-01-13 22:23:42 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0031.html

Note You need to log in before you can comment on or make changes to this bug.