Bug 624959 - xend should prevent restarting loops when guest crashes at boot time and dump-core is enabled
xend should prevent restarting loops when guest crashes at boot time and dump...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: xen (Show other bugs)
5.6
All Linux
low Severity medium
: rc
: ---
Assigned To: Miroslav Rezanina
Virtualization Bugs
:
Depends On:
Blocks: 514500
  Show dependency treegraph
 
Reported: 2010-08-18 05:00 EDT by Yufang Zhang
Modified: 2011-01-13 17:23 EST (History)
5 users (show)

See Also:
Fixed In Version: xen-3.0.3-120.el5
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-01-13 17:23:42 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
patch to prevent loops (1.07 KB, patch)
2010-08-18 05:02 EDT, Yufang Zhang
no flags Details | Diff
xend.log (67.53 KB, text/plain)
2010-08-18 23:27 EDT, Yufang Zhang
no flags Details
xend.log of Xen-119 (58.24 KB, text/plain)
2010-11-23 04:35 EST, YangGuang
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:0031 normal SHIPPED_LIVE xen bug fix and enhancement update 2011-01-12 10:59:24 EST

  None (edit)
Description Yufang Zhang 2010-08-18 05:00:26 EDT
Description of problem:
After upgrading a guest to a problematic kernel, I found the guest crashed at boot time. As 'enable-dump' was set as 'yes' in xend and 'on_crash' was set as 'restart' in the config file, dump core was done for the guest. Finally in the restart() function of xend, elapse time is expanded so that it is larger than MINIMUM_RESTART_TIME(20s). Thus xend wouldn't destroy the guest even it crashes early at boot time, and the guest drops into a restart-crash-dumpcore loop. My domain0 was full of dump core files. 

Version-Release number of selected component (if applicable):
xen-3.0.3-115.el5

How reproducible:
Always

Steps to Reproduce:
1. Edit a grub file of guest to make it crash at boot time
2. set 'enable-dump' as 'yes' in xend and 'on_crash' as 'restart' for the guest 
3. Start the guest
  
Actual results:
The guest drops into a restart-crash-dumpcore loop. Domain0 was full of dump core files.

Expected results:
xend should prevent such loops.

Additional info:
I have modified xend to show elapse time at key point, and get such results:

[2010-08-18 16:00:34 xend.XendDomainInfo 17839] WARNING (XendDomainInfo:1185) Domain has crashed: name=vm1 id=114.
[2010-08-18 16:00:34 xend.XendDomainInfo 17839] INFO (XendDomainInfo:1187) elapse time when guest crashes 4.61353492737
[2010-08-18 16:00:34 xend.XendDomainInfo 17839] INFO (XendDomainInfo:1195) Starting automatic crash dump
[2010-08-18 16:00:53 xend.XendDomainInfo 17839] INFO (XendDomainInfo:1205) elapse time after core dump 24.5423090458
[2010-08-18 16:00:54 xend.XendDomainInfo 17839] INFO (XendDomainInfo:2723) elapse time when compared with MINIMUM_RESTART_TIME: 24.8053679466
[2010-08-18 16:00:54 xend.XendDomainInfo 17839] DEBUG (XendDomainInfo:1035) Storing domain details: {'console/ring-ref': '1180241', 'console/port': '2', 'cpu/3/availability': 'online', 'name': 'vm1', 'console/limit': '1048576', 'cpu/2/availability': 'online', 'vm': '/vm/1efb30c3-86fd-9dd7-4934-9b72b6a833fc', 'domid': '114', 'cpu/0/availability': 'online', 'memory/target': '1536000', 'store/ring-ref': '1180242', 'cpu/1/availability': 'online', 'store/port': '1'}
[2010-08-18 16:00:54 xend.XendDomainInfo 17839] DEBUG (XendDomainInfo:2262) XendDomainInfo.destroy: domid=114
[2010-08-18 16:00:54 xend.XendDomainInfo 17839] INFO (XendDomainInfo:2403) Dev 51712 still active, looping...
[2010-08-18 16:00:54 xend.XendDomainInfo 17839] INFO (XendDomainInfo:2403) Dev 51712 still active, looping...
[2010-08-18 16:00:54 xend.XendDomainInfo 17839] INFO (XendDomainInfo:2403) Dev 51712 still active, looping...
[2010-08-18 16:00:54 xend.XendDomainInfo 17839] INFO (XendDomainInfo:2403) Dev 51712 still active, looping...
[2010-08-18 16:00:54 xend.XendDomainInfo 17839] INFO (XendDomainInfo:2403) Dev 51712 still active, looping...
[2010-08-18 16:00:55 xend.XendDomainInfo 17839] DEBUG (XendDomainInfo:2187) UUID Created: True
[2010-08-18 16:00:55 xend.XendDomainInfo 17839] DEBUG (XendDomainInfo:2188) Devices to release: [], domid = 114
[2010-08-18 16:00:55 xend.XendDomainInfo 17839] DEBUG (XendDomainInfo:2200) Releasing PVFB backend devices ...
Comment 1 Yufang Zhang 2010-08-18 05:02:03 EDT
Created attachment 439328 [details]
patch to prevent loops

I have figured out a patch to solve this problem, which records the time when the guest crashes and uses this time to compute elapse time in restart() function. I have tested the situation with the patch, xend would destroy the guest when it crashes at boot time.
Comment 2 Miroslav Rezanina 2010-08-18 05:57:40 EDT
Thanks for patch Yufang. Can you please test this scenario with xen-3.0.3-114.el5? Is there any relevant difference in behavior?
Comment 3 Yufang Zhang 2010-08-18 23:27:03 EDT
Created attachment 439572 [details]
xend.log

Hi Miroslav, 

I have tested this scenario with xen-3.0.3-114.el5. I hit the same problem but with a small difference in behaviour. xend would restart the guest at its first boot-and-crash, because 'xend/previous_restart_time' is None. After the guest reboots, timeout value is also expanded due to dump core. Thus the guest drops into loops. You could check more detailed information from xend.log in the attachment.
Comment 8 YangGuang 2010-11-23 04:33:31 EST
With xen-116 and kernel-xen-233, I can reproduce this bug with comment#1 steps with RHEL5.5-64bit-pv-guest whose elapse time is larger than MINIMUM_RESTART_TIME.

But with xen-119 and kernel-xen-233, there is a problem that the bug will still occur while I execute "xm list " continually during the process. The details is in the followings:

Version-Release number of selected component (if applicable):
xen-3.0.3-119.el5
kernel-xen-2.6.18-233.el5

How reproducible:
Always

Actual steps:
1. Edit a grub file of guest to make it crash at boot time
2. set 'enable-dump' as 'yes' in xend and 'on_crash' as 'restart' for the guest 
3. Start the guest
4. In the same time with step3, open another console, execute "xm list" continually.

Actual results:
1. The guest drops into a restart-crash-dumpcore loop. Domain0 was full of dump
core files.
2. When "xm li" stops, the guest will work well later.
Comment 9 YangGuang 2010-11-23 04:35:21 EST
Created attachment 462276 [details]
xend.log of Xen-119
Comment 10 Miroslav Rezanina 2010-11-23 05:24:59 EST
I see the problem in log. I do not know why, but in your case, refreshShutdown is call more than once - on each call time of crash is rewritten. This is probably due to xm list blocking xend to handle crashDump immediately. I will rewrite patch for this. Without xm list interfere, was this reproducible on 119?
Comment 11 YangGuang 2010-11-23 21:08:11 EST
(In reply to comment #10)
> I see the problem in log. I do not know why, but in your case, refreshShutdown
> is call more than once - on each call time of crash is rewritten. This is
> probably due to xm list blocking xend to handle crashDump immediately. I will
> rewrite patch for this. Without xm list interfere, was this reproducible on
> 119?

Without  xm list, I cannot reproduce it.
Comment 12 Miroslav Rezanina 2010-11-24 09:21:53 EST
Fix built into xen-3.0.3-120.el5
Comment 14 YangGuang 2010-11-29 02:19:56 EST
Version-Release number of selected component (if applicable):
xen-3.0.3-120.el5
kernel-xen-2.6.18-233.el5
host: RHEL5.5-x86_64
guest: RHEL5.5-x86_64-PV

Actual steps:
same with comment#8

Actual steps:
the guest works well.

So I change it to verified.
Comment 16 errata-xmlrpc 2011-01-13 17:23:42 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0031.html

Note You need to log in before you can comment on or make changes to this bug.