Bug 240413

Summary: Guests get stuck in paused state when booting under heavy load
Product: [Fedora] Fedora Reporter: Richard W.M. Jones <rjones>
Component: xenAssignee: Richard W.M. Jones <rjones>
Severity: medium Docs Contact:
Priority: medium    
Version: 9CC: katzj, triage
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: bzcl34nup
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-09-09 09:07:52 EDT Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Description Flags
Output of xenstore-ls with 3 domains paused this way none

Description Richard W.M. Jones 2007-05-17 08:55:03 EDT
Description of problem:

I am starting and stopping four FC6 PV guests from scripts under heavy load. 
Occasionally a guest will get stuck in the paused state just after it begins to
boot (just after xm start <domainname>).

Methodology of testing: http://et.redhat.com/~rjones/xen-stress-tests/

Version-Release number of selected component (if applicable):

xen-3.1.0-0.rc7.1.fc7 + patch to fix bug 240009

How reproducible:

Occurs very infrequently, but definitely reproducible if the tests are left to
run for a long time.

Steps to Reproduce:
1. Stress test under load, see: http://et.redhat.com/~rjones/xen-stress-tests/
Actual results:

Guests stay paused after booting.  In the xm list below, fc6-3 has this problem.

# /usr/sbin/xm list
Name                                      ID   Mem VCPUs      State   Time(s)
Domain-0                                   0  2984     4     r-----  21370.4
centos5                                        256     1                 0.2
fc6                                      464   256     1     r-----     14.4
fc6-2                                    467   256     1     -b----      0.1
fc6-3                                    452   256     1     --p---      0.0
fc6-4                                    465   256     1     -b----     11.9
freebsd32                                      256     1                 0.0

If the guest is manually unpaused then the boot continues as normal.

Expected results:

Guest should briefly pause while xend sets them up, then should be automatically
resumed by xend.

Additional info:

I will attach xend.log and xend-debug.log in followups.
Comment 1 Richard W.M. Jones 2007-05-17 08:57:11 EDT
Created attachment 154911 [details]

This is xend.log, cut down so it starts just before the guest is booted.

Domain of interest is ID 452, name fc6-3.
Comment 2 Richard W.M. Jones 2007-05-17 08:57:47 EDT
Created attachment 154912 [details]

This is xend-debug.log, cut down so it starts just before the guest is booted.

Domain of interest is ID 452, name fc6-3.
Comment 3 Richard W.M. Jones 2007-05-17 09:22:45 EDT
(A reminder to capture xenstore-ls output next time this happens)
Comment 4 Richard W.M. Jones 2007-05-17 09:30:21 EDT
Created attachment 154917 [details]
Output of xenstore-ls with 3 domains paused this way

Now I seem to have a reliable way to reproduce this bug.

What I do is take a huge file (a 4GB disk image from one of the guests) and
copy it.  Three domains were cycling while this was happening, and all 3 are
now stuck paused.

# /usr/sbin/xm list
Name					  ID   Mem VCPUs      State   Time(s)
Domain-0				   0  2984     4     r-----  23718.2
centos5 				       256     1		 0.2
fc6					       256     1		55.9
fc6-2					 492   256     1     --p---	 0.0
fc6-3					 493   256     1     --p---	 0.0
fc6-4					 494   256     1     --p---	 0.0
freebsd32				       256     1		 0.0

There is a message produced when this happens; it comes from the xm start
command itself, and it confirms the theory that the hotplug scripts are timing

+ /usr/sbin/xm start fc6-3
Error: Device 0 (vif) could not be connected. Hotplug scripts not working.
Usage: xm start <DomainName>

Start a Xend managed domain
  -p, --paused			 Do not unpause domain after starting it
Comment 5 Bug Zapper 2008-04-03 20:44:43 EDT
Based on the date this bug was created, it appears to have been reported
against rawhide during the development of a Fedora release that is no
longer maintained. In order to refocus our efforts as a project we are
flagging all of the open bugs for releases which are no longer
maintained. If this bug remains in NEEDINFO thirty (30) days from now,
we will automatically close it.

If you can reproduce this bug in a maintained Fedora version (7, 8, or
rawhide), please change this bug to the respective version and change
the status to ASSIGNED. (If you're unable to change the bug's version
or status, add a comment to the bug and someone will change it for you.)

Thanks for your help, and we apologize again that we haven't handled
these issues to this point.

The process we're following is outlined here:

We will be following the process here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping to ensure this
doesn't happen again.
Comment 6 Richard W.M. Jones 2008-04-04 06:15:06 EDT
Assigning it to me to retest.
Comment 7 Bug Zapper 2008-05-13 22:54:30 EDT
Changing version to '9' as part of upcoming Fedora 9 GA.
More information and reason for this action is here:
Comment 8 Richard W.M. Jones 2008-09-09 09:07:52 EDT
I retested with my load testing scripts a while back and
didn't see anything like this, so I'm going to assume