240413 – Guests get stuck in paused state when booting under heavy load

Bug 240413 - Guests get stuck in paused state when booting under heavy load

Summary: Guests get stuck in paused state when booting under heavy load

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	xen
Sub Component:
Version:	9
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Richard W.M. Jones
QA Contact:
Docs Contact:
URL:
Whiteboard:	bzcl34nup
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-05-17 12:55 UTC by Richard W.M. Jones
Modified:	2008-09-09 13:07 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2008-09-09 13:07:52 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
xend.log (34.30 KB, text/plain) 2007-05-17 12:57 UTC, Richard W.M. Jones	no flags	Details
xend-debug.log (85.65 KB, text/plain) 2007-05-17 12:57 UTC, Richard W.M. Jones	no flags	Details
Output of xenstore-ls with 3 domains paused this way (24.29 KB, text/plain) 2007-05-17 13:30 UTC, Richard W.M. Jones	no flags	Details
View All

Description Richard W.M. Jones 2007-05-17 12:55:03 UTC

Description of problem:

I am starting and stopping four FC6 PV guests from scripts under heavy load. 
Occasionally a guest will get stuck in the paused state just after it begins to
boot (just after xm start <domainname>).

Methodology of testing: http://et.redhat.com/~rjones/xen-stress-tests/

Version-Release number of selected component (if applicable):

xen-3.1.0-0.rc7.1.fc7 + patch to fix bug 240009

How reproducible:

Occurs very infrequently, but definitely reproducible if the tests are left to
run for a long time.

Steps to Reproduce:
1. Stress test under load, see: http://et.redhat.com/~rjones/xen-stress-tests/
  
Actual results:

Guests stay paused after booting.  In the xm list below, fc6-3 has this problem.

# /usr/sbin/xm list
Name                                      ID   Mem VCPUs      State   Time(s)
Domain-0                                   0  2984     4     r-----  21370.4
centos5                                        256     1                 0.2
fc6                                      464   256     1     r-----     14.4
fc6-2                                    467   256     1     -b----      0.1
fc6-3                                    452   256     1     --p---      0.0
fc6-4                                    465   256     1     -b----     11.9
freebsd32                                      256     1                 0.0

If the guest is manually unpaused then the boot continues as normal.

Expected results:

Guest should briefly pause while xend sets them up, then should be automatically
resumed by xend.

Additional info:

I will attach xend.log and xend-debug.log in followups.

Comment 1 Richard W.M. Jones 2007-05-17 12:57:11 UTC

Created attachment 154911 [details]
xend.log

This is xend.log, cut down so it starts just before the guest is booted.

Domain of interest is ID 452, name fc6-3.

Comment 2 Richard W.M. Jones 2007-05-17 12:57:47 UTC

Created attachment 154912 [details]
xend-debug.log

This is xend-debug.log, cut down so it starts just before the guest is booted.

Domain of interest is ID 452, name fc6-3.

Comment 3 Richard W.M. Jones 2007-05-17 13:22:45 UTC

(A reminder to capture xenstore-ls output next time this happens)

Comment 4 Richard W.M. Jones 2007-05-17 13:30:21 UTC

Created attachment 154917 [details]
Output of xenstore-ls with 3 domains paused this way

Now I seem to have a reliable way to reproduce this bug.

What I do is take a huge file (a 4GB disk image from one of the guests) and
copy it.  Three domains were cycling while this was happening, and all 3 are
now stuck paused.

# /usr/sbin/xm list
Name					  ID   Mem VCPUs      State   Time(s)
Domain-0				   0  2984     4     r-----  23718.2
centos5 				       256     1		 0.2
fc6					       256     1		55.9
fc6-2					 492   256     1     --p---	 0.0
fc6-3					 493   256     1     --p---	 0.0
fc6-4					 494   256     1     --p---	 0.0
freebsd32				       256     1		 0.0

There is a message produced when this happens; it comes from the xm start
command itself, and it confirms the theory that the hotplug scripts are timing
out:

+ /usr/sbin/xm start fc6-3
Error: Device 0 (vif) could not be connected. Hotplug scripts not working.
Usage: xm start <DomainName>

Start a Xend managed domain
  -p, --paused			 Do not unpause domain after starting it

Comment 5 Bug Zapper 2008-04-04 00:44:43 UTC

Based on the date this bug was created, it appears to have been reported
against rawhide during the development of a Fedora release that is no
longer maintained. In order to refocus our efforts as a project we are
flagging all of the open bugs for releases which are no longer
maintained. If this bug remains in NEEDINFO thirty (30) days from now,
we will automatically close it.

If you can reproduce this bug in a maintained Fedora version (7, 8, or
rawhide), please change this bug to the respective version and change
the status to ASSIGNED. (If you're unable to change the bug's version
or status, add a comment to the bug and someone will change it for you.)

Thanks for your help, and we apologize again that we haven't handled
these issues to this point.

The process we're following is outlined here:
http://fedoraproject.org/wiki/BugZappers/F9CleanUp

We will be following the process here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping to ensure this
doesn't happen again.

Comment 6 Richard W.M. Jones 2008-04-04 10:15:06 UTC

Assigning it to me to retest.

Comment 7 Bug Zapper 2008-05-14 02:54:30 UTC

Changing version to '9' as part of upcoming Fedora 9 GA.
More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 8 Richard W.M. Jones 2008-09-09 13:07:52 UTC

I retested with my load testing scripts a while back and
didn't see anything like this, so I'm going to assume
WORKSFORME for now.

Note You need to log in before you can comment on or make changes to this bug.