Bug 240651

Summary: After stress testing, xend becomes unresponsive to socket connections
Product: [Fedora] Fedora Reporter: Richard W.M. Jones <rjones>
Component: xenAssignee: Daniel Berrange <berrange>
Severity: medium Docs Contact:
Priority: medium    
Version: rawhideCC: katzj, triage, xen-maint
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard: bzcl34nup
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-04-04 06:16:59 EDT Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Description Richard W.M. Jones 2007-05-19 07:48:25 EDT
Description of problem:

After overnight stress testing, xend no longer responds to connections on its
control socket /var/run/xend/xmlrpc.sock.  This means that basic commands such
as 'xm list' hang.

Methodology of stress tests: http://et.redhat.com/~rjones/xen-stress-tests/

Version-Release number of selected component (if applicable):


How reproducible:

Happened twice at least on my test machine.  Pretty reproducible if I run the
stress tests for an extended period of time.

Steps to Reproduce:
1. Run the stress tests with 8 guests
Actual results:

virt-manager hangs (no updates, cannot be closed).

xm list hangs.

strace of xm list shows:

23848 connect(3, {sa_family=AF_FILE, path="/var/run/xend/xmlrpc.sock"}, 27) = 0
23848 sendto(3, "POST /RPC2 HTTP/1.0\r\nHost: \r\nUser-Agent: xmlrpclib.py/1.0.1
 (by www.pythonware.com)\r\nContent-Type: text/xml\r\nContent-Length: 268\r\n\r\
n", 132, 0, NULL, 0) = 132
23848 sendto(3, "<?xml version=\'1.0\'?>\n<methodCall>\n<methodName>xend.domains
lue><int>0</int></value>\n</param>\n</params>\n</methodCall>\n", 268, 0, NULL, 0
) = 268
23848 recvfrom(3, 0x2aaaaab3a7d4, 1, 0, 0, 0) = ? ERESTARTSYS (To be restarted)

(the final recvfrom hangs - here I hit ^C).

strace of xend shows:

23849 recvfrom(35, "POST /RPC2 HTTP/1.0\r\nHost: \r\nUser-Agent: xmlrpclib.py/1.
0.1 (by www.pythonware.com)\r\nContent-Type: text/xml\r\nContent-Length: 268\r\n
\r\n<?xml version=\'1.0\'?>\n<methodCall>\n<methodName>xend.domains_with_state</
nt></value>\n</param>\n</params>\n</methodCall>\n", 8192, 0, NULL, NULL) = 400
23849 futex(0x9c5ba0, FUTEX_WAKE, 1)    = 0
23849 futex(0x9c5ba0, FUTEX_WAKE, 1)    = 0
23849 futex(0x9c5ba0, FUTEX_WAKE, 1)    = 0
23849 futex(0x9c5ba0, FUTEX_WAKE, 1)    = 0
23849 futex(0x9c5ba0, FUTEX_WAKE, 1)    = 0
23849 futex(0x9c5ba0, FUTEX_WAKE, 1)    = 0
23849 futex(0x9c5ba0, FUTEX_WAKE, 1)    = 0
23849 futex(0x9c5ba0, FUTEX_WAKE, 1)    = 0
23849 futex(0x9c5ba0, FUTEX_WAKE, 1)    = 0
23849 futex(0x9d10c0, FUTEX_WAIT, 0, NULL <unfinished ...>

Expected results:

xend should not hang.

Additional info:

Restarting xend fixes the problem.
Comment 1 Richard W.M. Jones 2007-11-19 10:34:32 EST
Changing to NEEDINFO of me - I need to retest whether this is
still happening with more recent xend.
Comment 2 Bug Zapper 2008-04-03 20:47:38 EDT
Based on the date this bug was created, it appears to have been reported
against rawhide during the development of a Fedora release that is no
longer maintained. In order to refocus our efforts as a project we are
flagging all of the open bugs for releases which are no longer
maintained. If this bug remains in NEEDINFO thirty (30) days from now,
we will automatically close it.

If you can reproduce this bug in a maintained Fedora version (7, 8, or
rawhide), please change this bug to the respective version and change
the status to ASSIGNED. (If you're unable to change the bug's version
or status, add a comment to the bug and someone will change it for you.)

Thanks for your help, and we apologize again that we haven't handled
these issues to this point.

The process we're following is outlined here:

We will be following the process here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping to ensure this
doesn't happen again.
Comment 3 Richard W.M. Jones 2008-04-04 06:16:59 EDT
Not seen this one for a very long time.  If it reoccurs when
I do more stress testing, will reopen.