Bug 240465

Summary: xen domains in dying state, not actually dying
Product: [Fedora] Fedora Reporter: Richard W.M. Jones <rjones>
Component: xenAssignee: Xen Maintainance List <xen-maint>
Status: CLOSED WONTFIX QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: 9CC: triage, virt-maint
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: bzcl34nup
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-07-14 17:26:08 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Output of dmesg
none
Output of strace -f -s 1500 virsh list
none
xend.log from xend start, virsh list, xend stop
none
xend-debug.log from xend start, virsh list, xend stop
none
Output of xenstore-ls none

Description Richard W.M. Jones 2007-05-17 17:56:41 UTC
Description of problem:

After running a machine under load for some hours, I have it in a confused state
where domains appear to be marked at hypervisor level as "dying", but in fact
they're just stuck.

I captured as much information as possible, with a freshly restarted xend, and
the tools virsh list and xm list.

# /usr/sbin/xm list
Name                                      ID   Mem VCPUs      State   Time(s)
Domain-0                                   0  1961     4     r-----  32395.2
centos5                                        256     1                 0.2
fc6                                            256     1                 0.0
fc6-2                                          256     1                 0.0
fc6-3                                          256     1                 0.0
fc6-4                                          256     1                 0.0
fc6-5                                          256     1                 0.0
fc6-6                                          256     1                 0.0
fc6-7                                    820   256     1     -b----     14.9
fc6-8                                          256     1                 0.0
freebsd32                                      256     1                 0.0

# virsh list Id Name                 State
----------------------------------
  0 Domain-0             running
libvir: Xen Daemon error : GET operation failed: 
libvir: Xen Daemon error : GET operation failed: 
libvir: Xen Daemon error : GET operation failed: 
libvir: Xen Daemon error : GET operation failed: 
libvir: Xen Daemon error : GET operation failed: 
libvir: Xen Daemon error : GET operation failed: 
820 fc6-7                blocked

Version-Release number of selected component (if applicable):

xen-3.1.0-0.rc7.1.fc7
+ patch to fix bug 240009
+ http://www.gnome.org/~markmc/code/xen-libvncserver-threading.patch
+ http://www.gnome.org/~markmc/code/xen-libvncserver-threading2.patch

How reproducible:

Rare - this is the second time I have seen this.

Steps to Reproduce:
1. Run stress tests with eight domains:
http://et.redhat.com/~rjones/xen-stress-tests/
  
Actual results:

After a while, domains go into "paused" state (as shown in virt-manager) but
cannot be resumed.  There is some evidence that their real state is "dying" and
that they are stuck.

Expected results:

Domains should not get stuck.

Additional info:

I'm going to attach the following files:
* dmesg.txt - output of dmesg
* virsh.strace.txt - strace virsh list
* xend-debug.log & xend.log - log files from "xend start, virsh list, xend stop"
* xenstore-ls.txt - output xenstore-ls

Load testing methodology: http://et.redhat.com/~rjones/xen-stress-tests/

Comment 1 Richard W.M. Jones 2007-05-17 17:57:26 UTC
Created attachment 154941 [details]
Output of dmesg

Comment 2 Richard W.M. Jones 2007-05-17 17:58:15 UTC
Created attachment 154942 [details]
Output of strace -f -s 1500 virsh list

Comment 3 Richard W.M. Jones 2007-05-17 17:58:43 UTC
Created attachment 154943 [details]
xend.log from xend start, virsh list, xend stop

Comment 4 Richard W.M. Jones 2007-05-17 17:59:07 UTC
Created attachment 154944 [details]
xend-debug.log from xend start, virsh list, xend stop

Comment 5 Richard W.M. Jones 2007-05-17 17:59:30 UTC
Created attachment 154945 [details]
Output of xenstore-ls

Comment 6 Richard W.M. Jones 2007-05-17 18:30:14 UTC
I hacked a copy of libvirt so that we could see what the hypervisor actually
thinks is running on this machine.  The results are below.  An example of a
domain in the strange state is domain 718 (second one down in this list).

domain = 0
flags = 32
tot_pages = 502080
max_pages = 4294967295
shared_info_frame = 464
cpu_time = 32529805311753
nr_online_vcpus = 4
max_vcpu_id = 3
ssidref = 0
handle = 0x618d94

domain = 718
flags = 5
tot_pages = 472
max_pages = 65536
shared_info_frame = 3978
cpu_time = 72096732848
nr_online_vcpus = 1
max_vcpu_id = 0
ssidref = 0
handle = 0x618ddc

domain = 719
flags = 5
tot_pages = 472
max_pages = 65536
shared_info_frame = 4020
cpu_time = 50524523399
nr_online_vcpus = 1
max_vcpu_id = 0
ssidref = 0
handle = 0x618e24

domain = 742
flags = 5
tot_pages = 469
max_pages = 65536
shared_info_frame = 4032
cpu_time = 57615609754
nr_online_vcpus = 1
max_vcpu_id = 0
ssidref = 0
handle = 0x618e6c

domain = 746
flags = 5
tot_pages = 469
max_pages = 65536
shared_info_frame = 3990
cpu_time = 37173524272
nr_online_vcpus = 1
max_vcpu_id = 0
ssidref = 0
handle = 0x618eb4

domain = 763
flags = 5
tot_pages = 469
max_pages = 65536
shared_info_frame = 4058
cpu_time = 42146649115
nr_online_vcpus = 1
max_vcpu_id = 0
ssidref = 0
handle = 0x618efc

domain = 765
flags = 5
tot_pages = 469
max_pages = 65536
shared_info_frame = 3892
cpu_time = 41052249030
nr_online_vcpus = 1
max_vcpu_id = 0
ssidref = 0
handle = 0x618f44

domain = 820
flags = 16
tot_pages = 65536
max_pages = 65536
shared_info_frame = 3938
cpu_time = 40549375542
nr_online_vcpus = 1
max_vcpu_id = 0
ssidref = 0
handle = 0x618f8c

Some other observations:

The HV information for the stuck domains does not appear to change between
calls.  In particular, cpu_time stays the same.

If my understanding is right, then flags = 5 means: XEN_DOMINF_dying |
XEN_DOMINF_shutdown which would indicate that in the HV these domains have the
d->is_dying and d->is_shut_down flags both set in the domain structure (see
xen/include/xen/sched.h).

Comment 7 Richard W.M. Jones 2007-05-19 13:26:17 UTC
Reported on xen-devel:
http://lists.xensource.com/archives/html/xen-devel/2007-05/threads.html#00732

Comment 8 Bug Zapper 2008-04-04 00:45:25 UTC
Based on the date this bug was created, it appears to have been reported
against rawhide during the development of a Fedora release that is no
longer maintained. In order to refocus our efforts as a project we are
flagging all of the open bugs for releases which are no longer
maintained. If this bug remains in NEEDINFO thirty (30) days from now,
we will automatically close it.

If you can reproduce this bug in a maintained Fedora version (7, 8, or
rawhide), please change this bug to the respective version and change
the status to ASSIGNED. (If you're unable to change the bug's version
or status, add a comment to the bug and someone will change it for you.)

Thanks for your help, and we apologize again that we haven't handled
these issues to this point.

The process we're following is outlined here:
http://fedoraproject.org/wiki/BugZappers/F9CleanUp

We will be following the process here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping to ensure this
doesn't happen again.

Comment 9 Richard W.M. Jones 2008-04-04 10:16:18 UTC
Assigning this bug back to me to retest.

Comment 10 Bug Zapper 2008-05-14 02:54:43 UTC
Changing version to '9' as part of upcoming Fedora 9 GA.
More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 11 Bug Zapper 2009-06-09 22:36:37 UTC
This message is a reminder that Fedora 9 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 9.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '9'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 9's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 9 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 12 Bug Zapper 2009-07-14 17:26:08 UTC
Fedora 9 changed to end-of-life (EOL) status on 2009-07-10. Fedora 9 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.