Bug 249564 - stopping xend with running domUs destroys xen system
stopping xend with running domUs destroys xen system
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: xen (Show other bugs)
5.1
All Linux
low Severity medium
: ---
: ---
Assigned To: Daniel Berrange
Virtualization Bugs
: OtherQA, Regression
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-07-25 11:31 EDT by Chuck Morrison
Modified: 2009-12-14 16:15 EST (History)
11 users (show)

See Also:
Fixed In Version: RHEA-2007-0635
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-11-07 12:11:14 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
xend.log (186.63 KB, application/octet-stream)
2007-07-25 11:37 EDT, Chuck Morrison
no flags Details
Normalize UUIDs when loading VMs (664 bytes, patch)
2007-07-25 15:58 EDT, Daniel Berrange
no flags Details | Diff
Keep track of QEMU pids (2.52 KB, patch)
2007-09-11 17:34 EDT, Daniel Berrange
no flags Details | Diff

  None (edit)
Description Chuck Morrison 2007-07-25 11:31:55 EDT
Description of problem:

When running several fully virtualized domains I stopped xend and restarted it.
At first the virt-manager showed the domains in a loop of starting and stopping
with different names (like domain-11, not it's assigned name), eventually "xm
list" showed them as zombie-domainxx. I deleted them in virt-manager and
reinstalled under a different dom name, but using the same partition for the
guest OS. Starting the new domain would not get me to a working terminal, but
would show as running with 0% cpu.

Version-Release number of selected component (if applicable):

rhel5.1b0

How reproducible:

very

Steps to Reproduce:
1. install rhel5.1 w/ xen
2. install fully virtualized guest domain(s) and leave them running
3. stop xend and start xen
4. reboot and try to access or start any of the guest domains  

Actual results:

xend stopped leaving the virtual machines in an indeterminate state. When it
came back up it couldn't figure out what to do with them and they were lost
forever. Reinstalling of new virtual machines always results in a non-bootable
domain. Error messages (see below) refer to missing domain-ids. 

Expected results:

I would expect xend, when issued a "stop", would at least send a "destroy" to
the virtual machines before it stopped. Or perhaps clean up hanging info when
starting up. Something to keep from loosing virtual machines.

Additional info:

[root@max1 ~]# libvir: Xen Daemon error : internal error domain information
incomplete, missing id
Traceback (most recent call last):
  File "/usr/share/virt-manager/virtManager/console.py", line 418, in
update_widget_states
    if vm.is_serial_console_tty_accessible():
  File "/usr/share/virt-manager/virtManager/domain.py", line 427, in
is_serial_console_tty_accessible
    tty = self.get_serial_console_tty()
  File "/usr/share/virt-manager/virtManager/domain.py", line 424, in
get_serial_console_tty
    return self.get_xml_string("/domain/devices/console/@tty")
  File "/usr/share/virt-manager/virtManager/domain.py", line 403, in get_xml_string
    xml = self.get_xml()
  File "/usr/share/virt-manager/virtManager/domain.py", line 51, in get_xml
    self.xml = self.vm.XMLDesc(0)
  File "/usr/lib/python2.4/site-packages/libvirt.py", line 196, in XMLDesc
    if ret is None: raise libvirtError ('virDomainGetXMLDesc() failed', dom=self)
libvirt.libvirtError: virDomainGetXMLDesc() failed internal error domain
information incomplete, missing id
Traceback (most recent call last):
  File "/usr/share/virt-manager/virtManager/console.py", line 207, in retry_login
    self.try_login()
  File "/usr/share/virt-manager/virtManager/console.py", line 217, in try_login
    protocol, host, port = self.vm.get_graphics_console()
  File "/usr/share/virt-manager/virtManager/domain.py", line 433, in
get_graphics_console
    type = self.get_xml_string("/domain/devices/graphics/@type")
  File "/usr/share/virt-manager/virtManager/domain.py", line 403, in get_xml_string
    xml = self.get_xml()
  File "/usr/share/virt-manager/virtManager/domain.py", line 51, in get_xml
    self.xml = self.vm.XMLDesc(0)
  File "/usr/lib/python2.4/site-packages/libvirt.py", line 196, in XMLDesc
    if ret is None: raise libvirtError ('virDomainGetXMLDesc() failed', dom=self)
libvirt.libvirtError: virDomainGetXMLDesc() failed internal error domain
information incomplete, missing id
Comment 1 Chuck Morrison 2007-07-25 11:37:13 EDT
Created attachment 159946 [details]
xend.log

This log is from after the damage was done and records the attempts to recreate
the failed domains.
Comment 2 Hugh Brock 2007-07-25 11:46:32 EDT
Was virt-manager running when you restarted xend?
Comment 3 Hugh Brock 2007-07-25 11:54:08 EDT
No problem doing this on various x86 test hardware, trying it on ia64 now.
Comment 4 Chuck Morrison 2007-07-25 12:08:41 EDT
re: comment #2; No virt-manager was not running. 

Comment 5 Hugh Brock 2007-07-25 15:09:05 EDT
Interesting, this appears to be a genuine regression in the rhel 5.1 xend
version. Reassigning it to the appropriate person.
Comment 6 Daniel Berrange 2007-07-25 15:51:35 EDT
XenD uses xenstore to maintain state about VMs such as their name. When
restarting it looks up this info based on UUID. Unfortunately XenD is not
normalizing the different formats for UUID

eg
  8f07fe28753f2729d76dbdbd892f949a
vs
  8f07fe28-753f-2729-d76d-bdbd892f949a

So if xenstore has it in one format, and xend tries to read based on the other
format it fails to find the data, and gets very confused.
Comment 7 Daniel Berrange 2007-07-25 15:53:47 EDT
This is a regression from GA. The tools in RHEL-5.1 often use the form of UUIDs
without '-', but XenD (mostly) uses form with '-'. In GA, everything used the
format with '-'. Rather than fixing the tools to use '-' though we should fix
XenD to normalize UUIDs
Comment 8 RHEL Product and Program Management 2007-07-25 15:57:44 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 9 Daniel Berrange 2007-07-25 15:58:35 EDT
Created attachment 159977 [details]
Normalize UUIDs when loading VMs
Comment 10 RHEL Product and Program Management 2007-07-25 16:02:51 EDT
This bugzilla has Keywords: Regression.  

Since no regressions are allowed between releases, 
it is also being proposed as a blocker for this release.  

Please resolve ASAP.
Comment 12 Daniel Berrange 2007-07-30 18:43:28 EDT
* Thu Jul 19 2007 Daniel P. Berrange <berrange@redhat.com> - 3.0.3-34.el5
- Normalize UUID to avoid loosing guest name upon restarts (rhbz #249564)


$ brew latest-pkg dist-5E-qu-candidate xen
Build                                     Tag                   Built by
----------------------------------------  --------------------  ----------------
xen-3.0.3-34.el5                          dist-5E-qu-candidate  berrange

Comment 14 Doug Chapman 2007-08-03 15:37:35 EDT
would it be possible to make this latest version available outside of Red Hat
for partners to test?

thanks,

- Doug
Comment 15 Doug Chapman 2007-08-03 17:00:49 EDT
I just saw that Red Hat will be sending HP weekly snapshots between now and GA
so that satisfies my request from comment #14.

- Doug
Comment 16 John Poelstra 2007-08-30 20:25:41 EDT
A fix for this issue should have been included in the packages contained in the
RHEL5.1-Snapshot4 on partners.redhat.com.  

Requested action: Please verify that your issue is fixed *as soon as possible*
to ensure that it is included in this update release.

After you (Red Hat Partner) have verified that this issue has been addressed,
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)

If this issue is not fixed, please add a comment describing the most recent
symptoms of the problem you are having and change the status of the bug to FAILS_QA.

If you cannot access bugzilla, please reply with a message to Issue Tracker and
I will change the status for you.  If you need assistance accessing
ftp://partners.redhat.com, please contact your Partner Manager.
Comment 17 Chuck Morrison 2007-09-07 12:14:56 EDT
I do not see a major change in under snapshot 4.

If I stop xend with an hvm active, that hvm continues to function, but
virt-manager and xm will not run (all is as it should be so far). However, when
xend is restarted and virt-manager is restarted, virt-manager has limited
functionality and some is broken. Specifically, if you use virt-manager to
shutdown the hvm(s), you will see the hvm bouncing between shutdown and running.
This happens indefinately as far as I can tell. xm list shows the following:

[root@max1 ~]# xm list
Name                                      ID Mem(MiB) VCPUs State   Time(s)
Domain-0                                   0      433     1 r-----   9795.7
Zombie-fvdom1                              2     1031     1 ---s-d     35.5

On reboot of dom0 all appears to work correctly again.
Comment 18 John Poelstra 2007-09-07 12:23:10 EDT
Re: comment #17 are you reporting that the reported issue is not fixed or 
raising a new issue?

Thank you
Comment 19 Daniel Berrange 2007-09-07 12:32:19 EDT
Comment #17 has a different series of steps to produce a problem than the
original first comment in this ticket. The problem identified in the initial
post was addressed. 

This sounds like a different problem, but with similar outward appearances of a
guest going Zombie, but subtely different behaviour, so IMHO we need a new BZ
for it.
Comment 20 Chuck Morrison 2007-09-07 13:12:57 EDT
I'm not sure what you are thinking the reported issue is. 

From the original report:

"When running several fully virtualized domains I stopped xend and restarted it.
At first the virt-manager showed the domains in a loop of starting and stopping
with different names (like domain-11, not it's assigned name), eventually "xm
list" showed them as zombie-domainxx."

From what I can tell, the main difference (from my comment #17) is that I tried
to shutdown the hvm in virt-manager before the cycling occurred. I don't
remember if I had done this in the original instance or not. The end result is
the same, hvm(s) cycling on and off, labeled as zombies and a reboot required to
clean things up.

Comment 21 Daniel Berrange 2007-09-11 16:58:59 EDT
There are two underlying problems in this ticket, both of which result in the
same user visible behaviour of a domain toggling between running & inactive.

The first is that XenD does not normalize the UUIDs for VMs. This means that
when XenD restarts it fails to correctly re-associate a running VM with its
configuration data. This in fact affects both HVM & paravirt guests and in fact
every guests created with the RHEL-5.1 version of  virt-install/virt-manager.
This is the issue with identified & fixed in xen-3.0.3-34.el5

After further investigation it appears there is a flaw which can impact HVM
guests. Normally XenD is responsible for killing the QEMU device model when an
HVM guest shuts down. To do this, it requires knowledge of the PID of the QEMU
process. If XenD is restarted though it will loose this info. So if you destroy
a guest, and the QEMU process does not automatically detect this shutdown, it
will hang around & again cause the zombie domains. This is most likely the
problem still being reported in comment #17. There's more upstream patches to
deal with this problem - we'll need to see if backporting is practical
Comment 22 Daniel Berrange 2007-09-11 17:26:24 EDT
The original upstream fix for this was in:

changeset:   12657:cefb1f761f0b
user:        Ewan Mellor <ewan@xensource.com>
date:        Thu Nov 30 18:08:34 2006 +0000
files:       tools/python/xen/xend/XendConstants.py
tools/python/xen/xend/XendDomain.py tools/python/xen/xend/XendDomainInfo.py
tools/python/xen/xend/image.py
description:

Fix HVM shutdown when xend is restarted.

Added a recreate call to ImageHandler, allowing the subclasses of that to
hook into the code that runs when xend restarts.  This allows us in particular
to reregister the watches for HVM shutdown, and read the PID of qemu-dm from
the store.

Signed-off-by: Ewan Mellor <ewan@xensource.com>


Between that time and 3.1 though the code changed considerably further, so we
can't simply apply that changeset. The changeset also contains alot of crazy &
pointless refactoring. I need to figure out what the miniaml fix is
Comment 23 Daniel Berrange 2007-09-11 17:34:53 EDT
Created attachment 193041 [details]
Keep track of QEMU pids

This patch backports the important part of changeset 12657:cefb1f761f0b to the
RHEL-5.1 xen stack. THe original UUID normalization patch is still needed. THis
patch fixes an additional  HVM specific flaw which hits after the UUID flaw is
resolved.
Comment 27 John Poelstra 2007-09-20 00:47:23 EDT
A fix for this issue should have been included in the packages contained in the
RHEL5.1-Snapshot7 on partners.redhat.com.  

Requested action: Please verify that your issue is fixed ASAP to confirm that it
will be included in this update release.

After you (Red Hat Partner) have verified that this issue has been addressed,
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)

If this issue is not fixed, please add a comment describing the most recent
symptoms of the problem you are having and change the status of the bug to FAILS_QA.

If you cannot access bugzilla, please reply with a message to Issue Tracker and
I will change the status for you.  If you need assistance accessing
ftp://partners.redhat.com, please contact your Partner Manager.
Comment 28 John Poelstra 2007-09-26 19:44:51 EDT
A fix for this issue should be included in the packages contained in
RHEL5.1-Snapshot8--available now on partners.redhat.com.  

IMPORTANT: This is the last opportunity to confirm that your issue is fixed in
the RHEL5.1 update release.

After you (Red Hat Partner) have verified that this issue has been addressed,
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)

If this issue is not fixed, please add a comment describing the most recent
symptoms of the problem you are having and change the status of the bug to FAILS_QA.

If you cannot access bugzilla, please reply with a message to Issue Tracker and
I will change the status for you.  If you need assistance accessing
ftp://partners.redhat.com, please contact your Partner Manager.
Comment 30 errata-xmlrpc 2007-11-07 12:11:14 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2007-0635.html

Note You need to log in before you can comment on or make changes to this bug.