Bug 920704 - Guests fail to resume after host reboots
Summary: Guests fail to resume after host reboots
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-packstack
Version: 2.0 (Folsom)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: async
: 2.1
Assignee: Martin Magr
QA Contact: Nir Magnezi
URL:
Whiteboard:
Depends On: 890512 912284
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-03-12 15:13 UTC by Brent Eagles
Modified: 2023-09-18 09:58 UTC (History)
11 users (show)

Fixed In Version: openstack-packstack-2012.2.3-0.12.dev495
Doc Type: Bug Fix
Doc Text:
It was discovered that a race condition existed when rebooting Compute (Nova) nodes. The libvirt-guests script would sometimes start before the Compute service (Nova) itself. As a result both services would attempt to restart virtual machine instances that were running on the host prior to shutdown. The instances would ultimately terminate in error. PackStack has been updated to ensure that the libvirt-guests script is disabled on Compute nodes, ensuring that the Computer service has full control of restarting virtual machine instances.
Clone Of: 912284
Environment:
Last Closed: 2013-07-16 17:11:22 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 33816 0 'None' 'MERGED' 'Disable starting VMs by libvirt' 2019-11-11 09:44:00 UTC
OpenStack gerrit 34200 0 'None' 'MERGED' 'Disable starting VMs by libvirt' 2019-11-11 09:44:00 UTC
OpenStack gerrit 34201 0 'None' 'MERGED' 'Disable starting VMs by libvirt' 2019-11-11 09:44:00 UTC
Red Hat Product Errata RHBA-2013:1082 0 normal SHIPPED_LIVE Red Hat OpenStack 2.1 Preview bug fix advisory 2013-07-16 21:10:56 UTC

Comment 4 Brent Eagles 2013-03-13 16:22:31 UTC
This seems like it might be caused by a race on start up among the services. Some more investigation is required to confirm.

Comment 5 Brent Eagles 2013-03-14 21:16:58 UTC
After monitoring the interaction between the services, including libvirt, I discovered that there is a race condition in the startup scripts between openstack and libvirt-guests. The "Failed to resume" is being caused because the libvirt-guests script is in the process of restarting the VMs as well as the startup in nova. It is a tricky interaction that isn't seen all of the time because all of the openstack startup must be running past the initialization stage for any of the libvirt-guests attempts to run. Furthermore, for any of the libvirt-guests attempts to actually succeed, nova has to have setup the network interface etc.. I am testing a patch that makes nova *not* error out the VMs when this happens. Most likely the appropriate thing to do is to ignore/log the exception when it comes from the createDomainWithFlags() call at the site that it occurs. Other configuration will be necessary for the VM to be of any use. This is probably relevant to grizzly and upstream folsom as well.

Comment 6 Brent Eagles 2013-03-14 21:26:01 UTC
As a sidenote, you can pretty much tell when this happens by running virsh list. If the VM you are expecting not to be there is actually running, it is a pretty good indicator this is what has occurred.

Comment 7 Brent Eagles 2013-03-14 21:35:38 UTC
Probably a more effective way to resolve this is to disable automatic restart of persistent VMs by changing the value of the ON_BOOT variable in libvirt_guests to an empty string etc. This will avoid any other potential race conditions that might occur by having libvirt concurrently kicking off VMs.

Comment 8 Dave Allan 2013-03-15 21:09:03 UTC
Dan, it doesn't look to me like we have agreement between OpenStack and libvirt about what component is going to be starting VMs.  Can you shed any light on how this is supposed to work?

Comment 9 Daniel Berrangé 2013-03-18 13:34:28 UTC
The libvirt-guests script simply should be enabled on any OpenStack system. Nova must retain full control over startup.

Comment 10 Daniel Berrangé 2013-03-18 14:29:26 UTC
And when i said 'enabled' there, i meant libvirt-guests should be DISABLED.

Comment 11 Ofer Blaut 2013-03-19 07:05:08 UTC
Adding addtional depands on bug, 

since resume after host reboots = true will endup im shutoff or error states

Comment 12 Derek Higgins 2013-04-25 14:46:28 UTC
What needs to happen here ? I see it was assigned to packstack but am not clear what needs to happen, disable the libvirt-guests script ?

Comment 13 Brent Eagles 2013-04-26 12:52:44 UTC
Yes, that is correct. We need to alter the libvirt-guests file to prevent the start or restart actions from booting VMs or remove the script altogether.

Comment 15 Nir Magnezi 2013-07-14 08:52:34 UTC
Verified NVR: openstack-packstack-2012.2.3-0.12.dev495

Verifications steps:
====================
1. Installed openstack via packstack (all-in-one topology).
2. Uploaded an image to glance.
3. Launched 4 instances.
4. Rebooted the server.
5. Re-connected and listed instances via nova
   # nova list

+--------------------------------------+------+---------+--------------------------+
| ID                                   | Name | Status  | Networks                 |
+--------------------------------------+------+---------+--------------------------+
| 2565330b-056f-470e-b555-3849a9a9a3fe | test | SHUTOFF | novanetwork=192.168.32.4 |
| 5b94a2de-5c90-48c3-a3a8-4823dfc00403 | test | SHUTOFF | novanetwork=192.168.32.3 |
| ae80283d-98b4-4d85-969e-e93287b7d467 | test | SHUTOFF | novanetwork=192.168.32.5 |
| cb39ee9e-2f62-4cb7-9f8a-c7e57aa1c284 | test | SHUTOFF | novanetwork=192.168.32.2 |
+--------------------------------------+------+---------+--------------------------+

6. Rebooted (hard reboot) all instances and Verified that their status changed back to ACTIVE

+--------------------------------------+------+---------+--------------------------+
| ID                                   | Name | Status  | Networks                 |
+--------------------------------------+------+---------+--------------------------+
| 2565330b-056f-470e-b555-3849a9a9a3fe | test | SHUTOFF | novanetwork=192.168.32.4 |
| 5b94a2de-5c90-48c3-a3a8-4823dfc00403 | test | SHUTOFF | novanetwork=192.168.32.3 |
| ae80283d-98b4-4d85-969e-e93287b7d467 | test | SHUTOFF | novanetwork=192.168.32.5 |
| cb39ee9e-2f62-4cb7-9f8a-c7e57aa1c284 | test | SHUTOFF | novanetwork=192.168.32.2 |
+--------------------------------------+------+---------+--------------------------+

7. Verified that there are no errors both in nova and libvirt logs


Additional Info:
================

answer-file I used:

CONFIG_GLANCE_INSTALL=y
CONFIG_CINDER_INSTALL=y
CONFIG_NOVA_INSTALL=y
CONFIG_HORIZON_INSTALL=y
CONFIG_SWIFT_INSTALL=y
CONFIG_CLIENT_INSTALL=y
CONFIG_NTP_SERVERS=
CONFIG_NAGIOS_INSTALL=y
CONFIG_SSH_KEY=/root/.ssh/id_rsa.pub
CONFIG_MYSQL_HOST=IP_Address
CONFIG_MYSQL_USER=root
CONFIG_MYSQL_PW=11b22a572e4e4dcd
CONFIG_QPID_HOST=IP_Address
CONFIG_KEYSTONE_HOST=IP_Address
CONFIG_KEYSTONE_DB_PW=194892c3b5964441
CONFIG_KEYSTONE_ADMIN_TOKEN=b078d5e8ef11425ba1eb2b8204000f57
CONFIG_KEYSTONE_ADMIN_PW=123456
CONFIG_GLANCE_HOST=IP_Address
CONFIG_GLANCE_DB_PW=9a2851280f594546
CONFIG_GLANCE_KS_PW=3037970a19c741b1
CONFIG_CINDER_HOST=IP_Address
CONFIG_CINDER_DB_PW=078eb967731c49fc
CONFIG_CINDER_KS_PW=05d43dda2d574c81
CONFIG_CINDER_VOLUMES_CREATE=y
CONFIG_CINDER_VOLUMES_SIZE=20G
CONFIG_NOVA_API_HOST=IP_Address
CONFIG_NOVA_CERT_HOST=IP_Address
CONFIG_NOVA_VNCPROXY_HOST=IP_Address
CONFIG_NOVA_COMPUTE_HOSTS=IP_Address
CONFIG_NOVA_COMPUTE_PRIVIF=eth1
CONFIG_NOVA_NETWORK_HOST=IP_Address
CONFIG_NOVA_DB_PW=4aba7dd07c2a46aa
CONFIG_NOVA_KS_PW=d6f04927d8fb4f23
CONFIG_NOVA_NETWORK_PUBIF=eth2
CONFIG_NOVA_NETWORK_PRIVIF=eth1
CONFIG_NOVA_NETWORK_FIXEDRANGE=192.168.32.0/22
CONFIG_NOVA_NETWORK_FLOATRANGE=10.3.4.0/22
CONFIG_NOVA_NETWORK_AUTOASSIGNFLOATINGIP=n
CONFIG_NOVA_SCHED_HOST=IP_Address
CONFIG_NOVA_SCHED_CPU_ALLOC_RATIO=16.0
CONFIG_NOVA_SCHED_RAM_ALLOC_RATIO=1.5
CONFIG_OSCLIENT_HOST=IP_Address
CONFIG_HORIZON_HOST=IP_Address
CONFIG_HORIZON_SSL=y
CONFIG_SSL_CERT=
CONFIG_SSL_KEY=
CONFIG_SWIFT_PROXY_HOSTS=IP_Address
CONFIG_SWIFT_KS_PW=ae9003c2cacd424a
CONFIG_SWIFT_STORAGE_HOSTS=IP_Address
CONFIG_SWIFT_STORAGE_ZONES=1
CONFIG_SWIFT_STORAGE_REPLICAS=1
CONFIG_SWIFT_STORAGE_FSTYPE=ext4
CONFIG_REPO=
CONFIG_RH_USER=
CONFIG_RH_PW=
CONFIG_RH_BETA_REPO=n
CONFIG_SATELLITE_URL=
CONFIG_SATELLITE_USER=
CONFIG_SATELLITE_PW=
CONFIG_SATELLITE_AKEY=
CONFIG_SATELLITE_CACERT=
CONFIG_SATELLITE_PROFILE=
CONFIG_SATELLITE_FLAGS=
CONFIG_SATELLITE_PROXY=
CONFIG_SATELLITE_PROXY_USER=
CONFIG_SATELLITE_PROXY_PW=
CONFIG_NAGIOS_HOST=IP_Address
CONFIG_NAGIOS_PW=123456

Comment 17 errata-xmlrpc 2013-07-16 17:11:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1082.html


Note You need to log in before you can comment on or make changes to this bug.