912284 – with resume_guests_state_on_host_boot=True rebooting host leaves VM's in Error state

Bug 912284 - with resume_guests_state_on_host_boot=True rebooting host leaves VM's in Error state

Summary: with resume_guests_state_on_host_boot=True rebooting host leaves VM's in Erro...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-nova
Sub Component:
Version:	2.0 (Folsom)
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	snapshot5
Target Release:	2.1
Assignee:	Brent Eagles
QA Contact:	Ofer Blaut
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	920704
TreeView+	depends on / blocked

Reported:	2013-02-18 09:50 UTC by Gary Kotton
Modified:	2022-07-09 06:22 UTC (History)
CC List:	9 users (show)
Fixed In Version:	openstack-nova-2012.2.3-6.el6ost
Doc Type:	Release Note
Doc Text:	Setting the configuration option resume_guests_state_on_host_boot to True (it is False by default) is not recommended. Setting it to True causes problems with re-spawning instances when many services are being restarted simultaneously. This usually occurs when the services are running on the same host that gets restarted.
Clone Of:
Clones:	920704 (view as bug list)
Environment:
Last Closed:	2013-04-04 20:21:16 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-16367	0	None	None	None	2022-07-09 06:22:43 UTC
Red Hat Product Errata	RHSA-2013:0709	0	normal	SHIPPED_LIVE	Moderate: openstack-nova security and bug fix update	2013-04-05 00:19:00 UTC

Comment 3 Ofer Blaut 2013-03-12 10:01:15 UTC

Hi

I encountered the same

bug moved to High since the only workaound is to delete the VM.

happnes with nova network as well 

openstack-nova-common-2012.2.3-4.el6ost.noarch


[root@puma04 ~(keystone_admin)]$ nova reboot c3074bdc-f93e-41d8-b409-a426be744326
ERROR: Cannot 'reboot' while instance is in vm_state error (HTTP 409) (Request-ID: req-9976e673-5d5b-439d-9a06-398deaf59c28)
[root@puma04 ~(keystone_admin)]$ nova reset-state c3074bdc-f93e-41d8-b409-a426be744326
[root@puma04 ~(keystone_admin)]$ nova reboot c3074bdc-f93e-41d8-b409-a426be744326
ERROR: Cannot 'reboot' while instance is in vm_state error (HTTP 409) (Request-ID: req-1e729b58-9f20-44cc-bf7c-7be7ba46cadf)



013-03-12 11:27:25 INFO nova.compute.manager [req-c952cab3-134e-49c6-b8f1-ddcf9f6450ee None None] [instance: c3074bdc-f93e-41d8-b409-a426be744326] Rebooting instance after nova-compute restart.
2013-03-12 11:27:30 INFO nova.virt.libvirt.firewall [req-c952cab3-134e-49c6-b8f1-ddcf9f6450ee None None] [instance: c3074bdc-f93e-41d8-b409-a426be744326] Called setup_basic_filtering in nwfilter
2013-03-12 11:27:30 INFO nova.virt.libvirt.firewall [req-c952cab3-134e-49c6-b8f1-ddcf9f6450ee None None] [instance: c3074bdc-f93e-41d8-b409-a426be744326] Ensuring static filters
2013-03-12 11:27:33 WARNING nova.compute.manager [req-c952cab3-134e-49c6-b8f1-ddcf9f6450ee None None] [instance: c3074bdc-f93e-41d8-b409-a426be744326] Failed to resume instance

nova show output attached .

It seems like it happnes every 1-2 host reboots

[root@puma04 ~(keystone_admin)]$ nova show c3074bdc-f93e-41d8-b409-a426be744326
+-------------------------------------+----------------------------------------------------------+
| Property                            | Value                                                    |
+-------------------------------------+----------------------------------------------------------+
| OS-DCF:diskConfig                   | MANUAL                                                   |
| OS-EXT-SRV-ATTR:host                | puma04.scl.lab.tlv.redhat.com                            |
| OS-EXT-SRV-ATTR:hypervisor_hostname | puma04.scl.lab.tlv.redhat.com                            |
| OS-EXT-SRV-ATTR:instance_name       | instance-00000012                                        |
| OS-EXT-STS:power_state              | 1                                                        |
| OS-EXT-STS:task_state               | None                                                     |
| OS-EXT-STS:vm_state                 | error                                                    |
| accessIPv4                          |                                                          |
| accessIPv6                          |                                                          |
| config_drive                        |                                                          |
| created                             | 2013-03-10T07:14:58Z                                     |
| flavor                              | m1.tiny (1)                                              |
| hostId                              | fd68b52c230bee600e1b4819221ef01fe7b24c6221e37ecb6d770793 |
| id                                  | c3074bdc-f93e-41d8-b409-a426be744326                     |
| image                               | Fedora17-image (3a8b059f-6c4d-482f-883c-84b80bc7b86b)    |
| key_name                            | None                                                     |
| metadata                            | {}                                                       |
| name                                | VM-FED17                                                 |
| net_vlan_190 network                | 10.35.175.21                                             |
| security_groups                     | [{u'name': u'default'}]                                  |
| status                              | ERROR                                                    |
| tenant_id                           | 49202b0e09a4409c97475392341b57db                         |
| updated                             | 2013-03-12T09:27:33Z                                     |
| user_id                             | 1f0db08c839547339a4ede1d1fb99066                         |
+-------------------------------------+----------------------------------------------------------+

Comment 4 Nikola Dipanov 2013-03-12 15:59:44 UTC

The issue looks like resume_guests_state_on_host_boot config option we introduced seems to have issues when restarting the whole host (we are investigating and will hopefully soon know more and propose an actual fix.)

As a workaround - we will disable this option by default.

We have also opened a new bug to track the progress of the actual fix of the issue at #920704

The way to test this would be to try restarting the node and making sure that nova does not attempt to bring instances online instantly.

Please note that setting resume_guests_state_on_host_boot to True is now something we want our customers to avoid until we get a full fix.

Comment 5 Russell Bryant 2013-03-12 16:55:14 UTC

Note that this option is off by default and we had changed it for RHOS.  This workaround is changing it back to the upstream default.

There is actually a good argument for leaving it off by default, anyway.  Many deployments would likely prefer it that way.  Having a node go down with running instances on it is a failure, and applications using the cloud would likely have moved on and treated the instances that failed as gone and spawned new ones.  For those types of applications, automatically trying to restart their instances may not be what they want.

So right now I'm thinking that once we turn this off, we should just leave it that way, but we should still get to the bottom of this and fix it since some deployments still may want to turn it on.

Comment 7 Ofer Blaut 2013-03-21 09:06:05 UTC

The default resume_guests_state_on_host_boot = false

After reboot all VMs are in shutoff state 

Tested on openstack-nova-compute-2012.2.3-7.el6ost.noarch

Comment 9 errata-xmlrpc 2013-04-04 20:21:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0709.html

Note You need to log in before you can comment on or make changes to this bug.