890512 – Some Instances are shutoff after host reboot

Bug 890512 - Some Instances are shutoff after host reboot

Summary: Some Instances are shutoff after host reboot

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-nova
Sub Component:
Version:	2.0 (Folsom)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	snapshot4
Target Release:	2.1
Assignee:	Brent Eagles
QA Contact:	Omri Hochman
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	920704
TreeView+	depends on / blocked

Reported:	2012-12-27 12:39 UTC by Ofer Blaut
Modified:	2022-07-09 06:08 UTC (History)
CC List:	4 users (show)
Fixed In Version:	openstack-nova-2012.2.3-2.el6ost
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	913613 (view as bug list)
Environment:
Last Closed:	2013-03-21 18:14:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Nova log per Instance (7.84 KB, application/octet-stream) 2012-12-27 12:39 UTC, Ofer Blaut	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1085771	None	None	None	Never
OpenStack gerrit	22232	'None'	MERGED	Support running periodic tasks immediately at startup	2020-09-16 12:07:46 UTC
Red Hat Issue Tracker	OSP-16346	None	None	None	2022-07-09 06:08:50 UTC
Red Hat Product Errata	RHSA-2013:0657	normal	SHIPPED_LIVE	Moderate: openstack-nova security, bug fix, and enhancement update	2013-03-21 22:12:14 UTC

Description Ofer Blaut 2012-12-27 12:39:13 UTC

Created attachment 669579 [details]
Nova log per Instance

Description of problem:

I have deployed Instances which are running on All in one host with quantum.

While all instances are running i have isssued reboot to the host.

After host power up some instance starting to run but shortly powered off .

Error reported in nova logs "shutdown by itself. Calling the stop API "

Attached debug logs per one instance 

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. run few instances on All in One node 
2. reboot the HOST
3. check status of the instances  
  
Actual results:

some instances are powered off

Expected results:

either all are off or all are up 

Additional info:
There is relevant bug in ubunto lunchpad

Comment 2 Nikola Dipanov 2013-01-04 16:12:46 UTC

There seems to be a patch proposed to the upstream bug https://review.openstack.org/#/c/17449/. 

The patch itself is very unobtrusive and can be easily backported.

After that - we should get QA teem to revisit.

Comment 3 Brent Eagles 2013-02-08 18:58:22 UTC

I believe I misinterpreted the nature of this bug when I brought in the upstream patch. I think this requires more investigation and probably a better defined picture of what is actually going on and what the expectations should be... at least with respect to this bug. 

Ignoring the possibility that Openstack may have resume_guests_state_on_host_boot set to true, there is, as I understand it a de-facto expectation that states displayed when rebooting the host are "unreliable" until the scheduled _sync_power_states has a chance to run. If that is accurate, I wonder if it would be better if nova would have a general "we don't know yet, we just started" state for server state until it gets a chance to poll or is otherwise updated with something definitive. I guess the problem then would be how to distinguish between services simply being restarted (e.g. service restart openstack-nova-compute) or the host being reboot. Another way to go would be to "age" the information (i.e display a last updated time), allowing the viewer to interpret the info as they see fit.

Comment 5 Brent Eagles 2013-02-12 17:43:51 UTC

I've moved status back to ON_DEV because we are looking at this issue more closely. Please stay tuned!

Comment 6 Brent Eagles 2013-02-15 16:21:04 UTC

If this is indeed simply a status update issue, status update timing comes up in a few different issues, eg:

https://bugzilla.redhat.com/show_bug.cgi?id=886337
https://bugzilla.redhat.com/show_bug.cgi?id=857256

and

https://bugs.launchpad.net/nova/+bug/928910

(This last one is important as it informs what to look out for!)

It seems reasonable to start updating information as soon as possible when a compute service starts. If the host is already up but the compute service is shutdown for a period of time, an initial scan would address the issue of server changes while compute "was away". However, care must be taken to interpret "unknown" states properly when compute is starting on a system reboot as it is possible that virtualization drivers are in the process of coming up as well. Fortunately there are a few patches for this already "in the queue".

https://review.openstack.org/#/c/21516/
https://code.engineering.redhat.com/gerrit/#/c/3056/

Currently compute actually *does* fire up the periodic updates early in the startup process. It has a configurable property "periodic_fuzzy_delay" (default 60s) that causes random number to be selected between 0 and periodic_fuzzy_delay to use as an initial delay before running the scheduled processes. However, the sync_power_states helper is set to run only every 10 ticks (schedule periods) and that appears to function as a countdown. Basically, even though there is an "initial delay" of anywhere from 0 seconds to a minute, the sync_power_states will miss out. So there are few different approaches to getting sync_power_states further up the queue.

1.) Alter Manager.periodic_tasks to "count up" instead of "count down". I would not favor this UNLESS the power state sync up thing was the only periodic task in all of openstack with a positive tick count. It is possible that some tasks rely on the initial grace period that the "count down" affords.

2.) Extend how periodic tasks are declared, either changing the count strategy (meh) or setting a "run asap" flag that is reset as soon as it is run once. Not awful and could be easily added/adopted without breaking anything else. e.g.

@manager.periodic_task(ticks_between_runs=10, immediate_on_start=True)

3.) Do a forced direct calls on the power state update in the initialization code. This approach concerns me as it involves introducing potentially blocking calls in a startup sequence where it did not exist before.

Comment 10 Ofer Blaut 2013-03-10 08:00:39 UTC

No few VMs moves to ERROR state  after they were active and i have rebooted the HOSTS

heppned on setups with only 2 VMs ( quantum ) and 20 VMs ( Nova network )

openstack-nova-common-2012.2.3-4.el6ost.noarch
openstack-nova-network-2012.2.3-4.el6ost.noarch
openstack-nova-compute-2012.2.3-4.el6ost.noarch



[root@puma04 ~(keystone_admin)]$ grep b300c251-7d4a-465a-a209-77f87bee746d /var/log/nova/*
/var/log/nova/api.log:2013-03-10 09:52:56 INFO nova.api.openstack.wsgi [req-7448cc62-30a7-4d4a-af41-058c81731a8a 1f0db08c839547339a4ede1d1fb99066 49202b0e09a4409c97475392341b57db] GET http://10.35.160.17:8774/v2/49202b0e09a4409c97475392341b57db/servers/b300c251-7d4a-465a-a209-77f87bee746d
/var/log/nova/api.log:2013-03-10 09:52:56 INFO nova.api.openstack.wsgi [req-7448cc62-30a7-4d4a-af41-058c81731a8a 1f0db08c839547339a4ede1d1fb99066 49202b0e09a4409c97475392341b57db] http://10.35.160.17:8774/v2/49202b0e09a4409c97475392341b57db/servers/b300c251-7d4a-465a-a209-77f87bee746d returned with HTTP 200
/var/log/nova/api.log:2013-03-10 09:52:56 INFO nova.osapi_compute.wsgi.server [req-7448cc62-30a7-4d4a-af41-058c81731a8a 1f0db08c839547339a4ede1d1fb99066 49202b0e09a4409c97475392341b57db] 10.35.160.17 - - [10/Mar/2013 09:52:56] "GET /v2/49202b0e09a4409c97475392341b57db/servers/b300c251-7d4a-465a-a209-77f87bee746d HTTP/1.1" 200 1633 0.085694
/var/log/nova/compute.log:2013-03-10 09:46:50 INFO nova.compute.manager [req-11834b94-cb42-4945-a463-4daba4d96a92 None None] [instance: b300c251-7d4a-465a-a209-77f87bee746d] Rebooting instance after nova-compute restart.
/var/log/nova/compute.log:2013-03-10 09:46:51 INFO nova.virt.libvirt.firewall [req-11834b94-cb42-4945-a463-4daba4d96a92 None None] [instance: b300c251-7d4a-465a-a209-77f87bee746d] Called setup_basic_filtering in nwfilter
/var/log/nova/compute.log:2013-03-10 09:46:51 INFO nova.virt.libvirt.firewall [req-11834b94-cb42-4945-a463-4daba4d96a92 None None] [instance: b300c251-7d4a-465a-a209-77f87bee746d] Ensuring static filters
/var/log/nova/compute.log:2013-03-10 09:46:56 WARNING nova.compute.manager [req-11834b94-cb42-4945-a463-4daba4d96a92 None None] [instance: b300c251-7d4a-465a-a209-77f87bee746d] Failed to resume instance

Comment 11 Brent Eagles 2013-03-11 15:28:34 UTC

Was this evaluation based on immediate examination through "nova list" on restart (or similar) or was there a short period of time before you ran it? There  is anywhere from 0 to 60 seconds when the compute service starts up before it runs the periodic tasks so it could be a minute or two before compute syncs up with whatever it can.

Comment 12 Brent Eagles 2013-03-11 17:37:52 UTC

My apologies! I did not look closely enough at the provided log file data. The log file indicates that an instance failed to resume. This appears to be a different issue (see original description and following comments). Please file a new bug report with the information you've provided above. Also there may be other log information related to this failure in some of the other files. If so, please include them as well.

Comment 13 Brent Eagles 2013-03-11 18:37:04 UTC

Upon inspection of the other bugs, this comment 10 appears to be the same as the bug reported in https://bugzilla.redhat.com/show_bug.cgi?id=912284. I'm reverting this bug to ON_QA/VERIFIED.

Comment 14 Ofer Blaut 2013-03-12 07:50:11 UTC

Bug moved to verified 

Thanks

Comment 16 errata-xmlrpc 2013-03-21 18:14:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0657.html

Note You need to log in before you can comment on or make changes to this bug.