Bug 890512 - Some Instances are shutoff after host reboot
Some Instances are shutoff after host reboot
Status: CLOSED ERRATA
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova (Show other bugs)
2.0 (Folsom)
Unspecified Unspecified
high Severity high
: snapshot4
: 2.1
Assigned To: Brent Eagles
Omri Hochman
: Triaged
Depends On:
Blocks: 920704
  Show dependency treegraph
 
Reported: 2012-12-27 07:39 EST by Ofer Blaut
Modified: 2016-04-26 14:15 EDT (History)
4 users (show)

See Also:
Fixed In Version: openstack-nova-2012.2.3-2.el6ost
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 913613 (view as bug list)
Environment:
Last Closed: 2013-03-21 14:14:28 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
Nova log per Instance (7.84 KB, application/octet-stream)
2012-12-27 07:39 EST, Ofer Blaut
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Launchpad 1085771 None None None Never
OpenStack gerrit 22232 None None None Never

  None (edit)
Description Ofer Blaut 2012-12-27 07:39:13 EST
Created attachment 669579 [details]
Nova log per Instance

Description of problem:

I have deployed Instances which are running on All in one host with quantum.

While all instances are running i have isssued reboot to the host.

After host power up some instance starting to run but shortly powered off .

Error reported in nova logs "shutdown by itself. Calling the stop API "

Attached debug logs per one instance 

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. run few instances on All in One node 
2. reboot the HOST
3. check status of the instances  
  
Actual results:

some instances are powered off

Expected results:

either all are off or all are up 

Additional info:
There is relevant bug in ubunto lunchpad
Comment 2 Nikola Dipanov 2013-01-04 11:12:46 EST
There seems to be a patch proposed to the upstream bug https://review.openstack.org/#/c/17449/. 

The patch itself is very unobtrusive and can be easily backported.

After that - we should get QA teem to revisit.
Comment 3 Brent Eagles 2013-02-08 13:58:22 EST
I believe I misinterpreted the nature of this bug when I brought in the upstream patch. I think this requires more investigation and probably a better defined picture of what is actually going on and what the expectations should be... at least with respect to this bug. 

Ignoring the possibility that Openstack may have resume_guests_state_on_host_boot set to true, there is, as I understand it a de-facto expectation that states displayed when rebooting the host are "unreliable" until the scheduled _sync_power_states has a chance to run. If that is accurate, I wonder if it would be better if nova would have a general "we don't know yet, we just started" state for server state until it gets a chance to poll or is otherwise updated with something definitive. I guess the problem then would be how to distinguish between services simply being restarted (e.g. service restart openstack-nova-compute) or the host being reboot. Another way to go would be to "age" the information (i.e display a last updated time), allowing the viewer to interpret the info as they see fit.
Comment 5 Brent Eagles 2013-02-12 12:43:51 EST
I've moved status back to ON_DEV because we are looking at this issue more closely. Please stay tuned!
Comment 6 Brent Eagles 2013-02-15 11:21:04 EST
If this is indeed simply a status update issue, status update timing comes up in a few different issues, eg:

https://bugzilla.redhat.com/show_bug.cgi?id=886337
https://bugzilla.redhat.com/show_bug.cgi?id=857256

and

https://bugs.launchpad.net/nova/+bug/928910

(This last one is important as it informs what to look out for!)

It seems reasonable to start updating information as soon as possible when a compute service starts. If the host is already up but the compute service is shutdown for a period of time, an initial scan would address the issue of server changes while compute "was away". However, care must be taken to interpret "unknown" states properly when compute is starting on a system reboot as it is possible that virtualization drivers are in the process of coming up as well. Fortunately there are a few patches for this already "in the queue".

https://review.openstack.org/#/c/21516/
https://code.engineering.redhat.com/gerrit/#/c/3056/

Currently compute actually *does* fire up the periodic updates early in the startup process. It has a configurable property "periodic_fuzzy_delay" (default 60s) that causes random number to be selected between 0 and periodic_fuzzy_delay to use as an initial delay before running the scheduled processes. However, the sync_power_states helper is set to run only every 10 ticks (schedule periods) and that appears to function as a countdown. Basically, even though there is an "initial delay" of anywhere from 0 seconds to a minute, the sync_power_states will miss out. So there are few different approaches to getting sync_power_states further up the queue.

1.) Alter Manager.periodic_tasks to "count up" instead of "count down". I would not favor this UNLESS the power state sync up thing was the only periodic task in all of openstack with a positive tick count. It is possible that some tasks rely on the initial grace period that the "count down" affords.

2.) Extend how periodic tasks are declared, either changing the count strategy (meh) or setting a "run asap" flag that is reset as soon as it is run once. Not awful and could be easily added/adopted without breaking anything else. e.g.

@manager.periodic_task(ticks_between_runs=10, immediate_on_start=True)

3.) Do a forced direct calls on the power state update in the initialization code. This approach concerns me as it involves introducing potentially blocking calls in a startup sequence where it did not exist before.
Comment 10 Ofer Blaut 2013-03-10 04:00:39 EDT
No few VMs moves to ERROR state  after they were active and i have rebooted the HOSTS

heppned on setups with only 2 VMs ( quantum ) and 20 VMs ( Nova network )

openstack-nova-common-2012.2.3-4.el6ost.noarch
openstack-nova-network-2012.2.3-4.el6ost.noarch
openstack-nova-compute-2012.2.3-4.el6ost.noarch



[root@puma04 ~(keystone_admin)]$ grep b300c251-7d4a-465a-a209-77f87bee746d /var/log/nova/*
/var/log/nova/api.log:2013-03-10 09:52:56 INFO nova.api.openstack.wsgi [req-7448cc62-30a7-4d4a-af41-058c81731a8a 1f0db08c839547339a4ede1d1fb99066 49202b0e09a4409c97475392341b57db] GET http://10.35.160.17:8774/v2/49202b0e09a4409c97475392341b57db/servers/b300c251-7d4a-465a-a209-77f87bee746d
/var/log/nova/api.log:2013-03-10 09:52:56 INFO nova.api.openstack.wsgi [req-7448cc62-30a7-4d4a-af41-058c81731a8a 1f0db08c839547339a4ede1d1fb99066 49202b0e09a4409c97475392341b57db] http://10.35.160.17:8774/v2/49202b0e09a4409c97475392341b57db/servers/b300c251-7d4a-465a-a209-77f87bee746d returned with HTTP 200
/var/log/nova/api.log:2013-03-10 09:52:56 INFO nova.osapi_compute.wsgi.server [req-7448cc62-30a7-4d4a-af41-058c81731a8a 1f0db08c839547339a4ede1d1fb99066 49202b0e09a4409c97475392341b57db] 10.35.160.17 - - [10/Mar/2013 09:52:56] "GET /v2/49202b0e09a4409c97475392341b57db/servers/b300c251-7d4a-465a-a209-77f87bee746d HTTP/1.1" 200 1633 0.085694
/var/log/nova/compute.log:2013-03-10 09:46:50 INFO nova.compute.manager [req-11834b94-cb42-4945-a463-4daba4d96a92 None None] [instance: b300c251-7d4a-465a-a209-77f87bee746d] Rebooting instance after nova-compute restart.
/var/log/nova/compute.log:2013-03-10 09:46:51 INFO nova.virt.libvirt.firewall [req-11834b94-cb42-4945-a463-4daba4d96a92 None None] [instance: b300c251-7d4a-465a-a209-77f87bee746d] Called setup_basic_filtering in nwfilter
/var/log/nova/compute.log:2013-03-10 09:46:51 INFO nova.virt.libvirt.firewall [req-11834b94-cb42-4945-a463-4daba4d96a92 None None] [instance: b300c251-7d4a-465a-a209-77f87bee746d] Ensuring static filters
/var/log/nova/compute.log:2013-03-10 09:46:56 WARNING nova.compute.manager [req-11834b94-cb42-4945-a463-4daba4d96a92 None None] [instance: b300c251-7d4a-465a-a209-77f87bee746d] Failed to resume instance
Comment 11 Brent Eagles 2013-03-11 11:28:34 EDT
Was this evaluation based on immediate examination through "nova list" on restart (or similar) or was there a short period of time before you ran it? There  is anywhere from 0 to 60 seconds when the compute service starts up before it runs the periodic tasks so it could be a minute or two before compute syncs up with whatever it can.
Comment 12 Brent Eagles 2013-03-11 13:37:52 EDT
My apologies! I did not look closely enough at the provided log file data. The log file indicates that an instance failed to resume. This appears to be a different issue (see original description and following comments). Please file a new bug report with the information you've provided above. Also there may be other log information related to this failure in some of the other files. If so, please include them as well.
Comment 13 Brent Eagles 2013-03-11 14:37:04 EDT
Upon inspection of the other bugs, this comment 10 appears to be the same as the bug reported in https://bugzilla.redhat.com/show_bug.cgi?id=912284. I'm reverting this bug to ON_QA/VERIFIED.
Comment 14 Ofer Blaut 2013-03-12 03:50:11 EDT
Bug moved to verified 

Thanks
Comment 16 errata-xmlrpc 2013-03-21 14:14:28 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0657.html

Note You need to log in before you can comment on or make changes to this bug.