Red Hat Bugzilla – Bug 890512
Some Instances are shutoff after host reboot
Last modified: 2016-04-26 14:15:18 EDT
Created attachment 669579 [details]
Nova log per Instance
Description of problem:
I have deployed Instances which are running on All in one host with quantum.
While all instances are running i have isssued reboot to the host.
After host power up some instance starting to run but shortly powered off .
Error reported in nova logs "shutdown by itself. Calling the stop API "
Attached debug logs per one instance
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. run few instances on All in One node
2. reboot the HOST
3. check status of the instances
some instances are powered off
either all are off or all are up
There is relevant bug in ubunto lunchpad
There seems to be a patch proposed to the upstream bug https://review.openstack.org/#/c/17449/.
The patch itself is very unobtrusive and can be easily backported.
After that - we should get QA teem to revisit.
I believe I misinterpreted the nature of this bug when I brought in the upstream patch. I think this requires more investigation and probably a better defined picture of what is actually going on and what the expectations should be... at least with respect to this bug.
Ignoring the possibility that Openstack may have resume_guests_state_on_host_boot set to true, there is, as I understand it a de-facto expectation that states displayed when rebooting the host are "unreliable" until the scheduled _sync_power_states has a chance to run. If that is accurate, I wonder if it would be better if nova would have a general "we don't know yet, we just started" state for server state until it gets a chance to poll or is otherwise updated with something definitive. I guess the problem then would be how to distinguish between services simply being restarted (e.g. service restart openstack-nova-compute) or the host being reboot. Another way to go would be to "age" the information (i.e display a last updated time), allowing the viewer to interpret the info as they see fit.
I've moved status back to ON_DEV because we are looking at this issue more closely. Please stay tuned!
If this is indeed simply a status update issue, status update timing comes up in a few different issues, eg:
(This last one is important as it informs what to look out for!)
It seems reasonable to start updating information as soon as possible when a compute service starts. If the host is already up but the compute service is shutdown for a period of time, an initial scan would address the issue of server changes while compute "was away". However, care must be taken to interpret "unknown" states properly when compute is starting on a system reboot as it is possible that virtualization drivers are in the process of coming up as well. Fortunately there are a few patches for this already "in the queue".
Currently compute actually *does* fire up the periodic updates early in the startup process. It has a configurable property "periodic_fuzzy_delay" (default 60s) that causes random number to be selected between 0 and periodic_fuzzy_delay to use as an initial delay before running the scheduled processes. However, the sync_power_states helper is set to run only every 10 ticks (schedule periods) and that appears to function as a countdown. Basically, even though there is an "initial delay" of anywhere from 0 seconds to a minute, the sync_power_states will miss out. So there are few different approaches to getting sync_power_states further up the queue.
1.) Alter Manager.periodic_tasks to "count up" instead of "count down". I would not favor this UNLESS the power state sync up thing was the only periodic task in all of openstack with a positive tick count. It is possible that some tasks rely on the initial grace period that the "count down" affords.
2.) Extend how periodic tasks are declared, either changing the count strategy (meh) or setting a "run asap" flag that is reset as soon as it is run once. Not awful and could be easily added/adopted without breaking anything else. e.g.
3.) Do a forced direct calls on the power state update in the initialization code. This approach concerns me as it involves introducing potentially blocking calls in a startup sequence where it did not exist before.
No few VMs moves to ERROR state after they were active and i have rebooted the HOSTS
heppned on setups with only 2 VMs ( quantum ) and 20 VMs ( Nova network )
[root@puma04 ~(keystone_admin)]$ grep b300c251-7d4a-465a-a209-77f87bee746d /var/log/nova/*
/var/log/nova/api.log:2013-03-10 09:52:56 INFO nova.api.openstack.wsgi [req-7448cc62-30a7-4d4a-af41-058c81731a8a 1f0db08c839547339a4ede1d1fb99066 49202b0e09a4409c97475392341b57db] GET http://10.35.160.17:8774/v2/49202b0e09a4409c97475392341b57db/servers/b300c251-7d4a-465a-a209-77f87bee746d
/var/log/nova/api.log:2013-03-10 09:52:56 INFO nova.api.openstack.wsgi [req-7448cc62-30a7-4d4a-af41-058c81731a8a 1f0db08c839547339a4ede1d1fb99066 49202b0e09a4409c97475392341b57db] http://10.35.160.17:8774/v2/49202b0e09a4409c97475392341b57db/servers/b300c251-7d4a-465a-a209-77f87bee746d returned with HTTP 200
/var/log/nova/api.log:2013-03-10 09:52:56 INFO nova.osapi_compute.wsgi.server [req-7448cc62-30a7-4d4a-af41-058c81731a8a 1f0db08c839547339a4ede1d1fb99066 49202b0e09a4409c97475392341b57db] 10.35.160.17 - - [10/Mar/2013 09:52:56] "GET /v2/49202b0e09a4409c97475392341b57db/servers/b300c251-7d4a-465a-a209-77f87bee746d HTTP/1.1" 200 1633 0.085694
/var/log/nova/compute.log:2013-03-10 09:46:50 INFO nova.compute.manager [req-11834b94-cb42-4945-a463-4daba4d96a92 None None] [instance: b300c251-7d4a-465a-a209-77f87bee746d] Rebooting instance after nova-compute restart.
/var/log/nova/compute.log:2013-03-10 09:46:51 INFO nova.virt.libvirt.firewall [req-11834b94-cb42-4945-a463-4daba4d96a92 None None] [instance: b300c251-7d4a-465a-a209-77f87bee746d] Called setup_basic_filtering in nwfilter
/var/log/nova/compute.log:2013-03-10 09:46:51 INFO nova.virt.libvirt.firewall [req-11834b94-cb42-4945-a463-4daba4d96a92 None None] [instance: b300c251-7d4a-465a-a209-77f87bee746d] Ensuring static filters
/var/log/nova/compute.log:2013-03-10 09:46:56 WARNING nova.compute.manager [req-11834b94-cb42-4945-a463-4daba4d96a92 None None] [instance: b300c251-7d4a-465a-a209-77f87bee746d] Failed to resume instance
Was this evaluation based on immediate examination through "nova list" on restart (or similar) or was there a short period of time before you ran it? There is anywhere from 0 to 60 seconds when the compute service starts up before it runs the periodic tasks so it could be a minute or two before compute syncs up with whatever it can.
My apologies! I did not look closely enough at the provided log file data. The log file indicates that an instance failed to resume. This appears to be a different issue (see original description and following comments). Please file a new bug report with the information you've provided above. Also there may be other log information related to this failure in some of the other files. If so, please include them as well.
Upon inspection of the other bugs, this comment 10 appears to be the same as the bug reported in https://bugzilla.redhat.com/show_bug.cgi?id=912284. I'm reverting this bug to ON_QA/VERIFIED.
Bug moved to verified
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.