Bug 1563714 - Cannot start instance after reboot of OC nodes
Summary: Cannot start instance after reboot of OC nodes
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 13.0 (Queens)
Hardware: x86_64
OS: Linux
high
high
Target Milestone: async
: 13.0 (Queens)
Assignee: Lee Yarwood
QA Contact: nova-maint
URL:
Whiteboard:
Depends On: 1571888
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-04 14:32 UTC by Artem Hrechanychenko
Modified: 2020-01-08 17:59 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1571888 (view as bug list)
Environment:
Last Closed: 2018-05-08 10:21:20 UTC
Target Upstream Version:


Attachments (Terms of Use)
nova compute logs (2.04 MB, text/plain)
2018-04-25 13:39 UTC, Artem Hrechanychenko
no flags Details

Description Artem Hrechanychenko 2018-04-04 14:32:36 UTC
Description of problem:

https://rhos-ci-staging-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DF%20Current%20release/job/DFG-df-13-deployment-7.5-virthost-1cont_1comp_3ceph-no_UC_SSL-no_OC_SSL-ceph-ipv4-vxlan-RHELOSP-31811/10/console

Instance which was create after deploy of OC sometimes cannot be started after reboot of OC nodes

13:55:02 + nova start after_deploy
13:55:02 Cannot 'start' instance 6083abbb-dfa1-4cba-9d77-309747559122 while it is in vm_state active (HTTP 409) (Request-ID: req-e8d1bcc3-bb4c-4717-8e04-5b9ea8904b0a)

(overcloud) [stack@undercloud-0 ~]$ nova start after_deploy
Request to start server after_deploy has been accepted.
(overcloud) [stack@undercloud-0 ~]$ nova list
+--------------------------------------+--------------+---------+-------------+-------------+---------------------------------------+
| ID                                   | Name         | Status  | Task State  | Power State | Networks                              |
+--------------------------------------+--------------+---------+-------------+-------------+---------------------------------------+
| 6083abbb-dfa1-4cba-9d77-309747559122 | after_deploy | SHUTOFF | powering-on | Shutdown    | tenantvxlan=192.168.32.11, 10.0.0.176 |
| 283630da-03d1-4eca-af76-3e6261940991 | after_reboot | ACTIVE  | -           | Running     | tenantvxlan=192.168.32.6, 10.0.0.179  |
+--------------------------------------+--------------+---------+-------------+-------------+---------------------------------------+
(overcloud) [stack@undercloud-0 ~]$ nova list
+--------------------------------------+--------------+--------+------------+-------------+---------------------------------------+
| ID                                   | Name         | Status | Task State | Power State | Networks                              |
+--------------------------------------+--------------+--------+------------+-------------+---------------------------------------+
| 6083abbb-dfa1-4cba-9d77-309747559122 | after_deploy | ACTIVE | -          | Running     | tenantvxlan=192.168.32.11, 10.0.0.176 |
| 283630da-03d1-4eca-af76-3e6261940991 | after_reboot | ACTIVE | -          | Running     | tenantvxlan=192.168.32.6, 10.0.0.179  |
+--------------------------------------+--------------+--------+------------+-------------+---------------------------------------+
(overcloud) [stack@undercloud-0 ~]$

Version-Release number of selected component (if applicable):
openstack-nova-compute:2018-03-29.1 

How reproducible:
always

Steps to Reproduce:
1.Deploy OC
2.Launch instance
3. Reboot oc
4. Start instance was created on step2

Actual results:
13:55:02 Cannot 'start' instance 6083abbb-dfa1-4cba-9d77-309747559122 while it is in vm_state active (HTTP 409) (Request-ID: req-e8d1bcc3-bb4c-4717-8e04-5b9ea8904b0a)

Expected results:
Request to start server after_deploy has been accepted.

Additional info:

Comment 2 Lee Yarwood 2018-04-06 13:57:40 UTC
tl;dr - This doesn't look like a bug, our tests should be waiting for the instances to be marked as SHUTDOWN by n-cpu after a host reboot before attempting to restart them.

(In reply to Artem Hrechanychenko from comment #0)
> Description of problem:
> 
> https://rhos-ci-staging-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/
> DF%20Current%20release/job/DFG-df-13-deployment-7.5-virthost-
> 1cont_1comp_3ceph-no_UC_SSL-no_OC_SSL-ceph-ipv4-vxlan-RHELOSP-31811/10/
> console
> 
> Instance which was create after deploy of OC sometimes cannot be started
> after reboot of OC nodes
> 
> 13:55:02 + nova start after_deploy
> 13:55:02 Cannot 'start' instance 6083abbb-dfa1-4cba-9d77-309747559122 while
> it is in vm_state active (HTTP 409) (Request-ID:
> req-e8d1bcc3-bb4c-4717-8e04-5b9ea8904b0a)
> 
> (overcloud) [stack@undercloud-0 ~]$ nova start after_deploy
> Request to start server after_deploy has been accepted.
> (overcloud) [stack@undercloud-0 ~]$ nova list
> +--------------------------------------+--------------+---------+------------
> -+-------------+---------------------------------------+
> | ID                                   | Name         | Status  | Task State
> | Power State | Networks                              |
> +--------------------------------------+--------------+---------+------------
> -+-------------+---------------------------------------+
> | 6083abbb-dfa1-4cba-9d77-309747559122 | after_deploy | SHUTOFF |
> powering-on | Shutdown    | tenantvxlan=192.168.32.11, 10.0.0.176 |
> | 283630da-03d1-4eca-af76-3e6261940991 | after_reboot | ACTIVE  | -         
> | Running     | tenantvxlan=192.168.32.6, 10.0.0.179  |
> +--------------------------------------+--------------+---------+------------
> -+-------------+---------------------------------------+
> (overcloud) [stack@undercloud-0 ~]$ nova list
> +--------------------------------------+--------------+--------+------------
> +-------------+---------------------------------------+
> | ID                                   | Name         | Status | Task State
> | Power State | Networks                              |
> +--------------------------------------+--------------+--------+------------
> +-------------+---------------------------------------+
> | 6083abbb-dfa1-4cba-9d77-309747559122 | after_deploy | ACTIVE | -         
> | Running     | tenantvxlan=192.168.32.11, 10.0.0.176 |
> | 283630da-03d1-4eca-af76-3e6261940991 | after_reboot | ACTIVE | -         
> | Running     | tenantvxlan=192.168.32.6, 10.0.0.179  |
> +--------------------------------------+--------------+--------+------------
> +-------------+---------------------------------------+
> (overcloud) [stack@undercloud-0 ~]$

I assume this is from a follow up attempt to start the instance that succeeds? This isn't captured in the logs btw. 

> Version-Release number of selected component (if applicable):
> openstack-nova-compute:2018-03-29.1 

FWIW this is openstack-nova-17.0.2-0.20180323024604.0390d5f.el7ost as listed in the logs below.

AFAICT this is a simple race between n-cpu running _sync_power_states after the host reboot and n-api requesting the to start the instance.

We can see the n-api request in the logs @ 13:54:45.537 :

4707 2018-04-04 13:54:45.537 15 DEBUG nova.api.openstack.wsgi [req-e8d1bcc3-bb4c-4717-8e04-5b9ea8904b0a 6f2495f8d5654d54885c9ebd1c7ecf41 62fc93a096ff487997b723a4069a78c3 - default default] Action: 'action', calling method: <bound method ServersController._start_server of <nova.api.openstack.compute.servers.ServersController object at 0x7fd82a464310>>, body: {"os-start": null} _process_stack /usr/lib/python2.7/site-packages/nova/api/openstack/wsgi.py:604
[..]
4710 2018-04-04 13:54:45.591 15 INFO nova.api.openstack.wsgi [req-e8d1bcc3-bb4c-4717-8e04-5b9ea8904b0a 6f2495f8d5654d54885c9ebd1c7ecf41 62fc93a096ff487997b723a4069a78c3 - default default] HTTP exception thrown: Cannot 'start' instance 6083abbb-dfa1-4cba-9d77-309747559122 while it is in vm_state active

At this time n-cpu was still updating the power state of the instance, not finishing until 13:54:45.928 :

2880 2018-04-04 13:54:21.558 1 INFO nova.service [req-3079f43e-796e-4742-ac0d-7d1a853f1759 - - - - -] Starting compute node (version 17.0.2-0.20180323024604.0390d5f.el7ost)
[..]
3094 2018-04-04 13:54:27.882 1 DEBUG nova.compute.manager [req-3079f43e-796e-4742-ac0d-7d1a853f1759 - - - - -] [instance: 6083abbb-dfa1-4cba-9d77-309747559122] Checking state _get_power_state /usr/lib/python2.7/site-packages/nova/compute/manager.py:1181
3095 2018-04-04 13:54:27.892 1 DEBUG nova.compute.manager [req-3079f43e-796e-4742-ac0d-7d1a853f1759 - - - - -] [instance: 6083abbb-dfa1-4cba-9d77-309747559122] Current state is 4, state in DB is 1. _init_instance /usr/lib/python2.7/site-packages/nova/compute/manager.py:997
[..]
3115 2018-04-04 13:54:45.265 1 DEBUG oslo_service.periodic_task [req-3079f43e-796e-4742-ac0d-7d1a853f1759 - - - - -] Running periodic task ComputeManager._sync_power_states run_periodic_tasks /usr/lib/python2.7/site-packages/oslo_service/periodic_task.py:215
[..]
3122 2018-04-04 13:54:45.728 1 DEBUG oslo_concurrency.lockutils [req-35fa5969-d248-4e7e-8d44-6708a2f1a93a - - - - -] Lock "6083abbb-dfa1-4cba-9d77-309747559122" acquired by "nova.compute.manager.do_stop_instance" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:273
[..]
3128 2018-04-04 13:54:45.831 1 INFO nova.virt.libvirt.driver [req-35fa5969-d248-4e7e-8d44-6708a2f1a93a - - - - -] [instance: 6083abbb-dfa1-4cba-9d77-309747559122] Instance already shutdown.
3129 2018-04-04 13:54:45.836 1 INFO nova.virt.libvirt.driver [req-35fa5969-d248-4e7e-8d44-6708a2f1a93a - - - - -] [instance: 6083abbb-dfa1-4cba-9d77-309747559122] Instance destroyed successfully.
3130 2018-04-04 13:54:45.836 1 DEBUG nova.compute.manager [req-35fa5969-d248-4e7e-8d44-6708a2f1a93a - - - - -] [instance: 6083abbb-dfa1-4cba-9d77-309747559122] Checking state _get_power_state /usr/lib/python2.7/site-packages/nova/compute/manager.py:1181
3131 2018-04-04 13:54:45.928 1 DEBUG oslo_concurrency.lockutils [req-35fa5969-d248-4e7e-8d44-6708a2f1a93a - - - - -] Lock "6083abbb-dfa1-4cba-9d77-309747559122" released by "nova.compute.manager.do_stop_instance" :: held 0.200s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:285

Artem, I'm going to close this out as NOTABUG but please feel free to comment/reopen if you feel I've missed something here.

Comment 3 Artem Hrechanychenko 2018-04-25 13:39:19 UTC
Created attachment 1426673 [details]
nova compute logs

Comment 4 Artem Hrechanychenko 2018-04-25 13:40:45 UTC
    (overcloud) [stack@undercloud-0 ~]$ nova list
    +--------------------------------------+--------------+---------+-------------+-------------+---------------------------------------+
    | ID                                   | Name         | Status  | Task State  | Power State | Networks                              |
    +--------------------------------------+--------------+---------+-------------+-------------+---------------------------------------+
    | 92c27232-af71-44c3-82ef-e920b3d298f5 | after_deploy | SHUTOFF | powering-on | Shutdown    | tenantvxlan=192.168.32.12, 10.0.0.183 |
    +--------------------------------------+--------------+---------+-------------+-------------+---------------------------------------+
    (overcloud) [stack@undercloud-0 ~]$ nova start after_deploy
    Cannot 'start' instance 92c27232-af71-44c3-82ef-e920b3d298f5 while it is in task_state powering-on (HTTP 409) (Request-ID: req-468d1b79-e468-4ef1-8672-32fe1b9a898a)
    ERROR (CommandError): Unable to start the specified server(s).
    (overcloud) [stack@undercloud-0 ~]$ nova start after_deploy
    Cannot 'start' instance 92c27232-af71-44c3-82ef-e920b3d298f5 while it is in task_state powering-on (HTTP 409) (Request-ID: req-551e80b8-a1bf-4eba-b4df-a02984c36003)
    ERROR (CommandError): Unable to start the specified server(s).
    (overcloud) [stack@undercloud-0 ~]$ date
    Wed Apr 25 09:21:46 EDT 2018
    (overcloud) [stack@undercloud-0 ~]$ date
    Wed Apr 25 09:21:49 EDT 2018
    (overcloud) [stack@undercloud-0 ~]$ nova start after_deploy
    Cannot 'start' instance 92c27232-af71-44c3-82ef-e920b3d298f5 while it is in task_state powering-on (HTTP 409) (Request-ID: req-3c643204-8372-4c0f-b8c1-7ed4e7f00259)
    ERROR (CommandError): Unable to start the specified server(s).
    (overcloud) [stack@undercloud-0 ~]$ date
    Wed Apr 25 09:21:59 EDT 2018
    (overcloud) [stack@undercloud-0 ~]$ date
    Wed Apr 25 09:24:57 EDT 2018
    (overcloud) [stack@undercloud-0 ~]$ nova start after_deploy
    Cannot 'start' instance 92c27232-af71-44c3-82ef-e920b3d298f5 while it is in task_state powering-on (HTTP 409) (Request-ID: req-f2c4d9b9-522c-4828-ace5-17208b22d2b4)
    ERROR (CommandError): Unable to start the specified server(s).
    (overcloud) [stack@undercloud-0 ~]$

(overcloud) [stack@undercloud-0 ~]$ nova start after_deploy
Cannot 'start' instance 92c27232-af71-44c3-82ef-e920b3d298f5 while it is in task_state powering-on (HTTP 409) (Request-ID: req-f2c4d9b9-522c-4828-ace5-17208b22d2b4)
ERROR (CommandError): Unable to start the specified server(s).
(overcloud) [stack@undercloud-0 ~]$ date
Wed Apr 25 09:25:29 EDT 2018
(overcloud) [stack@undercloud-0 ~]$ nova start after_deploy
Cannot 'start' instance 92c27232-af71-44c3-82ef-e920b3d298f5 while it is in task_state powering-on (HTTP 409) (Request-ID: req-5ba84e9a-a432-4c4e-8374-b1ecd453cc2c)
ERROR (CommandError): Unable to start the specified server(s).
(overcloud) [stack@undercloud-0 ~]$ nova start after_deploy
Request to start server after_deploy has been accepted.
(overcloud) [stack@undercloud-0 ~]$ date
Wed Apr 25 09:28:44 EDT 2018


so timeout after reboot were 10minutes, so regarding to lyarwood I re-opened issue

Comment 5 Lee Yarwood 2018-04-25 14:11:00 UTC
We also need the nova-api.log from a controller listing some of these example requests to boot the instance.

I'm still pretty sure this is just due to us not waiting until the instance is marked as Status == SHUTOFF, Task State == None and Power State == shutdown before attempting to start the instance.

Comment 6 Lee Yarwood 2018-04-25 14:59:20 UTC
The additional issue being seen in this environment is the following upstream bug currently being worked on in master:

Cannot hard reboot a libvirt instance in error state (mdev query fails)
https://bugs.launchpad.net/nova/+bug/1764460

I'll create a clone for this now so we can track this downstream.

Comment 8 Omri Hochman 2018-04-25 16:38:15 UTC
Adding blocker flag, as this BZ is blocking one of the basic 
DF:DFG Regression scenarios.

Deploy OSP13 --> Reboot all Overcloud nodes -->  wait 5 minutes --> Attempt to launch instance on compute nodes. 

Result: Launch Instance Fails.

Comment 12 Artem Hrechanychenko 2018-05-08 10:21:20 UTC
Wasn't reproduce on two previous runs
MOVE to Closed


Note You need to log in before you can comment on or make changes to this bug.