Description of problem: https://rhos-ci-staging-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DF%20Current%20release/job/DFG-df-13-deployment-7.5-virthost-1cont_1comp_3ceph-no_UC_SSL-no_OC_SSL-ceph-ipv4-vxlan-RHELOSP-31811/10/console Instance which was create after deploy of OC sometimes cannot be started after reboot of OC nodes 13:55:02 + nova start after_deploy 13:55:02 Cannot 'start' instance 6083abbb-dfa1-4cba-9d77-309747559122 while it is in vm_state active (HTTP 409) (Request-ID: req-e8d1bcc3-bb4c-4717-8e04-5b9ea8904b0a) (overcloud) [stack@undercloud-0 ~]$ nova start after_deploy Request to start server after_deploy has been accepted. (overcloud) [stack@undercloud-0 ~]$ nova list +--------------------------------------+--------------+---------+-------------+-------------+---------------------------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+--------------+---------+-------------+-------------+---------------------------------------+ | 6083abbb-dfa1-4cba-9d77-309747559122 | after_deploy | SHUTOFF | powering-on | Shutdown | tenantvxlan=192.168.32.11, 10.0.0.176 | | 283630da-03d1-4eca-af76-3e6261940991 | after_reboot | ACTIVE | - | Running | tenantvxlan=192.168.32.6, 10.0.0.179 | +--------------------------------------+--------------+---------+-------------+-------------+---------------------------------------+ (overcloud) [stack@undercloud-0 ~]$ nova list +--------------------------------------+--------------+--------+------------+-------------+---------------------------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+--------------+--------+------------+-------------+---------------------------------------+ | 6083abbb-dfa1-4cba-9d77-309747559122 | after_deploy | ACTIVE | - | Running | tenantvxlan=192.168.32.11, 10.0.0.176 | | 283630da-03d1-4eca-af76-3e6261940991 | after_reboot | ACTIVE | - | Running | tenantvxlan=192.168.32.6, 10.0.0.179 | +--------------------------------------+--------------+--------+------------+-------------+---------------------------------------+ (overcloud) [stack@undercloud-0 ~]$ Version-Release number of selected component (if applicable): openstack-nova-compute:2018-03-29.1 How reproducible: always Steps to Reproduce: 1.Deploy OC 2.Launch instance 3. Reboot oc 4. Start instance was created on step2 Actual results: 13:55:02 Cannot 'start' instance 6083abbb-dfa1-4cba-9d77-309747559122 while it is in vm_state active (HTTP 409) (Request-ID: req-e8d1bcc3-bb4c-4717-8e04-5b9ea8904b0a) Expected results: Request to start server after_deploy has been accepted. Additional info:
tl;dr - This doesn't look like a bug, our tests should be waiting for the instances to be marked as SHUTDOWN by n-cpu after a host reboot before attempting to restart them. (In reply to Artem Hrechanychenko from comment #0) > Description of problem: > > https://rhos-ci-staging-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/ > DF%20Current%20release/job/DFG-df-13-deployment-7.5-virthost- > 1cont_1comp_3ceph-no_UC_SSL-no_OC_SSL-ceph-ipv4-vxlan-RHELOSP-31811/10/ > console > > Instance which was create after deploy of OC sometimes cannot be started > after reboot of OC nodes > > 13:55:02 + nova start after_deploy > 13:55:02 Cannot 'start' instance 6083abbb-dfa1-4cba-9d77-309747559122 while > it is in vm_state active (HTTP 409) (Request-ID: > req-e8d1bcc3-bb4c-4717-8e04-5b9ea8904b0a) > > (overcloud) [stack@undercloud-0 ~]$ nova start after_deploy > Request to start server after_deploy has been accepted. > (overcloud) [stack@undercloud-0 ~]$ nova list > +--------------------------------------+--------------+---------+------------ > -+-------------+---------------------------------------+ > | ID | Name | Status | Task State > | Power State | Networks | > +--------------------------------------+--------------+---------+------------ > -+-------------+---------------------------------------+ > | 6083abbb-dfa1-4cba-9d77-309747559122 | after_deploy | SHUTOFF | > powering-on | Shutdown | tenantvxlan=192.168.32.11, 10.0.0.176 | > | 283630da-03d1-4eca-af76-3e6261940991 | after_reboot | ACTIVE | - > | Running | tenantvxlan=192.168.32.6, 10.0.0.179 | > +--------------------------------------+--------------+---------+------------ > -+-------------+---------------------------------------+ > (overcloud) [stack@undercloud-0 ~]$ nova list > +--------------------------------------+--------------+--------+------------ > +-------------+---------------------------------------+ > | ID | Name | Status | Task State > | Power State | Networks | > +--------------------------------------+--------------+--------+------------ > +-------------+---------------------------------------+ > | 6083abbb-dfa1-4cba-9d77-309747559122 | after_deploy | ACTIVE | - > | Running | tenantvxlan=192.168.32.11, 10.0.0.176 | > | 283630da-03d1-4eca-af76-3e6261940991 | after_reboot | ACTIVE | - > | Running | tenantvxlan=192.168.32.6, 10.0.0.179 | > +--------------------------------------+--------------+--------+------------ > +-------------+---------------------------------------+ > (overcloud) [stack@undercloud-0 ~]$ I assume this is from a follow up attempt to start the instance that succeeds? This isn't captured in the logs btw. > Version-Release number of selected component (if applicable): > openstack-nova-compute:2018-03-29.1 FWIW this is openstack-nova-17.0.2-0.20180323024604.0390d5f.el7ost as listed in the logs below. AFAICT this is a simple race between n-cpu running _sync_power_states after the host reboot and n-api requesting the to start the instance. We can see the n-api request in the logs @ 13:54:45.537 : 4707 2018-04-04 13:54:45.537 15 DEBUG nova.api.openstack.wsgi [req-e8d1bcc3-bb4c-4717-8e04-5b9ea8904b0a 6f2495f8d5654d54885c9ebd1c7ecf41 62fc93a096ff487997b723a4069a78c3 - default default] Action: 'action', calling method: <bound method ServersController._start_server of <nova.api.openstack.compute.servers.ServersController object at 0x7fd82a464310>>, body: {"os-start": null} _process_stack /usr/lib/python2.7/site-packages/nova/api/openstack/wsgi.py:604 [..] 4710 2018-04-04 13:54:45.591 15 INFO nova.api.openstack.wsgi [req-e8d1bcc3-bb4c-4717-8e04-5b9ea8904b0a 6f2495f8d5654d54885c9ebd1c7ecf41 62fc93a096ff487997b723a4069a78c3 - default default] HTTP exception thrown: Cannot 'start' instance 6083abbb-dfa1-4cba-9d77-309747559122 while it is in vm_state active At this time n-cpu was still updating the power state of the instance, not finishing until 13:54:45.928 : 2880 2018-04-04 13:54:21.558 1 INFO nova.service [req-3079f43e-796e-4742-ac0d-7d1a853f1759 - - - - -] Starting compute node (version 17.0.2-0.20180323024604.0390d5f.el7ost) [..] 3094 2018-04-04 13:54:27.882 1 DEBUG nova.compute.manager [req-3079f43e-796e-4742-ac0d-7d1a853f1759 - - - - -] [instance: 6083abbb-dfa1-4cba-9d77-309747559122] Checking state _get_power_state /usr/lib/python2.7/site-packages/nova/compute/manager.py:1181 3095 2018-04-04 13:54:27.892 1 DEBUG nova.compute.manager [req-3079f43e-796e-4742-ac0d-7d1a853f1759 - - - - -] [instance: 6083abbb-dfa1-4cba-9d77-309747559122] Current state is 4, state in DB is 1. _init_instance /usr/lib/python2.7/site-packages/nova/compute/manager.py:997 [..] 3115 2018-04-04 13:54:45.265 1 DEBUG oslo_service.periodic_task [req-3079f43e-796e-4742-ac0d-7d1a853f1759 - - - - -] Running periodic task ComputeManager._sync_power_states run_periodic_tasks /usr/lib/python2.7/site-packages/oslo_service/periodic_task.py:215 [..] 3122 2018-04-04 13:54:45.728 1 DEBUG oslo_concurrency.lockutils [req-35fa5969-d248-4e7e-8d44-6708a2f1a93a - - - - -] Lock "6083abbb-dfa1-4cba-9d77-309747559122" acquired by "nova.compute.manager.do_stop_instance" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:273 [..] 3128 2018-04-04 13:54:45.831 1 INFO nova.virt.libvirt.driver [req-35fa5969-d248-4e7e-8d44-6708a2f1a93a - - - - -] [instance: 6083abbb-dfa1-4cba-9d77-309747559122] Instance already shutdown. 3129 2018-04-04 13:54:45.836 1 INFO nova.virt.libvirt.driver [req-35fa5969-d248-4e7e-8d44-6708a2f1a93a - - - - -] [instance: 6083abbb-dfa1-4cba-9d77-309747559122] Instance destroyed successfully. 3130 2018-04-04 13:54:45.836 1 DEBUG nova.compute.manager [req-35fa5969-d248-4e7e-8d44-6708a2f1a93a - - - - -] [instance: 6083abbb-dfa1-4cba-9d77-309747559122] Checking state _get_power_state /usr/lib/python2.7/site-packages/nova/compute/manager.py:1181 3131 2018-04-04 13:54:45.928 1 DEBUG oslo_concurrency.lockutils [req-35fa5969-d248-4e7e-8d44-6708a2f1a93a - - - - -] Lock "6083abbb-dfa1-4cba-9d77-309747559122" released by "nova.compute.manager.do_stop_instance" :: held 0.200s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:285 Artem, I'm going to close this out as NOTABUG but please feel free to comment/reopen if you feel I've missed something here.
Created attachment 1426673 [details] nova compute logs
(overcloud) [stack@undercloud-0 ~]$ nova list +--------------------------------------+--------------+---------+-------------+-------------+---------------------------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+--------------+---------+-------------+-------------+---------------------------------------+ | 92c27232-af71-44c3-82ef-e920b3d298f5 | after_deploy | SHUTOFF | powering-on | Shutdown | tenantvxlan=192.168.32.12, 10.0.0.183 | +--------------------------------------+--------------+---------+-------------+-------------+---------------------------------------+ (overcloud) [stack@undercloud-0 ~]$ nova start after_deploy Cannot 'start' instance 92c27232-af71-44c3-82ef-e920b3d298f5 while it is in task_state powering-on (HTTP 409) (Request-ID: req-468d1b79-e468-4ef1-8672-32fe1b9a898a) ERROR (CommandError): Unable to start the specified server(s). (overcloud) [stack@undercloud-0 ~]$ nova start after_deploy Cannot 'start' instance 92c27232-af71-44c3-82ef-e920b3d298f5 while it is in task_state powering-on (HTTP 409) (Request-ID: req-551e80b8-a1bf-4eba-b4df-a02984c36003) ERROR (CommandError): Unable to start the specified server(s). (overcloud) [stack@undercloud-0 ~]$ date Wed Apr 25 09:21:46 EDT 2018 (overcloud) [stack@undercloud-0 ~]$ date Wed Apr 25 09:21:49 EDT 2018 (overcloud) [stack@undercloud-0 ~]$ nova start after_deploy Cannot 'start' instance 92c27232-af71-44c3-82ef-e920b3d298f5 while it is in task_state powering-on (HTTP 409) (Request-ID: req-3c643204-8372-4c0f-b8c1-7ed4e7f00259) ERROR (CommandError): Unable to start the specified server(s). (overcloud) [stack@undercloud-0 ~]$ date Wed Apr 25 09:21:59 EDT 2018 (overcloud) [stack@undercloud-0 ~]$ date Wed Apr 25 09:24:57 EDT 2018 (overcloud) [stack@undercloud-0 ~]$ nova start after_deploy Cannot 'start' instance 92c27232-af71-44c3-82ef-e920b3d298f5 while it is in task_state powering-on (HTTP 409) (Request-ID: req-f2c4d9b9-522c-4828-ace5-17208b22d2b4) ERROR (CommandError): Unable to start the specified server(s). (overcloud) [stack@undercloud-0 ~]$ (overcloud) [stack@undercloud-0 ~]$ nova start after_deploy Cannot 'start' instance 92c27232-af71-44c3-82ef-e920b3d298f5 while it is in task_state powering-on (HTTP 409) (Request-ID: req-f2c4d9b9-522c-4828-ace5-17208b22d2b4) ERROR (CommandError): Unable to start the specified server(s). (overcloud) [stack@undercloud-0 ~]$ date Wed Apr 25 09:25:29 EDT 2018 (overcloud) [stack@undercloud-0 ~]$ nova start after_deploy Cannot 'start' instance 92c27232-af71-44c3-82ef-e920b3d298f5 while it is in task_state powering-on (HTTP 409) (Request-ID: req-5ba84e9a-a432-4c4e-8374-b1ecd453cc2c) ERROR (CommandError): Unable to start the specified server(s). (overcloud) [stack@undercloud-0 ~]$ nova start after_deploy Request to start server after_deploy has been accepted. (overcloud) [stack@undercloud-0 ~]$ date Wed Apr 25 09:28:44 EDT 2018 so timeout after reboot were 10minutes, so regarding to lyarwood I re-opened issue
We also need the nova-api.log from a controller listing some of these example requests to boot the instance. I'm still pretty sure this is just due to us not waiting until the instance is marked as Status == SHUTOFF, Task State == None and Power State == shutdown before attempting to start the instance.
The additional issue being seen in this environment is the following upstream bug currently being worked on in master: Cannot hard reboot a libvirt instance in error state (mdev query fails) https://bugs.launchpad.net/nova/+bug/1764460 I'll create a clone for this now so we can track this downstream.
Adding blocker flag, as this BZ is blocking one of the basic DF:DFG Regression scenarios. Deploy OSP13 --> Reboot all Overcloud nodes --> wait 5 minutes --> Attempt to launch instance on compute nodes. Result: Launch Instance Fails.
Wasn't reproduce on two previous runs MOVE to Closed