Bug 1563303 - Can't delete a stack: provision state still deleting
Summary: Can't delete a stack: provision state still deleting
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: instack-undercloud
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ---
Assignee: James Slagle
QA Contact: Arik Chernetsky
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-03 14:40 UTC by Udi Kalifon
Modified: 2021-03-17 19:11 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-04-20 17:33:37 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Udi Kalifon 2018-04-03 14:40:03 UTC
Description of problem:
Stack deletions fail with an error like: 
Resource DELETE failed: ResourceInError: resources.Controller.resources[1].resources.Controller: Went to status ERROR due to "Server overcloud-controller-1 delete failed: (500) Error destroying the instance on node e7169ae1-78d2-4cd4-830c-f3d5799b4081. Provision state still 'deleting'."

When running "openstack server list" you see some nodes in ERROR state.

This happenst to me on bare metals (and I use the GUI), but I understand it's also happening to others in all kinds of environments. I tried to restart heat and the nova services but nothing helps, and the only solution I have is to reinstall the undercloud.


Version-Release number of selected component (if applicable):
openstack-tripleo-common-8.5.1-0.20180326153322.91f52e9.el7ost.noarch
openstack-tripleo-heat-templates-8.0.2-0.20180327213843.f25e2d8.el7ost.noarch


How reproducible:
100%


Steps to Reproduce:
1. Delete a stack


Actual results:
Deletion fails


Expected results:
Stack should be deleted, all resources should be freed

Comment 1 Alex Schultz 2018-04-03 16:50:31 UTC
Please provide logs. Specifically we would want to see the ironic/nova logs.

Comment 6 Dmitry Tantsur 2018-04-04 18:35:45 UTC
The ironic-conductor logs look clean, it seems that unprovisioning did happen from ironic's standpoint. Could you please enable debug mode for the undercloud though? Also please show 'openstack baremetal node list' and 'openstack server list' after the first deletion attempt. If there are nodes and/or servers in ERROR state, please show their details. Thanks!

Comment 7 Bob Fournier 2018-04-10 19:27:10 UTC
Any update on the requested info?

Comment 9 Bob Fournier 2018-04-17 20:57:41 UTC
From both the old and recent logs the Ironic unprovisioning looks clean:

2018-04-15 09:30:11.795 16550 INFO ironic.conductor.manager [req-2845fa49-3624-4974-85fa-ece252fa9675 2f2f560c1b7c496ea5d26970124fcab1 41ea856819f14463977262b07674a851 - default default] VIF 07deeff5-d2a2-4f33-bae8-13e016be3e0c successfully detached from node 47d3b7d7-639f-491a-9958-62988e96a969
2018-04-15 09:30:11.887 16550 INFO ironic.conductor.manager [req-1b2cced7-7e01-4821-8249-b869f0711570 2f2f560c1b7c496ea5d26970124fcab1 41ea856819f14463977262b07674a851 - default default] VIF e0faa0df-22e7-44a8-be50-f19b87ae01d1 successfully detached from node 456d6cce-0474-416b-ba22-93b8c1ea7ef8
2018-04-15 09:30:41.296 16550 INFO ironic.conductor.manager [req-6f5c8f57-8d1c-4908-be03-b1561fce170c 2f2f560c1b7c496ea5d26970124fcab1 41ea856819f14463977262b07674a851 - default default] VIF 07395809-d4d2-44d7-9f23-1f1dfe5498eb successfully detached from node d3d965a9-1b6e-4af6-a005-57ba92e47c75
2018-04-15 09:30:45.845 16550 INFO ironic.conductor.manager [req-feb393f6-7fcc-4b94-b65c-287563a4f708 2f2f560c1b7c496ea5d26970124fcab1 41ea856819f14463977262b07674a851 - default default] VIF cdd537c5-dbea-483e-b613-dc3841a39286 successfully detached from node fddb25dc-60f0-468f-90f0-2e265fee3299
2018-04-15 09:30:49.012 16550 INFO ironic.conductor.manager [req-6e3dbd8f-e2f7-4fd9-875d-c9263bd7375d 2f2f560c1b7c496ea5d26970124fcab1 41ea856819f14463977262b07674a851 - default default] VIF d288f20a-7176-43ab-9b55-2fb152dc9189 successfully detached from node c5467fdb-162d-4c23-b36f-1b091b9f4fca
2018-04-15 09:32:27.588 16550 INFO ironic.conductor.manager [req-889a746c-3bf3-493a-af68-1817ccefe2f5 2f2f560c1b7c496ea5d26970124fcab1 41ea856819f14463977262b07674a851 - default default] VIF d09064c2-4086-4a42-aeba-5b22e224997c successfully detached from node 4534385d-b6a4-4ed4-b6f5-a745731d04b6


The server details show an error in nova, which looks like the deletion is taking longer than expected:
Error destroying the instance on node fddb25dc-60f0-468f-90f0-2e265fee3299. Provision state still 'deleting'.", u'code': 500, u'details': u'  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 202, in decorated_function\n    return function(self, context, *args, **kwargs)\n  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2559, in terminate_instance\n    do_terminate_instance(instance, bdms)\n  File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 274, in inner\n    return f(*args, **kwargs)\n  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2557, in do_terminate_instance\n    self._set_instance_obj_error_state(context, instance)\n  File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__\n    self.force_reraise()\n  File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n    six.reraise(self.type_, self.value, self.tb)\n  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2547, in do_terminate_instance\n    self._delete_instance(context, instance, bdms)\n  File "/usr/lib/python2.7/site-packages/nova/hooks.py", line 154, in inner\n    rv = f(*args, **kwargs)\n  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2476, in _delete_instance\n    self._shutdown_instance(context, instance, bdms)\n  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2368, in _shutdown_instance\n    requested_networks)\n  File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__\n    self.force_reraise()\n  File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n    six.reraise(self.type_, self.value, self.tb)\n  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2355, in _shutdown_instance\n    block_device_info)\n  File "/usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py", line 1213, in destroy\n    self._unprovision(instance, node)\n  File "/usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py", line 1186, in _unprovision\n    timer.start(interval=CONF.ironic.api_retry_interval).wait()\n  File "/usr/lib/python2.7/site-packages/eventlet/event.py", line 121, in wait\n    return hubs.get_hub().switch()\n  File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 294, in switch\n    return self.greenlet.switch()\n  File "/usr/lib/python2.7/site-packages/oslo_service/loopingcall.py", line 137, in _run_loop\n    result = func(*self.args, **self.kw)\n  File "/usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py", line 1178, in _wait_for_provision_state\n    raise exception.NovaException(msg)\n', u'created': u'2018-04-15T06:32:48Z'

I'm also seeing lots of errors in nova-compute.log in the latest set of logs (stack-delete-logs.tgz)

I'd like to see if the compute team can take a look, especially at the errors in nova-compute.log, so including the Compute DFG.  The server details do show timeouts in the virt/ironic/driver.py time but the ironic logs look clean.  Its also confusing in that this also seems to be occurring in a no-baremetal environment.

Please send back if it looks like this is an Ironic issue.

Comment 10 Matthew Booth 2018-04-20 09:24:14 UTC
Here's a relevant log sequence for a single failed delete:

===
2018-04-03 17:16:42.126 [./ironic/ironic-conductor.log]
successfully detached from node 2eee3de0-6e8e-46fc-8969-0e9e36102b47

2018-04-03 17:16:43.300 [./ironic/ironic-conductor.log]
Node 2eee3de0-6e8e-46fc-8969-0e9e36102b47 moved to provision state "deleting" from state "active"; target provision state is "available"

2018-04-03 17:18:45.400 [./nova/nova-compute.log]
Error destroying the instance on node 2eee3de0-6e8e-46fc-8969-0e9e36102b47. Provision state still 'deleting'.

2018-04-03 17:18:59.625 [./ironic/ironic-conductor.log]
Successfully set node 2eee3de0-6e8e-46fc-8969-0e9e36102b47 power state to power off by power off.
Successfully unprovisioned node 2eee3de0-6e8e-46fc-8969-0e9e36102b47 with instance 99feb01c-8f79-4714-94e9-81e90eb479b8.
Node 2eee3de0-6e8e-46fc-8969-0e9e36102b47 moved to provision state "cleaning" from state "deleting"; target provision state is "available"
Node 2eee3de0-6e8e-46fc-8969-0e9e36102b47 moved to provision state "available" from state "cleaning"; target provision state is "None"
Automated cleaning is disabled, node 2eee3de0-6e8e-46fc-8969-0e9e36102b47 has been successfully moved to AVAILABLE state.
===

The delete starts at 17:16:43. The Nova ironic driver fails it with a timeout at 17:18:45: 2 mins and 2 seconds later. It actually finishes at 17:18:59, 14 seconds after the timeout.

The ironic driver is sat in a wait loop controlled by 2 config variables:

ironic.api_retry_interval: default 2 seconds
ironic.api_max_retries: default 60

i.e. By default we've configured the ironic driver to timeout if the delete doesn't complete within 2 minutes, which is what it does. As the delete actually takes 2 minutes 17 seconds, it seems we need to either work out why it's slow, or bump the timeout. Suspect the latter.

I'm inclined to close this NOTABUG because it's a config issue, unless the Ironic folks think it's a bug the delete takes that long.

Comment 11 Bob Fournier 2018-04-20 17:01:46 UTC
Matt - thanks for looking at this. It looks like in all of the logs that its consistently taking Ironic just over 2 minutes for the node delete to complete. Here's an example in the last set where it takes 2 minutes 16 seconds:

2018-04-15 09:30:08.185 16550 INFO ironic.conductor.task_manager [req-0e907dd8-80de-4f12-be84-a697612bbb35 2f2f560c1b7c496ea5d26970124fcab1 41ea856819f14463977262b07674a851 - default default] Node 4534385d-b6a4-4ed4-b6f5-a745731d04b6 moved to provision state "deleting" from state "active"; target provision state is "available"
...
2018-04-15 09:32:24.678 16550 INFO ironic.conductor.utils [req-0e907dd8-80de-4f12-be84-a697612bbb35 2f2f560c1b7c496ea5d26970124fcab1 41ea856819f14463977262b07674a851 - default default] Successfully set node 4534385d-b6a4-4ed4-b6f5-a745731d04b6 power state to power off by power off.

There does not seem to be any issues in Ironic, its just the length of time it takes the h/w to power off.

As Matt indicated, please set the nova ironic.max_retries higher (i.e. 90) to avoid this.

Comment 12 Alex Schultz 2018-04-20 17:15:31 UTC
Just as a note, this is not currently configurable on the undercloud via instack-undercloud. It's inheriting the defaults from the nova ironic configuration.  In puppet-nova we're just using the defaults provided via puppet.

https://github.com/openstack/puppet-nova/blob/master/manifests/ironic/common.pp#L66

If this is something that should be tuned it'll need to be exposed via instack-undercloud and THT

Comment 13 Alex Schultz 2018-04-20 17:17:04 UTC
Reopening because the suggested configuration change is not currently possible via tripleo mechanisms.

Comment 14 Alex Schultz 2018-04-20 17:21:50 UTC
Dropping the blocker flag portion as this is environmental and should not prevent anyone from deploying OSP13. This would need to be a new configuration item if a user is to configure these timeouts.

Comment 15 Alex Schultz 2018-04-20 17:33:37 UTC
I partially take back my comment in c12. A hieradata_override file could be used in the undercloud.conf to specify this values.

nova::ironic::common::api_max_retries: 90
nova::ironic::common::api_retry_interval: 2

I do think it would be beneficial to expose these as a top level undercloud configuration if an end user is supposed to tune them. But we can cross that bridge if more folks report this as a problem.


Note You need to log in before you can comment on or make changes to this bug.