Deleting the OC stack occasionally fails Environment: puppet-heat-12.4.0-0.20180329033345.6577c1d.el7ost.noarch heat-cfntools-1.3.0-2.el7ost.noarch openstack-heat-api-cfn-10.0.1-0.20180404165313.825731d.el7ost.noarch instack-undercloud-8.4.0-3.el7ost.noarch openstack-heat-engine-10.0.1-0.20180404165313.825731d.el7ost.noarch python2-heatclient-1.14.0-1.el7ost.noarch python-heat-agent-1.5.4-0.20180308153305.ecf43c7.el7ost.noarch openstack-tripleo-heat-templates-8.0.2-0.20180410170330.a39634a.el7ost.noarch openstack-heat-common-10.0.1-0.20180404165313.825731d.el7ost.noarch openstack-heat-api-10.0.1-0.20180404165313.825731d.el7ost.noarch Steps to reproduce: Attempt to deploy OC with faulty configuration (it should fail) a few times - say 5. Result: At some point the deletion will fail and the stack will be in the following status: (undercloud) [stack@undercloud-0 ~]$ openstack stack list +--------------------------------------+------------+----------------------------------+---------------+----------------------+----------------------+ | ID | Stack Name | Project | Stack Status | Creation Time | Updated Time | +--------------------------------------+------------+----------------------------------+---------------+----------------------+----------------------+ | 0fc28fc6-7101-4827-b04c-8e57def73f9f | overcloud | b1362edcbce04b589b1dee1a125de5e7 | DELETE_FAILED | 2018-04-17T18:45:22Z | 2018-04-17T19:06:42Z | +--------------------------------------+------------+----------------------------------+---------------+----------------------+----------------------+ The w/a is to attempt to delete in loop until succeeds.
Note: We see it in many places. It doesn't happen on every attempt to delete a faulty stack.
Example (from another setup); (undercloud) [stack@localhost undercloud_scripts]$ openstack stack delete openshift -y --wait │····················· 2018-04-17 20:23:33Z [openshift]: CREATE_FAILED Stack CREATE cancelled │····················· 2018-04-17 20:23:34Z [openshift.OpenShiftMaster]: CREATE_FAILED resources.OpenShiftMaster: Stack UPDATE cancelled │····················· 2018-04-17 20:23:34Z [openshift]: CREATE_FAILED Resource CREATE failed: resources.OpenShiftMaster: Stack UPDATE cancelled │····················· 2018-04-17 20:23:34Z [openshift.OpenShiftWorker]: CREATE_FAILED CREATE aborted (user triggered cancel) │····················· 2018-04-17 20:23:39Z [openshift]: DELETE_IN_PROGRESS Stack DELETE started │····················· 2018-04-17 20:23:48Z [openshift.OpenShiftMaster]: DELETE_IN_PROGRESS state changed │····················· 2018-04-17 20:23:51Z [openshift.OpenShiftMasterMergedConfigSettings]: DELETE_IN_PROGRESS state changed │····················· 2018-04-17 20:23:51Z [openshift.OpenShiftMasterMergedConfigSettings]: DELETE_COMPLETE state changed │····················· 2018-04-17 20:23:54Z [openshift.RedisVirtualIP]: DELETE_IN_PROGRESS state changed │····················· 2018-04-17 20:23:55Z [openshift.RedisVirtualIP]: DELETE_COMPLETE state changed │····················· 2018-04-17 20:24:02Z [openshift.VipHosts]: DELETE_IN_PROGRESS state changed │····················· 2018-04-17 20:24:02Z [openshift.VipHosts]: DELETE_COMPLETE state changed │····················· 2018-04-17 20:24:07Z [openshift.OpenShiftWorker]: DELETE_IN_PROGRESS state changed │····················· 2018-04-17 20:24:08Z [openshift.OpenShiftWorkerMergedConfigSettings]: DELETE_IN_PROGRESS state changed │····················· 2018-04-17 20:24:08Z [openshift.OpenShiftWorkerMergedConfigSettings]: DELETE_COMPLETE state changed │····················· 2018-04-17 20:24:20Z [openshift.OpenShiftMaster]: DELETE_FAILED ResourceInError: resources.OpenShiftMaster.resources[0].resources.OpenShiftMaster: Went to status ERROR│····················· due to "Server openshift-openshiftmaster-0 delete failed: (None) Unknown" │····················· 2018-04-17 20:24:20Z [openshift]: DELETE_FAILED Resource DELETE failed: ResourceInError: resources.OpenShiftMaster.resources[0].resources.OpenShiftMaster: Went to stat│····················· us ERROR due to "Server openshift-openshiftmaster-0 delete failed: (None) Unknown" │····················· │····················· Stack openshift DELETE_FAILED │····················· │····················· Unable to delete 1 of the 1 stacks. │····················· (undercloud) [stack@localhost undercloud_scripts]$ openstack stack delete openshift -y --wait │····················· 2018-04-17 20:24:41Z [OpenShiftWorker]: DELETE_FAILED DELETE aborted (user triggered cancel) │····················· 2018-04-17 20:24:46Z [openshift]: DELETE_IN_PROGRESS Stack DELETE started │····················· 2018-04-17 20:24:46Z [openshift.OpenShiftWorker]: DELETE_IN_PROGRESS state changed │····················· 2018-04-17 20:24:47Z [openshift.OpenShiftMaster]: DELETE_IN_PROGRESS state changed │····················· 2018-04-17 20:24:47Z [openshift.OpenShiftMaster]: DELETE_FAILED ResourceInError: resources.OpenShiftMaster.resources[0].resources.OpenShiftMaster: Went to status ERROR│····················· due to "Server openshift-openshiftmaster-0 delete failed: (None) Unknown" │····················· 2018-04-17 20:24:47Z [openshift]: DELETE_FAILED Resource DELETE failed: ResourceInError: resources.OpenShiftMaster.resources[0].resources.OpenShiftMaster: Went to stat│····················· us ERROR due to "Server openshift-openshiftmaster-0 delete failed: (None) Unknown" │····················· 2018-04-17 20:24:47Z [openshift.OpenShiftWorker]: DELETE_FAILED resources.OpenShiftWorker: Stack DELETE cancelled │····················· 2018-04-17 20:24:48Z [openshift]: DELETE_FAILED Resource DELETE failed: resources.OpenShiftWorker: Stack DELETE cancelled │····················· │····················· Stack openshift DELETE_FAILED │····················· │····················· Unable to delete 1 of the 1 stacks.
The delete is failing because the Nova server is going into an ERROR state.
Agreed looks like a nova error, though we're missing logs to have more information. Do you have nova logs for those errors?
This may be helped by the nova settings Alex mentions here - https://bugzilla.redhat.com/show_bug.cgi?id=1563303#c15
The last error is due the virtualbmc/libvirt issue as tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1571384. We can see these IPMI failures in ironic-conductor.log when attempting to power off the node, which eventually results in the nova error. Stderr: u'Error: Unable to establish IPMI v2 / RMCP+ session\n'.: ProcessExecutionError: Unexpected error while running command. 2018-05-10 10:51:34.110 21902 ERROR ironic.conductor.manager [req-1bd1f299-c448-43c6-b5e2-10f7974aedb9 98893e94cf32457b8c839d3713adb313 1a4f67cfc5b54418a1f423c269626fc4 - default default] Error in tear_down of node a5afc780-6918-41eb-ba2c-ce80a5b67769: IPMI call failed: power status.: IPMIFailure: IPMI call failed: power status. Since there is a libvirt patch in https://bugzilla.redhat.com/show_bug.cgi?id=1576464 (which is also the fix for 1571384) which should take care of it, can you install that patch and retry?
Since the logs show this is being caused by the libvirt/virtualbmc issue I'm closing it as a duplicate. *** This bug has been marked as a duplicate of bug 1571384 ***
I've seen this in a baremetal env (no vbmc), and I think the issue is a race in nova when trying to delete an IN_PROGRESS deployment. I'll try to reproduce and raise a new bug, but if anyone else does the same please ensure you include the nova and ironic logs in any bug report, as I don't think the error output from heat or tripleoclient is enough to pinpoint the issue, all it tells us is nova had an error but not why.