Bug 1568578 - Deleting the OC stack occasionally fails
Summary: Deleting the OC stack occasionally fails
Keywords:
Status: CLOSED DUPLICATE of bug 1571384
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: RHOS Maint
QA Contact: Amit Ugol
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-17 19:45 UTC by Alexander Chuzhoy
Modified: 2019-09-09 16:30 UTC (History)
26 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-05-11 20:06:21 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Alexander Chuzhoy 2018-04-17 19:45:58 UTC
Deleting the OC stack occasionally fails


Environment:
puppet-heat-12.4.0-0.20180329033345.6577c1d.el7ost.noarch
heat-cfntools-1.3.0-2.el7ost.noarch
openstack-heat-api-cfn-10.0.1-0.20180404165313.825731d.el7ost.noarch
instack-undercloud-8.4.0-3.el7ost.noarch
openstack-heat-engine-10.0.1-0.20180404165313.825731d.el7ost.noarch
python2-heatclient-1.14.0-1.el7ost.noarch
python-heat-agent-1.5.4-0.20180308153305.ecf43c7.el7ost.noarch
openstack-tripleo-heat-templates-8.0.2-0.20180410170330.a39634a.el7ost.noarch
openstack-heat-common-10.0.1-0.20180404165313.825731d.el7ost.noarch
openstack-heat-api-10.0.1-0.20180404165313.825731d.el7ost.noarch


Steps to reproduce:
Attempt to deploy OC with faulty configuration (it should fail) a few times - say 5.

Result:

At some point the deletion will fail and the stack will be in the following status:

(undercloud) [stack@undercloud-0 ~]$ openstack stack list
+--------------------------------------+------------+----------------------------------+---------------+----------------------+----------------------+
| ID                                   | Stack Name | Project                          | Stack Status  | Creation Time        | Updated Time         |
+--------------------------------------+------------+----------------------------------+---------------+----------------------+----------------------+
| 0fc28fc6-7101-4827-b04c-8e57def73f9f | overcloud  | b1362edcbce04b589b1dee1a125de5e7 | DELETE_FAILED | 2018-04-17T18:45:22Z | 2018-04-17T19:06:42Z |
+--------------------------------------+------------+----------------------------------+---------------+----------------------+----------------------+



The w/a is to attempt to delete in loop until succeeds.

Comment 1 Alexander Chuzhoy 2018-04-17 19:47:18 UTC
Note: We see it in many places. It doesn't happen on every attempt to delete a faulty stack.

Comment 3 Alexander Chuzhoy 2018-04-17 20:25:42 UTC
Example (from another setup);


(undercloud) [stack@localhost undercloud_scripts]$ openstack stack delete openshift -y --wait                                                                           │·····················
2018-04-17 20:23:33Z [openshift]: CREATE_FAILED  Stack CREATE cancelled                                                                                                 │·····················
2018-04-17 20:23:34Z [openshift.OpenShiftMaster]: CREATE_FAILED  resources.OpenShiftMaster: Stack UPDATE cancelled                                                      │·····················
2018-04-17 20:23:34Z [openshift]: CREATE_FAILED  Resource CREATE failed: resources.OpenShiftMaster: Stack UPDATE cancelled                                              │·····················
2018-04-17 20:23:34Z [openshift.OpenShiftWorker]: CREATE_FAILED  CREATE aborted (user triggered cancel)                                                                 │·····················
2018-04-17 20:23:39Z [openshift]: DELETE_IN_PROGRESS  Stack DELETE started                                                                                              │·····················
2018-04-17 20:23:48Z [openshift.OpenShiftMaster]: DELETE_IN_PROGRESS  state changed                                                                                     │·····················
2018-04-17 20:23:51Z [openshift.OpenShiftMasterMergedConfigSettings]: DELETE_IN_PROGRESS  state changed                                                                 │·····················
2018-04-17 20:23:51Z [openshift.OpenShiftMasterMergedConfigSettings]: DELETE_COMPLETE  state changed                                                                    │·····················
2018-04-17 20:23:54Z [openshift.RedisVirtualIP]: DELETE_IN_PROGRESS  state changed                                                                                      │·····················
2018-04-17 20:23:55Z [openshift.RedisVirtualIP]: DELETE_COMPLETE  state changed                                                                                         │·····················
2018-04-17 20:24:02Z [openshift.VipHosts]: DELETE_IN_PROGRESS  state changed                                                                                            │·····················
2018-04-17 20:24:02Z [openshift.VipHosts]: DELETE_COMPLETE  state changed                                                                                               │·····················
2018-04-17 20:24:07Z [openshift.OpenShiftWorker]: DELETE_IN_PROGRESS  state changed                                                                                     │·····················
2018-04-17 20:24:08Z [openshift.OpenShiftWorkerMergedConfigSettings]: DELETE_IN_PROGRESS  state changed                                                                 │·····················
2018-04-17 20:24:08Z [openshift.OpenShiftWorkerMergedConfigSettings]: DELETE_COMPLETE  state changed                                                                    │·····················
2018-04-17 20:24:20Z [openshift.OpenShiftMaster]: DELETE_FAILED  ResourceInError: resources.OpenShiftMaster.resources[0].resources.OpenShiftMaster: Went to status ERROR│·····················
 due to "Server openshift-openshiftmaster-0 delete failed: (None) Unknown"                                                                                              │·····················
2018-04-17 20:24:20Z [openshift]: DELETE_FAILED  Resource DELETE failed: ResourceInError: resources.OpenShiftMaster.resources[0].resources.OpenShiftMaster: Went to stat│·····················
us ERROR due to "Server openshift-openshiftmaster-0 delete failed: (None) Unknown"                                                                                      │·····················
                                                                                                                                                                        │·····················
 Stack openshift DELETE_FAILED                                                                                                                                          │·····················
                                                                                                                                                                        │·····················
Unable to delete 1 of the 1 stacks.                                                                                                                                     │·····················
(undercloud) [stack@localhost undercloud_scripts]$ openstack stack delete openshift -y --wait                                                                           │·····················
2018-04-17 20:24:41Z [OpenShiftWorker]: DELETE_FAILED  DELETE aborted (user triggered cancel)                                                                           │·····················
2018-04-17 20:24:46Z [openshift]: DELETE_IN_PROGRESS  Stack DELETE started                                                                                              │·····················
2018-04-17 20:24:46Z [openshift.OpenShiftWorker]: DELETE_IN_PROGRESS  state changed                                                                                     │·····················
2018-04-17 20:24:47Z [openshift.OpenShiftMaster]: DELETE_IN_PROGRESS  state changed                                                                                     │·····················
2018-04-17 20:24:47Z [openshift.OpenShiftMaster]: DELETE_FAILED  ResourceInError: resources.OpenShiftMaster.resources[0].resources.OpenShiftMaster: Went to status ERROR│·····················
 due to "Server openshift-openshiftmaster-0 delete failed: (None) Unknown"                                                                                              │·····················
2018-04-17 20:24:47Z [openshift]: DELETE_FAILED  Resource DELETE failed: ResourceInError: resources.OpenShiftMaster.resources[0].resources.OpenShiftMaster: Went to stat│·····················
us ERROR due to "Server openshift-openshiftmaster-0 delete failed: (None) Unknown"                                                                                      │·····················
2018-04-17 20:24:47Z [openshift.OpenShiftWorker]: DELETE_FAILED  resources.OpenShiftWorker: Stack DELETE cancelled                                                      │·····················
2018-04-17 20:24:48Z [openshift]: DELETE_FAILED  Resource DELETE failed: resources.OpenShiftWorker: Stack DELETE cancelled                                              │·····················
                                                                                                                                                                        │·····················
 Stack openshift DELETE_FAILED                                                                                                                                          │·····················
                                                                                                                                                                        │·····················
Unable to delete 1 of the 1 stacks.

Comment 4 Zane Bitter 2018-04-24 20:17:59 UTC
The delete is failing because the Nova server is going into an ERROR state.

Comment 5 Thomas Hervé 2018-04-25 08:07:19 UTC
Agreed looks like a nova error, though we're missing logs to have more information. Do you have nova logs for those errors?

Comment 6 Bob Fournier 2018-05-09 15:53:49 UTC
This may be helped by the nova settings Alex mentions here - https://bugzilla.redhat.com/show_bug.cgi?id=1563303#c15

Comment 8 Bob Fournier 2018-05-10 17:05:22 UTC
The last error is due the virtualbmc/libvirt issue as tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1571384.  We can see these IPMI failures in ironic-conductor.log when attempting to power off the node, which eventually results in the nova error.

Stderr: u'Error: Unable to establish IPMI v2 / RMCP+ session\n'.: ProcessExecutionError: Unexpected error while running command.
2018-05-10 10:51:34.110 21902 ERROR ironic.conductor.manager [req-1bd1f299-c448-43c6-b5e2-10f7974aedb9 98893e94cf32457b8c839d3713adb313 1a4f67cfc5b54418a1f423c269626fc4 - default default] Error in tear_down of node a5afc780-6918-41eb-ba2c-ce80a5b67769: IPMI call failed: power status.: IPMIFailure: IPMI call failed: power status.

Since there is a libvirt patch in https://bugzilla.redhat.com/show_bug.cgi?id=1576464 (which is also the fix for 1571384) which should take care of it, can you install that patch and retry?

Comment 10 Bob Fournier 2018-05-11 20:06:21 UTC
Since the logs show this is being caused by the libvirt/virtualbmc issue I'm closing it as a duplicate.

*** This bug has been marked as a duplicate of bug 1571384 ***

Comment 11 Steven Hardy 2018-05-17 09:07:38 UTC
I've seen this in a baremetal env (no vbmc), and I think the issue is a race in nova when trying to delete an IN_PROGRESS deployment.

I'll try to reproduce and raise a new bug, but if anyone else does the same please ensure you include the nova and ironic logs in any bug report, as I don't think the error output from heat or tripleoclient is enough to pinpoint the issue, all it tells us is nova had an error but not why.


Note You need to log in before you can comment on or make changes to this bug.