Bug 1568578

Summary:	Deleting the OC stack occasionally fails
Product:	Red Hat OpenStack	Reporter:	Alexander Chuzhoy <sasha>
Component:	rhosp-director	Assignee:	RHOS Maint <rhos-maint>
Status:	CLOSED DUPLICATE	QA Contact:	Amit Ugol <augol>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	13.0 (Queens)	CC:	ahrechan, aschultz, berrange, bfournie, dasmith, dbecker, dprince, eglynn, imain, jhakimra, kchamart, maandre, mburns, mlammon, morazi, ohochman, sasha, sbaker, sbauza, sferdjao, sgordon, shardy, srevivo, stephenfin, therve, vromanso
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-05-11 20:06:21 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Alexander Chuzhoy 2018-04-17 19:45:58 UTC

Deleting the OC stack occasionally fails


Environment:
puppet-heat-12.4.0-0.20180329033345.6577c1d.el7ost.noarch
heat-cfntools-1.3.0-2.el7ost.noarch
openstack-heat-api-cfn-10.0.1-0.20180404165313.825731d.el7ost.noarch
instack-undercloud-8.4.0-3.el7ost.noarch
openstack-heat-engine-10.0.1-0.20180404165313.825731d.el7ost.noarch
python2-heatclient-1.14.0-1.el7ost.noarch
python-heat-agent-1.5.4-0.20180308153305.ecf43c7.el7ost.noarch
openstack-tripleo-heat-templates-8.0.2-0.20180410170330.a39634a.el7ost.noarch
openstack-heat-common-10.0.1-0.20180404165313.825731d.el7ost.noarch
openstack-heat-api-10.0.1-0.20180404165313.825731d.el7ost.noarch


Steps to reproduce:
Attempt to deploy OC with faulty configuration (it should fail) a few times - say 5.

Result:

At some point the deletion will fail and the stack will be in the following status:

(undercloud) [stack@undercloud-0 ~]$ openstack stack list
+--------------------------------------+------------+----------------------------------+---------------+----------------------+----------------------+
| ID                                   | Stack Name | Project                          | Stack Status  | Creation Time        | Updated Time         |
+--------------------------------------+------------+----------------------------------+---------------+----------------------+----------------------+
| 0fc28fc6-7101-4827-b04c-8e57def73f9f | overcloud  | b1362edcbce04b589b1dee1a125de5e7 | DELETE_FAILED | 2018-04-17T18:45:22Z | 2018-04-17T19:06:42Z |
+--------------------------------------+------------+----------------------------------+---------------+----------------------+----------------------+



The w/a is to attempt to delete in loop until succeeds.

Comment 1 Alexander Chuzhoy 2018-04-17 19:47:18 UTC

Note: We see it in many places. It doesn't happen on every attempt to delete a faulty stack.

Comment 3 Alexander Chuzhoy 2018-04-17 20:25:42 UTC

Example (from another setup);


(undercloud) [stack@localhost undercloud_scripts]$ openstack stack delete openshift -y --wait                                                                           │·····················
2018-04-17 20:23:33Z [openshift]: CREATE_FAILED  Stack CREATE cancelled                                                                                                 │·····················
2018-04-17 20:23:34Z [openshift.OpenShiftMaster]: CREATE_FAILED  resources.OpenShiftMaster: Stack UPDATE cancelled                                                      │·····················
2018-04-17 20:23:34Z [openshift]: CREATE_FAILED  Resource CREATE failed: resources.OpenShiftMaster: Stack UPDATE cancelled                                              │·····················
2018-04-17 20:23:34Z [openshift.OpenShiftWorker]: CREATE_FAILED  CREATE aborted (user triggered cancel)                                                                 │·····················
2018-04-17 20:23:39Z [openshift]: DELETE_IN_PROGRESS  Stack DELETE started                                                                                              │·····················
2018-04-17 20:23:48Z [openshift.OpenShiftMaster]: DELETE_IN_PROGRESS  state changed                                                                                     │·····················
2018-04-17 20:23:51Z [openshift.OpenShiftMasterMergedConfigSettings]: DELETE_IN_PROGRESS  state changed                                                                 │·····················
2018-04-17 20:23:51Z [openshift.OpenShiftMasterMergedConfigSettings]: DELETE_COMPLETE  state changed                                                                    │·····················
2018-04-17 20:23:54Z [openshift.RedisVirtualIP]: DELETE_IN_PROGRESS  state changed                                                                                      │·····················
2018-04-17 20:23:55Z [openshift.RedisVirtualIP]: DELETE_COMPLETE  state changed                                                                                         │·····················
2018-04-17 20:24:02Z [openshift.VipHosts]: DELETE_IN_PROGRESS  state changed                                                                                            │·····················
2018-04-17 20:24:02Z [openshift.VipHosts]: DELETE_COMPLETE  state changed                                                                                               │·····················
2018-04-17 20:24:07Z [openshift.OpenShiftWorker]: DELETE_IN_PROGRESS  state changed                                                                                     │·····················
2018-04-17 20:24:08Z [openshift.OpenShiftWorkerMergedConfigSettings]: DELETE_IN_PROGRESS  state changed                                                                 │·····················
2018-04-17 20:24:08Z [openshift.OpenShiftWorkerMergedConfigSettings]: DELETE_COMPLETE  state changed                                                                    │·····················
2018-04-17 20:24:20Z [openshift.OpenShiftMaster]: DELETE_FAILED  ResourceInError: resources.OpenShiftMaster.resources[0].resources.OpenShiftMaster: Went to status ERROR│·····················
 due to "Server openshift-openshiftmaster-0 delete failed: (None) Unknown"                                                                                              │·····················
2018-04-17 20:24:20Z [openshift]: DELETE_FAILED  Resource DELETE failed: ResourceInError: resources.OpenShiftMaster.resources[0].resources.OpenShiftMaster: Went to stat│·····················
us ERROR due to "Server openshift-openshiftmaster-0 delete failed: (None) Unknown"                                                                                      │·····················
                                                                                                                                                                        │·····················
 Stack openshift DELETE_FAILED                                                                                                                                          │·····················
                                                                                                                                                                        │·····················
Unable to delete 1 of the 1 stacks.                                                                                                                                     │·····················
(undercloud) [stack@localhost undercloud_scripts]$ openstack stack delete openshift -y --wait                                                                           │·····················
2018-04-17 20:24:41Z [OpenShiftWorker]: DELETE_FAILED  DELETE aborted (user triggered cancel)                                                                           │·····················
2018-04-17 20:24:46Z [openshift]: DELETE_IN_PROGRESS  Stack DELETE started                                                                                              │·····················
2018-04-17 20:24:46Z [openshift.OpenShiftWorker]: DELETE_IN_PROGRESS  state changed                                                                                     │·····················
2018-04-17 20:24:47Z [openshift.OpenShiftMaster]: DELETE_IN_PROGRESS  state changed                                                                                     │·····················
2018-04-17 20:24:47Z [openshift.OpenShiftMaster]: DELETE_FAILED  ResourceInError: resources.OpenShiftMaster.resources[0].resources.OpenShiftMaster: Went to status ERROR│·····················
 due to "Server openshift-openshiftmaster-0 delete failed: (None) Unknown"                                                                                              │·····················
2018-04-17 20:24:47Z [openshift]: DELETE_FAILED  Resource DELETE failed: ResourceInError: resources.OpenShiftMaster.resources[0].resources.OpenShiftMaster: Went to stat│·····················
us ERROR due to "Server openshift-openshiftmaster-0 delete failed: (None) Unknown"                                                                                      │·····················
2018-04-17 20:24:47Z [openshift.OpenShiftWorker]: DELETE_FAILED  resources.OpenShiftWorker: Stack DELETE cancelled                                                      │·····················
2018-04-17 20:24:48Z [openshift]: DELETE_FAILED  Resource DELETE failed: resources.OpenShiftWorker: Stack DELETE cancelled                                              │·····················
                                                                                                                                                                        │·····················
 Stack openshift DELETE_FAILED                                                                                                                                          │·····················
                                                                                                                                                                        │·····················
Unable to delete 1 of the 1 stacks.

Comment 4 Zane Bitter 2018-04-24 20:17:59 UTC

The delete is failing because the Nova server is going into an ERROR state.

Comment 5 Thomas Hervé 2018-04-25 08:07:19 UTC

Agreed looks like a nova error, though we're missing logs to have more information. Do you have nova logs for those errors?

Comment 6 Bob Fournier 2018-05-09 15:53:49 UTC

This may be helped by the nova settings Alex mentions here - https://bugzilla.redhat.com/show_bug.cgi?id=1563303#c15

Comment 8 Bob Fournier 2018-05-10 17:05:22 UTC

The last error is due the virtualbmc/libvirt issue as tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1571384.  We can see these IPMI failures in ironic-conductor.log when attempting to power off the node, which eventually results in the nova error.

Stderr: u'Error: Unable to establish IPMI v2 / RMCP+ session\n'.: ProcessExecutionError: Unexpected error while running command.
2018-05-10 10:51:34.110 21902 ERROR ironic.conductor.manager [req-1bd1f299-c448-43c6-b5e2-10f7974aedb9 98893e94cf32457b8c839d3713adb313 1a4f67cfc5b54418a1f423c269626fc4 - default default] Error in tear_down of node a5afc780-6918-41eb-ba2c-ce80a5b67769: IPMI call failed: power status.: IPMIFailure: IPMI call failed: power status.

Since there is a libvirt patch in https://bugzilla.redhat.com/show_bug.cgi?id=1576464 (which is also the fix for 1571384) which should take care of it, can you install that patch and retry?

Comment 10 Bob Fournier 2018-05-11 20:06:21 UTC

Since the logs show this is being caused by the libvirt/virtualbmc issue I'm closing it as a duplicate.

*** This bug has been marked as a duplicate of bug 1571384 ***

Comment 11 Steven Hardy 2018-05-17 09:07:38 UTC

I've seen this in a baremetal env (no vbmc), and I think the issue is a race in nova when trying to delete an IN_PROGRESS deployment.

I'll try to reproduce and raise a new bug, but if anyone else does the same please ensure you include the nova and ironic logs in any bug report, as I don't think the error output from heat or tripleoclient is enough to pinpoint the issue, all it tells us is nova had an error but not why.