Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1608022

Summary:	[osp10] heat unable to recover from a restart
Product:	Red Hat OpenStack	Reporter:	Mike McClure <mimcclur>
Component:	openstack-heat	Assignee:	Zane Bitter <zbitter>
Status:	CLOSED DUPLICATE	QA Contact:	Ronnie Rasouli <rrasouli>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	10.0 (Newton)	CC:	mburns, ramishra, sbaker, shardy, srevivo
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-07-25 15:34:34 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Mike McClure 2018-07-24 18:59:46 UTC

Description of problem:

Attempt to remove compute node using `openstack overcloud node delete` resulted in resources stuck in UPDATE or DELETE status.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:

After using 'openstack overcloud node delete' the resources were stuck.

2018-07-16 20:12:52Z [wcmsc2.ComputeV2.0]: UPDATE_FAILED  Engine went down during stack UPDATE
2018-07-16 20:12:54Z [wcmsc2.ComputeV2.0]: UPDATE_FAILED  Engine went down during resource UPDATE
2018-07-16 20:12:54Z [wcmsc2.ComputeV2]: UPDATE_FAILED  Engine went down during stack UPDATE
2018-07-16 20:12:56Z [wcmsc2-Compute-ok7lplq5ei5z-15-xi3zvn5j636s.SshHostPubKey]: UPDATE_FAILED  Engine went down during stack UPDATE
2018-07-16 20:13:02Z [wcmsc2-Compute-ok7lplq5ei5z-25-xo75qxrqqg5v-NodeExtraConfig-uf2iausxikxw]: UPDATE_FAILED  Engine went down during stack UPDATE
2018-07-16 20:13:08Z [wcmsc2-Compute-ok7lplq5ei5z-15-xi3zvn5j636s-NodeTLSCAData-blwce7cjaykx]: UPDATE_FAILED  Engine went down during stack UPDATE
2018-07-16 20:13:17Z [wcmsc2-Controller-elifqhy64wnv.0.ControllerDeployment]: SIGNAL_IN_PROGRESS  Signal: deployment e9d5adb1-c723-4b36-8dec-34d58fd26ad1 succeeded
2018-07-17 04:07:14Z [wcmsc2-Compute-ok7lplq5ei5z-2-bs6ppl4pzqwp-NodeTLSCAData-tk6fdjwzgaja]: UPDATE_FAILED  Engine went down during stack UPDATE
2018-07-17 04:07:22Z [wcmsc2-Controller-elifqhy64wnv-1-v2iedxyixliw-NodeTLSData-6wtyzwp5o5f4]: UPDATE_FAILED  Engine went down during stack UPDATE
2018-07-17 04:07:24Z [wcmsc2-Controller-elifqhy64wnv-2-slqbov77khwc-NodeTLSData-ydnpwnpvlti7]: UPDATE_FAILED  Engine went down during stack UPDATE
2018-07-17 04:10:15Z [wcmsc2]: UPDATE_FAILED  State invalid for UPDATE

Expected results:

Compute node would remove successfully.

Additional info:

It appears the root cause is that they killed heat-engine right after launching the scale out because they forgot to load some templates.  Looking to find a way to better handle recovery in this situation.

The stack update completed successfully and were able to scale out the compute nodes as needed. Command used to update DB was:

MariaDB [heat]> update resource set status='FAILED',status_reason='Stack status manually reset' where status like '%IN_PROG%';

Comment 2 Mike McClure 2018-07-25 15:34:34 UTC


*** This bug has been marked as a duplicate of bug 1445484 ***