Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1608022

Summary: [osp10] heat unable to recover from a restart
Product: Red Hat OpenStack Reporter: Mike McClure <mimcclur>
Component: openstack-heatAssignee: Zane Bitter <zbitter>
Status: CLOSED DUPLICATE QA Contact: Ronnie Rasouli <rrasouli>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 10.0 (Newton)CC: mburns, ramishra, sbaker, shardy, srevivo
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-07-25 15:34:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Mike McClure 2018-07-24 18:59:46 UTC
Description of problem:

Attempt to remove compute node using `openstack overcloud node delete` resulted in resources stuck in UPDATE or DELETE status.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:

After using 'openstack overcloud node delete' the resources were stuck.

2018-07-16 20:12:52Z [wcmsc2.ComputeV2.0]: UPDATE_FAILED  Engine went down during stack UPDATE
2018-07-16 20:12:54Z [wcmsc2.ComputeV2.0]: UPDATE_FAILED  Engine went down during resource UPDATE
2018-07-16 20:12:54Z [wcmsc2.ComputeV2]: UPDATE_FAILED  Engine went down during stack UPDATE
2018-07-16 20:12:56Z [wcmsc2-Compute-ok7lplq5ei5z-15-xi3zvn5j636s.SshHostPubKey]: UPDATE_FAILED  Engine went down during stack UPDATE
2018-07-16 20:13:02Z [wcmsc2-Compute-ok7lplq5ei5z-25-xo75qxrqqg5v-NodeExtraConfig-uf2iausxikxw]: UPDATE_FAILED  Engine went down during stack UPDATE
2018-07-16 20:13:08Z [wcmsc2-Compute-ok7lplq5ei5z-15-xi3zvn5j636s-NodeTLSCAData-blwce7cjaykx]: UPDATE_FAILED  Engine went down during stack UPDATE
2018-07-16 20:13:17Z [wcmsc2-Controller-elifqhy64wnv.0.ControllerDeployment]: SIGNAL_IN_PROGRESS  Signal: deployment e9d5adb1-c723-4b36-8dec-34d58fd26ad1 succeeded
2018-07-17 04:07:14Z [wcmsc2-Compute-ok7lplq5ei5z-2-bs6ppl4pzqwp-NodeTLSCAData-tk6fdjwzgaja]: UPDATE_FAILED  Engine went down during stack UPDATE
2018-07-17 04:07:22Z [wcmsc2-Controller-elifqhy64wnv-1-v2iedxyixliw-NodeTLSData-6wtyzwp5o5f4]: UPDATE_FAILED  Engine went down during stack UPDATE
2018-07-17 04:07:24Z [wcmsc2-Controller-elifqhy64wnv-2-slqbov77khwc-NodeTLSData-ydnpwnpvlti7]: UPDATE_FAILED  Engine went down during stack UPDATE
2018-07-17 04:10:15Z [wcmsc2]: UPDATE_FAILED  State invalid for UPDATE

Expected results:

Compute node would remove successfully.

Additional info:

It appears the root cause is that they killed heat-engine right after launching the scale out because they forgot to load some templates.  Looking to find a way to better handle recovery in this situation.

The stack update completed successfully and were able to scale out the compute nodes as needed. Command used to update DB was:

MariaDB [heat]> update resource set status='FAILED',status_reason='Stack status manually reset' where status like '%IN_PROG%';

Comment 2 Mike McClure 2018-07-25 15:34:34 UTC

*** This bug has been marked as a duplicate of bug 1445484 ***