Bug 1313885

Summary: Ability to recover from RHELUnregistrationDeployment if the nodes are gone
Product: Red Hat OpenStack Reporter: David Juran <djuran>
Component: openstack-heatAssignee: Dan Macpherson <dmacpher>
Status: CLOSED NEXTRELEASE QA Contact: Amit Ugol <augol>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.0 (Kilo)CC: dmacpher, gchenuet, ipilcher, jason.dobies, jliberma, mburns, mschuppe, nchandek, rhel-osp-director-maint, rlondhe, sbaker, shardy, srevivo, zbitter
Target Milestone: ---Keywords: ZStream
Target Release: 8.0 (Liberty)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Known Issue
Doc Text:
Cause: When deleting the overcloud, the RHN unregistration step can hang if there is a problem with the node being deleted. Consequence: The stack delete will wait until the unregister step times out, which makes the delete appear to be hung. Fix: The unregistration step is, in Heat terminology, a "software deployment". Deployments wait for a signal from the node before moving out of the "in progress" state. This signal can be manually sent to the stack. The first step is to determine the ID of the nested stack where the deployment exists: heat resource-list -n5 overcloud | grep RHELUnregistrationDeployment There is a column in that output titled "stack_name". This is the value to pass as <nested-stack-name> in the following command: heat resource-signal <nested-stack-name> RHELUnregistrationDeployment Result: The resource-signal command will allow Heat to move past the unregistration step and finish the delete.
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-02-03 13:52:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description David Juran 2016-03-02 14:31:33 UTC
Description of problem:
If you've had your OverCloud nodes registered to a Red Hat Satellite (upstream CDN) and something bad happens which wipes out a node and you want to delete your stack and start over, heat will currently wait for RHELUnregistrationDeployment to time out, which by default is four hours.
A documented way of recovering would be nice.


Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-0.8.6-121.el7ost.noarch

How reproducible:
Every time


Steps to Reproduce:
1. Deploy an overcloud, making sure your nodes are registered to a Satellite
2. Wipe out a node
3. heat stack-delete overcloud



Additional info:
A few suggestions for improving the situation:

1. heat stack-delete overcloud

2. heat resource-list -n5 overcloud | grep RHELUnregistrationDeployment

3. heat resource-signal <nested stack id> RHELUnregistrationDeployment

Where <nested stack id> is the stack name/id containing the
RHELUnregistrationDeployment resource (this is included in the
resource-list output as the last column in recent heat/heatclient versions,
alternatively it can be derived from the heat stack-list -n | grep
NodeExtraConfig output)

Note you'll repeat step (3) with a different nested stack ID, e.g once per
node, after which the delete should complete OK.

Another suggestion is to run

heat deployment-delete <id_of_RHELUnregistrationDeployment> 

Yet another way would be to implement an independent timeout for the Software Deployment resource, but that is an RFE for heat.
Another approach would be to tell heat not to expect any signal for
unregistration:

http://docs.openstack.org/developer/heat/template_guide/openstack.html#OS::Heat::SoftwareDeployment

e.g we could set the signal_transport == NO_SIGNAL (or make this easily
configurable) - the main disadvantage of doing this by default is you then
fail to catch any real unregistration failures when you're not recovering
from the nodes-aren't-there scenario under discussion here.

Comment 2 Steve Baker 2016-03-02 22:54:55 UTC
Another option is to perform the scale-down steps for any node which was removed outside of heat.

Setting signal_transport:NO_SIGNAL is risky as the server may be halted before the unregister has a chance to run.

One possible fix which could happen in heat (and be backported) is for the deployment resource to check the nova server exists on DELETE, and if it doesn't then behave like NO_SIGNAL.

Comment 3 Steve Baker 2016-03-02 22:55:42 UTC
Will assign to heat just so we can discuss.

Comment 4 Jay Dobies 2016-03-16 13:45:57 UTC
RFE filed for deployment timeouts: https://bugs.launchpad.net/heat/+bug/1557764

Otherwise, I'm going to flag this as a doctext bug and write up the resource-signal approach.

Comment 5 Zane Bitter 2016-05-19 17:11:38 UTC
This is fixed in Newton by the linked OpenStack Gerrit review. For RHOS 8 (Liberty) we should first document the workaround in the release notes before considering a backport.

Comment 6 jliberma@redhat.com 2016-05-26 14:36:33 UTC
Is the workaround the same for OSP 7?

Comment 7 Zane Bitter 2016-05-26 14:43:08 UTC
Yes.

Comment 10 James Slagle 2016-10-14 15:23:20 UTC
*** Bug 1312989 has been marked as a duplicate of this bug. ***

Comment 11 Zane Bitter 2017-02-03 13:52:59 UTC
Fixed in OSP 10. Use the linked workaround https://access.redhat.com/site/solutions/2260561 for OSP 7/8/9.

Comment 12 Martin Schuppert 2018-11-28 12:18:52 UTC
*** Bug 1652784 has been marked as a duplicate of this bug. ***