Bug 1313885 - Ability to recover from RHELUnregistrationDeployment if the nodes are gone
Summary: Ability to recover from RHELUnregistrationDeployment if the nodes are gone
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-heat
Version: 7.0 (Kilo)
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 8.0 (Liberty)
Assignee: Dan Macpherson
QA Contact: Amit Ugol
URL:
Whiteboard:
: 1312989 1652784 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-03-02 14:31 UTC by David Juran
Modified: 2022-03-13 14:19 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
Cause: When deleting the overcloud, the RHN unregistration step can hang if there is a problem with the node being deleted. Consequence: The stack delete will wait until the unregister step times out, which makes the delete appear to be hung. Fix: The unregistration step is, in Heat terminology, a "software deployment". Deployments wait for a signal from the node before moving out of the "in progress" state. This signal can be manually sent to the stack. The first step is to determine the ID of the nested stack where the deployment exists: heat resource-list -n5 overcloud | grep RHELUnregistrationDeployment There is a column in that output titled "stack_name". This is the value to pass as <nested-stack-name> in the following command: heat resource-signal <nested-stack-name> RHELUnregistrationDeployment Result: The resource-signal command will allow Heat to move past the unregistration step and finish the delete.
Clone Of:
Environment:
Last Closed: 2017-02-03 13:52:59 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 301260 0 None MERGED Add check for server existence on software deployment delete 2020-12-04 21:07:34 UTC
Red Hat Issue Tracker OSP-13541 0 None None None 2022-03-13 14:19:30 UTC
Red Hat Knowledge Base (Solution) 2260561 0 None None None 2016-09-12 14:50:10 UTC

Description David Juran 2016-03-02 14:31:33 UTC
Description of problem:
If you've had your OverCloud nodes registered to a Red Hat Satellite (upstream CDN) and something bad happens which wipes out a node and you want to delete your stack and start over, heat will currently wait for RHELUnregistrationDeployment to time out, which by default is four hours.
A documented way of recovering would be nice.


Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-0.8.6-121.el7ost.noarch

How reproducible:
Every time


Steps to Reproduce:
1. Deploy an overcloud, making sure your nodes are registered to a Satellite
2. Wipe out a node
3. heat stack-delete overcloud



Additional info:
A few suggestions for improving the situation:

1. heat stack-delete overcloud

2. heat resource-list -n5 overcloud | grep RHELUnregistrationDeployment

3. heat resource-signal <nested stack id> RHELUnregistrationDeployment

Where <nested stack id> is the stack name/id containing the
RHELUnregistrationDeployment resource (this is included in the
resource-list output as the last column in recent heat/heatclient versions,
alternatively it can be derived from the heat stack-list -n | grep
NodeExtraConfig output)

Note you'll repeat step (3) with a different nested stack ID, e.g once per
node, after which the delete should complete OK.

Another suggestion is to run

heat deployment-delete <id_of_RHELUnregistrationDeployment> 

Yet another way would be to implement an independent timeout for the Software Deployment resource, but that is an RFE for heat.
Another approach would be to tell heat not to expect any signal for
unregistration:

http://docs.openstack.org/developer/heat/template_guide/openstack.html#OS::Heat::SoftwareDeployment

e.g we could set the signal_transport == NO_SIGNAL (or make this easily
configurable) - the main disadvantage of doing this by default is you then
fail to catch any real unregistration failures when you're not recovering
from the nodes-aren't-there scenario under discussion here.

Comment 2 Steve Baker 2016-03-02 22:54:55 UTC
Another option is to perform the scale-down steps for any node which was removed outside of heat.

Setting signal_transport:NO_SIGNAL is risky as the server may be halted before the unregister has a chance to run.

One possible fix which could happen in heat (and be backported) is for the deployment resource to check the nova server exists on DELETE, and if it doesn't then behave like NO_SIGNAL.

Comment 3 Steve Baker 2016-03-02 22:55:42 UTC
Will assign to heat just so we can discuss.

Comment 4 Jay Dobies 2016-03-16 13:45:57 UTC
RFE filed for deployment timeouts: https://bugs.launchpad.net/heat/+bug/1557764

Otherwise, I'm going to flag this as a doctext bug and write up the resource-signal approach.

Comment 5 Zane Bitter 2016-05-19 17:11:38 UTC
This is fixed in Newton by the linked OpenStack Gerrit review. For RHOS 8 (Liberty) we should first document the workaround in the release notes before considering a backport.

Comment 6 jliberma@redhat.com 2016-05-26 14:36:33 UTC
Is the workaround the same for OSP 7?

Comment 7 Zane Bitter 2016-05-26 14:43:08 UTC
Yes.

Comment 10 James Slagle 2016-10-14 15:23:20 UTC
*** Bug 1312989 has been marked as a duplicate of this bug. ***

Comment 11 Zane Bitter 2017-02-03 13:52:59 UTC
Fixed in OSP 10. Use the linked workaround https://access.redhat.com/site/solutions/2260561 for OSP 7/8/9.

Comment 12 Martin Schuppert 2018-11-28 12:18:52 UTC
*** Bug 1652784 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.