Bug 1313885 - Ability to recover from RHELUnregistrationDeployment if the nodes are gone
Ability to recover from RHELUnregistrationDeployment if the nodes are gone
Status: CLOSED NEXTRELEASE
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-heat (Show other bugs)
7.0 (Kilo)
Unspecified Unspecified
unspecified Severity unspecified
: ---
: 8.0 (Liberty)
Assigned To: Dan Macpherson
Amit Ugol
: ZStream
: 1312989 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-03-02 09:31 EST by David Juran
Modified: 2017-03-22 02:31 EDT (History)
13 users (show)

See Also:
Fixed In Version:
Doc Type: Known Issue
Doc Text:
Cause: When deleting the overcloud, the RHN unregistration step can hang if there is a problem with the node being deleted. Consequence: The stack delete will wait until the unregister step times out, which makes the delete appear to be hung. Fix: The unregistration step is, in Heat terminology, a "software deployment". Deployments wait for a signal from the node before moving out of the "in progress" state. This signal can be manually sent to the stack. The first step is to determine the ID of the nested stack where the deployment exists: heat resource-list -n5 overcloud | grep RHELUnregistrationDeployment There is a column in that output titled "stack_name". This is the value to pass as <nested-stack-name> in the following command: heat resource-signal <nested-stack-name> RHELUnregistrationDeployment Result: The resource-signal command will allow Heat to move past the unregistration step and finish the delete.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-02-03 08:52:59 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 2260561 None None None 2016-09-12 10:50 EDT
OpenStack gerrit 301260 None None None 2016-04-13 17:15 EDT

  None (edit)
Description David Juran 2016-03-02 09:31:33 EST
Description of problem:
If you've had your OverCloud nodes registered to a Red Hat Satellite (upstream CDN) and something bad happens which wipes out a node and you want to delete your stack and start over, heat will currently wait for RHELUnregistrationDeployment to time out, which by default is four hours.
A documented way of recovering would be nice.


Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-0.8.6-121.el7ost.noarch

How reproducible:
Every time


Steps to Reproduce:
1. Deploy an overcloud, making sure your nodes are registered to a Satellite
2. Wipe out a node
3. heat stack-delete overcloud



Additional info:
A few suggestions for improving the situation:

1. heat stack-delete overcloud

2. heat resource-list -n5 overcloud | grep RHELUnregistrationDeployment

3. heat resource-signal <nested stack id> RHELUnregistrationDeployment

Where <nested stack id> is the stack name/id containing the
RHELUnregistrationDeployment resource (this is included in the
resource-list output as the last column in recent heat/heatclient versions,
alternatively it can be derived from the heat stack-list -n | grep
NodeExtraConfig output)

Note you'll repeat step (3) with a different nested stack ID, e.g once per
node, after which the delete should complete OK.

Another suggestion is to run

heat deployment-delete <id_of_RHELUnregistrationDeployment> 

Yet another way would be to implement an independent timeout for the Software Deployment resource, but that is an RFE for heat.
Another approach would be to tell heat not to expect any signal for
unregistration:

http://docs.openstack.org/developer/heat/template_guide/openstack.html#OS::Heat::SoftwareDeployment

e.g we could set the signal_transport == NO_SIGNAL (or make this easily
configurable) - the main disadvantage of doing this by default is you then
fail to catch any real unregistration failures when you're not recovering
from the nodes-aren't-there scenario under discussion here.
Comment 2 Steve Baker 2016-03-02 17:54:55 EST
Another option is to perform the scale-down steps for any node which was removed outside of heat.

Setting signal_transport:NO_SIGNAL is risky as the server may be halted before the unregister has a chance to run.

One possible fix which could happen in heat (and be backported) is for the deployment resource to check the nova server exists on DELETE, and if it doesn't then behave like NO_SIGNAL.
Comment 3 Steve Baker 2016-03-02 17:55:42 EST
Will assign to heat just so we can discuss.
Comment 4 Jay Dobies 2016-03-16 09:45:57 EDT
RFE filed for deployment timeouts: https://bugs.launchpad.net/heat/+bug/1557764

Otherwise, I'm going to flag this as a doctext bug and write up the resource-signal approach.
Comment 5 Zane Bitter 2016-05-19 13:11:38 EDT
This is fixed in Newton by the linked OpenStack Gerrit review. For RHOS 8 (Liberty) we should first document the workaround in the release notes before considering a backport.
Comment 6 jliberma@redhat.com 2016-05-26 10:36:33 EDT
Is the workaround the same for OSP 7?
Comment 7 Zane Bitter 2016-05-26 10:43:08 EDT
Yes.
Comment 10 James Slagle 2016-10-14 11:23:20 EDT
*** Bug 1312989 has been marked as a duplicate of this bug. ***
Comment 11 Zane Bitter 2017-02-03 08:52:59 EST
Fixed in OSP 10. Use the linked workaround https://access.redhat.com/site/solutions/2260561 for OSP 7/8/9.

Note You need to log in before you can comment on or make changes to this bug.