Hide Forgot
Description of problem: If you've had your OverCloud nodes registered to a Red Hat Satellite (upstream CDN) and something bad happens which wipes out a node and you want to delete your stack and start over, heat will currently wait for RHELUnregistrationDeployment to time out, which by default is four hours. A documented way of recovering would be nice. Version-Release number of selected component (if applicable): openstack-tripleo-heat-templates-0.8.6-121.el7ost.noarch How reproducible: Every time Steps to Reproduce: 1. Deploy an overcloud, making sure your nodes are registered to a Satellite 2. Wipe out a node 3. heat stack-delete overcloud Additional info: A few suggestions for improving the situation: 1. heat stack-delete overcloud 2. heat resource-list -n5 overcloud | grep RHELUnregistrationDeployment 3. heat resource-signal <nested stack id> RHELUnregistrationDeployment Where <nested stack id> is the stack name/id containing the RHELUnregistrationDeployment resource (this is included in the resource-list output as the last column in recent heat/heatclient versions, alternatively it can be derived from the heat stack-list -n | grep NodeExtraConfig output) Note you'll repeat step (3) with a different nested stack ID, e.g once per node, after which the delete should complete OK. Another suggestion is to run heat deployment-delete <id_of_RHELUnregistrationDeployment> Yet another way would be to implement an independent timeout for the Software Deployment resource, but that is an RFE for heat. Another approach would be to tell heat not to expect any signal for unregistration: http://docs.openstack.org/developer/heat/template_guide/openstack.html#OS::Heat::SoftwareDeployment e.g we could set the signal_transport == NO_SIGNAL (or make this easily configurable) - the main disadvantage of doing this by default is you then fail to catch any real unregistration failures when you're not recovering from the nodes-aren't-there scenario under discussion here.
Another option is to perform the scale-down steps for any node which was removed outside of heat. Setting signal_transport:NO_SIGNAL is risky as the server may be halted before the unregister has a chance to run. One possible fix which could happen in heat (and be backported) is for the deployment resource to check the nova server exists on DELETE, and if it doesn't then behave like NO_SIGNAL.
Will assign to heat just so we can discuss.
RFE filed for deployment timeouts: https://bugs.launchpad.net/heat/+bug/1557764 Otherwise, I'm going to flag this as a doctext bug and write up the resource-signal approach.
This is fixed in Newton by the linked OpenStack Gerrit review. For RHOS 8 (Liberty) we should first document the workaround in the release notes before considering a backport.
Is the workaround the same for OSP 7?
Yes.
*** Bug 1312989 has been marked as a duplicate of this bug. ***
Fixed in OSP 10. Use the linked workaround https://access.redhat.com/site/solutions/2260561 for OSP 7/8/9.
*** Bug 1652784 has been marked as a duplicate of this bug. ***