Bug 1571864 - FFU: RHELRegistration resource hangs on DELETE
Summary: FFU: RHELRegistration resource hangs on DELETE
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: rc
: 13.0 (Queens)
Assignee: Jiri Stransky
QA Contact: Marius Cornea
URL:
Whiteboard:
Depends On: 1547091 1574601 1574610
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-25 14:46 UTC by Andrew Austin
Modified: 2019-05-08 17:45 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
Temporary removal of Heat stack resources during fast-forward upgrade preparation triggers RHEL unregistration. As a result, RHEL unregistration is stalled because Heat software deployment signalling does not work properly. To avoid the problem, while the overcloud is still on OSP 10 and ready to perform the last overcloud minor version update: 1. Edit the template file /usr/share/openstack-tripleo-heat-templates/extraconfig/pre_deploy/rhel-registration/rhel-registration.yaml 2. Delete RHELUnregistration and RHELUnregistrationDeployment resources from the template. 3. Proceed with the minor update and fast-forward upgrade procedure.
Clone Of:
: 1574601 1574610 (view as bug list)
Environment:
Last Closed: 2018-06-28 15:17:09 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Andrew Austin 2018-04-25 14:46:29 UTC
Description of problem:

When executing the 'openstack overcloud ffwd-upgrade prepare' on an OSP10 environment which uses RHELRegistration, Heat attempts to delete the RHELRegistration resource which triggers a deployment of RHELUnregister. 

This deployment successfully executes on the nodes, but the completion callback is never registered by Heat.


Version-Release number of selected component (if applicable):


How reproducible:
Deploy an OSP10 overcloud registered with Satellite using the included rhel-registration Heat templates and attempt to upgrade to OSP 13.

Steps to Reproduce:
1. Begin with a deployed OSP10 overcloud registered to Satellite using the rhel-registration Heat template.
2. Proceed with the FFU process to the point of running 'ffwd-upgrade prepare' on the upgraded OSP13 undercloud.
3. Observe that the 'ffwd-upgrade prepare' hangs when appearing to update NodeExtraConfig on the overcloud resources
4. Check 'openstack software deployment list | grep -v COMPL' and observe an in-progress deployment for each overcloud node
5. Check the config for each of the in-progress deployments and observe that they are all RHELUnregister
6. Observe that the overcloud nodes are no longer registered to Satellite

Actual results:
Nodes are unregistered from Satellite and the ffwd-upgrade prepare process is stuck.

Expected results:
Nodes remain registered to Satellite so they can eventually update to OSP13

Additional info:

Comment 1 Andrew Austin 2018-04-25 14:55:25 UTC
When attempting to manually trigger the signal with curl, the following error is returned by Heat:

<ErrorResponse><Error><Message>A bad or out-of-range value was supplied:signal is not supported for resource.
Traceback (most recent call last):

  File "/usr/lib/python2.7/site-packages/heat/common/context.py", line 409, in wrapped
    return func(self, ctx, *args, **kwargs)

  File "/usr/lib/python2.7/site-packages/heat/engine/service.py", line 1824, in resource_signal
    _resource_signal(stack, rsrc, details, False)

  File "/usr/lib/python2.7/site-packages/heat/engine/service.py", line 1789, in _resource_signal
    needs_metadata_updates = rsrc.signal(details, need_check)

  File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 2508, in signal
    self._handle_signal(details)

  File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 2453, in _handle_signal
    raise exception.ResourceActionNotSupported(action='signal')

ResourceActionNotSupported: signal is not supported for resource.

Comment 2 Jiri Stransky 2018-04-27 14:42:16 UTC
We hit this during testing FFWD, it's a blocker as the overcloud Heat stack effectively got into a state which was (AFAIU) unrecoverable without manual editing of the DB.

I think we managed to work around the issue by manually applying patches on the OSP 10 templates before going through FFWD procedure. I think the patches were probably:

https://review.openstack.org/#/c/522033/
https://review.openstack.org/#/c/546144/

but will confirm with Andrew / Randy.


^ If the above is correct, we may have already patched OSP 10.z ready, through bug 1547091.

Comment 3 Jiri Stransky 2018-04-27 15:23:47 UTC
At another look, it seems that we'll also need this one:

https://review.openstack.org/#/c/492970

Comment 4 Zane Bitter 2018-04-27 15:25:22 UTC
You'll also need https://review.openstack.org/#/c/558541/

Comment 5 Randy Rubins 2018-04-30 12:17:11 UTC
For me modifying the software_deployment table in heat from IN-PROGRESS to COMPLETE, got the ‘ffwd-upgrade prepare’ unstuck (verified 3 times now).  Also, had to reregister all overcloud nodes before ‘ffwd-upgrade run’.

It would be nice if we could have 'DeleteOnRHELUnregistration: True' or something like that in OSP10, based on https://review.openstack.org/#/c/492970

Comment 6 Lukas Bezdicka 2018-05-04 15:55:36 UTC
With #1574610 we should just document the issue in FFU now.

Comment 8 Jiri Stransky 2018-05-09 09:18:22 UTC
The 10 backport is MODIFIED so i'll move this at least to POST.

Comment 10 Mike Orazi 2018-05-09 18:32:44 UTC
Jirka,

I am thinking you are saying this bug is really meant as a queue to retest once all the bits have landed in OSP 10.  Are there any merges that we need to track in this bug against 13 or should this be TestOnly or something similar?

Comment 11 Jiri Stransky 2018-05-10 08:50:10 UTC
Yes exactly, we're just waiting for the dependent BZ to land in OSP 10 puddle so that we can retest, i was unsure about how to properly reflect that in the state of this BZ.

I'm adding TestOnly keyword. Should this stay in POST or move to some other state?

Comment 12 Jon Schlueter 2018-05-10 11:08:14 UTC
Thanks TestOnly is right flag to add, and once we have build in a puddle you can test with update this BZ to ON_QA

Comment 16 Jiri Stransky 2018-05-30 08:33:21 UTC
Bug 1574610 is now ON_QA, moving this bug to ON_QA too.


Note You need to log in before you can comment on or make changes to this bug.