Bug 1966632

Summary: [4.8.0] [assisted operator] Unable to re-register an SNO instance if deleting CRDs during install
Product: OpenShift Container Platform Reporter: Antoni Segura Puimedon <asegurap>
Component: assisted-installerAssignee: Fred Rolland <frolland>
assisted-installer sub component: assisted-service QA Contact: Chad Crum <ccrum>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: urgent CC: alazar, aos-bugs, ccrum
Version: 4.8Keywords: Triaged
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: AI-Team-Hive KNI-EDGE-4.8
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 23:10:52 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1947154    
Bug Blocks:    

Description Antoni Segura Puimedon 2021-06-01 14:50:29 UTC
This bug was initially created as a copy of Bug #1947154

I am copying this bug because: 



Description of problem:
If a SNO instance is in the process of installing openshift onto the node (Using Assisted Service Operator + CRDs), if the instance is removed by deleting relevant CRDs (ClusterDeploy / InstallEnv / Agent), it is not possible re-register+install the same node until the installation has timed out or been manually aborted.

Version-Release number of selected component (if applicable):
Assisted Service Master (commit 5bc8d7ef053110bb3da7be9460284e930eb03b1e)

How reproducible:
100%

Steps to Reproduce:
1. Deploy an SNO cluster via CRDs + Assisted Operator
2. Delete the relevant CRDs while it is installing
3. Attempt to reapply the CRDs + start the SNO machine with a new discover ISO

Actual results:
- Agents fail to start on SNO Machine and agent cr is not created
- ClusterDeployment CR says the state is "installing"

Assisted Service pod logs:
time="2021-04-07T19:07:41Z" level=error msg="failed to deregister cluster: sno-cluster-deployment: cluster 40a22dbd-d424-40dd-9100-13108fd5323b can not be removed while being installed" func="github.com/openshift/assisted-service/internal/controller/controllers.(*ClusterDeploymentsReconciler).deregisterClusterIfNeeded.func1" file="/go/src/github.com/openshift/origin/internal/controller/controllers/clusterdeployments_controller.go:652"
time="2021-04-07T19:07:46Z" level=info msg="Deregister cluster id 40a22dbd-d424-40dd-9100-13108fd5323b" func="github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).DeregisterClusterInternal" file="/go/src/github.com/openshift/origin/internal/bminventory/inventory.go:655" go-id=841 pkg=Inventory request_id=
time="2021-04-07T19:07:47Z" level=error msg="failed to deregister cluster 40a22dbd-d424-40dd-9100-13108fd5323b" func="github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).DeregisterClusterInternal" file="/go/src/github.com/openshift/origin/internal/bminventory/inventory.go:674" error="cluster 40a22dbd-d424-40dd-9100-13108fd5323b can not be removed while being installed" go-id=841 pkg=Inventory request_id=



Expected results:
Installation is halted immediately (Aborted automatically?)

Additional info:

Although the issue is alluded to in the assisted service pods, I did not 100% realize what was happening until I checked the assisted ui, which I should NOT need to do. It's a confusing situation.

Comment 3 Chad Crum 2021-06-19 13:33:06 UTC
I have validated the fix.

- 2.3.0-DOWNSTREAM-2021-06-17-01-26-58
- 4.8.0-fc.7


Steps:
- Deployed upstream Assisted/Hive operators on OCP 4.8 
- Created all required ZTP SNO CRs and got to the point of the SNO instance performing the install
- Deleted all SNO CRs (Which deleted the agent cr as expected)
- Recreated all of the same ZTP SNO CRs and confirmed that the SNO instance started performing the install again

Comment 5 errata-xmlrpc 2021-07-27 23:10:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438