1947154 – [master] [assisted operator] Unable to re-register an SNO instance if deleting CRDs during install

Bug 1947154 - [master] [assisted operator] Unable to re-register an SNO instance if deleting CRDs during install

Summary: [master] [assisted operator] Unable to re-register an SNO instance if deletin...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	assisted-installer
Sub Component:
Version:	4.8
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	medium
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Fred Rolland
QA Contact:	Chad Crum
Docs Contact:
URL:
Whiteboard:	AI-Team-Hive KNI-EDGE-4.8
Depends On:
Blocks:	1966632
TreeView+	depends on / blocked

Reported:	2021-04-07 19:18 UTC by Chad Crum
Modified:	2021-07-27 22:58 UTC (History)
CC List:	5 users (show)
Fixed In Version:	OCP-Metal-v1.0.21.3
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 22:57:59 UTC
Target Upstream Version:
Embargoed:
Flags:	frolland: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift assisted-service pull 1855	0	None	closed	Bug 1947154: Kubeapi cancel install before delete	2021-06-01 14:51:57 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 22:58:16 UTC

Description Chad Crum 2021-04-07 19:18:10 UTC

Description of problem:
If a SNO instance is in the process of installing openshift onto the node (Using Assisted Service Operator + CRDs), if the instance is removed by deleting relevant CRDs (ClusterDeploy / InstallEnv / Agent), it is not possible re-register+install the same node until the installation has timed out or been manually aborted.

Version-Release number of selected component (if applicable):
Assisted Service Master (commit 5bc8d7ef053110bb3da7be9460284e930eb03b1e)

How reproducible:
100%

Steps to Reproduce:
1. Deploy an SNO cluster via CRDs + Assisted Operator
2. Delete the relevant CRDs while it is installing
3. Attempt to reapply the CRDs + start the SNO machine with a new discover ISO

Actual results:
- Agents fail to start on SNO Machine and agent cr is not created
- ClusterDeployment CR says the state is "installing"

Assisted Service pod logs:
time="2021-04-07T19:07:41Z" level=error msg="failed to deregister cluster: sno-cluster-deployment: cluster 40a22dbd-d424-40dd-9100-13108fd5323b can not be removed while being installed" func="github.com/openshift/assisted-service/internal/controller/controllers.(*ClusterDeploymentsReconciler).deregisterClusterIfNeeded.func1" file="/go/src/github.com/openshift/origin/internal/controller/controllers/clusterdeployments_controller.go:652"
time="2021-04-07T19:07:46Z" level=info msg="Deregister cluster id 40a22dbd-d424-40dd-9100-13108fd5323b" func="github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).DeregisterClusterInternal" file="/go/src/github.com/openshift/origin/internal/bminventory/inventory.go:655" go-id=841 pkg=Inventory request_id=
time="2021-04-07T19:07:47Z" level=error msg="failed to deregister cluster 40a22dbd-d424-40dd-9100-13108fd5323b" func="github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).DeregisterClusterInternal" file="/go/src/github.com/openshift/origin/internal/bminventory/inventory.go:674" error="cluster 40a22dbd-d424-40dd-9100-13108fd5323b can not be removed while being installed" go-id=841 pkg=Inventory request_id=



Expected results:
Installation is halted immediately (Aborted automatically?)

Additional info:

Although the issue is alluded to in the assisted service pods, I did not 100% realize what was happening until I checked the assisted ui, which I should NOT need to do. It's a confusing situation.

Comment 1 Michael Filanov 2021-04-14 06:41:36 UTC

will be resolved with https://issues.redhat.com/browse/MGMT-4261

Comment 2 Michael Filanov 2021-05-25 11:30:36 UTC

The issues a bit different form what i was thinking, we are not enable to delete the cluster during installation and user need to cancel the installation first. 
So in case of cleanup we need to check if cluster can be deleted and only then delete it.

Comment 3 Fred Rolland 2021-05-31 12:09:39 UTC

@mfilanov We need to set priority/severity/blocker

WDYT?

Comment 4 Fred Rolland 2021-05-31 12:11:28 UTC

https://github.com/openshift/assisted-service/pull/1855

Comment 6 Chad Crum 2021-06-04 15:01:48 UTC

I have validated the fix.

Tested with quay.io/ocpmetal/assisted-service-operator-bundle@sha256:79515efe3fb20e6bdf31a67db068cb076665fd9b9227c7c829bcca6c5d9b7994


Steps:
- Deployed upstream Assisted/Hive operators on OCP 4.8 
- Created all required ZTP SNO CRs and got to the point of the SNO instance performing the install
- Deleted all SNO CRs (Which deleted the agent cr as expected)
- Recreated all of the same ZTP SNO CRs and confirmed that the SNO instance started performing the install again

Comment 9 errata-xmlrpc 2021-07-27 22:57:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.