1937070 – Storage cluster cannot be uninstalled when cluster not fully configured

Bug 1937070 - Storage cluster cannot be uninstalled when cluster not fully configured

Summary: Storage cluster cannot be uninstalled when cluster not fully configured

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	OCS 4.7.0
Assignee:	Santosh Pillai
QA Contact:	Filip Balák
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-03-09 18:15 UTC by Filip Balák
Modified:	2021-05-19 09:21 UTC (History)
CC List:	12 users (show)
Fixed In Version:	4.7.0-307.ci
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-05-19 09:20:08 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift rook pull 196	None	closed	Bug 1937070: ceph: skip cleanup if cluster is not configured correctly	2021-03-21 06:23:39 UTC
Github	rook rook pull 7414	None	open	ceph: skip cleanup if cluster is not configured correctly.	2021-03-16 07:12:56 UTC
Red Hat Product Errata	RHSA-2021:2041	None	None	None	2021-05-19 09:20:59 UTC

Description Filip Balák 2021-03-09 18:15:22 UTC

Description of problem (please be detailed as possible and provide log
snippests):
After deleting Storage Cluster with KMS integration and reinstalling it 3 times, the cluster is in error state with errors in rook-ceph-operator pod:

2021-03-09 14:39:33.879492 E | clusterdisruption-controller: failed to check cluster health: failed to get status. timed out
. : exit status 1

I tried to delete cluster after that but it got stuck with errors:

2021-03-09 15:02:14.601227 I | ceph-cluster-controller: deleting ceph cluster "ocs-storagecluster-cephcluster"
2021-03-09 15:02:14.623083 E | ceph-cluster-controller: failed to reconcile. failed to get mon secret, no cleanup: failed to get cluster info: not expected to create new cluster info and did not find existing secret
2021-03-09 15:10:22.283342 E | clusterdisruption-controller: failed to check cluster health: failed to get status. timed out
. : exit status 1

Version of all relevant components (if applicable):
OCS ocs-operator.v4.7.0-284.ci
OCP 4.7.0-0.nightly-2021-03-06-183610

Can this issue reproducible?
I am not sure.

Can this issue reproduce from the UI?
partially

Steps to Reproduce:
1. Install OCP and OCS operator, set Vault KMS server.
2. Create OCS Storage cluster with cluster-wide encryption that uses Vault.
3. Delete the cluster after installation is completed.
4. Delete kms resources because of https://bugzilla.redhat.com/show_bug.cgi?id=1925249
5. Repeat steps 2-4 multiple times (it took 3-4 times in my case).

Actual results:
Cluster is in broken state and can not be easily deleted.

Expected results:
Cluster should work. Uninstallation and installation of cluster should work.

Additional info:

Comment 4 Sébastien Han 2021-03-10 09:31:52 UTC

Neha, it looks like the cluster is already gone when the deletion is requested, at least some resources are gone, the "rook-ceph-mon" secret.

Comment 5 Nitin Goyal 2021-03-10 16:23:36 UTC

@shan Even if some resources are gone should not we proceed with the uninstall from the rook. In ocs-operator, we do not have any control over these resources. So I don't think we can fix this in the ocs-operator.

Comment 6 Mudit Agarwal 2021-03-11 17:36:07 UTC

Probably a rook issue, please change the component accordingly if not.

Assigning it to Nitin as he is taking a look.

Comment 7 Santosh Pillai 2021-03-12 09:01:29 UTC

Looking at the rook logs, I think the last clean up was performed on a cluster that was already in a bad state.

second last cleanup was successful.  But something triggered a new cluster creation in an empty namespace.

2021-03-09 14:29:27.583680 I | ceph-spec: adding finalizer "cephcluster.ceph.rook.io" on "ocs-storagecluster-cephcluster"
2021-03-09 14:29:27.606337 I | ceph-cluster-controller: reconciling ceph cluster in namespace ""
2021-03-09 14:29:27.606386 I | ceph-cluster-controller: clusterInfo not yet found, must be a new cluster
2021-03-09 14:29:27.655199 E | ceph-cluster-controller: failed to reconcile. failed to reconcile cluster "ocs-storagecluster-cephcluster": failed to configure local ceph cluster: failed to perform validation before cluster creation: failed to validate kms connection details: failed to fetch kms token secret "ocs-kms-token": an empty namespace may not be set when a resource name is provided
2021-03-09 14:29:28.655412 I | ceph-cluster-controller: reconciling ceph cluster in namespace ""
2021-03-09 14:29:28.655485 I | ceph-cluster-controller: clusterInfo not yet found, must be a new cluster

Comment 10 Travis Nielsen 2021-03-12 15:22:54 UTC

@Santosh In this code snippet, instead of failing the reconcile, how about if we just log the error, skip the cluster cleanup, and allow the finalizer to be removed?

monSecret, clusterFSID, err := r.clusterController.getCleanUpDetails(cephCluster.Namespace)
if err != nil {
  // log the error instead of failing the reconcile
  return reconcile.Result{}, errors.Wrap(err, "failed to get mon secret, no cleanup")
}

Comment 11 Travis Nielsen 2021-03-15 16:52:18 UTC

Acking so we can get this fix in for 4.7.

Comment 12 Travis Nielsen 2021-03-15 19:46:36 UTC

Also proposing as a blocker since it's a small fix, low risk, and improves the uninstall in corner cases.

Comment 15 Santosh Pillai 2021-03-16 07:13:54 UTC

(In reply to Travis Nielsen from comment #10)
> @Santosh In this code snippet, instead of failing the reconcile, how about
> if we just log the error, skip the cluster cleanup, and allow the finalizer
> to be removed?
> 
> monSecret, clusterFSID, err :=
> r.clusterController.getCleanUpDetails(cephCluster.Namespace)
> if err != nil {
>   // log the error instead of failing the reconcile
>   return reconcile.Result{}, errors.Wrap(err, "failed to get mon secret, no
> cleanup")
> }

Yes, this change should help us cover this corner case.

Comment 17 Filip Balák 2021-03-23 15:59:32 UTC

This seems to be fixed. There was no problem with uninstalling cluster 10x in row with complete, incomplete or misconfigured installations. --> VERIFIED
Tested with:
4.7.0-307.ci

Comment 19 errata-xmlrpc 2021-05-19 09:20:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2041

Note You need to log in before you can comment on or make changes to this bug.