Description of problem (please be detailed as possible and provide log snippests): After deleting Storage Cluster with KMS integration and reinstalling it 3 times, the cluster is in error state with errors in rook-ceph-operator pod: 2021-03-09 14:39:33.879492 E | clusterdisruption-controller: failed to check cluster health: failed to get status. timed out . : exit status 1 I tried to delete cluster after that but it got stuck with errors: 2021-03-09 15:02:14.601227 I | ceph-cluster-controller: deleting ceph cluster "ocs-storagecluster-cephcluster" 2021-03-09 15:02:14.623083 E | ceph-cluster-controller: failed to reconcile. failed to get mon secret, no cleanup: failed to get cluster info: not expected to create new cluster info and did not find existing secret 2021-03-09 15:10:22.283342 E | clusterdisruption-controller: failed to check cluster health: failed to get status. timed out . : exit status 1 Version of all relevant components (if applicable): OCS ocs-operator.v4.7.0-284.ci OCP 4.7.0-0.nightly-2021-03-06-183610 Can this issue reproducible? I am not sure. Can this issue reproduce from the UI? partially Steps to Reproduce: 1. Install OCP and OCS operator, set Vault KMS server. 2. Create OCS Storage cluster with cluster-wide encryption that uses Vault. 3. Delete the cluster after installation is completed. 4. Delete kms resources because of https://bugzilla.redhat.com/show_bug.cgi?id=1925249 5. Repeat steps 2-4 multiple times (it took 3-4 times in my case). Actual results: Cluster is in broken state and can not be easily deleted. Expected results: Cluster should work. Uninstallation and installation of cluster should work. Additional info:
Neha, it looks like the cluster is already gone when the deletion is requested, at least some resources are gone, the "rook-ceph-mon" secret.
@shan Even if some resources are gone should not we proceed with the uninstall from the rook. In ocs-operator, we do not have any control over these resources. So I don't think we can fix this in the ocs-operator.
Probably a rook issue, please change the component accordingly if not. Assigning it to Nitin as he is taking a look.
Looking at the rook logs, I think the last clean up was performed on a cluster that was already in a bad state. second last cleanup was successful. But something triggered a new cluster creation in an empty namespace. 2021-03-09 14:29:27.583680 I | ceph-spec: adding finalizer "cephcluster.ceph.rook.io" on "ocs-storagecluster-cephcluster" 2021-03-09 14:29:27.606337 I | ceph-cluster-controller: reconciling ceph cluster in namespace "" 2021-03-09 14:29:27.606386 I | ceph-cluster-controller: clusterInfo not yet found, must be a new cluster 2021-03-09 14:29:27.655199 E | ceph-cluster-controller: failed to reconcile. failed to reconcile cluster "ocs-storagecluster-cephcluster": failed to configure local ceph cluster: failed to perform validation before cluster creation: failed to validate kms connection details: failed to fetch kms token secret "ocs-kms-token": an empty namespace may not be set when a resource name is provided 2021-03-09 14:29:28.655412 I | ceph-cluster-controller: reconciling ceph cluster in namespace "" 2021-03-09 14:29:28.655485 I | ceph-cluster-controller: clusterInfo not yet found, must be a new cluster
@Santosh In this code snippet, instead of failing the reconcile, how about if we just log the error, skip the cluster cleanup, and allow the finalizer to be removed? monSecret, clusterFSID, err := r.clusterController.getCleanUpDetails(cephCluster.Namespace) if err != nil { // log the error instead of failing the reconcile return reconcile.Result{}, errors.Wrap(err, "failed to get mon secret, no cleanup") }
Acking so we can get this fix in for 4.7.
Also proposing as a blocker since it's a small fix, low risk, and improves the uninstall in corner cases.
(In reply to Travis Nielsen from comment #10) > @Santosh In this code snippet, instead of failing the reconcile, how about > if we just log the error, skip the cluster cleanup, and allow the finalizer > to be removed? > > monSecret, clusterFSID, err := > r.clusterController.getCleanUpDetails(cephCluster.Namespace) > if err != nil { > // log the error instead of failing the reconcile > return reconcile.Result{}, errors.Wrap(err, "failed to get mon secret, no > cleanup") > } Yes, this change should help us cover this corner case.
This seems to be fixed. There was no problem with uninstalling cluster 10x in row with complete, incomplete or misconfigured installations. --> VERIFIED Tested with: 4.7.0-307.ci
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2041