Bug 1937070
| Summary: | Storage cluster cannot be uninstalled when cluster not fully configured | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Container Storage | Reporter: | Filip Balák <fbalak> |
| Component: | rook | Assignee: | Santosh Pillai <sapillai> |
| Status: | CLOSED ERRATA | QA Contact: | Filip Balák <fbalak> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.7 | CC: | ebenahar, jthottan, madam, muagarwa, nberry, nigoyal, ocs-bugs, rtalur, sapillai, shan, srai, tnielsen |
| Target Milestone: | --- | Keywords: | AutomationBackLog |
| Target Release: | OCS 4.7.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | 4.7.0-307.ci | Doc Type: | No Doc Update |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-05-19 09:20:08 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Filip Balák
2021-03-09 18:15:22 UTC
Neha, it looks like the cluster is already gone when the deletion is requested, at least some resources are gone, the "rook-ceph-mon" secret. @shan Even if some resources are gone should not we proceed with the uninstall from the rook. In ocs-operator, we do not have any control over these resources. So I don't think we can fix this in the ocs-operator. Probably a rook issue, please change the component accordingly if not. Assigning it to Nitin as he is taking a look. Looking at the rook logs, I think the last clean up was performed on a cluster that was already in a bad state. second last cleanup was successful. But something triggered a new cluster creation in an empty namespace. 2021-03-09 14:29:27.583680 I | ceph-spec: adding finalizer "cephcluster.ceph.rook.io" on "ocs-storagecluster-cephcluster" 2021-03-09 14:29:27.606337 I | ceph-cluster-controller: reconciling ceph cluster in namespace "" 2021-03-09 14:29:27.606386 I | ceph-cluster-controller: clusterInfo not yet found, must be a new cluster 2021-03-09 14:29:27.655199 E | ceph-cluster-controller: failed to reconcile. failed to reconcile cluster "ocs-storagecluster-cephcluster": failed to configure local ceph cluster: failed to perform validation before cluster creation: failed to validate kms connection details: failed to fetch kms token secret "ocs-kms-token": an empty namespace may not be set when a resource name is provided 2021-03-09 14:29:28.655412 I | ceph-cluster-controller: reconciling ceph cluster in namespace "" 2021-03-09 14:29:28.655485 I | ceph-cluster-controller: clusterInfo not yet found, must be a new cluster @Santosh In this code snippet, instead of failing the reconcile, how about if we just log the error, skip the cluster cleanup, and allow the finalizer to be removed?
monSecret, clusterFSID, err := r.clusterController.getCleanUpDetails(cephCluster.Namespace)
if err != nil {
// log the error instead of failing the reconcile
return reconcile.Result{}, errors.Wrap(err, "failed to get mon secret, no cleanup")
}
Acking so we can get this fix in for 4.7. Also proposing as a blocker since it's a small fix, low risk, and improves the uninstall in corner cases. (In reply to Travis Nielsen from comment #10) > @Santosh In this code snippet, instead of failing the reconcile, how about > if we just log the error, skip the cluster cleanup, and allow the finalizer > to be removed? > > monSecret, clusterFSID, err := > r.clusterController.getCleanUpDetails(cephCluster.Namespace) > if err != nil { > // log the error instead of failing the reconcile > return reconcile.Result{}, errors.Wrap(err, "failed to get mon secret, no > cleanup") > } Yes, this change should help us cover this corner case. This seems to be fixed. There was no problem with uninstalling cluster 10x in row with complete, incomplete or misconfigured installations. --> VERIFIED Tested with: 4.7.0-307.ci Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2041 |