Bug 1860670
Summary: | OCS 4.5 Uninstall External: Openshift-storage namespace in Terminating state as CephObjectStoreUser had finalizers remaining | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat OpenShift Container Storage | Reporter: | Neha Berry <nberry> |
Component: | ocs-operator | Assignee: | Raghavendra Talur <rtalur> |
Status: | CLOSED ERRATA | QA Contact: | Neha Berry <nberry> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.5 | CC: | aclewett, aeyal, assingh, ebenahar, jarrpa, jelopez, madam, muagarwa, nbecker, ocs-bugs, rtalur, sostapov |
Target Milestone: | --- | Keywords: | AutomationBackLog |
Target Release: | OCS 4.6.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | 4.6.0-116.ci | Doc Type: | No Doc Update |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-12-17 06:23:00 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1886873 | ||
Bug Blocks: |
Description
Neha Berry
2020-07-26 13:05:40 UTC
Though the workaround was known and the uninstall docs mention using the article -https://access.redhat.com/solutions/3881901 to resolve these kinds of issues, we raised this BZ as it was seen during Uninstall of External Mode Cluster(new feature of OCS 4.5). Hence, we wanted to confirm if there is a genuine issue with the deletion of the CephObjectStores during namespace deletion in External Mode Thanks. Proposing as a blocker in order to keep this BZ in 4.5, at least until its being investigated. We need to make sure that the uninstall is not broken with external mode clusters. @Jose, please confirm if this is a blocker. We have one more Noobaa related uninstall issue (https://bugzilla.redhat.com/show_bug.cgi?id=1860418) @Nimrod, can someone take a look at both the issues? My bad, https://bugzilla.redhat.com/show_bug.cgi?id=1860418 is not noobaa related. Have updated the bz. But this one needs some expertise from Noobaa team. As discussed in a meeting today between engineering and QE, moving this to OCS 4.6. We will document a workaround for OCS 4.5. jrivera per Comment 13 and Comment 14, it seems that this BZ is already fixed in OCS 4.5. Is this correct? Was it fixed with by Bug 1849105? Seems like it! Reassigning to Talur for completion and moving to ON_QA. As already pointed out, the finalizers remaining behind for "CephobjectstoreUser" was one of the many cases of intermittent issues seen during Project deletion. We have seen many other resources getting stuck as well. Also, since this issue is very intermittently seen, we cannot be 100% sure that the code fixed it. I observed the same Cephobjectstore Finalizer issue while uninstalling in 4.5.0-518 ( 1 out of 5 recent attempts). So it exists, but rarely seen. Also, if these issues do re-occur, we are adding a troubleshooting guide link to patch the resources with finalizers:null - https://bugzilla.redhat.com/show_bug.cgi?id=1866809 https://docs.google.com/document/d/1_6VzcV_uaPaXUSaRSb9CDapfrwFsu6Klqbs-By6KCKw/edit# Based on Comment#13, Comment#14 and our recent tests on 4.5.0-518 and 4.5.0-521, the issue was seen once in all these attempts. Let me know if we can still move the BZ to verified state. It is currently targeted for 4.6. We moved it out of 4.5 because of the race. Although the other fixes have reduced the probability of hitting this bug, we don't think it should be considered fixed yet. We will move it to ON_QA for 4.6. Moving it back to assigned for now. Talur, are we waiting for further changes or can this be moved to ON_QA? The OCS uninstall in latest OCS 4.6 is getting stuck due to remaining cephobjectstoreUser. Hence, until that bug is fixed, we cannot verify this BZ Bug 1886873 - [OCS 4.6 External/Internal Uninstall] - Storage Cluster deletion stuck indefinitely, "failed to delete object store", remaining users: [noobaa-ceph-objectstore-user] Verified the fix on OCS 4.6 4.6.0-144.ci external mode cluster. Will test in internal mode too, before moving the BZ to verified state 1. Created an OCS external mode cluster. the cluster is in Connected state 2. Triggered OCS uninstall by deleting the storagecluster 3. Deleted the namespace Observation: The storage cluster deletion succeeds, followed by successful deletion of the namespace. The operator now issues separate command to delete the cephobjectore and cephobjectstoreuser, hence chances of these resources staying back is now gone. But, if for any reason, the whole uninstall process is affected(say cluster was not in good state prior to uninstall) and if namespace deletion gets stuck due to FinalizersRemaining for some resources, we can always use the oc patch command to patch them with null ( added in troubleshooting guide) OCP = 4.6.0-0.nightly-2020-10-22-034051 OCS = ocs-operator.v4.6.0-144.ci _________________________________________________________________________________________________ Before triggering uninstall ========================= Wed Oct 28 16:45:23 UTC 2020 -------------- ========CSV ====== NAME DISPLAY VERSION REPLACES PHASE ocs-operator.v4.6.0-144.ci OpenShift Container Storage 4.6.0-144.ci Succeeded -------------- =======PODS ====== NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES csi-cephfsplugin-hh85d 3/3 Running 0 35m 10.1.160.165 compute-0 <none> <none> csi-cephfsplugin-n7rgp 3/3 Running 0 35m 10.1.160.180 compute-2 <none> <none> csi-cephfsplugin-nvnmn 3/3 Running 0 35m 10.1.160.161 compute-1 <none> <none> csi-cephfsplugin-provisioner-56455449bd-6cmhn 6/6 Running 0 35m 10.131.0.205 compute-1 <none> <none> csi-cephfsplugin-provisioner-56455449bd-bnnvk 6/6 Running 0 35m 10.129.2.94 compute-2 <none> <none> csi-rbdplugin-68wgt 3/3 Running 0 35m 10.1.160.165 compute-0 <none> <none> csi-rbdplugin-6xfvz 3/3 Running 0 35m 10.1.160.180 compute-2 <none> <none> csi-rbdplugin-7wjdv 3/3 Running 0 35m 10.1.160.161 compute-1 <none> <none> csi-rbdplugin-provisioner-586fc6cfc-d55ds 6/6 Running 0 35m 10.128.2.68 compute-0 <none> <none> csi-rbdplugin-provisioner-586fc6cfc-nh2br 6/6 Running 0 35m 10.131.0.204 compute-1 <none> <none> noobaa-core-0 1/1 Running 0 35m 10.128.2.69 compute-0 <none> <none> noobaa-db-0 1/1 Running 0 35m 10.131.0.206 compute-1 <none> <none> noobaa-endpoint-58dc95697d-4gnzc 1/1 Running 0 34m 10.131.0.207 compute-1 <none> <none> noobaa-operator-7bcf846c94-h722m 1/1 Running 0 36m 10.131.0.203 compute-1 <none> <none> ocs-metrics-exporter-777dc7b97f-4v4hm 1/1 Running 0 36m 10.129.2.93 compute-2 <none> <none> ocs-operator-86846df567-gmp25 1/1 Running 0 36m 10.129.2.91 compute-2 <none> <none> rook-ceph-operator-f44db9fbf-4bkrh 1/1 Running 0 36m 10.129.2.92 compute-2 <none> <none> -------------- ======= PVC ========== NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE db-noobaa-db-0 Bound pvc-4c1a12e0-d866-4fe0-842d-95061698db86 50Gi RWO ocs-external-storagecluster-ceph-rbd 35m -------------- ======= storagecluster ========== NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-external-storagecluster 35m Ready true 2020-10-28T16:10:00Z 4.6.0 >> while true; do oc get cephobjectstore -n openshift-storage ; oc get cephobjectstoreuser; sleep 5; done NAME AGE ocs-external-storagecluster-cephobjectstore 35m NAME AGE noobaa-ceph-objectstore-user 35m 2. deleted the storage cluster $ date --utc; oc delete -n openshift-storage storagecluster --all --wait=true Wed Oct 28 16:45:42 UTC 2020 storagecluster.ocs.openshift.io "ocs-external-storagecluster" deleted 3.$ oc delete project openshift-storage --wait=true --timeout=5m project.project.openshift.io "openshift-storage" deleted $ oc get project openshift-storage Error from server (NotFound): namespaces "openshift-storage" not found >> rook-log snip 2020-10-28 16:46:01.516215 E | ceph-object-store-user-controller: failed to reconcile failed to delete ceph object user "noobaa-ceph-objectstore-user": failed to delete ceph object user "noobaa-ceph-objectstore-user". . could not remove user: unable to remove user, must specify purge data to remove user with buckets: failed to delete s3 user: exit status 17 2020-10-28 16:46:02.575081 I | ceph-spec: object "rook-ceph-config" matched on delete, reconciling 2020-10-28 16:46:02.575201 I | ceph-spec: removing finalizer "cephcluster.ceph.rook.io" on "ocs-external-storagecluster-cephcluster" 2020-10-28 16:46:02.591833 E | clusterdisruption-controller: cephcluster "openshift-storage/ocs-external-storagecluster-cephcluster" seems to be deleted, not requeuing until triggered again 2020-10-28 16:46:02.639919 I | ceph-spec: object "rook-ceph-mgr-external" matched on delete, reconciling 2020-10-28 16:46:02.711974 E | clusterdisruption-controller: cephcluster "openshift-storage/" seems to be deleted, not requeuing until triggered again 2020-10-28 16:46:02.712153 I | ceph-spec: removing finalizer "cephobjectstore.ceph.rook.io" on "ocs-external-storagecluster-cephobjectstore" 2020-10-28 16:46:02.739777 E | clusterdisruption-controller: cephcluster "openshift-storage/" seems to be deleted, not requeuing until triggered again 2020-10-28 16:46:02.755733 I | ceph-spec: object "rook-ceph-rgw-ocs-external-storagecluster-cephobjectstore" matched on delete, reconciling 2020-10-28 16:46:02.795772 E | ceph-object-store-user-controller: failed to reconcile failed to populate cluster info: not expected to create new cluster info and did not find existing secret 2020-10-28 16:46:03.796028 I | ceph-spec: removing finalizer "cephobjectstoreuser.ceph.rook.io" on "noobaa-ceph-objectstore-user" 2020-10-28 16:46:03.825505 I | ceph-spec: object "rook-ceph-object-user-ocs-external-storagecluster-cephobjectstore-noobaa-ceph-objectstore-user" matched on delete, reconciling >> ocs-op snip {"level":"info","ts":"2020-10-28T16:46:02.712Z","logger":"controller_storagecluster","msg":"Uninstall in progress","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster","Status":"Uninstall: Waiting for cephObjectStore ocs-external-storagecluster-cephobjectstore to be deleted"} {"level":"info","ts":"2020-10-28T16:46:02.756Z","logger":"controller_storagecluster","msg":"Reconciling external StorageCluster","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster"} {"level":"info","ts":"2020-10-28T16:46:02.798Z","logger":"controller_storagecluster","msg":"Uninstall: CephCluster not found, can't set the cleanup policy and uninstall mode","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster"} {"level":"info","ts":"2020-10-28T16:46:02.798Z","logger":"controller_storagecluster","msg":"Uninstall: NooBaa not found, can't set UninstallModeForced","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster"} {"level":"info","ts":"2020-10-28T16:46:02.798Z","logger":"controller_storagecluster","msg":"NooBaa and noobaa-core PVC not found.","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster"} {"level":"info","ts":"2020-10-28T16:46:02.798Z","logger":"controller_storagecluster","msg":"Uninstall: CephCluster not found","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster"} {"level":"info","ts":"2020-10-28T16:46:02.798Z","logger":"controller_storagecluster","msg":"Uninstall: CephObjectStoreUser not found","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster","CephObjectStoreUser Name":"ocs-external-storagecluster-cephobjectstoreuser"} {"level":"info","ts":"2020-10-28T16:46:02.798Z","logger":"controller_storagecluster","msg":"Uninstall: CephObjectStore not found","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster","CephObjectStore Name":"ocs-external-storagecluster-cephobjectstore"} {"level":"info","ts":"2020-10-28T16:46:02.898Z","logger":"controller_storagecluster","msg":"Uninstall: CephFilesystem not found","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster","CephFilesystem Name":"ocs-external-storagecluster-cephfilesystem"} {"level":"info","ts":"2020-10-28T16:46:02.999Z","logger":"controller_storagecluster","msg":"Uninstall: CephBlockPool not found","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster","CephBlockPool Name":"ocs-external-storagecluster-cephblockpool"} >>while true; do oc get cephobjectstore -n openshift-storage ; oc get cephobjectstoreuser; sleep 5; done No resources found in openshift-storage namespace. No resources found in openshift-storage namespace. Hence, moving the BZ to verified state as now we have added delete functions for remaining resources as part of uninstall procedure. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5605 |