Description of problem (please be detailed as possible and provide log snippests): -------------------------------------------------- Created an OCS 4.5 External Mode cluster. On following the steps for uninstall, the openshift-storage project was stuck in terminating state with following message: - lastTransitionTime: "2020-07-26T12:32:56Z" message: All content successfully deleted, may be waiting on finalization reason: ContentDeleted status: "False" type: NamespaceDeletionContentFailure - lastTransitionTime: "2020-07-26T12:32:56Z" message: 'Some resources are remaining: cephobjectstoreusers.ceph.rook.io has 1 resource instances' reason: SomeResourcesRemain status: "True" type: NamespaceContentRemaining - lastTransitionTime: "2020-07-26T12:32:56Z" message: 'Some content in the namespace has finalizers remaining: cephobjectstoreuser.ceph.rook.io in 1 resource instances' reason: SomeFinalizersRemain status: "True" type: NamespaceFinalizersRemaining phase: Terminating The following resource was not automatically cleaned up: $ oc get cephobjectstoreusers.ceph.rook.io NAME AGE noobaa-ceph-objectstore-user 26h Version of all relevant components (if applicable): ------------------------------------------ OCP = 4.5.0-0.nightly-2020-07-24-091850 OCS = 4.5.0-494.ci RHCS external = RHCS 4.1.z1 (14.2.8-81.el8cp) Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? ----------------------------------------------- Yes. Since external Mode is a new feature of OCS 4.5, raised this BZ to inspect if the issue is due to external cluster setup. Is there any workaround available to the best of your knowledge? ------------------------------------------------------------ Yes. date --utc; oc patch cephobjectstoreusers.ceph.rook.io/noobaa-ceph-objectstore-user -n openshift-storage --type=merge -p '{"metadata": {"finalizers":null}}' Sun Jul 26 12:38:29 UTC 2020 cephobjectstoreuser.ceph.rook.io/noobaa-ceph-objectstore-user patched The project got successfully deleted -------------- $ while true; do oc get project openshift-storage; sleep 10; done NAME DISPLAY NAME STATUS openshift-storage Terminating Error from server (NotFound): namespaces "openshift-storage" not found Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? ------------------------------------------------------ 3 Can this issue reproducible? --------------------------- Tested once Can this issue reproduce from the UI? ---------------------------- Doesn't matter If this is a regression, please provide more details to justify this: ----------------------------------------- Not sure Steps to Reproduce: 1. Create an OCP 4.5 cluster with latest build 2. Collect the RHCS external cluster detail using the python-exporter script and save in a json file 3. upload the JSON during OCS StorageCluster Service Creation and the External Mode cluster is created 4. created some PVC/OBCs 5. Follow following steps to Uninstall OCS a) Query for PVCs and OBCs that are using the storage class provisioners. b) Delete them once they are not in use by any PODs c) Delete the StorageCluster object : UI: Installed Operators->OCS Operator->Storage Cluster->Select the Storagecluster->Click on 3 dots->Delete StorageCluster Service d) Check the RBD and CephFS SCs are deleted. Delete the noobaa-sc. Delete the noobaa-db PV if in Released state(Bug 1860418) e) Delete the openshift-storage namespace: $ oc delete project openshift-storage --wait=true --timeout=5m f) Check the state of openshift-storage Project. In case it is stuck in Terminating state, check the reason: oc get project openshift-storage -o yaml In this attempt, the cephobjectoreUser was still existing in the namespace and blocking its deletion. Actual results: ---------------------- Openshift-storage namespace deletion is stuck in terminating state due to "SomeResourcesRemain" and "SomeFinalizersRemain" message: Some content in the namespace has finalizers remaining: cephobjectstoreuser.ceph.rook.io in 1 resource instances' Expected results: --------------------- On deletion of the project, the cephobjectstoreuser resource should automatically get deleted. Additional info: --------------------
Though the workaround was known and the uninstall docs mention using the article -https://access.redhat.com/solutions/3881901 to resolve these kinds of issues, we raised this BZ as it was seen during Uninstall of External Mode Cluster(new feature of OCS 4.5). Hence, we wanted to confirm if there is a genuine issue with the deletion of the CephObjectStores during namespace deletion in External Mode Thanks.
Proposing as a blocker in order to keep this BZ in 4.5, at least until its being investigated. We need to make sure that the uninstall is not broken with external mode clusters.
@Jose, please confirm if this is a blocker.
We have one more Noobaa related uninstall issue (https://bugzilla.redhat.com/show_bug.cgi?id=1860418) @Nimrod, can someone take a look at both the issues?
My bad, https://bugzilla.redhat.com/show_bug.cgi?id=1860418 is not noobaa related. Have updated the bz. But this one needs some expertise from Noobaa team.
As discussed in a meeting today between engineering and QE, moving this to OCS 4.6. We will document a workaround for OCS 4.5.
jrivera per Comment 13 and Comment 14, it seems that this BZ is already fixed in OCS 4.5. Is this correct? Was it fixed with by Bug 1849105?
Seems like it! Reassigning to Talur for completion and moving to ON_QA.
As already pointed out, the finalizers remaining behind for "CephobjectstoreUser" was one of the many cases of intermittent issues seen during Project deletion. We have seen many other resources getting stuck as well. Also, since this issue is very intermittently seen, we cannot be 100% sure that the code fixed it. I observed the same Cephobjectstore Finalizer issue while uninstalling in 4.5.0-518 ( 1 out of 5 recent attempts). So it exists, but rarely seen. Also, if these issues do re-occur, we are adding a troubleshooting guide link to patch the resources with finalizers:null - https://bugzilla.redhat.com/show_bug.cgi?id=1866809 https://docs.google.com/document/d/1_6VzcV_uaPaXUSaRSb9CDapfrwFsu6Klqbs-By6KCKw/edit# Based on Comment#13, Comment#14 and our recent tests on 4.5.0-518 and 4.5.0-521, the issue was seen once in all these attempts. Let me know if we can still move the BZ to verified state.
It is currently targeted for 4.6. We moved it out of 4.5 because of the race. Although the other fixes have reduced the probability of hitting this bug, we don't think it should be considered fixed yet. We will move it to ON_QA for 4.6. Moving it back to assigned for now.
Talur, are we waiting for further changes or can this be moved to ON_QA?
The OCS uninstall in latest OCS 4.6 is getting stuck due to remaining cephobjectstoreUser. Hence, until that bug is fixed, we cannot verify this BZ Bug 1886873 - [OCS 4.6 External/Internal Uninstall] - Storage Cluster deletion stuck indefinitely, "failed to delete object store", remaining users: [noobaa-ceph-objectstore-user]
Verified the fix on OCS 4.6 4.6.0-144.ci external mode cluster. Will test in internal mode too, before moving the BZ to verified state 1. Created an OCS external mode cluster. the cluster is in Connected state 2. Triggered OCS uninstall by deleting the storagecluster 3. Deleted the namespace Observation: The storage cluster deletion succeeds, followed by successful deletion of the namespace. The operator now issues separate command to delete the cephobjectore and cephobjectstoreuser, hence chances of these resources staying back is now gone. But, if for any reason, the whole uninstall process is affected(say cluster was not in good state prior to uninstall) and if namespace deletion gets stuck due to FinalizersRemaining for some resources, we can always use the oc patch command to patch them with null ( added in troubleshooting guide) OCP = 4.6.0-0.nightly-2020-10-22-034051 OCS = ocs-operator.v4.6.0-144.ci _________________________________________________________________________________________________ Before triggering uninstall ========================= Wed Oct 28 16:45:23 UTC 2020 -------------- ========CSV ====== NAME DISPLAY VERSION REPLACES PHASE ocs-operator.v4.6.0-144.ci OpenShift Container Storage 4.6.0-144.ci Succeeded -------------- =======PODS ====== NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES csi-cephfsplugin-hh85d 3/3 Running 0 35m 10.1.160.165 compute-0 <none> <none> csi-cephfsplugin-n7rgp 3/3 Running 0 35m 10.1.160.180 compute-2 <none> <none> csi-cephfsplugin-nvnmn 3/3 Running 0 35m 10.1.160.161 compute-1 <none> <none> csi-cephfsplugin-provisioner-56455449bd-6cmhn 6/6 Running 0 35m 10.131.0.205 compute-1 <none> <none> csi-cephfsplugin-provisioner-56455449bd-bnnvk 6/6 Running 0 35m 10.129.2.94 compute-2 <none> <none> csi-rbdplugin-68wgt 3/3 Running 0 35m 10.1.160.165 compute-0 <none> <none> csi-rbdplugin-6xfvz 3/3 Running 0 35m 10.1.160.180 compute-2 <none> <none> csi-rbdplugin-7wjdv 3/3 Running 0 35m 10.1.160.161 compute-1 <none> <none> csi-rbdplugin-provisioner-586fc6cfc-d55ds 6/6 Running 0 35m 10.128.2.68 compute-0 <none> <none> csi-rbdplugin-provisioner-586fc6cfc-nh2br 6/6 Running 0 35m 10.131.0.204 compute-1 <none> <none> noobaa-core-0 1/1 Running 0 35m 10.128.2.69 compute-0 <none> <none> noobaa-db-0 1/1 Running 0 35m 10.131.0.206 compute-1 <none> <none> noobaa-endpoint-58dc95697d-4gnzc 1/1 Running 0 34m 10.131.0.207 compute-1 <none> <none> noobaa-operator-7bcf846c94-h722m 1/1 Running 0 36m 10.131.0.203 compute-1 <none> <none> ocs-metrics-exporter-777dc7b97f-4v4hm 1/1 Running 0 36m 10.129.2.93 compute-2 <none> <none> ocs-operator-86846df567-gmp25 1/1 Running 0 36m 10.129.2.91 compute-2 <none> <none> rook-ceph-operator-f44db9fbf-4bkrh 1/1 Running 0 36m 10.129.2.92 compute-2 <none> <none> -------------- ======= PVC ========== NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE db-noobaa-db-0 Bound pvc-4c1a12e0-d866-4fe0-842d-95061698db86 50Gi RWO ocs-external-storagecluster-ceph-rbd 35m -------------- ======= storagecluster ========== NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-external-storagecluster 35m Ready true 2020-10-28T16:10:00Z 4.6.0 >> while true; do oc get cephobjectstore -n openshift-storage ; oc get cephobjectstoreuser; sleep 5; done NAME AGE ocs-external-storagecluster-cephobjectstore 35m NAME AGE noobaa-ceph-objectstore-user 35m 2. deleted the storage cluster $ date --utc; oc delete -n openshift-storage storagecluster --all --wait=true Wed Oct 28 16:45:42 UTC 2020 storagecluster.ocs.openshift.io "ocs-external-storagecluster" deleted 3.$ oc delete project openshift-storage --wait=true --timeout=5m project.project.openshift.io "openshift-storage" deleted $ oc get project openshift-storage Error from server (NotFound): namespaces "openshift-storage" not found >> rook-log snip 2020-10-28 16:46:01.516215 E | ceph-object-store-user-controller: failed to reconcile failed to delete ceph object user "noobaa-ceph-objectstore-user": failed to delete ceph object user "noobaa-ceph-objectstore-user". . could not remove user: unable to remove user, must specify purge data to remove user with buckets: failed to delete s3 user: exit status 17 2020-10-28 16:46:02.575081 I | ceph-spec: object "rook-ceph-config" matched on delete, reconciling 2020-10-28 16:46:02.575201 I | ceph-spec: removing finalizer "cephcluster.ceph.rook.io" on "ocs-external-storagecluster-cephcluster" 2020-10-28 16:46:02.591833 E | clusterdisruption-controller: cephcluster "openshift-storage/ocs-external-storagecluster-cephcluster" seems to be deleted, not requeuing until triggered again 2020-10-28 16:46:02.639919 I | ceph-spec: object "rook-ceph-mgr-external" matched on delete, reconciling 2020-10-28 16:46:02.711974 E | clusterdisruption-controller: cephcluster "openshift-storage/" seems to be deleted, not requeuing until triggered again 2020-10-28 16:46:02.712153 I | ceph-spec: removing finalizer "cephobjectstore.ceph.rook.io" on "ocs-external-storagecluster-cephobjectstore" 2020-10-28 16:46:02.739777 E | clusterdisruption-controller: cephcluster "openshift-storage/" seems to be deleted, not requeuing until triggered again 2020-10-28 16:46:02.755733 I | ceph-spec: object "rook-ceph-rgw-ocs-external-storagecluster-cephobjectstore" matched on delete, reconciling 2020-10-28 16:46:02.795772 E | ceph-object-store-user-controller: failed to reconcile failed to populate cluster info: not expected to create new cluster info and did not find existing secret 2020-10-28 16:46:03.796028 I | ceph-spec: removing finalizer "cephobjectstoreuser.ceph.rook.io" on "noobaa-ceph-objectstore-user" 2020-10-28 16:46:03.825505 I | ceph-spec: object "rook-ceph-object-user-ocs-external-storagecluster-cephobjectstore-noobaa-ceph-objectstore-user" matched on delete, reconciling >> ocs-op snip {"level":"info","ts":"2020-10-28T16:46:02.712Z","logger":"controller_storagecluster","msg":"Uninstall in progress","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster","Status":"Uninstall: Waiting for cephObjectStore ocs-external-storagecluster-cephobjectstore to be deleted"} {"level":"info","ts":"2020-10-28T16:46:02.756Z","logger":"controller_storagecluster","msg":"Reconciling external StorageCluster","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster"} {"level":"info","ts":"2020-10-28T16:46:02.798Z","logger":"controller_storagecluster","msg":"Uninstall: CephCluster not found, can't set the cleanup policy and uninstall mode","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster"} {"level":"info","ts":"2020-10-28T16:46:02.798Z","logger":"controller_storagecluster","msg":"Uninstall: NooBaa not found, can't set UninstallModeForced","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster"} {"level":"info","ts":"2020-10-28T16:46:02.798Z","logger":"controller_storagecluster","msg":"NooBaa and noobaa-core PVC not found.","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster"} {"level":"info","ts":"2020-10-28T16:46:02.798Z","logger":"controller_storagecluster","msg":"Uninstall: CephCluster not found","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster"} {"level":"info","ts":"2020-10-28T16:46:02.798Z","logger":"controller_storagecluster","msg":"Uninstall: CephObjectStoreUser not found","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster","CephObjectStoreUser Name":"ocs-external-storagecluster-cephobjectstoreuser"} {"level":"info","ts":"2020-10-28T16:46:02.798Z","logger":"controller_storagecluster","msg":"Uninstall: CephObjectStore not found","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster","CephObjectStore Name":"ocs-external-storagecluster-cephobjectstore"} {"level":"info","ts":"2020-10-28T16:46:02.898Z","logger":"controller_storagecluster","msg":"Uninstall: CephFilesystem not found","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster","CephFilesystem Name":"ocs-external-storagecluster-cephfilesystem"} {"level":"info","ts":"2020-10-28T16:46:02.999Z","logger":"controller_storagecluster","msg":"Uninstall: CephBlockPool not found","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster","CephBlockPool Name":"ocs-external-storagecluster-cephblockpool"} >>while true; do oc get cephobjectstore -n openshift-storage ; oc get cephobjectstoreuser; sleep 5; done No resources found in openshift-storage namespace. No resources found in openshift-storage namespace. Hence, moving the BZ to verified state as now we have added delete functions for remaining resources as part of uninstall procedure.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5605