Bug 1860670 - OCS 4.5 Uninstall External: Openshift-storage namespace in Terminating state as CephObjectStoreUser had finalizers remaining
Summary: OCS 4.5 Uninstall External: Openshift-storage namespace in Terminating state ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: ocs-operator
Version: 4.5
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: OCS 4.6.0
Assignee: Raghavendra Talur
QA Contact: Neha Berry
URL:
Whiteboard:
Depends On: 1886873
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-26 13:05 UTC by Neha Berry
Modified: 2020-12-17 06:23 UTC (History)
12 users (show)

Fixed In Version: 4.6.0-116.ci
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-12-17 06:23:00 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ocs-operator pull 782 0 None closed StorageCluster: add delete functions for remaining resources as part of uninstall procedure 2021-02-02 20:51:45 UTC
Red Hat Product Errata RHSA-2020:5605 0 None None None 2020-12-17 06:23:39 UTC

Description Neha Berry 2020-07-26 13:05:40 UTC
Description of problem (please be detailed as possible and provide log
snippests):
--------------------------------------------------
Created an OCS 4.5 External Mode cluster. On following the steps for uninstall, the openshift-storage project was stuck in terminating state with following message:

- lastTransitionTime: "2020-07-26T12:32:56Z"
    message: All content successfully deleted, may be waiting on finalization
    reason: ContentDeleted
    status: "False"
    type: NamespaceDeletionContentFailure
  - lastTransitionTime: "2020-07-26T12:32:56Z"
    message: 'Some resources are remaining: cephobjectstoreusers.ceph.rook.io has
      1 resource instances'
    reason: SomeResourcesRemain
    status: "True"
    type: NamespaceContentRemaining
  - lastTransitionTime: "2020-07-26T12:32:56Z"
    message: 'Some content in the namespace has finalizers remaining: cephobjectstoreuser.ceph.rook.io
      in 1 resource instances'
    reason: SomeFinalizersRemain
    status: "True"
    type: NamespaceFinalizersRemaining
  phase: Terminating


The following resource was not automatically cleaned up:

$ oc get cephobjectstoreusers.ceph.rook.io
NAME                           AGE
noobaa-ceph-objectstore-user   26h



Version of all relevant components (if applicable):
------------------------------------------
OCP =  4.5.0-0.nightly-2020-07-24-091850
OCS = 4.5.0-494.ci
RHCS external = RHCS 4.1.z1 (14.2.8-81.el8cp)


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
-----------------------------------------------
Yes. Since external Mode is a new feature of OCS 4.5, raised this BZ to inspect if the issue is due to external cluster setup.

Is there any workaround available to the best of your knowledge?
------------------------------------------------------------
Yes.

date --utc;  oc patch cephobjectstoreusers.ceph.rook.io/noobaa-ceph-objectstore-user  -n openshift-storage  --type=merge -p '{"metadata": {"finalizers":null}}'
Sun Jul 26 12:38:29 UTC 2020
cephobjectstoreuser.ceph.rook.io/noobaa-ceph-objectstore-user patched

The project got successfully deleted
--------------
$ while true; do oc get project openshift-storage; sleep 10; done
NAME                DISPLAY NAME   STATUS
openshift-storage                  Terminating
Error from server (NotFound): namespaces "openshift-storage" not found



Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
------------------------------------------------------
3

Can this issue reproducible?
---------------------------
Tested once


Can this issue reproduce from the UI?
----------------------------
Doesn't matter

If this is a regression, please provide more details to justify this:
-----------------------------------------
Not sure

Steps to Reproduce:
1. Create an OCP 4.5 cluster with latest build
2. Collect the RHCS external cluster detail using the python-exporter script and save in a json file
3. upload the JSON during OCS StorageCluster Service Creation and the External Mode cluster is created

4. created some PVC/OBCs

5. Follow following steps to Uninstall OCS
a) Query for PVCs and OBCs that are using the storage class provisioners.
b) Delete them once they are not in use by any PODs

c) Delete the StorageCluster object : UI: Installed Operators->OCS Operator->Storage Cluster->Select the Storagecluster->Click on 3 dots->Delete StorageCluster Service

d) Check the RBD and CephFS SCs are deleted. Delete the noobaa-sc. Delete the noobaa-db PV if in Released state(Bug 1860418)

e) Delete the openshift-storage namespace:
$ oc delete project openshift-storage --wait=true --timeout=5m

f) Check the state of openshift-storage Project. In case it is stuck in Terminating state, check the reason:

oc get project openshift-storage -o yaml

In this attempt, the cephobjectoreUser was still existing in the namespace and blocking its deletion.


Actual results:
----------------------
Openshift-storage namespace deletion is stuck in terminating state due to "SomeResourcesRemain"  and "SomeFinalizersRemain" 

message: Some content in the namespace has finalizers remaining: cephobjectstoreuser.ceph.rook.io
      in 1 resource instances'


Expected results:
---------------------
On deletion of the project, the cephobjectstoreuser resource should automatically get deleted.

Additional info:
--------------------

Comment 3 Neha Berry 2020-07-26 13:15:02 UTC
Though the workaround was known and the uninstall docs mention using the article  -https://access.redhat.com/solutions/3881901 to resolve these kinds of issues, we raised this BZ as it was seen during Uninstall of External Mode Cluster(new feature of OCS 4.5). 

Hence, we wanted to confirm if there is a genuine issue with the deletion of the CephObjectStores during namespace deletion in External Mode

Thanks.

Comment 4 Elad 2020-07-27 07:01:31 UTC
Proposing as a blocker in order to keep this BZ in 4.5, at least until its being investigated. We need to make sure that the uninstall is not broken with external mode clusters.

Comment 5 Mudit Agarwal 2020-07-31 13:06:25 UTC
@Jose, please confirm if this is a blocker.

Comment 7 Mudit Agarwal 2020-08-04 07:19:58 UTC
We have one more Noobaa related uninstall issue (https://bugzilla.redhat.com/show_bug.cgi?id=1860418)
@Nimrod, can someone take a look at both the issues?

Comment 9 Mudit Agarwal 2020-08-04 12:49:06 UTC
My bad, https://bugzilla.redhat.com/show_bug.cgi?id=1860418 is not noobaa related. Have updated the bz.

But this one needs some expertise from Noobaa team.

Comment 11 Jose A. Rivera 2020-08-05 16:33:54 UTC
As discussed in a meeting today between engineering and QE, moving this to OCS 4.6. We will document a workaround for OCS 4.5.

Comment 15 Anat Eyal 2020-08-10 17:22:59 UTC
 jrivera per Comment 13 and Comment 14, it seems that this BZ is already fixed in OCS 4.5. Is this correct? Was it fixed with by Bug 1849105?

Comment 16 Jose A. Rivera 2020-08-11 17:40:20 UTC
Seems like it! Reassigning to Talur for completion and moving to ON_QA.

Comment 17 Neha Berry 2020-08-17 13:50:49 UTC
As already pointed out, the finalizers remaining behind for "CephobjectstoreUser" was one of the many cases of intermittent issues seen during Project deletion. We have seen many other resources getting stuck as well.

Also, since this issue is very intermittently seen, we cannot be 100% sure that the code fixed it. I observed the same Cephobjectstore Finalizer issue while uninstalling in 4.5.0-518 ( 1 out of 5 recent attempts). So it exists, but rarely seen.

Also, if these issues do re-occur, we are adding a troubleshooting guide link to patch the resources with finalizers:null - https://bugzilla.redhat.com/show_bug.cgi?id=1866809

https://docs.google.com/document/d/1_6VzcV_uaPaXUSaRSb9CDapfrwFsu6Klqbs-By6KCKw/edit# 

Based on Comment#13, Comment#14 and our recent tests on 4.5.0-518 and 4.5.0-521, the issue was seen once in all these attempts. Let me know if we can still move the BZ to verified state.

Comment 18 Raghavendra Talur 2020-08-21 15:36:48 UTC
It is currently targeted for 4.6. We moved it out of 4.5 because of the race.

Although the other fixes have reduced the probability of hitting this bug, we don't think it should be considered fixed yet.

We will move it to ON_QA for 4.6. Moving it back to assigned for now.

Comment 19 Mudit Agarwal 2020-09-21 04:02:49 UTC
Talur, are we waiting for further changes or can this be moved to ON_QA?

Comment 24 Neha Berry 2020-10-22 08:31:57 UTC
The OCS uninstall in latest OCS 4.6 is getting stuck due to remaining cephobjectstoreUser. Hence, until that bug is fixed, we cannot verify this BZ


Bug 1886873 - [OCS 4.6 External/Internal Uninstall] - Storage Cluster deletion stuck indefinitely, "failed to delete object store", remaining users: [noobaa-ceph-objectstore-user]

Comment 25 Neha Berry 2020-10-28 17:09:43 UTC
Verified the fix on OCS 4.6 4.6.0-144.ci external mode cluster. Will test in internal mode too, before moving the BZ to verified state


1. Created an OCS external mode cluster. the cluster is in Connected state
2. Triggered OCS uninstall by deleting the storagecluster
3. Deleted the namespace

Observation:

The storage cluster deletion succeeds, followed by successful deletion of the namespace. The operator now issues separate command to delete the cephobjectore and cephobjectstoreuser, hence chances of these resources staying back is now gone.

But, if for any reason, the whole uninstall process is affected(say cluster was not in good state prior to uninstall) and if namespace deletion gets stuck due to FinalizersRemaining for some resources, we can always use the oc patch command to patch them with null ( added in troubleshooting guide)


OCP = 4.6.0-0.nightly-2020-10-22-034051
OCS = ocs-operator.v4.6.0-144.ci

_________________________________________________________________________________________________

Before triggering uninstall
=========================

Wed Oct 28 16:45:23 UTC 2020
--------------
========CSV ======
NAME                         DISPLAY                       VERSION        REPLACES   PHASE
ocs-operator.v4.6.0-144.ci   OpenShift Container Storage   4.6.0-144.ci              Succeeded
--------------
=======PODS ======
NAME                                            READY   STATUS    RESTARTS   AGE   IP             NODE        NOMINATED NODE   READINESS GATES
csi-cephfsplugin-hh85d                          3/3     Running   0          35m   10.1.160.165   compute-0   <none>           <none>
csi-cephfsplugin-n7rgp                          3/3     Running   0          35m   10.1.160.180   compute-2   <none>           <none>
csi-cephfsplugin-nvnmn                          3/3     Running   0          35m   10.1.160.161   compute-1   <none>           <none>
csi-cephfsplugin-provisioner-56455449bd-6cmhn   6/6     Running   0          35m   10.131.0.205   compute-1   <none>           <none>
csi-cephfsplugin-provisioner-56455449bd-bnnvk   6/6     Running   0          35m   10.129.2.94    compute-2   <none>           <none>
csi-rbdplugin-68wgt                             3/3     Running   0          35m   10.1.160.165   compute-0   <none>           <none>
csi-rbdplugin-6xfvz                             3/3     Running   0          35m   10.1.160.180   compute-2   <none>           <none>
csi-rbdplugin-7wjdv                             3/3     Running   0          35m   10.1.160.161   compute-1   <none>           <none>
csi-rbdplugin-provisioner-586fc6cfc-d55ds       6/6     Running   0          35m   10.128.2.68    compute-0   <none>           <none>
csi-rbdplugin-provisioner-586fc6cfc-nh2br       6/6     Running   0          35m   10.131.0.204   compute-1   <none>           <none>
noobaa-core-0                                   1/1     Running   0          35m   10.128.2.69    compute-0   <none>           <none>
noobaa-db-0                                     1/1     Running   0          35m   10.131.0.206   compute-1   <none>           <none>
noobaa-endpoint-58dc95697d-4gnzc                1/1     Running   0          34m   10.131.0.207   compute-1   <none>           <none>
noobaa-operator-7bcf846c94-h722m                1/1     Running   0          36m   10.131.0.203   compute-1   <none>           <none>
ocs-metrics-exporter-777dc7b97f-4v4hm           1/1     Running   0          36m   10.129.2.93    compute-2   <none>           <none>
ocs-operator-86846df567-gmp25                   1/1     Running   0          36m   10.129.2.91    compute-2   <none>           <none>
rook-ceph-operator-f44db9fbf-4bkrh              1/1     Running   0          36m   10.129.2.92    compute-2   <none>           <none>
--------------
======= PVC ==========
NAME             STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                           AGE
db-noobaa-db-0   Bound    pvc-4c1a12e0-d866-4fe0-842d-95061698db86   50Gi       RWO            ocs-external-storagecluster-ceph-rbd   35m
--------------
======= storagecluster ==========
NAME                          AGE   PHASE   EXTERNAL   CREATED AT             VERSION
ocs-external-storagecluster   35m   Ready   true       2020-10-28T16:10:00Z   4.6.0

>> while true; do oc get cephobjectstore -n openshift-storage ; oc get cephobjectstoreuser; sleep 5; done


NAME                                          AGE
ocs-external-storagecluster-cephobjectstore   35m
NAME                           AGE
noobaa-ceph-objectstore-user   35m


2. deleted the storage cluster

$ date --utc; oc delete -n openshift-storage storagecluster --all --wait=true
Wed Oct 28 16:45:42 UTC 2020
storagecluster.ocs.openshift.io "ocs-external-storagecluster" deleted


3.$ oc delete project openshift-storage --wait=true --timeout=5m
project.project.openshift.io "openshift-storage" deleted

$ oc get project openshift-storage
Error from server (NotFound): namespaces "openshift-storage" not found



>> rook-log snip


2020-10-28 16:46:01.516215 E | ceph-object-store-user-controller: failed to reconcile failed to delete ceph object user "noobaa-ceph-objectstore-user": failed to delete ceph object user "noobaa-ceph-objectstore-user". . could not remove user: unable to remove user, must specify purge data to remove user with buckets: failed to delete s3 user: exit status 17
2020-10-28 16:46:02.575081 I | ceph-spec: object "rook-ceph-config" matched on delete, reconciling
2020-10-28 16:46:02.575201 I | ceph-spec: removing finalizer "cephcluster.ceph.rook.io" on "ocs-external-storagecluster-cephcluster"
2020-10-28 16:46:02.591833 E | clusterdisruption-controller: cephcluster "openshift-storage/ocs-external-storagecluster-cephcluster" seems to be deleted, not requeuing until triggered again
2020-10-28 16:46:02.639919 I | ceph-spec: object "rook-ceph-mgr-external" matched on delete, reconciling
2020-10-28 16:46:02.711974 E | clusterdisruption-controller: cephcluster "openshift-storage/" seems to be deleted, not requeuing until triggered again
2020-10-28 16:46:02.712153 I | ceph-spec: removing finalizer "cephobjectstore.ceph.rook.io" on "ocs-external-storagecluster-cephobjectstore"
2020-10-28 16:46:02.739777 E | clusterdisruption-controller: cephcluster "openshift-storage/" seems to be deleted, not requeuing until triggered again
2020-10-28 16:46:02.755733 I | ceph-spec: object "rook-ceph-rgw-ocs-external-storagecluster-cephobjectstore" matched on delete, reconciling
2020-10-28 16:46:02.795772 E | ceph-object-store-user-controller: failed to reconcile failed to populate cluster info: not expected to create new cluster info and did not find existing secret
2020-10-28 16:46:03.796028 I | ceph-spec: removing finalizer "cephobjectstoreuser.ceph.rook.io" on "noobaa-ceph-objectstore-user"
2020-10-28 16:46:03.825505 I | ceph-spec: object "rook-ceph-object-user-ocs-external-storagecluster-cephobjectstore-noobaa-ceph-objectstore-user" matched on delete, reconciling



>> ocs-op snip


{"level":"info","ts":"2020-10-28T16:46:02.712Z","logger":"controller_storagecluster","msg":"Uninstall in progress","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster","Status":"Uninstall: Waiting for cephObjectStore ocs-external-storagecluster-cephobjectstore to be deleted"}
{"level":"info","ts":"2020-10-28T16:46:02.756Z","logger":"controller_storagecluster","msg":"Reconciling external StorageCluster","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster"}
{"level":"info","ts":"2020-10-28T16:46:02.798Z","logger":"controller_storagecluster","msg":"Uninstall: CephCluster not found, can't set the cleanup policy and uninstall mode","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster"}
{"level":"info","ts":"2020-10-28T16:46:02.798Z","logger":"controller_storagecluster","msg":"Uninstall: NooBaa not found, can't set UninstallModeForced","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster"}
{"level":"info","ts":"2020-10-28T16:46:02.798Z","logger":"controller_storagecluster","msg":"NooBaa and noobaa-core PVC not found.","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster"}
{"level":"info","ts":"2020-10-28T16:46:02.798Z","logger":"controller_storagecluster","msg":"Uninstall: CephCluster not found","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster"}
{"level":"info","ts":"2020-10-28T16:46:02.798Z","logger":"controller_storagecluster","msg":"Uninstall: CephObjectStoreUser not found","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster","CephObjectStoreUser Name":"ocs-external-storagecluster-cephobjectstoreuser"}
{"level":"info","ts":"2020-10-28T16:46:02.798Z","logger":"controller_storagecluster","msg":"Uninstall: CephObjectStore not found","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster","CephObjectStore Name":"ocs-external-storagecluster-cephobjectstore"}
{"level":"info","ts":"2020-10-28T16:46:02.898Z","logger":"controller_storagecluster","msg":"Uninstall: CephFilesystem not found","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster","CephFilesystem Name":"ocs-external-storagecluster-cephfilesystem"}
{"level":"info","ts":"2020-10-28T16:46:02.999Z","logger":"controller_storagecluster","msg":"Uninstall: CephBlockPool not found","Request.Namespace":"openshift-storage","Request.Name":"ocs-external-storagecluster","CephBlockPool Name":"ocs-external-storagecluster-cephblockpool"}


>>while true; do oc get cephobjectstore -n openshift-storage ; oc get cephobjectstoreuser; sleep 5; done
No resources found in openshift-storage namespace.
No resources found in openshift-storage namespace.

Hence, moving the BZ to verified state as now we have added delete functions for remaining resources as part of uninstall procedure.

Comment 28 errata-xmlrpc 2020-12-17 06:23:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5605


Note You need to log in before you can comment on or make changes to this bug.