Bug 2000941

Summary: Uninstall - finalizers blocking openshift-storage namespace deletion
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Jilju Joy <jijoy>
Component: ocs-operatorAssignee: Jose A. Rivera <jrivera>
Status: CLOSED CURRENTRELEASE QA Contact: Anna Sandler <asandler>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.9CC: aindenba, amagrawa, dzaken, ebenahar, etamir, idryomov, jrivera, madam, muagarwa, nbecker, nigoyal, ocs-bugs, odf-bz-bot, owasserm, shan, sostapov
Target Milestone: ---Keywords: Regression
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-01-07 17:46:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2005040    
Bug Blocks:    

Description Jilju Joy 2021-09-03 10:45:26 UTC
Description of problem:

While deleting openshift-storage namespace as part of ODF uninstall process, the namespace remained in "Terminating" status due to the presence of finalizers as seen in the yaml output given below.

Storagecluster and storage system was deleted before deleting the namespace openshift-storage.

$ oc delete project openshift-storage --wait=true --timeout=5m
project.project.openshift.io "openshift-storage" deleted

$ oc get project openshift-storage
NAME                DISPLAY NAME   STATUS
openshift-storage                  Terminating

$ oc get project openshift-storage -o yaml
apiVersion: project.openshift.io/v1
kind: Project
metadata:
  annotations:
    openshift.io/sa.scc.mcs: s0:c26,c0
    openshift.io/sa.scc.supplemental-groups: 1000650000/10000
    openshift.io/sa.scc.uid-range: 1000650000/10000
  creationTimestamp: "2021-09-02T07:52:41Z"
  deletionTimestamp: "2021-09-02T17:37:50Z"
  labels:
    kubernetes.io/metadata.name: openshift-storage
    olm.operatorgroup.uid/76c42cb4-e5d1-446e-82e9-c121c43996f7: ""
    openshift.io/cluster-monitoring: "true"
  name: openshift-storage
  resourceVersion: "388296"
  uid: 8534b9c6-002e-4afd-8c31-d08a8250c501
spec:
  finalizers:
  - kubernetes
status:
  conditions:
  - lastTransitionTime: "2021-09-02T17:38:07Z"
    message: All resources successfully discovered
    reason: ResourcesDiscovered
    status: "False"
    type: NamespaceDeletionDiscoveryFailure
  - lastTransitionTime: "2021-09-02T17:38:07Z"
    message: All legacy kube types successfully parsed
    reason: ParsedGroupVersions
    status: "False"
    type: NamespaceDeletionGroupVersionParsingFailure
  - lastTransitionTime: "2021-09-02T17:38:34Z"
    message: All content successfully deleted, may be waiting on finalization
    reason: ContentDeleted
    status: "False"
    type: NamespaceDeletionContentFailure
  - lastTransitionTime: "2021-09-02T17:38:07Z"
    message: 'Some resources are remaining: backingstores.noobaa.io has 1 resource
      instances, bucketclasses.noobaa.io has 1 resource instances, noobaas.noobaa.io
      has 1 resource instances'
    reason: SomeResourcesRemain
    status: "True"
    type: NamespaceContentRemaining
  - lastTransitionTime: "2021-09-02T17:38:07Z"
    message: 'Some content in the namespace has finalizers remaining: noobaa.io/finalizer
      in 2 resource instances, noobaa.io/graceful_finalizer in 1 resource instances'
    reason: SomeFinalizersRemain
    status: "True"
    type: NamespaceFinalizersRemaining
  phase: Terminating



These are the finalizers which blocked the deletion of the namespace 'openshift-storage':

$ oc get noobaa noobaa -n openshift-storage -o yaml | grep finalizer
  finalizers:
  - noobaa.io/graceful_finalizer

$ oc get backingstore noobaa-default-backing-store -n openshift-storage -o yaml | grep finalizer
  finalizers:
  - noobaa.io/finalizer


$ oc get bucketclasses.noobaa.io noobaa-default-bucket-class -n openshift-storage -o yaml | grep finalizer
  finalizers:
  - noobaa.io/finalizer



Workaround:
Remove the finalizers when the namespace status is Terminating.

$ oc patch -n openshift-storage noobaa/noobaa --type=merge -p '{"metadata": {"finalizers":null}}'
noobaa.noobaa.io/noobaa patched

$ oc patch -n openshift-storage backingstore/noobaa-default-backing-store --type=merge -p '{"metadata": {"finalizers":null}}'
backingstore.noobaa.io/noobaa-default-backing-store patched

$ oc patch -n openshift-storage bucketclasses.noobaa.io/noobaa-default-bucket-class --type=merge -p '{"metadata": {"finalizers":null}}'
bucketclass.noobaa.io/noobaa-default-bucket-class patched


=================================================================================
Version-Release number of selected component (if applicable):

$ oc get csv
NAME                            DISPLAY                       VERSION        REPLACES   PHASE
noobaa-operator.v4.9.0-123.ci   NooBaa Operator               4.9.0-123.ci              Succeeded
ocs-operator.v4.9.0-123.ci      OpenShift Container Storage   4.9.0-123.ci              Succeeded
odf-operator.v4.9.0-123.ci      OpenShift Data Foundation     4.9.0-123.ci              Succeeded

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.nightly-2021-09-01-193941   True        False         10h     Cluster version is 4.9.0-0.nightly-2021-09-01-193941
====================================================================================

How reproducible:
Reporting the first failure.

================================================================================
Steps to Reproduce:

Follow the existing steps to uninstall OCS -
https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.8/html/deploying_openshift_container_storage_using_amazon_web_services/assembly_uninstalling-openshift-container-storage_rhocs#uninstalling-openshift-container-storage-in-internal-mode_rhocs
(testing was done on AWS)

After deleting the storagecluster (after step 5 in the given doc), delete the storage system.
$ oc delete storagesystem ocs-storagecluster-storagesystem
storagesystem.odf.openshift.io "ocs-storagecluster-storagesystem" deleted
$ oc get storagesyatem
error: the server doesn't have a resource type "storagesyatem"

Continue with the rest of the steps given in the doc. 


Actual results:

The step 7 given in the documentation cannot be completed.
"Delete the namespace and wait till the deletion is complete. You will need to switch to another project if openshift-storage is the active project."

The namespace openshift-storage will remain in terminating state.


Expected results:
Uninstall process should succeed. The namespace openshift-storage should be deleted.


Additional info:

Comment 2 Jilju Joy 2021-09-03 10:47:30 UTC
Adding Regressing keyword because uninstall process was working in 4.8.

Comment 7 Alexander Indenbaum 2021-10-17 05:51:08 UTC
Reproduced in dev environment. I can see two issues in removing the project namespace scenario:
1. The operator deployment is removed before NooBaa CR deletion is complete, leaving behind NooBaa resources finalizers 
2. The operator's RBAC resources removed before the NooBaa operator causing ☠️  Panic Attack: [Unauthorized]

Comment 16 Sébastien Han 2021-10-21 15:03:17 UTC
This is strange, during uninstallation, we probably have a step to remove the PVC, which actually should unmap the rbd block and remove it from Ceph. At this point, the entire cluster is gone but it looks like the rbd block is still there:

rbd0   252:0    0    50G  0 disk /var/lib/kubelet/pods/1798b7f3-7904-4b20-a76c-5c9d5bfbe97d/volumes/kubernetes.io~csi/pvc-c2d41602-3ddb-41fd-8ba6-9bea2a622dee/mount

Could this be a sequencing issue during deletion?

Everything we are seeing here is because of that lingering rbd block. The fs on top does not respond then Postgres hangs too.

Comment 17 Danny 2021-10-27 08:20:37 UTC
after discussing this issue with Sebastien, he thinks this is more of an ocs-op bug and not rook. changing the component to ocs-operator

Comment 18 Mudit Agarwal 2021-11-09 15:03:22 UTC
Providing the dev ack, we are still looking for the RCA

Comment 21 Jose A. Rivera 2021-11-15 19:26:12 UTC
Reviewing this as best I could, I can only come up with a few thoughts but no solutions:

* In following the uninstall documentation, what was the state of the cluster when you did Step 7 "Delete the namespace and wait till the deletion is complete."? Ideally there should have been only operator and CSI Pods present. If any other Pods were still running that means things did not resolve correctly.

* Part of the uninstall workflow has the user removing all OCS PVCs. While the documentation is careful to provide a script that ignores the NooBaa PVCs, can you verify this was done correctly?

* It seems a potentially related BZ (https://bugzilla.redhat.com/show_bug.cgi?id=2005040) was updated after the latest round of testing was done on this one. I also share the suspicion that it may be related... Since that one is ON_QA, could we also move this one to ON_QA?

Comment 22 Jilju Joy 2021-11-16 05:07:22 UTC
(In reply to Jose A. Rivera from comment #21)
> Reviewing this as best I could, I can only come up with a few thoughts but
> no solutions:
> 
> * In following the uninstall documentation, what was the state of the
> cluster when you did Step 7 "Delete the namespace and wait till the deletion
> is complete."? Ideally there should have been only operator and CSI Pods
> present. If any other Pods were still running that means things did not
> resolve correctly.

csi, operator and noobaa pods were present. Adding here the steps deleting storage cluster, storage system and openshift-storage namespace. These steps are captured from the initial reproducer of the issue given in comment #0.


(venv) [jijoy@localhost ocs-ci]$ oc delete -n openshift-storage storagecluster --all --wait=true
storagecluster.ocs.openshift.io "ocs-storagecluster" deleted
(venv) [jijoy@localhost ocs-ci]$ 
(venv) [jijoy@localhost ocs-ci]$ oc get pods -n openshift-storage | grep -i cleanup
cluster-cleanup-job-ip-10-0-158-68.us-east-2.compute.i--1-j6pp8   0/1     Completed   0             29s
(venv) [jijoy@localhost ocs-ci]$ 
(venv) [jijoy@localhost ocs-ci]$ oc get pods -n openshift-storage
NAME                                                              READY   STATUS      RESTARTS      AGE
cluster-cleanup-job-ip-10-0-158-68.us-east-2.compute.i--1-j6pp8   0/1     Completed   0             57s
csi-cephfsplugin-8hwqz                                            3/3     Running     0             3h56m
csi-cephfsplugin-9mk5q                                            3/3     Running     0             3h56m
csi-cephfsplugin-provisioner-8546f775c4-q7tnk                     6/6     Running     0             3h56m
csi-cephfsplugin-provisioner-8546f775c4-xwftd                     6/6     Running     0             3h56m
csi-cephfsplugin-snj9w                                            3/3     Running     0             3h56m
csi-rbdplugin-24sl8                                               3/3     Running     0             3h56m
csi-rbdplugin-5rcfb                                               3/3     Running     0             3h56m
csi-rbdplugin-provisioner-59dbd44fdd-hxlq2                        6/6     Running     0             3h56m
csi-rbdplugin-provisioner-59dbd44fdd-w8mz8                        6/6     Running     0             3h56m
csi-rbdplugin-shfrc                                               3/3     Running     0             3h56m
noobaa-core-0                                                     1/1     Running     0             3h53m
noobaa-db-pg-0                                                    1/1     Running     0             3h53m
noobaa-endpoint-9649f7f74-fhgth                                   1/1     Running     0             3h53m
noobaa-operator-6c4f6fcfb8-wkkb9                                  1/1     Running     1 (62s ago)   9h
ocs-metrics-exporter-564f89d788-6fkj5                             1/1     Running     0             9h
ocs-operator-7c9fcf7d74-chxbp                                     1/1     Running     0             9h
odf-console-7c6fd85bcf-ftxl4                                      2/2     Running     0             9h
odf-operator-controller-manager-55dcf859f9-sfdqz                  2/2     Running     0             9h
rook-ceph-operator-847c7bc6f4-f7lfg                               1/1     Running     0             9h
(venv) [jijoy@localhost ocs-ci]$ 
(venv) [jijoy@localhost ocs-ci]$ oc get pods -n openshift-storage | grep -i cleanup
cluster-cleanup-job-ip-10-0-158-68.us-east-2.compute.i--1-j6pp8   0/1     Completed   0               3m14s
(venv) [jijoy@localhost ocs-ci]$ 
(venv) [jijoy@localhost ocs-ci]$ oc get storagecluster
No resources found in openshift-storage namespace.
(venv) [jijoy@localhost ocs-ci]$ 
(venv) [jijoy@localhost ocs-ci]$ oc get storagesystem
NAME                               STORAGE-SYSTEM-KIND                  STORAGE-SYSTEM-NAME
ocs-storagecluster-storagesystem   storagecluster.ocs.openshift.io/v1   ocs-storagecluster
(venv) [jijoy@localhost ocs-ci]$ 
(venv) [jijoy@localhost ocs-ci]$ oc delete storagesystem ocs-storagecluster-storagesystem
storagesystem.odf.openshift.io "ocs-storagecluster-storagesystem" deleted
(venv) [jijoy@localhost ocs-ci]$ 
(venv) [jijoy@localhost ocs-ci]$ oc get pvc
NAME                STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
db-noobaa-db-pg-0   Bound    pvc-f3a135b7-8d3c-4a8b-8e4c-29cfcd198134   50Gi       RWO            gp2            9h
(venv) [jijoy@localhost ocs-ci]$ 
(venv) [jijoy@localhost ocs-ci]$ oc get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                 STORAGECLASS   REASON   AGE
pvc-f3a135b7-8d3c-4a8b-8e4c-29cfcd198134   50Gi       RWO            Delete           Bound    openshift-storage/db-noobaa-db-pg-0   gp2                     9h
(venv) [jijoy@localhost ocs-ci]$ 
(venv) [jijoy@localhost ocs-ci]$ oc get storageclass
NAME                          PROVISIONER                       RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
gp2 (default)                 kubernetes.io/aws-ebs             Delete          WaitForFirstConsumer   true                   11h
gp2-csi                       ebs.csi.aws.com                   Delete          WaitForFirstConsumer   true                   11h
openshift-storage.noobaa.io   openshift-storage.noobaa.io/obc   Delete          Immediate              false                  9h
(venv) [jijoy@localhost ocs-ci]$ 
(venv) [jijoy@localhost ocs-ci]$ oc get storagesyatem
error: the server doesn't have a resource type "storagesyatem"
(venv) [jijoy@localhost ocs-ci]$ 
(venv) [jijoy@localhost ocs-ci]$ for i in $(oc get node -l cluster.ocs.openshift.io/openshift-storage= -o jsonpath='{ .items[*].metadata.name }'); do oc debug node/${i} -- chroot /host  ls -l /var/lib/rook; done
Starting pod/ip-10-0-158-68us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
total 0

Removing debug pod ...
Starting pod/ip-10-0-164-221us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
total 0
drwxr-xr-x. 5 root root 129 Sep  2 13:36 openshift-storage

Removing debug pod ...
Starting pod/ip-10-0-221-123us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
total 0
drwxr-xr-x. 5 root root 129 Sep  2 13:36 openshift-storage

Removing debug pod ...
(venv) [jijoy@localhost ocs-ci]$ 
(venv) [jijoy@localhost ocs-ci]$ oc project default
Now using project "default" on server "https://api.jijoy-sep2.qe.rh-ocs.com:6443".
(venv) [jijoy@localhost ocs-ci]$ 
(venv) [jijoy@localhost ocs-ci]$ oc delete project openshift-storage --wait=true --timeout=5m
project.project.openshift.io "openshift-storage" deleted
(venv) [jijoy@localhost ocs-ci]$ 
(venv) [jijoy@localhost ocs-ci]$ oc get project openshift-storage
NAME                DISPLAY NAME   STATUS
openshift-storage                  Terminating

> 
> * Part of the uninstall workflow has the user removing all OCS PVCs. While
> the documentation is careful to provide a script that ignores the NooBaa
> PVCs, can you verify this was done correctly?
Yes, this was done correctly.
> 
> * It seems a potentially related BZ
> (https://bugzilla.redhat.com/show_bug.cgi?id=2005040) was updated after the
> latest round of testing was done on this one. I also share the suspicion
> that it may be related... Since that one is ON_QA, could we also move this
> one to ON_QA?
The uninstall steps has changed. Now we are deleting storage system alone and storage cluster should be deleted automatically. 
The bug 2005040 will verify if storage system can be deleted successfully. If the complete uninstall flow is working now this bug also can be considered as fixed.

Comment 23 Mudit Agarwal 2021-11-16 07:43:12 UTC
BZ #2005040 is now in VERIFIED state, which means uninstallation is working properly.

I am moving it to ON_QA, please re-test with the latest build and move it back to ASSIGNED if you still see the issue.

Comment 24 Anna Sandler 2021-11-16 23:06:17 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=2005040 is back to assigned. ill wait with this verification

Comment 25 Anna Sandler 2021-11-24 02:32:47 UTC
verifying this duo to the fact that there is no dependency on bug 2005040 and this bug is not seen anymore 

[asandler@fedora ~]$ oc delete -n openshift-storage storagesystem --all --wait=true
storagesystem.odf.openshift.io "ocs-storagecluster-storagesystem" deleted
[asandler@fedora ~]$ oc get storagesystem -A
No resources found
[asandler@fedora ~]$ oc project default
Now using project "default" on server "https://api.asandler-bug.qe.rh-ocs.com:6443".
[asandler@fedora ~]$ oc delete project openshift-storage --wait=true --timeout=5m
project.project.openshift.io "openshift-storage" deleted
[asandler@fedora ~]$ oc get project openshift-storage
NAME                DISPLAY NAME   STATUS
openshift-storage                  Terminating
[asandler@fedora ~]$ oc get project openshift-storage
Error from server (NotFound): namespaces "openshift-storage" not found
[asandler@fedora ~]$ 

OCP 4.9 + ODF 4.9 on AWS