Description of problem: While deleting openshift-storage namespace as part of ODF uninstall process, the namespace remained in "Terminating" status due to the presence of finalizers as seen in the yaml output given below. Storagecluster and storage system was deleted before deleting the namespace openshift-storage. $ oc delete project openshift-storage --wait=true --timeout=5m project.project.openshift.io "openshift-storage" deleted $ oc get project openshift-storage NAME DISPLAY NAME STATUS openshift-storage Terminating $ oc get project openshift-storage -o yaml apiVersion: project.openshift.io/v1 kind: Project metadata: annotations: openshift.io/sa.scc.mcs: s0:c26,c0 openshift.io/sa.scc.supplemental-groups: 1000650000/10000 openshift.io/sa.scc.uid-range: 1000650000/10000 creationTimestamp: "2021-09-02T07:52:41Z" deletionTimestamp: "2021-09-02T17:37:50Z" labels: kubernetes.io/metadata.name: openshift-storage olm.operatorgroup.uid/76c42cb4-e5d1-446e-82e9-c121c43996f7: "" openshift.io/cluster-monitoring: "true" name: openshift-storage resourceVersion: "388296" uid: 8534b9c6-002e-4afd-8c31-d08a8250c501 spec: finalizers: - kubernetes status: conditions: - lastTransitionTime: "2021-09-02T17:38:07Z" message: All resources successfully discovered reason: ResourcesDiscovered status: "False" type: NamespaceDeletionDiscoveryFailure - lastTransitionTime: "2021-09-02T17:38:07Z" message: All legacy kube types successfully parsed reason: ParsedGroupVersions status: "False" type: NamespaceDeletionGroupVersionParsingFailure - lastTransitionTime: "2021-09-02T17:38:34Z" message: All content successfully deleted, may be waiting on finalization reason: ContentDeleted status: "False" type: NamespaceDeletionContentFailure - lastTransitionTime: "2021-09-02T17:38:07Z" message: 'Some resources are remaining: backingstores.noobaa.io has 1 resource instances, bucketclasses.noobaa.io has 1 resource instances, noobaas.noobaa.io has 1 resource instances' reason: SomeResourcesRemain status: "True" type: NamespaceContentRemaining - lastTransitionTime: "2021-09-02T17:38:07Z" message: 'Some content in the namespace has finalizers remaining: noobaa.io/finalizer in 2 resource instances, noobaa.io/graceful_finalizer in 1 resource instances' reason: SomeFinalizersRemain status: "True" type: NamespaceFinalizersRemaining phase: Terminating These are the finalizers which blocked the deletion of the namespace 'openshift-storage': $ oc get noobaa noobaa -n openshift-storage -o yaml | grep finalizer finalizers: - noobaa.io/graceful_finalizer $ oc get backingstore noobaa-default-backing-store -n openshift-storage -o yaml | grep finalizer finalizers: - noobaa.io/finalizer $ oc get bucketclasses.noobaa.io noobaa-default-bucket-class -n openshift-storage -o yaml | grep finalizer finalizers: - noobaa.io/finalizer Workaround: Remove the finalizers when the namespace status is Terminating. $ oc patch -n openshift-storage noobaa/noobaa --type=merge -p '{"metadata": {"finalizers":null}}' noobaa.noobaa.io/noobaa patched $ oc patch -n openshift-storage backingstore/noobaa-default-backing-store --type=merge -p '{"metadata": {"finalizers":null}}' backingstore.noobaa.io/noobaa-default-backing-store patched $ oc patch -n openshift-storage bucketclasses.noobaa.io/noobaa-default-bucket-class --type=merge -p '{"metadata": {"finalizers":null}}' bucketclass.noobaa.io/noobaa-default-bucket-class patched ================================================================================= Version-Release number of selected component (if applicable): $ oc get csv NAME DISPLAY VERSION REPLACES PHASE noobaa-operator.v4.9.0-123.ci NooBaa Operator 4.9.0-123.ci Succeeded ocs-operator.v4.9.0-123.ci OpenShift Container Storage 4.9.0-123.ci Succeeded odf-operator.v4.9.0-123.ci OpenShift Data Foundation 4.9.0-123.ci Succeeded $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.0-0.nightly-2021-09-01-193941 True False 10h Cluster version is 4.9.0-0.nightly-2021-09-01-193941 ==================================================================================== How reproducible: Reporting the first failure. ================================================================================ Steps to Reproduce: Follow the existing steps to uninstall OCS - https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.8/html/deploying_openshift_container_storage_using_amazon_web_services/assembly_uninstalling-openshift-container-storage_rhocs#uninstalling-openshift-container-storage-in-internal-mode_rhocs (testing was done on AWS) After deleting the storagecluster (after step 5 in the given doc), delete the storage system. $ oc delete storagesystem ocs-storagecluster-storagesystem storagesystem.odf.openshift.io "ocs-storagecluster-storagesystem" deleted $ oc get storagesyatem error: the server doesn't have a resource type "storagesyatem" Continue with the rest of the steps given in the doc. Actual results: The step 7 given in the documentation cannot be completed. "Delete the namespace and wait till the deletion is complete. You will need to switch to another project if openshift-storage is the active project." The namespace openshift-storage will remain in terminating state. Expected results: Uninstall process should succeed. The namespace openshift-storage should be deleted. Additional info:
Adding Regressing keyword because uninstall process was working in 4.8.
Reproduced in dev environment. I can see two issues in removing the project namespace scenario: 1. The operator deployment is removed before NooBaa CR deletion is complete, leaving behind NooBaa resources finalizers 2. The operator's RBAC resources removed before the NooBaa operator causing ☠️ Panic Attack: [Unauthorized]
This is strange, during uninstallation, we probably have a step to remove the PVC, which actually should unmap the rbd block and remove it from Ceph. At this point, the entire cluster is gone but it looks like the rbd block is still there: rbd0 252:0 0 50G 0 disk /var/lib/kubelet/pods/1798b7f3-7904-4b20-a76c-5c9d5bfbe97d/volumes/kubernetes.io~csi/pvc-c2d41602-3ddb-41fd-8ba6-9bea2a622dee/mount Could this be a sequencing issue during deletion? Everything we are seeing here is because of that lingering rbd block. The fs on top does not respond then Postgres hangs too.
after discussing this issue with Sebastien, he thinks this is more of an ocs-op bug and not rook. changing the component to ocs-operator
Providing the dev ack, we are still looking for the RCA
Reviewing this as best I could, I can only come up with a few thoughts but no solutions: * In following the uninstall documentation, what was the state of the cluster when you did Step 7 "Delete the namespace and wait till the deletion is complete."? Ideally there should have been only operator and CSI Pods present. If any other Pods were still running that means things did not resolve correctly. * Part of the uninstall workflow has the user removing all OCS PVCs. While the documentation is careful to provide a script that ignores the NooBaa PVCs, can you verify this was done correctly? * It seems a potentially related BZ (https://bugzilla.redhat.com/show_bug.cgi?id=2005040) was updated after the latest round of testing was done on this one. I also share the suspicion that it may be related... Since that one is ON_QA, could we also move this one to ON_QA?
(In reply to Jose A. Rivera from comment #21) > Reviewing this as best I could, I can only come up with a few thoughts but > no solutions: > > * In following the uninstall documentation, what was the state of the > cluster when you did Step 7 "Delete the namespace and wait till the deletion > is complete."? Ideally there should have been only operator and CSI Pods > present. If any other Pods were still running that means things did not > resolve correctly. csi, operator and noobaa pods were present. Adding here the steps deleting storage cluster, storage system and openshift-storage namespace. These steps are captured from the initial reproducer of the issue given in comment #0. (venv) [jijoy@localhost ocs-ci]$ oc delete -n openshift-storage storagecluster --all --wait=true storagecluster.ocs.openshift.io "ocs-storagecluster" deleted (venv) [jijoy@localhost ocs-ci]$ (venv) [jijoy@localhost ocs-ci]$ oc get pods -n openshift-storage | grep -i cleanup cluster-cleanup-job-ip-10-0-158-68.us-east-2.compute.i--1-j6pp8 0/1 Completed 0 29s (venv) [jijoy@localhost ocs-ci]$ (venv) [jijoy@localhost ocs-ci]$ oc get pods -n openshift-storage NAME READY STATUS RESTARTS AGE cluster-cleanup-job-ip-10-0-158-68.us-east-2.compute.i--1-j6pp8 0/1 Completed 0 57s csi-cephfsplugin-8hwqz 3/3 Running 0 3h56m csi-cephfsplugin-9mk5q 3/3 Running 0 3h56m csi-cephfsplugin-provisioner-8546f775c4-q7tnk 6/6 Running 0 3h56m csi-cephfsplugin-provisioner-8546f775c4-xwftd 6/6 Running 0 3h56m csi-cephfsplugin-snj9w 3/3 Running 0 3h56m csi-rbdplugin-24sl8 3/3 Running 0 3h56m csi-rbdplugin-5rcfb 3/3 Running 0 3h56m csi-rbdplugin-provisioner-59dbd44fdd-hxlq2 6/6 Running 0 3h56m csi-rbdplugin-provisioner-59dbd44fdd-w8mz8 6/6 Running 0 3h56m csi-rbdplugin-shfrc 3/3 Running 0 3h56m noobaa-core-0 1/1 Running 0 3h53m noobaa-db-pg-0 1/1 Running 0 3h53m noobaa-endpoint-9649f7f74-fhgth 1/1 Running 0 3h53m noobaa-operator-6c4f6fcfb8-wkkb9 1/1 Running 1 (62s ago) 9h ocs-metrics-exporter-564f89d788-6fkj5 1/1 Running 0 9h ocs-operator-7c9fcf7d74-chxbp 1/1 Running 0 9h odf-console-7c6fd85bcf-ftxl4 2/2 Running 0 9h odf-operator-controller-manager-55dcf859f9-sfdqz 2/2 Running 0 9h rook-ceph-operator-847c7bc6f4-f7lfg 1/1 Running 0 9h (venv) [jijoy@localhost ocs-ci]$ (venv) [jijoy@localhost ocs-ci]$ oc get pods -n openshift-storage | grep -i cleanup cluster-cleanup-job-ip-10-0-158-68.us-east-2.compute.i--1-j6pp8 0/1 Completed 0 3m14s (venv) [jijoy@localhost ocs-ci]$ (venv) [jijoy@localhost ocs-ci]$ oc get storagecluster No resources found in openshift-storage namespace. (venv) [jijoy@localhost ocs-ci]$ (venv) [jijoy@localhost ocs-ci]$ oc get storagesystem NAME STORAGE-SYSTEM-KIND STORAGE-SYSTEM-NAME ocs-storagecluster-storagesystem storagecluster.ocs.openshift.io/v1 ocs-storagecluster (venv) [jijoy@localhost ocs-ci]$ (venv) [jijoy@localhost ocs-ci]$ oc delete storagesystem ocs-storagecluster-storagesystem storagesystem.odf.openshift.io "ocs-storagecluster-storagesystem" deleted (venv) [jijoy@localhost ocs-ci]$ (venv) [jijoy@localhost ocs-ci]$ oc get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE db-noobaa-db-pg-0 Bound pvc-f3a135b7-8d3c-4a8b-8e4c-29cfcd198134 50Gi RWO gp2 9h (venv) [jijoy@localhost ocs-ci]$ (venv) [jijoy@localhost ocs-ci]$ oc get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pvc-f3a135b7-8d3c-4a8b-8e4c-29cfcd198134 50Gi RWO Delete Bound openshift-storage/db-noobaa-db-pg-0 gp2 9h (venv) [jijoy@localhost ocs-ci]$ (venv) [jijoy@localhost ocs-ci]$ oc get storageclass NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE gp2 (default) kubernetes.io/aws-ebs Delete WaitForFirstConsumer true 11h gp2-csi ebs.csi.aws.com Delete WaitForFirstConsumer true 11h openshift-storage.noobaa.io openshift-storage.noobaa.io/obc Delete Immediate false 9h (venv) [jijoy@localhost ocs-ci]$ (venv) [jijoy@localhost ocs-ci]$ oc get storagesyatem error: the server doesn't have a resource type "storagesyatem" (venv) [jijoy@localhost ocs-ci]$ (venv) [jijoy@localhost ocs-ci]$ for i in $(oc get node -l cluster.ocs.openshift.io/openshift-storage= -o jsonpath='{ .items[*].metadata.name }'); do oc debug node/${i} -- chroot /host ls -l /var/lib/rook; done Starting pod/ip-10-0-158-68us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` total 0 Removing debug pod ... Starting pod/ip-10-0-164-221us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` total 0 drwxr-xr-x. 5 root root 129 Sep 2 13:36 openshift-storage Removing debug pod ... Starting pod/ip-10-0-221-123us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` total 0 drwxr-xr-x. 5 root root 129 Sep 2 13:36 openshift-storage Removing debug pod ... (venv) [jijoy@localhost ocs-ci]$ (venv) [jijoy@localhost ocs-ci]$ oc project default Now using project "default" on server "https://api.jijoy-sep2.qe.rh-ocs.com:6443". (venv) [jijoy@localhost ocs-ci]$ (venv) [jijoy@localhost ocs-ci]$ oc delete project openshift-storage --wait=true --timeout=5m project.project.openshift.io "openshift-storage" deleted (venv) [jijoy@localhost ocs-ci]$ (venv) [jijoy@localhost ocs-ci]$ oc get project openshift-storage NAME DISPLAY NAME STATUS openshift-storage Terminating > > * Part of the uninstall workflow has the user removing all OCS PVCs. While > the documentation is careful to provide a script that ignores the NooBaa > PVCs, can you verify this was done correctly? Yes, this was done correctly. > > * It seems a potentially related BZ > (https://bugzilla.redhat.com/show_bug.cgi?id=2005040) was updated after the > latest round of testing was done on this one. I also share the suspicion > that it may be related... Since that one is ON_QA, could we also move this > one to ON_QA? The uninstall steps has changed. Now we are deleting storage system alone and storage cluster should be deleted automatically. The bug 2005040 will verify if storage system can be deleted successfully. If the complete uninstall flow is working now this bug also can be considered as fixed.
BZ #2005040 is now in VERIFIED state, which means uninstallation is working properly. I am moving it to ON_QA, please re-test with the latest build and move it back to ASSIGNED if you still see the issue.
https://bugzilla.redhat.com/show_bug.cgi?id=2005040 is back to assigned. ill wait with this verification
verifying this duo to the fact that there is no dependency on bug 2005040 and this bug is not seen anymore [asandler@fedora ~]$ oc delete -n openshift-storage storagesystem --all --wait=true storagesystem.odf.openshift.io "ocs-storagecluster-storagesystem" deleted [asandler@fedora ~]$ oc get storagesystem -A No resources found [asandler@fedora ~]$ oc project default Now using project "default" on server "https://api.asandler-bug.qe.rh-ocs.com:6443". [asandler@fedora ~]$ oc delete project openshift-storage --wait=true --timeout=5m project.project.openshift.io "openshift-storage" deleted [asandler@fedora ~]$ oc get project openshift-storage NAME DISPLAY NAME STATUS openshift-storage Terminating [asandler@fedora ~]$ oc get project openshift-storage Error from server (NotFound): namespaces "openshift-storage" not found [asandler@fedora ~]$ OCP 4.9 + ODF 4.9 on AWS