Description of problem ====================== It's not possible to uninstall ODF StorageSystem via OCP Console web UI. This is a regression compared to OCS 4.8. Version-Release number of selected component ============================================ OCP 4.9.0-0.nightly-2021-09-14-200602 LSO 4.9.0-202109132154 ODF 4.9.0-139.ci How reproducible ================ 2/2 Steps to Reproduce ================== 1. Install OCP cluster. 2. Install OCS/ODF (OpenShift Data Foundation) operator. 3. Install LSO operator. 4. Start "Create a StorageSystem" wizard in OCP Console web UI and complete the process. 5. Wait for the StorageSystem to be installed. 6. Initiate removal of StorageSystem via OCP Console. Actual results ============== Uninstallation fails, StorageSystem gets stuck in Terminating state (see screenshot #1). Expected results ================ Uninstallation finishes with success. Additional info =============== This is declared as a test blocker since it slows down testing in a significant way: instead of quick uninstallation of StorageSystem, we have to remove the whole cluster to be able to retry StorageSystem installation again or check StorageSystem with different configuration. Please suggest a workaround which mitigates this problem (quick and reliable way to remove storage system) to drop test blocker status.
StorageSystem waits on deleting the storageCluster. If StorageCluster still exists then you can not delete the storageSystem at all without removing finalizers. So the question is Does the StorageCluster still exist? Was uninstalling StorageCluster working in the prev release with the same type of deployment.
Ah, it seems that a root cause behind this problem is bit more severe and complex, and that some design overhaul will be necessary. Let me explain what I mean: In OCS 4.8, the operator clearly advertised StorageCluster CRD as the most important API it provides, and when you wanted to actually install OCS managed ceph cluster, you used "Create StorageCluster" button to start "Create StorageCluster" wizard. Then the StorageCluster resource is clearly present in OSD operator UI. If you decided to uninstall the cluster later, you used "remove" operation on the StorageCluster resource. In ODF 4.9, the operator advertises StorageSystem CRD as the only API it provides, and when you want to have a cluster installed, you go do so via "Create StorageSystem" button and wizard. When the installation finishes, you have StorageSystem resource clearly presented in the UI instead of StorageCluster, and one would expect that this resource is the main point for the user to understand the overall status of the storage system, and that one would also assume it's the right place to uninstall it. While the StorageCluster is still there, it's basically hidden away so that user won't see it via UI (see eg. BZ 2004030). So it seems that the design of new CRD's and UI doesn't align well. The question is how do we want to resolve it: - Do we want to have StorageCluster as the main resource for storage admin to work with? Then we need to change the way who StorageCluster CRD works, and make it better representation of the storage system as a whole. - Do we want storage admin to understand components of StorageSystem, so that it will naturally occurs to them that to uninstall it, one need to remove StorageCluster first? Then we need to redesign most of the UI, and make sure that we don't allow delete operation on StorageSystem CR (via k8s validation). Imho the 1st option makes more sense, but I don't see the whole picture here at this moment.
For context, I have to emphasize that the current backend design was taking into consideration scenarios outside of the UI, especially ones of headless automation. However, the StorageSystem CR is almost entirely a convenience API for the Console to deal with both OCS and IBM with some level of abstraction, deriving from multiple high-level discussions between product owners many months ago. It provides basically no technical benefit otherwise. As such, I'm open to doing what we can to make its interactions with the UI more agreeable, as long as the overall design doesn't suffer because of it. This product has a history, from its inception, of designing its UI to abstract and hide. IMO it's basically a race to the bottom of "what's the absolutely bare minimum amount of information we need to expose", leading to all sorts of headaches and conversations around vague definitions of actual users. The direction this BZ has taken is a perfect example of this. All that said, at this point this is more a product decision than a technical one. We're currently stuck with whatever UI is in the actual Console itself, but if I remember right everything outside the installation wizard will be coming in our dynamic plugin, so we still have some time and flexibility there. With that in mind we have to get a decision on just how much we want to lean on StorageSystems or StorageClusters as the primary interface for our UI customers. Obviously for existing customers upgrading this may be something of a jump, so that also needs to be considered. Honestly, my ideal would be to not rely on any CRD-based interface at all as it would give us much more flexibility in terms of displayed names and placements of a variety of UI elements, but I'm pretty sure that's out of the question for this release.
I can confirm that the suggested workaround (removing StorageCluster resource before removing StorageSystem) works fine. Dropping TestBlocker keyword.
I can imagine that discussion leading towards current design was not easy nor straightforward. That said, I seems to me that we need to consider keeping StorageCluster CR visible in the new ODF UI. It would be really nice if we can do this by tweaking the UI code we ship in the operator, as Jose pointed out. There are other problems such as BZ 2005014 which are caused by the redesign.
(In reply to Martin Bukatovic from comment #8) > I can confirm that the suggested workaround (removing StorageCluster > resource before removing StorageSystem) works fine. > > Dropping TestBlocker keyword. The datapoint above was observed during removal of storage cluster CR which got stuck in Error state during installation. When I try the scenario with a successfully installed cluster, I get StorageCluster CR stuck in Deleting phase. Command `oc describe storagecluster` reports: ``` Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning UninstallPending 15m controller_storagecluster uninstall: Waiting on NooBaa system noobaa to be deleted Warning UninstallPending 15m controller_storagecluster Uninstall: Waiting for Ceph RGW Route storagecluster-cephobjectstore to be deleted Warning UninstallPending 15m controller_storagecluster uninstall: Waiting for CephObjectStoreUser storagecluster-cephobjectstoreuser to be deleted Warning UninstallPending 15m controller_storagecluster uninstall: Waiting for CephObjectStore storagecluster-cephobjectstore to be deleted ``` Has something changed? I was able to remove StorageCluster this way before.
(In reply to Martin Bukatovic from comment #8) > I can confirm that the suggested workaround (removing StorageCluster > resource before removing StorageSystem) works fine. > > Dropping TestBlocker keyword. I think I was not too clear let me explain again, I did not suggest removing StorageCluster resource before removing StorageSystem. odf-operator itself issue a delete and wait for it to delete. so there is no manual deletion required of a StorageCluster. It is like how cephcluster deletion works in 4.8 upon deletion of the StorageCluster and StorageCluster wait for cephcluster to be deleted. Now we have one more layer on top which is StorageSystem. I hope that clears all doubts regarding the uninstall. By removing finalizers I meant the same way as we do for cephCluster if something is not right in case of storageCluster deletion. (In reply to Martin Bukatovic from comment #10) > Has something changed? I was able to remove StorageCluster this way before. On the ocs-operator side, we just changed the order of cephCluster deletion but on the rook side, a lot has changed. Blaine can help you with the rook doubts. ocs-operator PR to change the order:- https://github.com/red-hat-storage/ocs-operator/pull/1293
Hmm... it seems this BZ has somewhat evolved. Martin, can you explicitly outline the testing process you're using, including any UI actions or CLI commands? If simply doing sometihng like `oc delete storagecluster` isn't working, that is probably worth investigating via a must-gather.
(In reply to Jose A. Rivera from comment #12) > Hmm... it seems this BZ has somewhat evolved. > > Martin, can you explicitly outline the testing process you're using, > including any UI actions or CLI commands? If simply doing sometihng like `oc > delete storagecluster` isn't working, that is probably worth investigating > via a must-gather. The reproducer from the bug description (and the must gather referenced in comment 4) still applies. Sorry for the confusion on my side (which Nitin cleared up in comment 11). I noticed the problem from comment 10 when I tried an invalid procedure (I misunderstood Nitin's comment). I'm not sure if it's related to the actual bug as reported here (that would be visible in a must gather though) and whether it's worth chasing that use case (in a separate bz maybe?) as well.
Providing QE ack based on a triage meeting on 2021-09-21. We have agreement that removal of StorageSystem CR should remove the cluster, this bug should stay focused on uninstallation.
The same behaviour can be observed when one tries to remove storage cluster without OCP Console. The delete request gets stuck, storage cluster moves into Deleting phase, but nothing is actually removed: ``` $ oc delete storagecluster/ocs-storagecluster -n openshift-storage storagecluster.ocs.openshift.io "ocs-storagecluster" deleted ^C $ oc get storagecluster -n openshift-storage NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 15m Deleting 2021-10-04T14:08:34Z 4.9.0 ``` Checking details via `oc describe` shows the same set of warnings: ``` $ oc describe storagecluster -n openshift-storage | tail -5 Normal CreationSucceeded 21m StorageCluster controller StorageSystem ocs-storagecluster-storagesystem created for the StorageCluster ocs-storagecluster. Warning UninstallPending 9m12s controller_storagecluster uninstall: Waiting on NooBaa system noobaa to be deleted Warning UninstallPending 9m12s controller_storagecluster Uninstall: Waiting for Ceph RGW Route ocs-storagecluster-cephobjectstore to be deleted Warning UninstallPending 9m11s controller_storagecluster uninstall: Waiting for CephObjectStoreUser ocs-storagecluster-cephobjectstoreuser to be deleted Warning UninstallPending 9m10s controller_storagecluster uninstall: Waiting for CephObjectStore ocs-storagecluster-cephobjectstore to be deleted ``` I was using 4.9.0-164.ci
Additional information ====================== When I tried to list the items controller_storagecluster is waiting for to be deleted, I noticed one discrepancy: - controller_storagecluster is waiting for CephObjectStoreUser/ocs-storagecluster-cephobjectstoreuser to be deleted - but I see that there is CephObjectStoreUser/noobaa-ceph-objectstore-user instead ``` $ oc get CephObjectStoreUser -n openshift-storage NAME AGE noobaa-ceph-objectstore-user 37m ``` I'm not sure if this is expected or not, it could be unrelated to the problem I see here.
Reattaching TestBlocker keyword, as there is no workaround other than removal and reinstallation of whole Openshift cluster.
Asking Nimrod whether it would be possible to come up with some workaround, which would allow a storage cluster to be removed.
Manually, we can create a different BS and remove the one on top of RGW (if that was not used for buckets, if yes those need to be deleted first). Code fix, when NooBaa will go down it will also (or should at least) remove the objectstoreuser, as I saw from the comments above it seems like NooBaa is stuck in uninstall as well (or have I mixed up between the real repro and the non-repro scenarios?)
According to NooBaa operator logs, provided by Martin in must gather, during uninstall 1. NooBaa CR was removed: Not Found: NooBaa \"noobaa\" 2. CephCluster CR continues to exist, also after NooBaa CR removal: Exists: \"ocs-storagecluster-cephcluster\" 3. NooBaa operator watches the CephCluster CR, in order to react to Ceph cluster capacity changes. This controller was added in PR 511, https://github.com/noobaa/noobaa-operator/pull/511 I am not sure about the flow, is it expected that CephCluster CR continues to exist, also after NooBaa CR removal?
There might be a race condition in regards to `noobaa-ceph-objectstore-user` creation, during system uninstall. I pushed NooBaa operator image based on https://github.com/noobaa/noobaa-operator/pull/755 to: quay.io/baum/noobaa-operator:bz_2005040_Oct_14_2021 It is interesting if the termination issue is reproducible with this change.
Verifying with: - OCP 4.9.0-0.nightly-2021-10-19-063835 - LSO 4.9.0-202110012022 - ODF 4.9.0-193.ci I tried to initiate storagecluster removal, and I see that after about 5 minutes, cluster is stuck in Terminating state: ``` $ oc get storagecluster -n openshift-storage NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 129m Deleting 2021-10-19T16:56:56Z 4.9.0 ``` Output of oc describe shows the following events: ``` Phase: Deleting Related Objects: API Version: ceph.rook.io/v1 Kind: CephCluster Name: ocs-storagecluster-cephcluster Namespace: openshift-storage Resource Version: 114601 UID: 47e67e29-b8c9-44b3-823c-2205fa412b8e API Version: noobaa.io/v1alpha1 Kind: NooBaa Name: noobaa Namespace: openshift-storage Resource Version: 114984 UID: 3f840ecd-e358-40cc-aac8-11e19b9dc899 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning UninstallPending 20m controller_storagecluster uninstall: Waiting on NooBaa system noobaa to be deleted Warning UninstallPending 20m controller_storagecluster uninstall: Waiting for CephObjectStoreUser ocs-storagecluster-cephobjectstoreuser to be deleted Warning UninstallPending 20m controller_storagecluster uninstall: Waiting for CephObjectStore ocs-storagecluster-cephobjectstore to be deleted ``` Compared to the comment 20, there is one less warning, but otherwise the problem is still here. List of pods running: ``` $ oc get pods -n openshift-storage NAME READY STATUS RESTARTS AGE csi-cephfsplugin-7ckrs 3/3 Running 0 117m csi-cephfsplugin-cg84n 3/3 Running 0 117m csi-cephfsplugin-jqsjq 3/3 Running 0 117m csi-cephfsplugin-provisioner-5b4d988899-mg8pj 6/6 Running 0 117m csi-cephfsplugin-provisioner-5b4d988899-nvqsr 6/6 Running 0 117m csi-cephfsplugin-s252p 3/3 Running 0 117m csi-cephfsplugin-wv572 3/3 Running 0 117m csi-cephfsplugin-zh5lt 3/3 Running 0 117m csi-rbdplugin-bgjjq 3/3 Running 0 117m csi-rbdplugin-bql8h 3/3 Running 0 117m csi-rbdplugin-fpwr7 3/3 Running 0 117m csi-rbdplugin-hlq4p 3/3 Running 0 117m csi-rbdplugin-nhmlt 3/3 Running 0 117m csi-rbdplugin-provisioner-676987456c-s2cg7 6/6 Running 0 117m csi-rbdplugin-provisioner-676987456c-tzd6h 6/6 Running 0 117m csi-rbdplugin-vzldh 3/3 Running 0 117m noobaa-operator-5895464d68-hgmht 1/1 Running 0 121m ocs-metrics-exporter-6b8887d6ff-wqvnj 1/1 Running 0 121m ocs-operator-84cbfbcc97-7sc9p 1/1 Running 0 121m odf-console-797d6f968f-8ljbf 1/1 Running 0 122m odf-operator-controller-manager-58849f95c7-6q2dh 2/2 Running 1 (120m ago) 122m rook-ceph-crashcollector-compute-0-5df8f8fd78-zbz8x 1/1 Running 0 115m rook-ceph-crashcollector-compute-1-c5b68564c-glhqf 1/1 Running 0 115m rook-ceph-crashcollector-compute-2-7b649f48fb-jbv8t 1/1 Running 0 116m rook-ceph-crashcollector-compute-3-85655499c9-7vchr 1/1 Running 0 116m rook-ceph-crashcollector-compute-4-6d65d9d48b-mz6cn 1/1 Running 0 116m rook-ceph-crashcollector-compute-5-6459f445f-8d5zd 1/1 Running 0 115m rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5ff896dcbk2g2 2/2 Running 0 115m rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6c784564bk7cf 2/2 Running 0 115m rook-ceph-mgr-a-7b9bf457fd-j9vss 2/2 Running 0 116m rook-ceph-mon-a-655b6888f6-x9g64 2/2 Running 0 117m rook-ceph-mon-b-667f56f7d5-vl2lt 2/2 Running 0 116m rook-ceph-mon-c-697c784ccc-pd5sb 2/2 Running 0 116m rook-ceph-operator-6bf667c8cf-zb4t6 1/1 Running 0 121m rook-ceph-osd-0-5f65c96569-5p89p 2/2 Running 0 116m rook-ceph-osd-1-7545dbd854-j2rdh 2/2 Running 0 116m rook-ceph-osd-2-5965dfc57d-tc9pn 2/2 Running 0 116m rook-ceph-osd-3-54d9f9bb9c-rzzsm 2/2 Running 0 116m rook-ceph-osd-4-c67b555dc-r97n7 2/2 Running 0 116m rook-ceph-osd-5-74dbd866c4-6klp9 2/2 Running 0 115m rook-ceph-osd-6-755c76cf77-5kh7j 2/2 Running 0 115m rook-ceph-osd-7-85c8dd5b64-jv59v 2/2 Running 0 115m rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data---1-fwp5l 0/1 Completed 0 116m rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data---1-lql6x 0/1 Completed 0 116m rook-ceph-osd-prepare-ocs-deviceset-localblock-0-data---1-zv5h8 0/1 Completed 0 116m rook-ceph-osd-prepare-ocs-deviceset-localblock-1-data---1-qfdtk 0/1 Completed 0 116m rook-ceph-osd-prepare-ocs-deviceset-localblock-1-data---1-vx8c5 0/1 Completed 0 116m rook-ceph-osd-prepare-ocs-deviceset-localblock-1-data---1-xpj8m 0/1 Completed 0 116m rook-ceph-osd-prepare-ocs-deviceset-localblock-2-data---1-gcdb5 0/1 Completed 0 116m rook-ceph-osd-prepare-ocs-deviceset-localblock-2-data---1-qwbht 0/1 Completed 0 112m rook-ceph-osd-prepare-ocs-deviceset-localblock-2-data---1-wbsvg 0/1 Completed 0 116m rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-59b5695sv9qt 2/2 Running 0 115m rook-ceph-tools-77cb57894f-vl4pv 1/1 Running 0 115m ``` >>> ASSIGNED
At a quick glance, it seems that one bucket is still present "nb.1634662874470.apps.mbukatov-1019b.qe.rh-ocs.com", that's the reason why the ceph-object-store-user-controller failed to remove the user and the CephObjectStoreUser CR. See the error: 2021-10-19T18:47:21.747629646Z 2021-10-19 18:47:21.747556 E | ceph-object-store-user-controller: failed to reconcile failed to delete ceph object user "noobaa-ceph-objectstore-user": failed to delete ceph object user "noobaa-ceph-objectstore-user".: BucketAlreadyExists tx00000000000000000059e-00616f12b9-3a60-ocs-storagecluster-cephobjectstore 3a60-ocs-storagecluster-cephobjectstore-ocs-storagecluster-cephobjectstore So the bucket must be cleaned up in order to proceed with the deletion. Assigning to Blaine since this is related to the recent "dependent" patch. Martin, where is this bucket coming from? OBC? Thanks
I'm fine keeping it since I started the initial evaluation of the bug.
Hi Sebastien, this bucket is created by noobaa to use as the default backing-store. it is created via S3 API and not OBC. As far as I remember, by design, noobaa does not delete the buckets\containers it uses as storage for backing-stores (regardless of the type AWS\Azure\RGW, etc.)
(In reply to Sébastien Han from comment #32) > Martin, where is this bucket coming from? OBC? I haven't created any buckets myself. It is created by NooBaa as explained by Danny in comment 34.
(In reply to Martin Bukatovic from comment #35) > (In reply to Sébastien Han from comment #32) > > Martin, where is this bucket coming from? OBC? > > I haven't created any buckets myself. It is created by NooBaa as explained > by Danny in comment 34. Ok thanks, Martin and Danny. It looks like we are catching now because Rook has become more protective with the resources it creates. If we want to force the deletion, the finalizer must be removed or Noobaa should remove the bucket during uninstallation. José, is something ocs-op could do? (remove the finalizer).
Seb, Blaine, and I discussed this and we don't see that Rook upstream should accommodate the forced uninstall case with a setting in the CR. The finalizers are the protection for the cluster. If the protection is not desired, the finalizers should be force removed, which means the OCS operator really is the only place this could happen.
We don't want to set in place mechanisms for upstream administrators to potentially destroy their data accidentally. We have put a lot of design and intention into Rook to have better default-safe behaviors for user data. After discussion between myself, Jose, and Travis, we found that we can make an uninstallation optimization to meet the needs expressed here: If : we are deleting a CephObjectStore (could be any dependent resource, but let's keep this for the example) and : the CephCluster has yes-really-destroy-data-set and : the CephCluster has a nonzero deletion timestamp then: we can treat deletion of the CephObjectStore as though the CephCluster doesn't exist, because we can be pretty sure it won't exist very soon. (i.e., just delete the CephObjectStore resource) I think we will want to track this BZ for OCS-operator and Rook both since both will need to implement some changes. Rook implements the logic above, and OCS-operator needs to delete the CephCluster (with yes-really-destroy-data) and all other resources at the same time in order for Rook to proceed with deletion. I'll move this BZ to Rook for now.
Verifying this bug buy the updated 4.9 uninstall flow after deleting finalizers and dependent resources, using the command "oc delete -n openshift-storage storagesystem --all --wait=true" storagecluster and storagesystem were deleted successfully and were not stuck on terminating tested on OCP 4.9 on AWS openshift-storage mcg-operator.v4.9.0 NooBaa Operator 4.9.0 Succeeded openshift-storage ocs-operator.v4.9.0 OpenShift Container Storage 4.9.0 Succeeded openshift-storage odf-operator.v4.9.0 OpenShift Data Foundation 4.9.0 Succeeded
Verifying this bug buy the updated 4.9 uninstall flow after deleting finalizers and dependent resources, deleted the storagesystem using UI storagecluster and storagesystem were deleted successfully and were not stuck on terminating process can be seen on added attachment tested on OCP 4.9 on AWS openshift-storage mcg-operator.v4.9.0 NooBaa Operator 4.9.0 Succeeded openshift-storage ocs-operator.v4.9.0 OpenShift Container Storage 4.9.0 Succeeded openshift-storage odf-operator.v4.9.0 OpenShift Data Foundation 4.9.0 Succeeded
The bug status was moved to ON_QA by errata system. Hit this issue again in VMware platform internal mode cluster. Changing the bug status to modified. @Mudit FYI. There are other bugs where status is changed by errata system.
Hit this issue as mentioned in Comment 14. Details and logs will be shared in next comment.
The command to delete storagesystem is not completing because storagesystem is waiting for storagecluster to get deleted. $ oc delete -n openshift-storage storagesystem --all --wait=true storagesystem.odf.openshift.io "storagesystem-odf" deleted Storagecluster is not getting deleted. $ oc get storagecluster NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 28h Deleting 2021-11-15T05:42:30Z 4.9.0 Events from Storagecluster ocs-storagecluster: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning UninstallPending 31m controller_storagecluster uninstall: Waiting on NooBaa system noobaa to be deleted Warning UninstallPending 31m controller_storagecluster uninstall: Waiting for CephObjectStoreUser ocs-storagecluster-cephobjectstoreuser to be deleted Warning UninstallPending 31m controller_storagecluster uninstall: Waiting for CephObjectStore ocs-storagecluster-cephobjectstore to be deleted storagecluster is not deleted due to the presence of CephObjectStoreUser and CephObjectStore. $ oc get CephObjectStoreUser NAME AGE noobaa-ceph-objectstore-user 28h $ oc get CephObjectStore NAME AGE ocs-storagecluster-cephobjectstore 28h Events from CephObjectStore ocs-storagecluster-cephobjectstore: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning ReconcileFailed 34m rook-ceph-object-controller CephObjectStore "openshift-storage/ocs-storagecluster-cephobjectstore" will not be deleted until all dependents are removed: CephObjectStoreUsers: [noobaa-ceph-objectstore-user], buckets in the object store (could be from ObjectBucketClaims or COSI Buckets): [nb.1636955290789.apps.jijoy-nov15.qe.rh-ocs.com] Tested in version: 4.9.0-237.ci 4.10.0-0.nightly-2021-11-14-184249 Platform is VMware. Internal mode cluster.
Adding to comment #46 $ oc get cephobjectstoreuser -o yaml apiVersion: v1 items: - apiVersion: ceph.rook.io/v1 kind: CephObjectStoreUser metadata: creationTimestamp: "2021-11-15T05:46:29Z" deletionGracePeriodSeconds: 0 deletionTimestamp: "2021-11-16T09:38:15Z" finalizers: - cephobjectstoreuser.ceph.rook.io generation: 2 name: noobaa-ceph-objectstore-user namespace: openshift-storage ownerReferences: - apiVersion: noobaa.io/v1alpha1 blockOwnerDeletion: true controller: true kind: NooBaa name: noobaa uid: ead94976-a44e-4ab1-b76f-8acc127f9d41 resourceVersion: "1228781" uid: b298e368-79eb-4b68-ac10-936bf9db9c80 spec: displayName: my display name store: ocs-storagecluster-cephobjectstore status: info: secretName: rook-ceph-object-user-ocs-storagecluster-cephobjectstore-noobaa-ceph-objectstore-user phase: Ready kind: List metadata: resourceVersion: "" selfLink: "" $ oc describe cephobjectstore Name: ocs-storagecluster-cephobjectstore Namespace: openshift-storage Labels: <none> Annotations: <none> API Version: ceph.rook.io/v1 Kind: CephObjectStore Metadata: Creation Timestamp: 2021-11-15T05:42:31Z Deletion Grace Period Seconds: 0 Deletion Timestamp: 2021-11-16T09:38:16Z Finalizers: cephobjectstore.ceph.rook.io Generation: 2 Managed Fields: API Version: ceph.rook.io/v1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:ownerReferences: .: k:{"uid":"c75b415e-13fa-40fc-8e4c-0bbabdf62275"}: f:spec: .: f:dataPool: .: f:compressionMode: f:erasureCoded: .: f:codingChunks: f:dataChunks: f:failureDomain: f:mirroring: f:quotas: f:replicated: .: f:replicasPerFailureDomain: f:size: f:targetSizeRatio: f:statusCheck: .: f:mirror: f:gateway: .: f:instances: f:placement: .: f:nodeAffinity: .: f:requiredDuringSchedulingIgnoredDuringExecution: .: f:nodeSelectorTerms: f:podAntiAffinity: .: f:preferredDuringSchedulingIgnoredDuringExecution: f:requiredDuringSchedulingIgnoredDuringExecution: f:tolerations: f:port: f:priorityClassName: f:resources: .: f:limits: .: f:cpu: f:memory: f:requests: .: f:cpu: f:memory: f:securePort: f:service: .: f:annotations: .: f:service.beta.openshift.io/serving-cert-secret-name: f:healthCheck: .: f:bucket: f:metadataPool: .: f:compressionMode: f:erasureCoded: .: f:codingChunks: f:dataChunks: f:failureDomain: f:mirroring: f:quotas: f:replicated: .: f:replicasPerFailureDomain: f:size: f:statusCheck: .: f:mirror: f:zone: .: f:name: Manager: ocs-operator Operation: Update Time: 2021-11-15T05:42:31Z API Version: ceph.rook.io/v1 Fields Type: FieldsV1 fieldsV1: f:status: f:bucketStatus: f:lastChecked: Manager: rook Operation: Update Time: 2021-11-16T09:38:10Z API Version: ceph.rook.io/v1 Fields Type: FieldsV1 fieldsV1: f:status: f:conditions: f:phase: Manager: rook Operation: Update Subresource: status Time: 2021-11-16T09:38:18Z Owner References: API Version: ocs.openshift.io/v1 Block Owner Deletion: true Controller: true Kind: StorageCluster Name: ocs-storagecluster UID: c75b415e-13fa-40fc-8e4c-0bbabdf62275 Resource Version: 1325984 UID: bf664b76-cb79-4420-ae0b-8968ecd41d83 Spec: Data Pool: Compression Mode: none Erasure Coded: Coding Chunks: 0 Data Chunks: 0 Failure Domain: rack Mirroring: Quotas: Replicated: Replicas Per Failure Domain: 1 Size: 3 Target Size Ratio: 0.49 Status Check: Mirror: Gateway: Instances: 1 Placement: Node Affinity: Required During Scheduling Ignored During Execution: Node Selector Terms: Match Expressions: Key: cluster.ocs.openshift.io/openshift-storage Operator: Exists Pod Anti Affinity: Preferred During Scheduling Ignored During Execution: Pod Affinity Term: Label Selector: Match Expressions: Key: app Operator: In Values: rook-ceph-rgw Topology Key: kubernetes.io/hostname Weight: 100 Required During Scheduling Ignored During Execution: Label Selector: Match Expressions: Key: app Operator: In Values: rook-ceph-rgw Topology Key: kubernetes.io/hostname Tolerations: Effect: NoSchedule Key: node.ocs.openshift.io/storage Operator: Equal Value: true Port: 80 Priority Class Name: openshift-user-critical Resources: Limits: Cpu: 2 Memory: 4Gi Requests: Cpu: 2 Memory: 4Gi Secure Port: 443 Service: Annotations: service.beta.openshift.io/serving-cert-secret-name: ocs-storagecluster-cos-ceph-rgw-tls-cert Health Check: Bucket: Metadata Pool: Compression Mode: none Erasure Coded: Coding Chunks: 0 Data Chunks: 0 Failure Domain: rack Mirroring: Quotas: Replicated: Replicas Per Failure Domain: 1 Size: 3 Status Check: Mirror: Zone: Name: Status: Bucket Status: Health: Connected Last Changed: 2021-11-15T05:46:55Z Last Checked: 2021-11-16T09:38:10Z Conditions: Last Heartbeat Time: 2021-11-16T11:58:45Z Last Transition Time: 2021-11-16T09:38:18Z Message: CephObjectStore "openshift-storage/ocs-storagecluster-cephobjectstore" will not be deleted until all dependents are removed: CephObjectStoreUsers: [noobaa-ceph-objectstore-user], buckets in the object store (could be from ObjectBucketClaims or COSI Buckets): [nb.1636955290789.apps.jijoy-nov15.qe.rh-ocs.com] Reason: ObjectHasDependents Status: True Type: DeletionIsBlocked Info: Endpoint: http://rook-ceph-rgw-ocs-storagecluster-cephobjectstore.openshift-storage.svc:80 Secure Endpoint: https://rook-ceph-rgw-ocs-storagecluster-cephobjectstore.openshift-storage.svc:443 Phase: Deleting Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning ReconcileFailed 35m rook-ceph-object-controller failed to check for object buckets. failed to get admin ops API context: failed to create or retrieve rgw admin ops user: failed to create object user "rgw-admin-ops-user". error code 1 for object store "ocs-storagecluster-cephobjectstore": failed to create s3 user. . : signal: interrupt Warning ReconcileFailed 35m (x3 over 140m) rook-ceph-object-controller CephObjectStore "openshift-storage/ocs-storagecluster-cephobjectstore" will not be deleted until all dependents are removed: CephObjectStoreUsers: [noobaa-ceph-objectstore-user], buckets in the object store (could be from ObjectBucketClaims or COSI Buckets): [nb.1636955290789.apps.jijoy-nov15.qe.rh-ocs.com]
The operator log shows that the object store user failed to be deleted because the bucket exists. 2021-11-16 09:38:17.566321 E | ceph-object-store-user-controller: failed to reconcile failed to delete ceph object user "noobaa-ceph-objectstore-user": failed to delete ceph object user "noobaa-ceph-objectstore-user".: BucketAlreadyExists tx00000000000000000561e-0061937c09-5fe4-ocs-storagecluster-cephobjectstore 5fe4-ocs-storagecluster-cephobjectstore-ocs-storagecluster-cephobjectstore 2021-11-16 09:38:17.613729 I | op-mon: parsing mon endpoints: a=172.30.40.140:6789,b=172.30.128.131:6789,c=172.30.35.111:6789 2021-11-16 09:38:17.613837 I | ceph-object-store-user-controller: CephObjectStore "ocs-storagecluster-cephobjectstore" found 2021-11-16 09:38:17.613968 I | ceph-object-store-user-controller: CephObjectStore "ocs-storagecluster-cephobjectstore" found 2021-11-16 09:38:17.980512 E | ceph-object-store-user-controller: failed to reconcile failed to delete ceph object user "noobaa-ceph-objectstore-user": failed to delete ceph object user "noobaa-ceph-objectstore-user".: BucketAlreadyExists tx00000000000000000561f-0061937c09-5fe4-ocs-storagecluster-cephobjectstore 5fe4-ocs-storagecluster-cephobjectstore-ocs-storagecluster-cephobjectstore 2021-11-16 09:38:18.051215 I | ceph-object-controller: CephObjectStore "openshift-storage/ocs-storagecluster-cephobjectstore" will not be deleted until all dependents are removed: CephObjectStoreUsers: [noobaa-ceph-objectstore-user], buckets in the object store (could be from ObjectBucketClaims or COSI Buckets): [nb.1636955290789.apps.jijoy-nov15.qe.rh-ocs.com] 2021-11-16 09:38:18.068891 E | ceph-object-controller: failed to reconcile CephObjectStore "openshift-storage/ocs-storagecluster-cephobjectstore". CephObjectStore "openshift-storage/ocs-storagecluster-cephobjectstore" will not be deleted until all dependents are removed: CephObjectStoreUsers: [noobaa-ceph-objectstore-user], buckets in the object store (could be from ObjectBucketClaims or COSI Buckets): [nb.1636955290789.apps.jijoy-nov15.qe.rh-ocs.com] 2021-11-16 09:38:18.068932 I | op-k8sutil: Reporting Event openshift-storage:ocs-storagecluster-cephobjectstore Warning:ReconcileFailed:CephObjectStore "openshift-storage/ocs-storagecluster-cephobjectstore" will not be deleted until all dependents are removed: CephObjectStoreUsers: [noobaa-ceph-objectstore-user], buckets in the object store (could be from ObjectBucketClaims or COSI Buckets): [nb.1636955290789.apps.jijoy-nov15.qe.rh-ocs.com] I don't see any OBCs existing in the cluster, so it seems the bucket should have been deleted. If the bucket didn't exist, the user would be deleted, the object store would be cleaned up, and the uninstall could proceed. Blaine could you take a look?
This comment explains why there is a bucket on the user created for NooBaa. NooBaa creates it to be a default bucket and doesn't delete it when NooBaa is deleted. https://bugzilla.redhat.com/show_bug.cgi?id=2005040#c34 ----- The problem, as far as I can tell is that the CephCluster doesn't have a deletionTimestamp. It has the `cleanupPolicy` set, but not a deletionTimestamp, so the optimized/forced deletion is not happening as intended based on this comment: https://bugzilla.redhat.com/show_bug.cgi?id=2005040#c39. Both are necessary for Rook to do the optimized/forced delete. I believe this suggests that we need a change from ocs-operator to request deletion of all resources (including the CephCluster) in order to proceed with the optimized/forced delete strategy. ----- I'm confused how the same procedure yielded different results here https://bugzilla.redhat.com/show_bug.cgi?id=2005040#c41 and here https://bugzilla.redhat.com/show_bug.cgi?id=2005040#c46. @jijoy and @asandler, are these two different uninstall cases somehow?
I was testing on AWS and following https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.9/html-single/deploying_openshift_data_foundation_using_amazon_web_services/index?lb_target=preview
Testing was done on WMware platform where the cephobjectstoreuser "noobaa-ceph-objectstore-user" will be present.
I think I am understanding from the responses that the AWS tests don't install an object store or NooBaa, so they would naturally not be affected by this bug. Given that the deletion timestamp was missing on the CephCluster resource in Jiju's tests, I believe this still means OCS-Operator needs to do some slight adjustments to ensure it is deleting the CephCluster and all other Rook resources at the same time. Notably, the following command should cause the OCS-Operator to delete all the Rook resources: > $ oc delete -n openshift-storage storagesystem --all --wait=true
(In reply to Anna Sandler from comment #41) > Verifying this bug buy the updated 4.9 uninstall flow > after deleting finalizers Right now, the current documentation: https://gitlab.cee.redhat.com/red-hat-openshift-container-storage-documentation/openshift-data-foundation-documentation/tree/6c33df43168f6c21f7b221e27710684c7ef6788b doesn't mention anything about removing finalizers. If this information was about to be added at the moment the above statement was made, I would expect a reference to a bug or JIRA tracking it. The note about finilizers also conflicts with Blaine's suggested approach for the fix noted in comment 39. If we have decided to do something else, I'm missing a clear statement about that between comment 39 and comment 41. Only in comment 58 I see that the decision about this bug was basically not to fix it. > the docs are clear Does it mean that the hack about finalizers won't be necessary? Could you reference a description of this procedure somewhere? > tested on OCP 4.9 on AWS > the flow works as needed and the bug is fixed. This also needs to be tested on vSphere with LSO. I haven't noted down that this is vSphere specific, but vSphere is the only on premise platform where we could deploy LSO in a way which is usable for testing purposes (mimicking on premise LSO setup, using AWS with or without LSO won't do).
(In reply to Jose A. Rivera from comment #57) > "It's not a good user experience" is not an argument for blocking a release. I would not explain this as just a bad UX. Problems like this are unacceptable and if we let them in, we will end up hard to maintain mess in the end. > If there's no functional harm, no chance of production data corruption > (we're intentionally destroying access to the data at this point!), and a > workaround exists, Do you mean a workaround noted in https://bugzilla.redhat.com/show_bug.cgi?id=2000941 I haven't found any direct reference of the workaroud. At the moment I'm writing this, I don't see it the KCS https://access.redhat.com/articles/6525111 neither. > it's not a *blocker*. It is a blocker since it's a regression, as noted in the original bug report, the procedure explained in the reproducer was working fine before. Moreover it has nontrivial testing impact. Of course, if a program agrees on not fixing it because of particular reason, that is another question.
I just retried the original reproducer, and can still see the same behaviour: ``` Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning UninstallPending 3m33s controller_storagecluster uninstall: Waiting on NooBaa system noobaa to be deleted Warning UninstallPending 3m33s controller_storagecluster uninstall: Waiting for CephObjectStoreUser ocs-storagecluster-cephobjectstoreuser to be deleted Warning UninstallPending 3m32s controller_storagecluster uninstall: Waiting for CephObjectStore ocs-storagecluster-cephobjectstore to be deleted ``` Retried with vSphere LSO cluster: OCP 4.9.0-0.nightly-2021-11-24-090558 LSO 4.9.0-202111151318 OCS 4.9.0-249.ci
I also tried to perform the "workaround" suggested in BZ 2000941 before removing the ODF storage system: ``` $ oc patch -n openshift-storage noobaa/noobaa --type=merge -p '{"metadata": {"finalizers":null}}' $ oc patch -n openshift-storage backingstore/noobaa-default-backing-store --type=merge -p '{"metadata": {"finalizers":null}}' $ oc patch -n openshift-storage bucketclasses.noobaa.io/noobaa-default-bucket-class --type=merge -p '{"metadata": {"finalizers":null}}' ``` But I observed no difference, removal of StorageCluster still got stuck. It's possible that I got the workaround wrong. In such case, could someone be so kind as to explain a workaround procedure so that ODF StorageSystem could be consistently removed?
Martin and I had a sync up chat about this. We both are pretty confused and not sure the best way to proceed. I will lay out the component interactions that are causing this issue to the best of my knowledge, workarounds I know, and from there we should figure out what to do to resolve the issue. For 4.9, we should update documentation with a workaround. Otherwise, users will not know how to delete an OCS storage cluster. For 4.10, we will have to decide whether to fix this in a component or whether documentation is the preferred fix. --- CURRENT STATE This issue has arisen in 4.9 because Rook includes broader checks for user data when it deletes object stores. In order to not delete object stores with user data, Rook will block deleting a CephObjectStore if the object store has any user buckets created. In Rook upstream, the way to force delete the CephObjectStore resource (which won't delete the pools for the store), is to remove the finalizer on the CephObjectStore resource. NooBaa creates at least one bucket in the CephObjectStore and does not delete the bucket when NooBaa is deleted. Because of this, Rook does not remove the CephObjectStore, not wanting to delete user data. Travis implemented an "optimized" deletion path in Rook which will skip checks for user buckets if three criteria are met: 1. The CephCluster has the `cleanupPolicy` set 2. The CephCluster has a deletion timestamp (it has been requested to be deleted) 3. The CephObjectStore has a deletion timestamp (it has also been requested to be deleted) From the GChat discussion here https://chat.google.com/room/AAAAREGEba8/9MaG2Ig_TWM, it is my understanding that Jose does not wish to use the optimized deletion path in OCS in order to protect from accidental deletion of user data. At this point, deleting an OCS cluster with a CephObjectStore and NooBaa will always hang because NooBaa does not delete the bucket when NooBaa is removed, OCS does not force the CephObjectStore to be deleted using the "optimized" deletion path, and Rook will not delete the object store with NooBaa's bucket still there. --- WORKAROUNDS There are 2 potential workarounds to this issue to force deletion. 1. Remove the finalizer from the CephObjectStore 2. Use `oc delete` (or `kubectl delete`) to delete the CephCluster to use the "optimized" deletion path --- WHAT DO WE DO NEXT? I see 3 possible paths forward. There may be more. I would like to get input from NooBaa, Jose, and Mudit about how to do so. Options I see: 1. We update documentation to require users to perform one of the workaround steps when they want to remove the OCS storage cluster (bare minimum for 4.9) 2. Change NooBaa's uninstall procedure to remove the bucket(s) it creates in the CephObjectStore when NooBaa is being deleted 3. Use the "optimized" deletion path in ocs-operator @muagarwa , @jrivera, @nbecker
Thanks for the good summary Blaine. I want to add details regarding 4.10, which might complicate suggested approach #2, but also provide a suggestion out of the mess if we go with #1 As you wrote, NooBaa created a default BS on top of RGW for on-prem deployments. That is in addition to the fact that a customer can create any number of BackingStores on top of new or existing RGW buckets when RGW is avail (internal OR external). Since there is the option of creating a BS on top of an existing RGW bucket, with existing data which not not written via ODF/NooBaa we get to the same point of protecting the user data that rook implemented... we don't want to delete that data. In 4.10, to make things a little more complicated, even the default BS won't be "Safe" to delete since we are adding an ability (requested by several customers) to set the default BS they want and not necessarily keep going with the out of the box default. This means that even the default could now point to a bucket with data not written via ODF/NooBaa. The only way I see having #2 working is by giving the customer (UI and CLI flows) a warning in case he has his default BS on top of RGW and asking him to confirm that ALL data would be deleted. This way the customer takes the decision and if he is ok with that, so should we be. That would mean we will pass something similar to the "force" option to let the noobaa-operator know about this choice. Even if we go with this path though, we still need to think about the deletion process. Since there is no "delete all files" in S3 and a delete bucket command would fail if there is data on it. A client is essentially iterating over all objects and deletes them. This can take quite some time and I'm not sure we would wait to wait that time during uninstall. So if we go with this approach (and adding the warning to the customer) we would need to think about how we can efficiently delete or mark the bucket and objects in it to be deleted. I have to admit that path Travis implemented sounds the more reasonable to me, during uninstall we would mark certain things that will let the components know we are during uninstall and they should behave differently, that sounds like the right approach to me.
A small fix to the comment which I Cannot add in BZ ... I meant a suggestion out of the mess if we go with #2 and not #1
I can confirm that removing finalizers of ocs-storagecluster-cephobjectstore: ``` oc patch -n openshift-storage CephObjectStore/ocs-storagecluster-cephobjectstore --type=merge -p '{"metadata": {"finalizers":null}}' ``` works as a workaround here.
Nimrod already replied, removing need info on me.
I don't entirely remember the full extend of discussion on this BZ since it's been a while, and three weeks of PTO basically wiped my brain. That said, I believe we reached a general consensus that the "optimized" deletion strategy is valid and good to go. I don't foresee any changes to ocs-operator to accommodate this, so I think all the work is done?? Giving devel_ack+ and moving to ON_QA. Testing for this is just validating the standard regression procedures.
Will be tested via normal uninstallation procedure.
My recollection is that there are changes needed in ocs-operator to enabled this. During non-graceful deletion of a cluster, ocs-operator needs to set the `cleanupPolicy` on the `CephCluster`, then issue a delete call to the `CephCluster` before moving on to deleting the remainder of the `Ceph...` resources (chiefly the CephObjectStore). The last conversation we had about it, I believe ocs-operator instead tries to delete all of the secondary `Ceph...` resources (including `CephObjectStore`) before deleting the `CephCluster`. IMO, it is worth it for someone from the ocs-operator team to verify this behavior while QA is looking at it so that we don't get further time delay if QA comes back with a "no pass" result. @jrivera
The uninstall cannot be completed without applying the manual workaround of deleting the finalizers. The actual issue is mentioned in comment #77. $ oc describe storagesystem ocs-storagecluster-storagesystem | grep Events -A 4 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning ReconcileFailed 13m StorageSystem controller Waiting for storagecluster.ocs.openshift.io/v1 ocs-storagecluster to be deleted $ oc describe storagecluster ocs-storagecluster | grep Events -A 10 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning UninstallPending 14m controller_storagecluster uninstall: Waiting on NooBaa system noobaa to be deleted Warning UninstallPending 14m controller_storagecluster Uninstall: Waiting for Ceph RGW Route ocs-storagecluster-cephobjectstore to be deleted Warning UninstallPending 14m controller_storagecluster uninstall: Waiting for CephObjectStoreUser ocs-storagecluster-cephobjectstoreuser to be deleted Warning UninstallPending 14m controller_storagecluster uninstall: Waiting for CephObjectStoreUser prometheus-user to be deleted Warning UninstallPending 14m controller_storagecluster uninstall: Waiting for CephObjectStore ocs-storagecluster-cephobjectstore to be deleted $ oc describe CephObjectStore ocs-storagecluster-cephobjectstore | grep Events -A 10 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning ReconcileFailed 5s (x82 over 15m) rook-ceph-object-controller CephObjectStore "openshift-storage/ocs-storagecluster-cephobjectstore" will not be deleted until all dependents are removed: CephObjectStoreUsers: [noobaa-ceph-objectstore-user], buckets in the object store (could be from ObjectBucketClaims or COSI Buckets): [nb.1642741709887.apps.jijoy-jan21.qe.rh-ocs.com] $ oc get CephObjectStoreUser noobaa-ceph-objectstore-user NAME AGE noobaa-ceph-objectstore-user 14h The command "oc delete CephObjectStoreUser noobaa-ceph-objectstore-user"(deleting manually is still a workaround) will not be completed until the finalizers are removed by running the command given below. $ oc patch -n openshift-storage CephObjectStoreUser/noobaa-ceph-objectstore-user --type=merge -p '{"metadata": {"finalizers":null}}' cephobjectstoreuser.ceph.rook.io/noobaa-ceph-objectstore-user patched Deleting CephObjectStoreUser noobaa-ceph-objectstore-user did not help in deleting CephObjectStore ocs-storagecluster-cephobjectstore automatically. $ oc describe CephObjectStore ocs-storagecluster-cephobjectstore | grep Events -A 10 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning ReconcileFailed 7m56s (x82 over 23m) rook-ceph-object-controller CephObjectStore "openshift-storage/ocs-storagecluster-cephobjectstore" will not be deleted until all dependents are removed: CephObjectStoreUsers: [noobaa-ceph-objectstore-user], buckets in the object store (could be from ObjectBucketClaims or COSI Buckets): [nb.1642741709887.apps.jijoy-jan21.qe.rh-ocs.com] Warning ReconcileFailed 2m54s (x3 over 3m16s) rook-ceph-object-controller CephObjectStore "openshift-storage/ocs-storagecluster-cephobjectstore" will not be deleted until all dependents are removed: buckets in the object store (could be from ObjectBucketClaims or COSI Buckets): [nb.1642741709887.apps.jijoy-jan21.qe.rh-ocs.com] Removed the finalizer $ oc patch -n openshift-storage CephObjectStore/ocs-storagecluster-cephobjectstore --type=merge -p '{"metadata": {"finalizers":null}}' cephobjectstore.ceph.rook.io/ocs-storagecluster-cephobjectstore patched
*** Bug 2049309 has been marked as a duplicate of this bug. ***
Upstream PR for ocs-operator is posted: https://github.com/red-hat-storage/ocs-operator/pull/1563 After talking it over with Blaine, we're fairly confident that this should resolve the problem. I both love and hate how often tiny changes like this end up being the solution.
Isn't this bug dependent on the fix of the bug #2060897 ? According to the comment https://bugzilla.redhat.com/show_bug.cgi?id=2060897#c14, the issue described in #2060897 is related to the code changes linked to this bug.
The command given below became stuck because storagecluster is not getting deleted. $ oc delete -n openshift-storage storagesystem --all --wait=true storagesystem.odf.openshift.io "ocs-storagecluster-storagesystem" deleted $ oc describe storagesystem ocs-storagecluster-storagesystem | grep Events -A 30 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning ReconcileFailed 22m StorageSystem controller Waiting for storagecluster.ocs.openshift.io/v1 ocs-storagecluster to be deleted $ oc describe storagecluster ocs-storagecluster | grep Events -A 30 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning UninstallPending 23m controller_storagecluster uninstall: Waiting on NooBaa system noobaa to be deleted Warning UninstallPending 23m controller_storagecluster uninstall: Waiting for CephObjectStoreUser ocs-storagecluster-cephobjectstoreuser to be deleted Warning UninstallPending 23m controller_storagecluster uninstall: Waiting for CephObjectStoreUser prometheus-user to be deleted Warning UninstallPending 23m controller_storagecluster uninstall: Waiting for CephObjectStore ocs-storagecluster-cephobjectstore to be deleted Warning UninstallPending 23m controller_storagecluster uninstall: Waiting for CephFileSystem ocs-storagecluster-cephfilesystem to be deleted Warning UninstallPending 23m controller_storagecluster uninstall: Waiting for CephBlockPool ocs-storagecluster-cephblockpool to be deleted Warning UninstallPending 23m controller_storagecluster uninstall: Waiting for CephCluster to be deleted $ oc get cephcluster NAME DATADIRHOSTPATH MONCOUNT AGE PHASE MESSAGE HEALTH EXTERNAL ocs-storagecluster-cephcluster /var/lib/rook 3 4h59m Deleting Deleting the CephCluster HEALTH_OK $ oc describe cephcluster | grep Events -A 30 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal ReconcileSucceeded 44m (x2 over 4h56m) rook-ceph-cluster-controller successfully configured CephCluster "openshift-storage/ocs-storagecluster-cephcluster" Warning ReconcileFailed 24m rook-ceph-cluster-controller CephCluster "openshift-storage/ocs-storagecluster-cephcluster" will not be deleted until all dependents are removed: CephBlockPool: [ocs-storagecluster-cephblockpool], CephFilesystem: [ocs-storagecluster-cephfilesystem], CephObjectStore: [ocs-storagecluster-cephobjectstore] Normal Deleting 20m (x12 over 24m) rook-ceph-cluster-controller deleting CephCluster "openshift-storage/ocs-storagecluster-cephcluster" Warning ReconcileFailed 3m30s (x29 over 24m) rook-ceph-cluster-controller failed to clean up CephCluster "openshift-storage/ocs-storagecluster-cephcluster": failed to check if volumes exist for CephCluster in namespace "openshift-storage": waiting for csi volume attachments in cluster "openshift-storage" to be cleaned up $ oc get pvc,pv NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE persistentvolumeclaim/ocs-deviceset-0-data-05lxrb Bound pvc-f79af469-0dd7-4aed-8450-24d3b351b909 100Gi RWO thin 4h58m persistentvolumeclaim/ocs-deviceset-1-data-0d9qd6 Bound pvc-06a34bc6-b82b-4cd0-a545-ade14399f5c2 100Gi RWO thin 4h58m persistentvolumeclaim/ocs-deviceset-2-data-0dsw7c Bound pvc-cfd4b5c9-7a70-4df2-851e-9589ccf9cf7f 100Gi RWO thin 4h58m persistentvolumeclaim/rook-ceph-mon-a Bound pvc-4735b6fa-f60e-4d87-9141-a9dd8f4c8b2d 50Gi RWO thin 5h1m persistentvolumeclaim/rook-ceph-mon-b Bound pvc-89a13c6b-28f6-432f-a3d9-c4c05dee77b3 50Gi RWO thin 5h1m persistentvolumeclaim/rook-ceph-mon-c Bound pvc-276c2678-b663-4299-bc2f-3c5a42f53eaa 50Gi RWO thin 5h1m NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE persistentvolume/pvc-06a34bc6-b82b-4cd0-a545-ade14399f5c2 100Gi RWO Delete Bound openshift-storage/ocs-deviceset-1-data-0d9qd6 thin 4h58m persistentvolume/pvc-276c2678-b663-4299-bc2f-3c5a42f53eaa 50Gi RWO Delete Bound openshift-storage/rook-ceph-mon-c thin 5h persistentvolume/pvc-4735b6fa-f60e-4d87-9141-a9dd8f4c8b2d 50Gi RWO Delete Bound openshift-storage/rook-ceph-mon-a thin 5h persistentvolume/pvc-89a13c6b-28f6-432f-a3d9-c4c05dee77b3 50Gi RWO Delete Bound openshift-storage/rook-ceph-mon-b thin 5h persistentvolume/pvc-b79a0258-68ea-4aea-ae0c-c64b39da16cf 50Gi RWO Delete Released openshift-storage/db-noobaa-db-pg-0 ocs-storagecluster-ceph-rbd 4h56m persistentvolume/pvc-cfd4b5c9-7a70-4df2-851e-9589ccf9cf7f 100Gi RWO Delete Bound openshift-storage/ocs-deviceset-2-data-0dsw7c thin 4h58m persistentvolume/pvc-f79af469-0dd7-4aed-8450-24d3b351b909 100Gi RWO Delete Bound openshift-storage/ocs-deviceset-0-data-05lxrb thin 4h58m $ oc describe pv pvc-b79a0258-68ea-4aea-ae0c-c64b39da16cf | grep Events -A 30 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning VolumeFailedDelete 45s (x17 over 29m) openshift-storage.rbd.csi.ceph.com_csi-rbdplugin-provisioner-7b59944d67-pfsjb_21e16f1c-a476-4de0-8456-f47cc5f46d3b rpc error: code = InvalidArgument desc = provided secret is empty This issue is reported in the bug #2060897 Tested in version: ODF 4.10.0-184 ODF 4.10.0-0.nightly-2022-03-09-224546 Tested in VMware.
Because the issue mentioned in the above comment is already tracked by bug #2060897, I am moving this BZ to MODIFIED and out of 4.10 Can be moved back to ON_QA when bug #2060897 is fixed. Let me know if I have missed anything here.
(In reply to Mudit Agarwal from comment #83) > Because the issue mentioned in the above comment is already tracked by bug > #2060897, I am moving this BZ to MODIFIED and out of 4.10 > Can be moved back to ON_QA when bug #2060897 is fixed. Hi Mudit, Is this bug actually ready for verification ? The bug #2060897 is not fixed. The "Target Release" and "Fixed In Version" are not matching. > > Let me know if I have missed anything here.
This was moved to ON_QA automatically, moving it back.
Hi Mudit, why has this bug moved out of 4.12.0? it got all the acks for 4.12.0
This is being moved for many releases just not 4.12, we don't have bandwidth to fix uninstallation and it is having low priority.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Data Foundation 4.13.0 enhancement and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:3742