Created attachment 1868837 [details] auth list showing 3 onboarded clients Description of problem: ===================================== To offboard a consumer cluster from provider, we have tested uninstall add-on whereby the storageconsumers and other resources are deleted from provider However, when we delete an onboarded consumer directly from OCM UI(OCM UI-> Delete cluster), the offboarding is not kicked in and the storageconsumer and other resources still exist It is to be noted, as part of cluster deletion, the UI shows add-on in "Uninstalling phase" first. So the offboarding should have happened Version-Release number of selected component (if applicable): ================================================================= provider images ================== oc describe csv ocs-osd-deployer.v2.0.0 |grep -i image Mediatype: image/svg+xml Image: gcr.io/kubebuilder/kube-rbac-proxy:v0.5.0 Image: quay.io/osd-addons/ocs-osd-deployer:2.0.0-2 Image: quay.io/osd-addons/ocs-osd-deployer:2.0.0-2 oc get csv -n openshift-storage -o json ocs-operator.v4.10.0 | jq '.metadata.labels["full_version"]' "4.10.0-206" consumer ========== oc describe csv ocs-osd-deployer.v2.0.0 |grep -i image Mediatype: image/svg+xml Image: gcr.io/kubebuilder/kube-rbac-proxy:v0.5.0 Image: quay.io/osd-addons/ocs-osd-deployer:2.0.0-5 Image: quay.io/osd-addons/ocs-osd-deployer:2.0.0-5 How reproducible: ====================== Tested on 3 consumer clusters on the same ODF to ODF setup Steps to Reproduce: 1. Created an ODF to ODF cluster using the steps provided here[1] [1] -https://docs.google.com/document/d/1ehNBscWgLGNYqnnZUp6RPnkR9ByYU69BgXvr_z2n5sE/edit#heading=h.41dqse7bmiv5 2. Onboarded 3 consumers and created PVCs from each of the 3 3. With existing PVCs, started Delete cluster from OCM UI , expecting it to fail since PVCs existed 3. Actual results: =================== 1. consumer Cluster deleteion succeeds 2. On provider side, all 3 storageconsumers and corresponding resurces are still intact Expected results: ===================== If consumer cluster is permanently deleted from OCM UI, corresponding provider resources should also be deleted to free up the space Or proivider should have a mechanism to poll and delete the storageconsumers? Additional info:= After deletion from UI, Only provider exists rosa list clusters ID NAME STATE 1r7eu5ehl2m97pad70lecuk9uljcmaa4 sgatfane-28pr1 ready However corresponding stporageconsumer reosurces are not deleted ----------------------------------- date --utc; oc get storageconsumer,cephblockpool,cephfilesystemsubvolumegroup Mon Mar 28 08:07:23 PM UTC 2022 NAME AGE storageconsumer.ocs.openshift.io/storageconsumer-0acdb819-f6bb-42ba-b80a-c0e9bd88a40e 6h3m storageconsumer.ocs.openshift.io/storageconsumer-223e3a6c-205d-40ce-861e-2021a7ac4b62 6h20m storageconsumer.ocs.openshift.io/storageconsumer-99d4f738-4241-401c-9b40-ba7eaafba343 6h11m NAME AGE cephblockpool.ceph.rook.io/cephblockpool-storageconsumer-0acdb819-f6bb-42ba-b80a-c0e9bd88a40e 6h3m cephblockpool.ceph.rook.io/cephblockpool-storageconsumer-223e3a6c-205d-40ce-861e-2021a7ac4b62 6h20m cephblockpool.ceph.rook.io/cephblockpool-storageconsumer-99d4f738-4241-401c-9b40-ba7eaafba343 6h11m cephblockpool.ceph.rook.io/ocs-storagecluster-cephblockpool 14h NAME AGE cephfilesystemsubvolumegroup.ceph.rook.io/cephfilesystemsubvolumegroup-storageconsumer-0acdb819-f6bb-42ba-b80a-c0e9bd88a40e 6h4m cephfilesystemsubvolumegroup.ceph.rook.io/cephfilesystemsubvolumegroup-storageconsumer-223e3a6c-205d-40ce-861e-2021a7ac4b62 6h20m cephfilesystemsubvolumegroup.ceph.rook.io/cephfilesystemsubvolumegroup-storageconsumer-99d4f738-4241-401c-9b40-ba7eaafba343 6h11m ----------------------- Before deletion ------------------ rosa list clusters ID NAME STATE 1r7eu5ehl2m97pad70lecuk9uljcmaa4 sgatfane-28pr1 ready 1r7ieje1qbl46k94cposunuijk81ksvf sgatfane-28c3 ready 1r7kcdde31vldf7oi4e3r91f8jimbapl jijoyc1 ready 1r7lbvnvpqbdlqmrro6uncn6aoldemm6 jijoyc2 ready sh-4.4$ ceph fs subvolume ls ocs-storagecluster-cephfilesystem cephfilesystemsubvolumegroup-storageconsumer-99d4f738-4241-401c-9b40-ba7eaafba343 [ { "name": "csi-vol-a240d41d-aea6-11ec-ae4e-0a580a800220" } ] sh-4.4$ ceph fs subvolume ls ocs-storagecluster-cephfilesystem cephfilesystemsubvolumegroup-storageconsumer-223e3a6c-205d-40ce-861e-2021a7ac4b62 [ { "name": "csi-vol-5bcd36c7-aebc-11ec-804c-0a580a83003c" }, { "name": "csi-vol-5be9cf1d-aebc-11ec-804c-0a580a83003c" } ] sh-4.4$ ceph fs subvolume ls ocs-storagecluster-cephfilesystem cephfilesystemsubvolumegroup-storageconsumer-0acdb819-f6bb-42ba-b80a-c0e9bd88a40e [] sh-4.4$ rbd ls -p cephblockpool-storageconsumer-99d4f738-4241-401c-9b40-ba7eaafba343 csi-vol-8f1e897a-aeaf-11ec-9f4e-0a580a80021f csi-vol-c2e87e4a-aeb6-11ec-9f4e-0a580a80021f csi-vol-c51c6faa-aeae-11ec-9f4e-0a580a80021f csi-vol-cdf10ef2-aeb3-11ec-9f4e-0a580a80021f csi-vol-e8749ded-aead-11ec-9f4e-0a580a80021f sh-4.4$ rbd ls -p cephblockpool-storageconsumer-0acdb819-f6bb-42ba-b80a-c0e9bd88a40e csi-vol-1a7800cd-aeb0-11ec-8556-0a580a830025 csi-vol-c2d9b816-aeb6-11ec-8556-0a580a830025 csi-vol-cdcbdd4e-aeb3-11ec-8556-0a580a830025 csi-vol-e872570b-aead-11ec-8556-0a580a830025 csi-vol-f6612c51-aeb0-11ec-8556-0a580a830025 csi-vol-fc78b9d7-aeb1-11ec-8556-0a580a830025 sh-4.4$ rbd ls -p cephblockpool-storageconsumer-223e3a6c-205d-40ce-861e-2021a7ac4b62 csi-vol-4baf67df-aeb8-11ec-9c01-0a580a83003b
I tried to reproduce the bug but I was not able to Few things which I found: Your provider cluster is using an older image of deployer: quay.io/osd-addons/ocs-osd-deployer:2.0.0-2 and your consumer cluster is using a new image of deployer: quay.io/osd-addons/ocs-osd-deployer:2.0.0-5 The doc that you are using has some steps that do not need to be followed now as the deployer was updated, I have added comments to the doc. The behavior I observed while reproducing the bug: Deployer doesn't allow uninstallation if PVCs are using OCS storage classes. As soon as I delete the PVCs using OCS storage classes, the consumer offboarding starts, and the complete openshift-storage namespace is deleted in some time. When I checked the Provider cluster for storageConsumer resource, the resource still exists for the consumer that was offboarded. After debugging found that the PV was utilizing storage on the consumer cluster using cephfs storage class(this PV was created when I created PVCs with OCS storage class i.e. cephrbd and cephfs) When I manually deleted the PV on the consumer and deleted the corresponding subvolume on the provider, the storageConsumer resource was removed. So the deployer needs to uninstall when there is no PV using OCS StorageClasses instead of PVC, I have raised PR for that: https://github.com/red-hat-storage/ocs-osd-deployer/pull/152
If I understand correctly, the problem is when we are deleting the entire consumer instead of following offboarding process, the consumer on the provider side still exists. IIRC, there will stale blockPools, filesystem and cephClient(confirmed with Druv). We need to delete 1. blockPool, in this case `cephblockpool-storageconsumer-326dfd52-773c-4c72-ac1c-6576380bfe37 10d` notice the block pool name has storageconsumer name in the end. 2. filesystem, `cephfilesystemsubvolumegroup-storageconsumer-326dfd52-773c-4c72-ac1c-6576380bfe37 10d` notice the block pool name has storageconsumer name in the end. 3. cephClients, for deleting the cephClients linked to perticluar consumer, you list cephClient and check the `annotation` of cephClients with key `StorageConsumerAnnotation` which will have consumer name.
Moving the BZ to ON_QA as the tracker issue and fix in deployer were merged.
This is verified on the earlier Deployer version (v2.0.10). No storageconsumer observed if we deleted the PVC from consumer and openshift-storage project deleted successfully from consumer.