It does appear that the PVC is blocking deletion of the Ceph cluster. I'm not sure what secret is missing. Both the rook-csi-rbd-provisioner and rook-csi-rbd-node secrets exist still in the must gather directory 'after/'. I don't think I can debug further very easily. Moving this to the Ceph-CSI component. Name: pvc-e5bd26c4-b448-4219-a18b-7b7fb3f4e1c6 Labels: <none> Annotations: pv.kubernetes.io/provisioned-by: openshift-storage.rbd.csi.ceph.com Finalizers: [kubernetes.io/pv-protection] StorageClass: ocs-storagecluster-ceph-rbd Status: Released Claim: openshift-storage/db-noobaa-db-pg-0 Reclaim Policy: Delete Access Modes: RWO VolumeMode: Filesystem Capacity: 50Gi Node Affinity: <none> Message: Source: Type: CSI (a Container Storage Interface (CSI) volume source) Driver: openshift-storage.rbd.csi.ceph.com FSType: ext4 VolumeHandle: 0001-0011-openshift-storage-0000000000000001-56a6f1a1-9bb2-11ec-aef9-0a580a80020f ReadOnly: false VolumeAttributes: clusterID=openshift-storage csi.storage.k8s.io/pv/name=pvc-e5bd26c4-b448-4219-a18b-7b7fb3f4e1c6 csi.storage.k8s.io/pvc/name=db-noobaa-db-pg-0 csi.storage.k8s.io/pvc/namespace=openshift-storage imageFeatures=layering imageFormat=2 imageName=csi-vol-56a6f1a1-9bb2-11ec-aef9-0a580a80020f journalPool=ocs-storagecluster-cephblockpool pool=ocs-storagecluster-cephblockpool storage.kubernetes.io/csiProvisionerIdentity=1646394835158-8081-openshift-storage.rbd.csi.ceph.com Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning VolumeFailedDelete 117s (x10 over 6m12s) openshift-storage.rbd.csi.ceph.com_csi-rbdplugin-provisioner-d98d9b847-pvfgg_c99699ca-2fb4-45cd-a97d-795b918521d3 rpc error: code = InvalidArgument desc = provided secret is empty
Going through the must-gather, I see that the storageclass `ocs-storagecluster-ceph-rbd` doesn't exist due to which the below PV doesn't get deleted. ``` pvc-e5bd26c4-b448-4219-a18b-7b7fb3f4e1c6 50Gi RWO Delete Released openshift-storage/db-noobaa-db-pg-0 ocs-storagecluster-ceph-rbd 55m ``` And also as per the document for uninstall process, we must delete the pvcs and obcs before deleting the storagesystem. But here, there are many pvcs still existing which needs to be deleted. ``` --- apiVersion: v1 items: - apiVersion: v1 kind: PersistentVolumeClaim metadata: annotations: crushDeviceClass: "" pv.kubernetes.io/bind-completed: "yes" pv.kubernetes.io/bound-by-controller: "yes" volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/aws-ebs volume.kubernetes.io/selected-node: ip-10-0-186-155.us-east-2.compute.internal volume.kubernetes.io/storage-provisioner: kubernetes.io/aws-ebs creationTimestamp: "2022-03-04T11:56:47Z" finalizers: - kubernetes.io/pvc-protection generateName: ocs-deviceset-gp2-0-data-0 labels: ceph.rook.io/DeviceSet: ocs-deviceset-gp2-0 ceph.rook.io/DeviceSetPVCId: ocs-deviceset-gp2-0-data-0 ceph.rook.io/setIndex: "0" name: ocs-deviceset-gp2-0-data-0p4djr namespace: openshift-storage ownerReferences: - apiVersion: ceph.rook.io/v1 blockOwnerDeletion: true controller: true kind: CephCluster name: ocs-storagecluster-cephcluster uid: 9d961761-b5a7-4a26-b79e-bc784c2d7681 resourceVersion: "39038" uid: 4d3df2c4-6083-4054-aa8c-257a5505c477 spec: accessModes: - ReadWriteOnce resources: requests: storage: 512Gi storageClassName: gp2 volumeMode: Block volumeName: pvc-4d3df2c4-6083-4054-aa8c-257a5505c477 status: accessModes: - ReadWriteOnce capacity: storage: 512Gi phase: Bound ``` And these pvcs are mostly not getting deleted due to the finalizers ``` finalizers: - kubernetes.io/pvc-protection ```
yes @rgeorge, this is similar to the bug VMware bug. As mentioned in the above comment, the resources are not getting deleted due to the finalizers. We need to explicitly or forcefully remove the finalizers which will allow the resources to get deleted.
Going through the discussion in bug 2005040, I see removing the finalizers isn't a good user experience. The deletion of storagesystem is blocked due to the following reason: ``` Name: ocs-storagecluster Namespace: openshift-storage Labels: <none> Annotations: uninstall.ocs.openshift.io/cleanup-policy: delete Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning UninstallPending 6m18s controller_storagecluster uninstall: Waiting on NooBaa system noobaa to be deleted Warning UninstallPending 6m17s controller_storagecluster uninstall: Waiting for CephFileSystem ocs-storagecluster-cephfilesystem to be deleted Warning UninstallPending 6m15s controller_storagecluster uninstall: Waiting for CephBlockPool ocs-storagecluster-cephblockpool to be deleted Warning UninstallPending 6m14s controller_storagecluster uninstall: Waiting for CephCluster to be deleted ``` And the ceph health is also not healthy ``` HEALTH_ERR 1 filesystem is degraded; 1 filesystem has a failed mds daemon; 1 filesystem is offline; insufficient standby MDS daemons available [WRN] FS_DEGRADED: 1 filesystem is degraded fs ocs-storagecluster-cephfilesystem is degraded [WRN] FS_WITH_FAILED_MDS: 1 filesystem has a failed mds daemon fs ocs-storagecluster-cephfilesystem has 1 failed mds [ERR] MDS_ALL_DOWN: 1 filesystem is offline fs ocs-storagecluster-cephfilesystem is offline because no MDS is active for it. [WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons available have 0; want 1 more ``` And as stated above the cephcluster doesn't get deleted due to the presence of pvs and pvcs. We should move this to ocs-operator for solution as they did in VMWare bug.
Everything is fine on the ocs-operator side. Indeed, rook-ceph-operator is having problems removing the CephCluster: 2022-03-04T12:20:03.949054839Z 2022-03-04 12:20:03.949016 E | ceph-cluster-controller: Spec.CSI is nil for PV "pvc-06ccd9cc-8b4e-4c86-8e18-6ae0b95a2646" 2022-03-04T12:20:03.949054839Z 2022-03-04 12:20:03.949033 E | ceph-cluster-controller: Spec.CSI is nil for PV "pvc-4d3df2c4-6083-4054-aa8c-257a5505c477" 2022-03-04T12:20:03.949054839Z 2022-03-04 12:20:03.949037 E | ceph-cluster-controller: Spec.CSI is nil for PV "pvc-cb21cf2f-7af3-46e5-b4f6-e1ed120a324e" 2022-03-04T12:20:03.949054839Z 2022-03-04 12:20:03.949040 E | ceph-cluster-controller: Spec.CSI is nil for PV "pvc-dcc76cdb-015f-4b52-8852-122c62c88c49" 2022-03-04T12:20:03.949054839Z 2022-03-04 12:20:03.949043 E | ceph-cluster-controller: Spec.CSI is nil for PV "pvc-de4bffa8-28fd-4c04-b3d3-d2d5a1c47688" 2022-03-04T12:20:03.949085496Z 2022-03-04 12:20:03.949073 E | ceph-cluster-controller: failed to reconcile CephCluster "openshift-storage/ocs-storagecluster-cephcluster". failed to clean up CephCluster "openshift-storage/ocs-storagecluster-cephcluster": failed to check if volumes exist for CephCluster in namespace "openshift-storage": waiting for csi volume attachments in cluster "openshift-storage" to be cleaned up Blaine found the problem PV: pvc-e5bd26c4-b448-4219-a18b-7b7fb3f4e1c6 50Gi RWO Delete Released openshift-storage/db-noobaa-db-pg-0 ocs-storagecluster-ceph-rbd 22m So it's the NooBaa DB volume. However, The truly weird thing is that the PVC is gone. Looking at the following code, if the PVC is gone it would explain how we would have moved along through the uninstall logic after the NooBaa CR was removed: https://github.com/red-hat-storage/ocs-operator/blob/4f2dfefe4e014f1191ef88d4cdf831ecde3ef430/controllers/storagecluster/noobaa_system_reconciler.go#L229-L244 At that point we would continue deleting everything, including the StorageClasses. If the CSI driver gets the PV's relevant Secret information from the StorageClass (man, that makes me mad...) then this behavior makes sense. That said, looking at the CSI provisioner logs I see something I don't understand. The initial DeleteVolume request(s?) came in and completed successfully. But then another one comes along and fails? Following the volume handle "0001-0011-openshift-storage-0000000000000001-56a6f1a1-9bb2-11ec-aef9-0a580a80020f" in this Pod: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/2060897/after/must-gather.local.475967562843324470/quay-io-rhceph-dev-ocs-must-gather-sha256-1763ae725759f4fc78c9a7d432cd47a80f15487ef3ec79b6c7225b1e736529f1/namespaces/openshift-storage/pods/csi-rbdplugin-provisioner-d98d9b847-pvfgg/ From csi-rbdplugin: 2022-03-04T12:13:02.322224603Z I0304 12:13:02.322180 1 utils.go:191] ID: 42 Req-ID: 0001-0011-openshift-storage-0000000000000001-e9976bc8-9bb2-11ec-aef9-0a580a80020f GRPC call: /csi.v1.Controller/DeleteVolume 2022-03-04T12:13:02.322314962Z I0304 12:13:02.322302 1 utils.go:195] ID: 42 Req-ID: 0001-0011-openshift-storage-0000000000000001-e9976bc8-9bb2-11ec-aef9-0a580a80020f GRPC request: {"secrets":"***stripped***","volume_id":"0001-0011-openshift-storage-0000000000000001-e9976bc8-9bb2-11ec-aef9-0a580a80020f"} 2022-03-04T12:13:02.323395991Z I0304 12:13:02.323374 1 omap.go:87] ID: 42 Req-ID: 0001-0011-openshift-storage-0000000000000001-e9976bc8-9bb2-11ec-aef9-0a580a80020f got omap values: (pool="ocs-storagecluster-cephblockpool", namespace="", name="csi.volume.e9976bc8-9bb2-11ec-aef9-0a580a80020f"): map[csi.imageid:5e491cd01c30 csi.imagename:csi-vol-e9976bc8-9bb2-11ec-aef9-0a580a80020f csi.volname:pvc-c24b4757-c0bb-49bc-ba07-95d197c89d72 csi.volume.owner:test] 2022-03-04T12:13:02.337705434Z I0304 12:13:02.337672 1 utils.go:191] ID: 43 Req-ID: 0001-0011-openshift-storage-0000000000000001-ecf0d81b-9bb2-11ec-aef9-0a580a80020f GRPC call: /csi.v1.Controller/DeleteVolume 2022-03-04T12:13:02.337730258Z I0304 12:13:02.337716 1 utils.go:195] ID: 43 Req-ID: 0001-0011-openshift-storage-0000000000000001-ecf0d81b-9bb2-11ec-aef9-0a580a80020f GRPC request: {"secrets":"***stripped***","volume_id":"0001-0011-openshift-storage-0000000000000001-ecf0d81b-9bb2-11ec-aef9-0a580a80020f"} 2022-03-04T12:13:02.338344191Z I0304 12:13:02.338328 1 omap.go:87] ID: 43 Req-ID: 0001-0011-openshift-storage-0000000000000001-ecf0d81b-9bb2-11ec-aef9-0a580a80020f got omap values: (pool="ocs-storagecluster-cephblockpool", namespace="", name="csi.volume.ecf0d81b-9bb2-11ec-aef9-0a580a80020f"): map[csi.imageid:5e4956a52cf0 csi.imagename:csi-vol-ecf0d81b-9bb2-11ec-aef9-0a580a80020f csi.volname:pvc-821db073-46ae-4749-81ae-8cec1bf86b7a csi.volume.owner:test] 2022-03-04T12:13:02.388544034Z I0304 12:13:02.388511 1 rbd_util.go:647] ID: 42 Req-ID: 0001-0011-openshift-storage-0000000000000001-e9976bc8-9bb2-11ec-aef9-0a580a80020f rbd: delete csi-vol-e9976bc8-9bb2-11ec-aef9-0a580a80020f-temp using mon 172.30.183.200:6789,172.30.221.180:6789,172.30.55.98:6789, pool ocs-storagecluster-cephblockpool 2022-03-04T12:13:02.392576069Z I0304 12:13:02.392553 1 controllerserver.go:958] ID: 42 Req-ID: 0001-0011-openshift-storage-0000000000000001-e9976bc8-9bb2-11ec-aef9-0a580a80020f deleting image csi-vol-e9976bc8-9bb2-11ec-aef9-0a580a80020f 2022-03-04T12:13:02.392576069Z I0304 12:13:02.392567 1 rbd_util.go:647] ID: 42 Req-ID: 0001-0011-openshift-storage-0000000000000001-e9976bc8-9bb2-11ec-aef9-0a580a80020f rbd: delete csi-vol-e9976bc8-9bb2-11ec-aef9-0a580a80020f using mon 172.30.183.200:6789,172.30.221.180:6789,172.30.55.98:6789, pool ocs-storagecluster-cephblockpool 2022-03-04T12:13:02.404866765Z I0304 12:13:02.404839 1 rbd_util.go:647] ID: 43 Req-ID: 0001-0011-openshift-storage-0000000000000001-ecf0d81b-9bb2-11ec-aef9-0a580a80020f rbd: delete csi-vol-ecf0d81b-9bb2-11ec-aef9-0a580a80020f-temp using mon 172.30.183.200:6789,172.30.221.180:6789,172.30.55.98:6789, pool ocs-storagecluster-cephblockpool 2022-03-04T12:13:02.409296524Z I0304 12:13:02.409274 1 controllerserver.go:958] ID: 43 Req-ID: 0001-0011-openshift-storage-0000000000000001-ecf0d81b-9bb2-11ec-aef9-0a580a80020f deleting image csi-vol-ecf0d81b-9bb2-11ec-aef9-0a580a80020f 2022-03-04T12:13:02.409296524Z I0304 12:13:02.409287 1 rbd_util.go:647] ID: 43 Req-ID: 0001-0011-openshift-storage-0000000000000001-ecf0d81b-9bb2-11ec-aef9-0a580a80020f rbd: delete csi-vol-ecf0d81b-9bb2-11ec-aef9-0a580a80020f using mon 172.30.183.200:6789,172.30.221.180:6789,172.30.55.98:6789, pool ocs-storagecluster-cephblockpool 2022-03-04T12:13:02.427891448Z I0304 12:13:02.427867 1 rbd_util.go:682] ID: 42 Req-ID: 0001-0011-openshift-storage-0000000000000001-e9976bc8-9bb2-11ec-aef9-0a580a80020f rbd: adding task to remove image "ocs-storagecluster-cephblockpool/csi-vol-e9976bc8-9bb2-11ec-aef9-0a580a80020f" with id "5e491cd01c30" from trash 2022-03-04T12:13:02.444124128Z I0304 12:13:02.444099 1 rbd_util.go:682] ID: 43 Req-ID: 0001-0011-openshift-storage-0000000000000001-ecf0d81b-9bb2-11ec-aef9-0a580a80020f rbd: adding task to remove image "ocs-storagecluster-cephblockpool/csi-vol-ecf0d81b-9bb2-11ec-aef9-0a580a80020f" with id "5e4956a52cf0" from trash 2022-03-04T12:13:02.493368614Z I0304 12:13:02.493340 1 rbd_util.go:706] ID: 42 Req-ID: 0001-0011-openshift-storage-0000000000000001-e9976bc8-9bb2-11ec-aef9-0a580a80020f rbd: successfully added task to move image "ocs-storagecluster-cephblockpool/csi-vol-e9976bc8-9bb2-11ec-aef9-0a580a80020f" with id "5e491cd01c30" to trash 2022-03-04T12:13:02.500494605Z I0304 12:13:02.500465 1 omap.go:123] ID: 42 Req-ID: 0001-0011-openshift-storage-0000000000000001-e9976bc8-9bb2-11ec-aef9-0a580a80020f removed omap keys (pool="ocs-storagecluster-cephblockpool", namespace="", name="csi.volumes.default"): [csi.volume.pvc-c24b4757-c0bb-49bc-ba07-95d197c89d72] 2022-03-04T12:13:02.500558471Z I0304 12:13:02.500544 1 utils.go:202] ID: 42 Req-ID: 0001-0011-openshift-storage-0000000000000001-e9976bc8-9bb2-11ec-aef9-0a580a80020f GRPC response: {} 2022-03-04T12:13:02.504471563Z I0304 12:13:02.504446 1 rbd_util.go:706] ID: 43 Req-ID: 0001-0011-openshift-storage-0000000000000001-ecf0d81b-9bb2-11ec-aef9-0a580a80020f rbd: successfully added task to move image "ocs-storagecluster-cephblockpool/csi-vol-ecf0d81b-9bb2-11ec-aef9-0a580a80020f" with id "5e4956a52cf0" to trash 2022-03-04T12:13:02.510271302Z I0304 12:13:02.510240 1 omap.go:123] ID: 43 Req-ID: 0001-0011-openshift-storage-0000000000000001-ecf0d81b-9bb2-11ec-aef9-0a580a80020f removed omap keys (pool="ocs-storagecluster-cephblockpool", namespace="", name="csi.volumes.default"): [csi.volume.pvc-821db073-46ae-4749-81ae-8cec1bf86b7a] 2022-03-04T12:13:02.510315138Z I0304 12:13:02.510300 1 utils.go:202] ID: 43 Req-ID: 0001-0011-openshift-storage-0000000000000001-ecf0d81b-9bb2-11ec-aef9-0a580a80020f GRPC response: {} 2022-03-04T12:13:54.468124415Z I0304 12:13:54.468095 1 utils.go:191] ID: 44 GRPC call: /csi.v1.Identity/Probe 2022-03-04T12:13:54.468152266Z I0304 12:13:54.468137 1 utils.go:195] ID: 44 GRPC request: {} 2022-03-04T12:13:54.468173667Z I0304 12:13:54.468151 1 utils.go:202] ID: 44 GRPC response: {} 2022-03-04T12:14:32.324826039Z I0304 12:14:32.324791 1 utils.go:191] ID: 45 Req-ID: 0001-0011-openshift-storage-0000000000000001-56a6f1a1-9bb2-11ec-aef9-0a580a80020f GRPC call: /csi.v1.Controller/DeleteVolume 2022-03-04T12:14:32.324897203Z I0304 12:14:32.324864 1 utils.go:195] ID: 45 Req-ID: 0001-0011-openshift-storage-0000000000000001-56a6f1a1-9bb2-11ec-aef9-0a580a80020f GRPC request: {"volume_id":"0001-0011-openshift-storage-0000000000000001-56a6f1a1-9bb2-11ec-aef9-0a580a80020f"} 2022-03-04T12:14:32.324918706Z E0304 12:14:32.324897 1 utils.go:200] ID: 45 Req-ID: 0001-0011-openshift-storage-0000000000000001-56a6f1a1-9bb2-11ec-aef9-0a580a80020f GRPC error: rpc error: code = InvalidArgument desc = provided secret is empty From csi-provisioner on the same Pod: 2022-03-04T12:13:02.316830038Z I0304 12:13:02.316795 1 controller.go:1471] delete "pvc-c24b4757-c0bb-49bc-ba07-95d197c89d72": started 2022-03-04T12:13:02.321784455Z I0304 12:13:02.321766 1 connection.go:183] GRPC call: /csi.v1.Controller/DeleteVolume 2022-03-04T12:13:02.321840148Z I0304 12:13:02.321778 1 connection.go:184] GRPC request: {"secrets":"***stripped***","volume_id":"0001-0011-openshift-storage-0000000000000001-e9976bc8-9bb2-11ec-aef9-0a580a80020f"} 2022-03-04T12:13:02.333250078Z I0304 12:13:02.333220 1 controller.go:1471] delete "pvc-821db073-46ae-4749-81ae-8cec1bf86b7a": started 2022-03-04T12:13:02.337452686Z I0304 12:13:02.337435 1 connection.go:183] GRPC call: /csi.v1.Controller/DeleteVolume 2022-03-04T12:13:02.337492181Z I0304 12:13:02.337447 1 connection.go:184] GRPC request: {"secrets":"***stripped***","volume_id":"0001-0011-openshift-storage-0000000000000001-ecf0d81b-9bb2-11ec-aef9-0a580a80020f"} 2022-03-04T12:13:02.500825563Z I0304 12:13:02.500768 1 connection.go:186] GRPC response: {} 2022-03-04T12:13:02.500825563Z I0304 12:13:02.500808 1 connection.go:187] GRPC error: <nil> 2022-03-04T12:13:02.500825563Z I0304 12:13:02.500821 1 controller.go:1486] delete "pvc-c24b4757-c0bb-49bc-ba07-95d197c89d72": volume deleted 2022-03-04T12:13:02.510495404Z I0304 12:13:02.510450 1 connection.go:186] GRPC response: {} 2022-03-04T12:13:02.510495404Z I0304 12:13:02.510479 1 connection.go:187] GRPC error: <nil> 2022-03-04T12:13:02.510495404Z I0304 12:13:02.510489 1 controller.go:1486] delete "pvc-821db073-46ae-4749-81ae-8cec1bf86b7a": volume deleted 2022-03-04T12:13:02.511190259Z I0304 12:13:02.511172 1 controller.go:1531] delete "pvc-c24b4757-c0bb-49bc-ba07-95d197c89d72": persistentvolume deleted 2022-03-04T12:13:02.511202245Z I0304 12:13:02.511188 1 controller.go:1536] delete "pvc-c24b4757-c0bb-49bc-ba07-95d197c89d72": succeeded 2022-03-04T12:13:02.523690731Z I0304 12:13:02.523663 1 controller.go:1531] delete "pvc-821db073-46ae-4749-81ae-8cec1bf86b7a": persistentvolume deleted 2022-03-04T12:13:02.523720594Z I0304 12:13:02.523687 1 controller.go:1536] delete "pvc-821db073-46ae-4749-81ae-8cec1bf86b7a": succeeded [...] 2022-03-04T12:14:32.324464947Z I0304 12:14:32.324429 1 controller.go:1471] delete "pvc-e5bd26c4-b448-4219-a18b-7b7fb3f4e1c6": started 2022-03-04T12:14:32.324509888Z W0304 12:14:32.324482 1 controller.go:1192] failed to get storageclass: ocs-storagecluster-ceph-rbd, proceeding to delete without secrets. storageclass.storage.k8s.io "ocs-storagecluster-ceph-rbd" not found 2022-03-04T12:14:32.324509888Z I0304 12:14:32.324499 1 connection.go:183] GRPC call: /csi.v1.Controller/DeleteVolume 2022-03-04T12:14:32.324553506Z I0304 12:14:32.324505 1 connection.go:184] GRPC request: {"volume_id":"0001-0011-openshift-storage-0000000000000001-56a6f1a1-9bb2-11ec-aef9-0a580a80020f"} 2022-03-04T12:14:32.325071220Z I0304 12:14:32.325035 1 connection.go:186] GRPC response: {} 2022-03-04T12:14:32.325071220Z I0304 12:14:32.325061 1 connection.go:187] GRPC error: rpc error: code = InvalidArgument desc = provided secret is empty 2022-03-04T12:14:32.325088744Z E0304 12:14:32.325079 1 controller.go:1481] delete "pvc-e5bd26c4-b448-4219-a18b-7b7fb3f4e1c6": volume deletion failed: rpc error: code = InvalidArgument desc = provided secret is empty 2022-03-04T12:14:32.325118519Z W0304 12:14:32.325104 1 controller.go:989] Retrying syncing volume "pvc-e5bd26c4-b448-4219-a18b-7b7fb3f4e1c6", failure 0 2022-03-04T12:14:32.325146494Z E0304 12:14:32.325133 1 controller.go:1007] error syncing volume "pvc-e5bd26c4-b448-4219-a18b-7b7fb3f4e1c6": rpc error: code = InvalidArgument desc = provided secret is empty 2022-03-04T12:14:32.325217440Z I0304 12:14:32.325190 1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolume", Namespace:"", Name:"pvc-e5bd26c4-b448-4219-a18b-7b7fb3f4e1c6", UID:"6b0e667f-7de2-462c-9317-0adc4bf15f7b", APIVersion:"v1", ResourceVersion:"52043", FieldPath:""}): type: 'Warning' reason: 'VolumeFailedDelete' rpc error: code = InvalidArgument desc = provided secret is empty So in both cases, the CSI driver seems to believe the volume was succesfully deleted, and then tried to delete it again...? All of this is to say that this is not an ocs-operator problem at the moment. We need to learn how the PVC was removed such that the PV was not get fully deleted before the StorageClasses get deleted. Is this a race condition? If so, how have we not hit this before? In any case, moving this back to the csi-driver. Tagging both Madhu and Yati for further insight, just in case.
Blaine and I made significant progress in investigating this issue, but some further analysis and design is still pending. Thus far, we believe that in fixing our initial batch of uninstall problems we exposed one bug in the new changes to Rook-Ceph and a long-standing bug in the Ceph-CSI RBD provisioner. We can probably work around the latter bug in fixing the former. I'm moving this BZ to Rook-Ceph for now. We should create/clone a new one for the Ceph-CSI issue once we're certain what it is. My understanding is that uninstall is not a GA feature, so I don´t think it qualifies for blocker status on ODF 4.10.0. Moving it back to ODF 4.10.z at the very least.
I am not sure what exactly referred as a long outstanding bug in Ceph CSI RBD provisioner mentioned in c#11, if its referring to the dependency to SC for the deleteVolume to succeed, imo, we **can not** treat that as a "bug" at this stage, rather its by design and Ceph CSI dont have a role here at all. In short, thats how it been since very first version of Ceph CSI , external provisioner sidecar...etc. or iow, external provisioner sidecar which is responsible for createVolume and deleteVolume take credentials by reading the SC and pass it to CSI driver today and SC is a requirement for these operations to succeed and its the same behavior since very first release of these components and common one for all volume plugins who need secrets. There are discussions in upstream to get rid of SC or treat SC as a object which can be independent entity and not completely tied to the lifecycle PV. But then its an enhancement or improvement getting discussed/proposed. In that sense, we can not fix anything from Ceph CSI layer here. This issue looks to be caused by a regression or race condition with Rook PR https://github.com/rook/rook/pull/9041
Blaine, what are the next steps? Is there anything we can do? Thanks!
I believe work on CephFilesystem and CephBlockPool "dependents" in Rook (I'm currently investigating) will prevent the issue from occurring in the future. For now, given that uninstall is not a supported workflow, I don't believe we are expected to put extra priority on this.
Travis,Blaine,Jose..Et.Al, The improvement in external provisioner sidecar to forget the SC after PV creation and thus not make the PVC deletion operation SC dependent has been fixed/done via https://github.com/kubernetes-csi/external-provisioner/pull/713 This should be part of next external provisioner sidecar upstream release.
Ok, no changes needed in ODF except to pick up the image? Any timeline on when we can pick it up downstream? Moving to the build component to pick it up when available.
(In reply to Travis Nielsen from comment #20) > Ok, no changes needed in ODF except to pick up the image? Any timeline on > when we can pick it up downstream? > Moving to the build component to pick it up when available. I expect this https://github.com/kubernetes-csi/external-provisioner/pull/713 to be part of next external provisioner update, however I dont expect this to land in OCP < 4.12 . So that would be the first release we can expect the feature to be available from ODF point of view.
It looks like better to park this on `rook` component ( than build ) as I was told we are working similar solution/fix in rook. Please feel free to revert/change the component accordingly.
Per Humble in comment 21, moving this to 4.12. We can wait for the external provisioner, but I believe there is more to this issue that needs to be ironed out. Note that https://github.com/rook/rook/pull/9915 is in progress to add more checks for filesystems/subvolumes in the finalizer. However, that does not help the scenario of a force install. If the CephCluster CR is deleted, Rook doesn't wait for the finalizers of the child CRs. If we need to wait for those finalizers of the child CRs, the OCS operator can't delete the CephCluster CR until after those CRs are removed.
Per discussion in a tentative Rook PR [1], it would seem that OCS operator is setting the cleanup policy for forced deletion too soon. Blaine, before moving this BZ back to the ocs-operator, can you take a look to confirm if there is anything else Rook is missing with the finalizer design? [1] https://github.com/rook/rook/pull/10231#discussion_r876427414
It's a little hard for me to say. The interactions between Rook, CSI, and OCS are complicated. Certainly, part of the issue is that ocs-operator is deleting the storageclasses before PVs/PVCs are removed. I think Rook can help smooth the issue by disallowing the CephFilesystems/CephBlockPools from being deleted if they are in use (by PVs or otherwise). I am working on this part as I'm able. But I still worry that in the forced deletion case, ODF will still have the same issue due to OCS operator continuing to delete storageclasses before PVCs are removed. In that case, I would hope it is sufficient to instruct users: (1) don't uninstall ODF before deleting PVCs, and (2) force delete any remaining PVCs after uninstalling ODF if necessary. Also given that uninstall is not a supported workflow in ODF, I have had to push back my work on the fix for this a few times.
The current approach is implemented well for forced deletions. For an attempted clean uninstall, it seems we would need the following approach: 1. Delete the filesystem CR(s), block pool CR(s), and object store CR(s) 2. Wait for Rook to allow them to be deleted. Rook will check for PVs, buckets, and so on. When all the consumers are gone, rook will remove the finalizer and allow them to be deleted. 3. Delete the CephCluster CR 4. Wait for Rook to remove the finalizer on the CephCluster CR and thus delete it. 5. Delete the storage classes This way the storage classes won't be deleted too soon and Rook is the one that owns checking for all the consumers (and not OCS operator). Not exactly simple, but if we really want to support a clean uninstall, the complexity seems necessary.
*** Bug 2081965 has been marked as a duplicate of this bug. ***
Will move back to 4.12 if the fix is ready in time
I have been interacting with our builds for the last 3 months or so, And I am no longer encountering the issue. I dont know what changed but seems like the issue may no longer be there. Amrita can you try to reproduce the issue on any 4.12 or 4.13 available builds & tell me if this is still reproducible?
In the last 3 months or so( 4.12 & 4.13), I no longer see this problem. I don't exactly know what changed, but this is no longer reproducible. If someone finds the issue again this can be reopened.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days