2060897 – storagesystem deletion is stuck during uninstall process

Bug 2060897 - storagesystem deletion is stuck during uninstall process

Summary: storagesystem deletion is stuck during uninstall process

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ocs-operator
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Malay Kumar parida
QA Contact:
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	2081965 (view as bug list)
Depends On:
Blocks:	2005040
TreeView+	depends on / blocked

Reported:	2022-03-04 13:21 UTC by Amrita Mahapatra
Modified:	2023-12-08 04:27 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-03-20 09:02:56 UTC
Embargoed:
Flags:	mparida: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	red-hat-storage ocs-operator pull 1819	0	None	open	FIx storagecluster deletion is stuck forever	2022-09-17 16:30:19 UTC

Comment 4 Blaine Gardner 2022-03-04 17:18:48 UTC

It does appear that the PVC is blocking deletion of the Ceph cluster. I'm not sure what secret is missing. Both the rook-csi-rbd-provisioner and rook-csi-rbd-node secrets exist still in the must gather directory 'after/'. I don't think I can debug further very easily. Moving this to the Ceph-CSI component.


Name:            pvc-e5bd26c4-b448-4219-a18b-7b7fb3f4e1c6
Labels:          <none>
Annotations:     pv.kubernetes.io/provisioned-by: openshift-storage.rbd.csi.ceph.com
Finalizers:      [kubernetes.io/pv-protection]
StorageClass:    ocs-storagecluster-ceph-rbd
Status:          Released
Claim:           openshift-storage/db-noobaa-db-pg-0
Reclaim Policy:  Delete
Access Modes:    RWO
VolumeMode:      Filesystem
Capacity:        50Gi
Node Affinity:   <none>
Message:         
Source:
    Type:              CSI (a Container Storage Interface (CSI) volume source)
    Driver:            openshift-storage.rbd.csi.ceph.com
    FSType:            ext4
    VolumeHandle:      0001-0011-openshift-storage-0000000000000001-56a6f1a1-9bb2-11ec-aef9-0a580a80020f
    ReadOnly:          false
    VolumeAttributes:      clusterID=openshift-storage
                           csi.storage.k8s.io/pv/name=pvc-e5bd26c4-b448-4219-a18b-7b7fb3f4e1c6
                           csi.storage.k8s.io/pvc/name=db-noobaa-db-pg-0
                           csi.storage.k8s.io/pvc/namespace=openshift-storage
                           imageFeatures=layering
                           imageFormat=2
                           imageName=csi-vol-56a6f1a1-9bb2-11ec-aef9-0a580a80020f
                           journalPool=ocs-storagecluster-cephblockpool
                           pool=ocs-storagecluster-cephblockpool
                           storage.kubernetes.io/csiProvisionerIdentity=1646394835158-8081-openshift-storage.rbd.csi.ceph.com
Events:
  Type     Reason              Age                    From                                                                                                               Message
  ----     ------              ----                   ----                                                                                                               -------
  Warning  VolumeFailedDelete  117s (x10 over 6m12s)  openshift-storage.rbd.csi.ceph.com_csi-rbdplugin-provisioner-d98d9b847-pvfgg_c99699ca-2fb4-45cd-a97d-795b918521d3  rpc error: code = InvalidArgument desc = provided secret is empty

Comment 5 yati padia 2022-03-07 05:42:49 UTC

Going through the must-gather, I see that the storageclass `ocs-storagecluster-ceph-rbd` doesn't exist due to which the below PV doesn't get deleted.

```
pvc-e5bd26c4-b448-4219-a18b-7b7fb3f4e1c6   50Gi       RWO            Delete           Released   openshift-storage/db-noobaa-db-pg-0                 ocs-storagecluster-ceph-rbd            55m
```

And also as per the document for uninstall process, we must delete the pvcs and obcs before deleting the storagesystem. But here, there are many pvcs still existing which needs to be deleted.
```
---
apiVersion: v1
items:
- apiVersion: v1
  kind: PersistentVolumeClaim
  metadata:
    annotations:
      crushDeviceClass: ""
      pv.kubernetes.io/bind-completed: "yes"
      pv.kubernetes.io/bound-by-controller: "yes"
      volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/aws-ebs
      volume.kubernetes.io/selected-node: ip-10-0-186-155.us-east-2.compute.internal
      volume.kubernetes.io/storage-provisioner: kubernetes.io/aws-ebs
    creationTimestamp: "2022-03-04T11:56:47Z"
    finalizers:
    - kubernetes.io/pvc-protection
    generateName: ocs-deviceset-gp2-0-data-0
    labels:
      ceph.rook.io/DeviceSet: ocs-deviceset-gp2-0
      ceph.rook.io/DeviceSetPVCId: ocs-deviceset-gp2-0-data-0
      ceph.rook.io/setIndex: "0"
    name: ocs-deviceset-gp2-0-data-0p4djr
    namespace: openshift-storage
    ownerReferences:
    - apiVersion: ceph.rook.io/v1
      blockOwnerDeletion: true
      controller: true
      kind: CephCluster
      name: ocs-storagecluster-cephcluster
      uid: 9d961761-b5a7-4a26-b79e-bc784c2d7681
    resourceVersion: "39038"
    uid: 4d3df2c4-6083-4054-aa8c-257a5505c477
  spec:
    accessModes:
    - ReadWriteOnce
    resources:
      requests:
        storage: 512Gi
    storageClassName: gp2
    volumeMode: Block
    volumeName: pvc-4d3df2c4-6083-4054-aa8c-257a5505c477
  status:
    accessModes:
    - ReadWriteOnce
    capacity:
      storage: 512Gi
    phase: Bound
```

And these pvcs are mostly not getting deleted due to the finalizers
```
 finalizers:
    - kubernetes.io/pvc-protection
```

Comment 8 yati padia 2022-03-07 07:55:06 UTC

yes @rgeorge, this is similar to the bug VMware bug. As mentioned in the above comment, the resources are not getting deleted due to the finalizers. We need to explicitly or forcefully remove the finalizers which will allow the resources to get deleted.

Comment 9 yati padia 2022-03-07 08:14:46 UTC

Going through the discussion in bug 2005040, I see removing the finalizers isn't a good user experience. 
The deletion of storagesystem is blocked due to the following reason:
```
Name:         ocs-storagecluster
Namespace:    openshift-storage
Labels:       <none>
Annotations:  uninstall.ocs.openshift.io/cleanup-policy: delete

Events:
  Type     Reason            Age    From                       Message
  ----     ------            ----   ----                       -------
  Warning  UninstallPending  6m18s  controller_storagecluster  uninstall: Waiting on NooBaa system noobaa to be deleted
  Warning  UninstallPending  6m17s  controller_storagecluster  uninstall: Waiting for CephFileSystem ocs-storagecluster-cephfilesystem to be deleted
  Warning  UninstallPending  6m15s  controller_storagecluster  uninstall: Waiting for CephBlockPool ocs-storagecluster-cephblockpool to be deleted
  Warning  UninstallPending  6m14s  controller_storagecluster  uninstall: Waiting for CephCluster to be deleted
```
And the ceph health is also not healthy

```
HEALTH_ERR 1 filesystem is degraded; 1 filesystem has a failed mds daemon; 1 filesystem is offline; insufficient standby MDS daemons available
[WRN] FS_DEGRADED: 1 filesystem is degraded
    fs ocs-storagecluster-cephfilesystem is degraded
[WRN] FS_WITH_FAILED_MDS: 1 filesystem has a failed mds daemon
    fs ocs-storagecluster-cephfilesystem has 1 failed mds
[ERR] MDS_ALL_DOWN: 1 filesystem is offline
    fs ocs-storagecluster-cephfilesystem is offline because no MDS is active for it.
[WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons available
    have 0; want 1 more
```
And as stated above the cephcluster doesn't get deleted due to the presence of pvs and pvcs.
We should move this to ocs-operator for solution as they did in VMWare bug.

Comment 10 Jose A. Rivera 2022-03-07 15:18:01 UTC

Everything is fine on the ocs-operator side. Indeed, rook-ceph-operator is having problems removing the CephCluster:

2022-03-04T12:20:03.949054839Z 2022-03-04 12:20:03.949016 E | ceph-cluster-controller: Spec.CSI is nil for PV "pvc-06ccd9cc-8b4e-4c86-8e18-6ae0b95a2646"
2022-03-04T12:20:03.949054839Z 2022-03-04 12:20:03.949033 E | ceph-cluster-controller: Spec.CSI is nil for PV "pvc-4d3df2c4-6083-4054-aa8c-257a5505c477"
2022-03-04T12:20:03.949054839Z 2022-03-04 12:20:03.949037 E | ceph-cluster-controller: Spec.CSI is nil for PV "pvc-cb21cf2f-7af3-46e5-b4f6-e1ed120a324e"
2022-03-04T12:20:03.949054839Z 2022-03-04 12:20:03.949040 E | ceph-cluster-controller: Spec.CSI is nil for PV "pvc-dcc76cdb-015f-4b52-8852-122c62c88c49"
2022-03-04T12:20:03.949054839Z 2022-03-04 12:20:03.949043 E | ceph-cluster-controller: Spec.CSI is nil for PV "pvc-de4bffa8-28fd-4c04-b3d3-d2d5a1c47688"
2022-03-04T12:20:03.949085496Z 2022-03-04 12:20:03.949073 E | ceph-cluster-controller: failed to reconcile CephCluster "openshift-storage/ocs-storagecluster-cephcluster". failed to clean up CephCluster "openshift-storage/ocs-storagecluster-cephcluster": failed to check if volumes exist for CephCluster in namespace "openshift-storage": waiting for csi volume attachments in cluster "openshift-storage" to be cleaned up

Blaine found the problem PV:

pvc-e5bd26c4-b448-4219-a18b-7b7fb3f4e1c6   50Gi       RWO            Delete           Released   openshift-storage/db-noobaa-db-pg-0                 ocs-storagecluster-ceph-rbd            22m

So it's the NooBaa DB volume. However, The truly weird thing is that the PVC is gone. Looking at the following code, if the PVC is gone it would explain how we would have moved along through the uninstall logic after the NooBaa CR was removed: https://github.com/red-hat-storage/ocs-operator/blob/4f2dfefe4e014f1191ef88d4cdf831ecde3ef430/controllers/storagecluster/noobaa_system_reconciler.go#L229-L244 At that point we would continue deleting everything, including the StorageClasses. If the CSI driver gets the PV's relevant Secret information from the StorageClass (man, that makes me mad...) then this behavior makes sense. 

That said, looking at the CSI provisioner logs I see something I don't understand. The initial DeleteVolume request(s?) came in and completed successfully. But then another one comes along and fails?

Following the volume handle "0001-0011-openshift-storage-0000000000000001-56a6f1a1-9bb2-11ec-aef9-0a580a80020f" in this Pod: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/2060897/after/must-gather.local.475967562843324470/quay-io-rhceph-dev-ocs-must-gather-sha256-1763ae725759f4fc78c9a7d432cd47a80f15487ef3ec79b6c7225b1e736529f1/namespaces/openshift-storage/pods/csi-rbdplugin-provisioner-d98d9b847-pvfgg/

From csi-rbdplugin:

2022-03-04T12:13:02.322224603Z I0304 12:13:02.322180       1 utils.go:191] ID: 42 Req-ID: 0001-0011-openshift-storage-0000000000000001-e9976bc8-9bb2-11ec-aef9-0a580a80020f GRPC call: /csi.v1.Controller/DeleteVolume
2022-03-04T12:13:02.322314962Z I0304 12:13:02.322302       1 utils.go:195] ID: 42 Req-ID: 0001-0011-openshift-storage-0000000000000001-e9976bc8-9bb2-11ec-aef9-0a580a80020f GRPC request: {"secrets":"***stripped***","volume_id":"0001-0011-openshift-storage-0000000000000001-e9976bc8-9bb2-11ec-aef9-0a580a80020f"}
2022-03-04T12:13:02.323395991Z I0304 12:13:02.323374       1 omap.go:87] ID: 42 Req-ID: 0001-0011-openshift-storage-0000000000000001-e9976bc8-9bb2-11ec-aef9-0a580a80020f got omap values: (pool="ocs-storagecluster-cephblockpool", namespace="", name="csi.volume.e9976bc8-9bb2-11ec-aef9-0a580a80020f"): map[csi.imageid:5e491cd01c30 csi.imagename:csi-vol-e9976bc8-9bb2-11ec-aef9-0a580a80020f csi.volname:pvc-c24b4757-c0bb-49bc-ba07-95d197c89d72 csi.volume.owner:test]
2022-03-04T12:13:02.337705434Z I0304 12:13:02.337672       1 utils.go:191] ID: 43 Req-ID: 0001-0011-openshift-storage-0000000000000001-ecf0d81b-9bb2-11ec-aef9-0a580a80020f GRPC call: /csi.v1.Controller/DeleteVolume
2022-03-04T12:13:02.337730258Z I0304 12:13:02.337716       1 utils.go:195] ID: 43 Req-ID: 0001-0011-openshift-storage-0000000000000001-ecf0d81b-9bb2-11ec-aef9-0a580a80020f GRPC request: {"secrets":"***stripped***","volume_id":"0001-0011-openshift-storage-0000000000000001-ecf0d81b-9bb2-11ec-aef9-0a580a80020f"}
2022-03-04T12:13:02.338344191Z I0304 12:13:02.338328       1 omap.go:87] ID: 43 Req-ID: 0001-0011-openshift-storage-0000000000000001-ecf0d81b-9bb2-11ec-aef9-0a580a80020f got omap values: (pool="ocs-storagecluster-cephblockpool", namespace="", name="csi.volume.ecf0d81b-9bb2-11ec-aef9-0a580a80020f"): map[csi.imageid:5e4956a52cf0 csi.imagename:csi-vol-ecf0d81b-9bb2-11ec-aef9-0a580a80020f csi.volname:pvc-821db073-46ae-4749-81ae-8cec1bf86b7a csi.volume.owner:test]
2022-03-04T12:13:02.388544034Z I0304 12:13:02.388511       1 rbd_util.go:647] ID: 42 Req-ID: 0001-0011-openshift-storage-0000000000000001-e9976bc8-9bb2-11ec-aef9-0a580a80020f rbd: delete csi-vol-e9976bc8-9bb2-11ec-aef9-0a580a80020f-temp using mon 172.30.183.200:6789,172.30.221.180:6789,172.30.55.98:6789, pool ocs-storagecluster-cephblockpool
2022-03-04T12:13:02.392576069Z I0304 12:13:02.392553       1 controllerserver.go:958] ID: 42 Req-ID: 0001-0011-openshift-storage-0000000000000001-e9976bc8-9bb2-11ec-aef9-0a580a80020f deleting image csi-vol-e9976bc8-9bb2-11ec-aef9-0a580a80020f
2022-03-04T12:13:02.392576069Z I0304 12:13:02.392567       1 rbd_util.go:647] ID: 42 Req-ID: 0001-0011-openshift-storage-0000000000000001-e9976bc8-9bb2-11ec-aef9-0a580a80020f rbd: delete csi-vol-e9976bc8-9bb2-11ec-aef9-0a580a80020f using mon 172.30.183.200:6789,172.30.221.180:6789,172.30.55.98:6789, pool ocs-storagecluster-cephblockpool
2022-03-04T12:13:02.404866765Z I0304 12:13:02.404839       1 rbd_util.go:647] ID: 43 Req-ID: 0001-0011-openshift-storage-0000000000000001-ecf0d81b-9bb2-11ec-aef9-0a580a80020f rbd: delete csi-vol-ecf0d81b-9bb2-11ec-aef9-0a580a80020f-temp using mon 172.30.183.200:6789,172.30.221.180:6789,172.30.55.98:6789, pool ocs-storagecluster-cephblockpool
2022-03-04T12:13:02.409296524Z I0304 12:13:02.409274       1 controllerserver.go:958] ID: 43 Req-ID: 0001-0011-openshift-storage-0000000000000001-ecf0d81b-9bb2-11ec-aef9-0a580a80020f deleting image csi-vol-ecf0d81b-9bb2-11ec-aef9-0a580a80020f
2022-03-04T12:13:02.409296524Z I0304 12:13:02.409287       1 rbd_util.go:647] ID: 43 Req-ID: 0001-0011-openshift-storage-0000000000000001-ecf0d81b-9bb2-11ec-aef9-0a580a80020f rbd: delete csi-vol-ecf0d81b-9bb2-11ec-aef9-0a580a80020f using mon 172.30.183.200:6789,172.30.221.180:6789,172.30.55.98:6789, pool ocs-storagecluster-cephblockpool
2022-03-04T12:13:02.427891448Z I0304 12:13:02.427867       1 rbd_util.go:682] ID: 42 Req-ID: 0001-0011-openshift-storage-0000000000000001-e9976bc8-9bb2-11ec-aef9-0a580a80020f rbd: adding task to remove image "ocs-storagecluster-cephblockpool/csi-vol-e9976bc8-9bb2-11ec-aef9-0a580a80020f" with id "5e491cd01c30" from trash
2022-03-04T12:13:02.444124128Z I0304 12:13:02.444099       1 rbd_util.go:682] ID: 43 Req-ID: 0001-0011-openshift-storage-0000000000000001-ecf0d81b-9bb2-11ec-aef9-0a580a80020f rbd: adding task to remove image "ocs-storagecluster-cephblockpool/csi-vol-ecf0d81b-9bb2-11ec-aef9-0a580a80020f" with id "5e4956a52cf0" from trash
2022-03-04T12:13:02.493368614Z I0304 12:13:02.493340       1 rbd_util.go:706] ID: 42 Req-ID: 0001-0011-openshift-storage-0000000000000001-e9976bc8-9bb2-11ec-aef9-0a580a80020f rbd: successfully added task to move image "ocs-storagecluster-cephblockpool/csi-vol-e9976bc8-9bb2-11ec-aef9-0a580a80020f" with id "5e491cd01c30" to trash
2022-03-04T12:13:02.500494605Z I0304 12:13:02.500465       1 omap.go:123] ID: 42 Req-ID: 0001-0011-openshift-storage-0000000000000001-e9976bc8-9bb2-11ec-aef9-0a580a80020f removed omap keys (pool="ocs-storagecluster-cephblockpool", namespace="", name="csi.volumes.default"): [csi.volume.pvc-c24b4757-c0bb-49bc-ba07-95d197c89d72]
2022-03-04T12:13:02.500558471Z I0304 12:13:02.500544       1 utils.go:202] ID: 42 Req-ID: 0001-0011-openshift-storage-0000000000000001-e9976bc8-9bb2-11ec-aef9-0a580a80020f GRPC response: {}
2022-03-04T12:13:02.504471563Z I0304 12:13:02.504446       1 rbd_util.go:706] ID: 43 Req-ID: 0001-0011-openshift-storage-0000000000000001-ecf0d81b-9bb2-11ec-aef9-0a580a80020f rbd: successfully added task to move image "ocs-storagecluster-cephblockpool/csi-vol-ecf0d81b-9bb2-11ec-aef9-0a580a80020f" with id "5e4956a52cf0" to trash
2022-03-04T12:13:02.510271302Z I0304 12:13:02.510240       1 omap.go:123] ID: 43 Req-ID: 0001-0011-openshift-storage-0000000000000001-ecf0d81b-9bb2-11ec-aef9-0a580a80020f removed omap keys (pool="ocs-storagecluster-cephblockpool", namespace="", name="csi.volumes.default"): [csi.volume.pvc-821db073-46ae-4749-81ae-8cec1bf86b7a]
2022-03-04T12:13:02.510315138Z I0304 12:13:02.510300       1 utils.go:202] ID: 43 Req-ID: 0001-0011-openshift-storage-0000000000000001-ecf0d81b-9bb2-11ec-aef9-0a580a80020f GRPC response: {}
2022-03-04T12:13:54.468124415Z I0304 12:13:54.468095       1 utils.go:191] ID: 44 GRPC call: /csi.v1.Identity/Probe
2022-03-04T12:13:54.468152266Z I0304 12:13:54.468137       1 utils.go:195] ID: 44 GRPC request: {}
2022-03-04T12:13:54.468173667Z I0304 12:13:54.468151       1 utils.go:202] ID: 44 GRPC response: {}
2022-03-04T12:14:32.324826039Z I0304 12:14:32.324791       1 utils.go:191] ID: 45 Req-ID: 0001-0011-openshift-storage-0000000000000001-56a6f1a1-9bb2-11ec-aef9-0a580a80020f GRPC call: /csi.v1.Controller/DeleteVolume
2022-03-04T12:14:32.324897203Z I0304 12:14:32.324864       1 utils.go:195] ID: 45 Req-ID: 0001-0011-openshift-storage-0000000000000001-56a6f1a1-9bb2-11ec-aef9-0a580a80020f GRPC request: {"volume_id":"0001-0011-openshift-storage-0000000000000001-56a6f1a1-9bb2-11ec-aef9-0a580a80020f"}
2022-03-04T12:14:32.324918706Z E0304 12:14:32.324897       1 utils.go:200] ID: 45 Req-ID: 0001-0011-openshift-storage-0000000000000001-56a6f1a1-9bb2-11ec-aef9-0a580a80020f GRPC error: rpc error: code = InvalidArgument desc = provided secret is empty

From csi-provisioner on the same Pod:

2022-03-04T12:13:02.316830038Z I0304 12:13:02.316795       1 controller.go:1471] delete "pvc-c24b4757-c0bb-49bc-ba07-95d197c89d72": started
2022-03-04T12:13:02.321784455Z I0304 12:13:02.321766       1 connection.go:183] GRPC call: /csi.v1.Controller/DeleteVolume
2022-03-04T12:13:02.321840148Z I0304 12:13:02.321778       1 connection.go:184] GRPC request: {"secrets":"***stripped***","volume_id":"0001-0011-openshift-storage-0000000000000001-e9976bc8-9bb2-11ec-aef9-0a580a80020f"}
2022-03-04T12:13:02.333250078Z I0304 12:13:02.333220       1 controller.go:1471] delete "pvc-821db073-46ae-4749-81ae-8cec1bf86b7a": started
2022-03-04T12:13:02.337452686Z I0304 12:13:02.337435       1 connection.go:183] GRPC call: /csi.v1.Controller/DeleteVolume
2022-03-04T12:13:02.337492181Z I0304 12:13:02.337447       1 connection.go:184] GRPC request: {"secrets":"***stripped***","volume_id":"0001-0011-openshift-storage-0000000000000001-ecf0d81b-9bb2-11ec-aef9-0a580a80020f"}
2022-03-04T12:13:02.500825563Z I0304 12:13:02.500768       1 connection.go:186] GRPC response: {}
2022-03-04T12:13:02.500825563Z I0304 12:13:02.500808       1 connection.go:187] GRPC error: <nil>
2022-03-04T12:13:02.500825563Z I0304 12:13:02.500821       1 controller.go:1486] delete "pvc-c24b4757-c0bb-49bc-ba07-95d197c89d72": volume deleted
2022-03-04T12:13:02.510495404Z I0304 12:13:02.510450       1 connection.go:186] GRPC response: {}
2022-03-04T12:13:02.510495404Z I0304 12:13:02.510479       1 connection.go:187] GRPC error: <nil>
2022-03-04T12:13:02.510495404Z I0304 12:13:02.510489       1 controller.go:1486] delete "pvc-821db073-46ae-4749-81ae-8cec1bf86b7a": volume deleted
2022-03-04T12:13:02.511190259Z I0304 12:13:02.511172       1 controller.go:1531] delete "pvc-c24b4757-c0bb-49bc-ba07-95d197c89d72": persistentvolume deleted
2022-03-04T12:13:02.511202245Z I0304 12:13:02.511188       1 controller.go:1536] delete "pvc-c24b4757-c0bb-49bc-ba07-95d197c89d72": succeeded
2022-03-04T12:13:02.523690731Z I0304 12:13:02.523663       1 controller.go:1531] delete "pvc-821db073-46ae-4749-81ae-8cec1bf86b7a": persistentvolume deleted
2022-03-04T12:13:02.523720594Z I0304 12:13:02.523687       1 controller.go:1536] delete "pvc-821db073-46ae-4749-81ae-8cec1bf86b7a": succeeded
[...]
2022-03-04T12:14:32.324464947Z I0304 12:14:32.324429       1 controller.go:1471] delete "pvc-e5bd26c4-b448-4219-a18b-7b7fb3f4e1c6": started
2022-03-04T12:14:32.324509888Z W0304 12:14:32.324482       1 controller.go:1192] failed to get storageclass: ocs-storagecluster-ceph-rbd, proceeding to delete without secrets. storageclass.storage.k8s.io "ocs-storagecluster-ceph-rbd" not found
2022-03-04T12:14:32.324509888Z I0304 12:14:32.324499       1 connection.go:183] GRPC call: /csi.v1.Controller/DeleteVolume
2022-03-04T12:14:32.324553506Z I0304 12:14:32.324505       1 connection.go:184] GRPC request: {"volume_id":"0001-0011-openshift-storage-0000000000000001-56a6f1a1-9bb2-11ec-aef9-0a580a80020f"}
2022-03-04T12:14:32.325071220Z I0304 12:14:32.325035       1 connection.go:186] GRPC response: {}
2022-03-04T12:14:32.325071220Z I0304 12:14:32.325061       1 connection.go:187] GRPC error: rpc error: code = InvalidArgument desc = provided secret is empty
2022-03-04T12:14:32.325088744Z E0304 12:14:32.325079       1 controller.go:1481] delete "pvc-e5bd26c4-b448-4219-a18b-7b7fb3f4e1c6": volume deletion failed: rpc error: code = InvalidArgument desc = provided secret is empty
2022-03-04T12:14:32.325118519Z W0304 12:14:32.325104       1 controller.go:989] Retrying syncing volume "pvc-e5bd26c4-b448-4219-a18b-7b7fb3f4e1c6", failure 0
2022-03-04T12:14:32.325146494Z E0304 12:14:32.325133       1 controller.go:1007] error syncing volume "pvc-e5bd26c4-b448-4219-a18b-7b7fb3f4e1c6": rpc error: code = InvalidArgument desc = provided secret is empty
2022-03-04T12:14:32.325217440Z I0304 12:14:32.325190       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolume", Namespace:"", Name:"pvc-e5bd26c4-b448-4219-a18b-7b7fb3f4e1c6", UID:"6b0e667f-7de2-462c-9317-0adc4bf15f7b", APIVersion:"v1", ResourceVersion:"52043", FieldPath:""}): type: 'Warning' reason: 'VolumeFailedDelete' rpc error: code = InvalidArgument desc = provided secret is empty

So in both cases, the CSI driver seems to believe the volume was succesfully deleted, and then tried to delete it again...?

All of this is to say that this is not an ocs-operator problem at the moment. We need to learn how the PVC was removed such that the PV was not get fully deleted before the StorageClasses get deleted. Is this a race condition? If so, how have we not hit this before? In any case, moving this back to the csi-driver. Tagging both Madhu and Yati for further insight, just in case.

Comment 11 Jose A. Rivera 2022-03-07 19:46:01 UTC

Blaine and I made significant progress in investigating this issue, but some further analysis and design is still pending. Thus far, we believe that in fixing our initial batch of uninstall problems we exposed one bug in the new changes to Rook-Ceph and a long-standing bug in the Ceph-CSI RBD provisioner. We can probably work around the latter bug in fixing the former. 

I'm moving this BZ to Rook-Ceph for now. We should create/clone a new one for the Ceph-CSI issue once we're certain what it is. My understanding is that uninstall is not a GA feature, so I don´t think it qualifies for blocker status on ODF 4.10.0. Moving it back to ODF 4.10.z at the very least.

Comment 12 Humble Chirammal 2022-03-08 05:12:09 UTC

I am not sure what exactly referred as a long outstanding bug in Ceph CSI RBD provisioner mentioned in  c#11, if its referring to the dependency to SC for the deleteVolume to succeed, imo,  we **can not** treat that as a "bug" at this stage, rather its by design and Ceph CSI dont have a role here at all. In short,  thats how it been since very first version of Ceph CSI , external provisioner sidecar...etc. or iow, external provisioner sidecar which is responsible for createVolume and deleteVolume take credentials by reading the SC and pass it to CSI driver today and SC is a requirement for these operations to succeed and its the same behavior since very first release of these components and common one for all volume plugins who need secrets. 

There are discussions in upstream to get rid of SC or treat SC as a object which can be independent entity and not completely tied to the lifecycle PV. But then its an enhancement or improvement getting discussed/proposed. 

In that sense, we can not fix anything from Ceph CSI layer here. 

This issue looks to be caused by a regression or race condition with Rook PR https://github.com/rook/rook/pull/9041

Comment 17 Sébastien Han 2022-03-21 16:36:09 UTC

Blaine, what are the next steps? Is there anything we can do?
Thanks!

Comment 18 Blaine Gardner 2022-03-22 17:50:53 UTC

I believe work on CephFilesystem and CephBlockPool "dependents" in Rook (I'm currently investigating) will prevent the issue from occurring in the future. For now, given that uninstall is not a supported workflow, I don't believe we are expected to put extra priority on this.

Comment 19 Humble Chirammal 2022-04-12 06:50:04 UTC

Travis,Blaine,Jose..Et.Al,

The improvement in external provisioner sidecar to forget the SC after PV creation and thus not make the PVC deletion operation SC dependent has been fixed/done via

https://github.com/kubernetes-csi/external-provisioner/pull/713

This should be part of next external provisioner sidecar upstream release.

Comment 20 Travis Nielsen 2022-04-12 15:11:10 UTC

Ok, no changes needed in ODF except to pick up the image? Any timeline on when we can pick it up downstream?
Moving to the build component to pick it up when available.

Comment 21 Humble Chirammal 2022-05-09 10:37:12 UTC

(In reply to Travis Nielsen from comment #20)
> Ok, no changes needed in ODF except to pick up the image? Any timeline on
> when we can pick it up downstream?
> Moving to the build component to pick it up when available.

I expect this https://github.com/kubernetes-csi/external-provisioner/pull/713 to be part of next external provisioner update, however I dont expect this to land in OCP < 4.12 . So that would be the first release we can expect the feature to be available from ODF point of view.

Comment 22 Humble Chirammal 2022-05-17 07:51:49 UTC

It looks like better to park this on `rook` component ( than build ) as I was told we are working similar solution/fix in rook. Please feel free to revert/change the component accordingly.

Comment 24 Travis Nielsen 2022-05-17 18:48:34 UTC

Per Humble in comment 21, moving this to 4.12. We can wait for the external provisioner, but I believe there is more to this issue that needs to be ironed out.

Note that https://github.com/rook/rook/pull/9915 is in progress to add more checks for filesystems/subvolumes in the finalizer.
However, that does not help the scenario of a force install. If the CephCluster CR is deleted, Rook doesn't wait for the finalizers of the child CRs. If we need to wait for those finalizers of the child CRs, the OCS operator can't delete the CephCluster CR until after those CRs are removed.

Comment 26 Travis Nielsen 2022-05-18 22:34:05 UTC

Per discussion in a tentative Rook PR [1], it would seem that OCS operator is setting the cleanup policy for forced deletion too soon. 
Blaine, before moving this BZ back to the ocs-operator, can you take a look to confirm if there is anything else Rook is missing with the finalizer design?

[1] https://github.com/rook/rook/pull/10231#discussion_r876427414

Comment 27 Blaine Gardner 2022-05-23 19:53:44 UTC

It's a little hard for me to say. The interactions between Rook, CSI, and OCS are complicated.

Certainly, part of the issue is that ocs-operator is deleting the storageclasses before PVs/PVCs are removed. I think Rook can help smooth the issue by disallowing the CephFilesystems/CephBlockPools from being deleted if they are in use (by PVs or otherwise). I am working on this part as I'm able. 

But I still worry that in the forced deletion case, ODF will still have the same issue due to OCS operator continuing to delete storageclasses before PVCs are removed. In that case, I would hope it is sufficient to instruct users:
(1) don't uninstall ODF before deleting PVCs, and 
(2) force delete any remaining PVCs after uninstalling ODF if necessary.

Also given that uninstall is not a supported workflow in ODF, I have had to push back my work on the fix for this a few times.

Comment 29 Travis Nielsen 2022-05-24 18:35:50 UTC

The current approach is implemented well for forced deletions.

For an attempted clean uninstall, it seems we would need the following approach:
1. Delete the filesystem CR(s), block pool CR(s), and object store CR(s)
2. Wait for Rook to allow them to be deleted. Rook will check for PVs, buckets, and so on. When all the consumers are gone, rook will remove the finalizer and allow them to be deleted.
3. Delete the CephCluster CR
4. Wait for Rook to remove the finalizer on the CephCluster CR and thus delete it.
5. Delete the storage classes

This way the storage classes won't be deleted too soon and Rook is the one that owns checking for all the consumers (and not OCS operator). Not exactly simple, but if we really want to support a clean uninstall, the complexity seems necessary.

Comment 30 Nitin Goyal 2022-05-30 08:06:31 UTC

*** Bug 2081965 has been marked as a duplicate of this bug. ***

Comment 31 Mudit Agarwal 2022-11-02 03:24:06 UTC

Will move back to 4.12 if the fix is ready in time

Comment 46 Malay Kumar parida 2023-01-23 10:44:30 UTC

I have been interacting with our builds for the last 3 months or so, And I am no longer encountering the issue. I dont know what changed but seems like the issue may no longer be there.
Amrita can you try to reproduce the issue on any 4.12 or 4.13 available builds & tell me if this is still reproducible?

Comment 50 Malay Kumar parida 2023-03-20 09:02:56 UTC

In the last 3 months or so( 4.12 & 4.13), I no longer see this problem. I don't exactly know what changed, but this is no longer reproducible. If someone finds the issue again this can be reopened.

Comment 51 Red Hat Bugzilla 2023-12-08 04:27:54 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.