Bug 2203086

Summary: csi-snap are not deleted while volumesnapshotclass policy set to 'Delete'
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: David Vaanunu <dvaanunu>
Component: csi-driverAssignee: Niels de Vos <ndevos>
Status: CLOSED COMPLETED QA Contact: krishnaram Karthick <kramdoss>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.12CC: mrajanna, muagarwa, ndevos, odf-bz-bot
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-08-10 11:19:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description David Vaanunu 2023-05-11 08:06:27 UTC
Description of problem (please be detailed as possible and provide log
snippests):


OADP testing (backup flow), volumesnapshotclass policy set to 'Delete'.
during backup is running volumesnapshot & volumesnapshotcontent are created and at the end of the test both are deleted.

When checked the 'cephblockpool', csi-snap are exist.

Version of all relevant components (if applicable):

OCP 4.12.9
ODF 4.12.2
OADP 1.2.0-63

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
the pool can contain thousands and tens of thousands of csi-snap - maybe can impact on the Ceph performance

Is there any workaround available to the best of your knowledge?
manually delete 

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
yes

Can this issue reproduce from the UI?
no

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. delete volumesnapshot , volumesnapshotcontent , csi-snap (from ceph)
2. sset volumesnapshotclass policy to 'Delete'
3. create Ns with few pods
4. run csi backup (OADP)
5. during the test VSs & VSCs are created
6. test completed - VSs & VSCs are deleted
7. check ceph pool - csi-snap not deleted



must-gather output:
https://drive.google.com/drive/folders/1fr1g04Xj9I4la93neJqFaWKvHiqcinrG?usp=share_link


Actual results:

csi-snap aren't deleted

Expected results:

csi-snap deleted

Additional info:


[root@f01-h07-000-r640 playbooks]# oc get sc
NAME                                    PROVISIONER                             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
nvme-disks                              kubernetes.io/no-provisioner            Delete          WaitForFirstConsumer   false                  36d
ocs-storagecluster-ceph-rbd (default)   openshift-storage.rbd.csi.ceph.com      Delete          Immediate              true                   36d
ocs-storagecluster-ceph-rgw             openshift-storage.ceph.rook.io/bucket   Delete          Immediate              false                  36d
ocs-storagecluster-cephfs               openshift-storage.cephfs.csi.ceph.com   Delete          Immediate              true                   36d
ocs-storagecluster-cephfs-shallow       openshift-storage.cephfs.csi.ceph.com   Delete          Immediate              true                   22d
openshift-storage.noobaa.io             openshift-storage.noobaa.io/obc         Delete          Immediate              false                  36d
ssd-disks                               kubernetes.io/no-provisioner            Delete          WaitForFirstConsumer   false                  36d


[root@f01-h07-000-r640 playbooks]# oc get volumesnapshotclass
NAME                                        DRIVER                                  DELETIONPOLICY   AGE
ocs-storagecluster-cephfsplugin-snapclass   openshift-storage.cephfs.csi.ceph.com   Delete           36d
ocs-storagecluster-rbdplugin-snapclass      openshift-storage.rbd.csi.ceph.com      Delete           36d
scale-volumesnapshotclass                   openshift-storage.rbd.csi.ceph.com      Delete           35d

[root@f01-h07-000-r640 playbooks]# oc get volumesnapshotclass scale-volumesnapshotclass -oyaml
apiVersion: snapshot.storage.k8s.io/v1
deletionPolicy: Delete
driver: openshift-storage.rbd.csi.ceph.com
kind: VolumeSnapshotClass
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"snapshot.storage.k8s.io/v1","deletionPolicy":"Delete","driver":"openshift-storage.rbd.csi.ceph.com","kind":"VolumeSnapshotClass","metadata":{"annotations":{"snapshot.storage.kubernetes.io/is-default-class":"true"},"labels":{"velero.io/csi-volumesnapshot-class":"true"},"name":"scale-volumesnapshotclass"},"parameters":{"clusterID":"openshift-storage","csi.storage.k8s.io/snapshotter-secret-name":"rook-csi-rbd-provisioner","csi.storage.k8s.io/snapshotter-secret-namespace":"openshift-storage"}}
    snapshot.storage.kubernetes.io/is-default-class: "true"
  creationTimestamp: "2023-04-05T11:01:53Z"
  generation: 52
  labels:
    velero.io/csi-volumesnapshot-class: "true"
  name: scale-volumesnapshotclass
  resourceVersion: "31746873"
  uid: 02635ee4-fb62-4aa1-8f8d-36df79bcaa0d
parameters:
  clusterID: openshift-storage
  csi.storage.k8s.io/snapshotter-secret-name: rook-csi-rbd-provisioner
  csi.storage.k8s.io/snapshotter-secret-namespace: openshift-storage


sh-4.4$ rbd ls --pool=ocs-storagecluster-cephblockpool        
csi-snap-32a8aae4-4a13-4665-a4ec-d1966d6ca4d1
csi-snap-4f4cdeda-7170-4390-8616-8808fa4820d6
csi-snap-8637a09d-6415-41c5-956c-6423f10e6d8a
csi-snap-97fd08d1-5d51-4108-bde7-35eb98ec3c5d
csi-snap-a6596798-4d4f-47db-889c-cf85f67c7502
csi-snap-a9d4469d-bdba-4600-8f68-142df08e7f8d
csi-snap-baa534a7-3dbf-451b-b578-d1650e4e3769
csi-snap-ca800564-e4cf-4fa8-a6e2-735f99f65478
csi-snap-cfa2190b-fd9c-4602-8d9f-6e85ebd9824a
csi-snap-d740aebf-9170-4b53-a069-839eed37b85a
csi-vol-011c6410-83c6-4e73-916a-ce086ef9bf10
csi-vol-0f470f3a-8810-4396-96ce-1f31d5db8fe2
csi-vol-108050ba-bef8-4cf5-b9a6-a62fd816988e
csi-vol-121d9ef5-13a1-4b25-9eea-a40481f65e37
csi-vol-72edaf1f-f5ae-488d-a1b9-71aa2b1a0447
csi-vol-7dc4a909-43f9-4b9e-b279-1aa8b9b64f95
csi-vol-7eaed530-829d-4c91-9de7-ba979ef1ecf9
csi-vol-9806be2d-1313-4773-b272-1f1869c1cf9d
csi-vol-bd6e5b1f-b554-402e-8cae-0c02508e015e
csi-vol-c162d738-dc25-476f-aa0f-74975aa09fff

Comment 2 Niels de Vos 2023-05-11 09:49:04 UTC
It seems that the VolumeSnapshotClass is not correctly configured. The logs from the csi-rbdplugin-provisioner/csi-snapshotter contain many messages like the following:

2023-04-26T04:03:09.673248713Z E0426 04:03:09.673201       1 snapshot_controller_base.go:283] could not sync content "snapcontent-e5c948cd-6435-457b-8d22-67bd35db0398-clone": failed to delete snapshot "snapcontent-e5c948cd-6435-457b-8d22-67bd35db0398-clone", err: failed to delete snapshot content snapcontent-e5c948cd-6435-457b-8d22-67bd35db0398-clone: "rpc error: code = Internal desc = provided secret is empty"

This prevents the VolumeSnapshotContent from being deleted. These objects are expected to be still in the cluster, until deletion succeeded.

Could you please check:

1. secrets in the VolumeSnapshotClass
2. VolumeSnapshotContent objects in the cluster

Comment 5 Mudit Agarwal 2023-05-15 17:27:49 UTC
Not a 4.13 blocker

Comment 6 Niels de Vos 2023-05-16 12:41:34 UTC
I have not been able to reproduce this with simple steps:

---- >% ----
#
# yaml files from github.com/ceph/ceph-csi/examples/rbd/
#

oc_wait_status() {
	local TEMPLATE="${1}" UNTIL="${2}" OBJ="${3}"

	local STATUS=''
	while [ "${STATUS}" != "${UNTIL}" ]
	do
		[ -z "${STATUS}" ] || sleep 1
		STATUS=$(oc get --template="{{${TEMPLATE}}}" "${OBJ}")
	done
}

create_pvc() {
	oc create -f pvc.yaml
	oc_wait_status .status.phase Bound persistentvolumeclaim/rbd-pvc
}

create_snapshot() {
	oc create -f snapshot.yaml
	oc_wait_status .status.readyToUse true volumesnapshot/rbd-pvc-snapshot
}

restore_pvc() {
	oc create -f pvc-restore.yaml
	oc_wait_status .status.phase Bound persistentvolumeclaim/rbd-pvc-restore
}

cleanup() {
	cat pvc.yaml snapshot.yaml pvc-restore.yaml | oc delete -f- --wait
}

RUNS=0

while true
do
	create_pvc
	create_snapshot
	restore_pvc

	cleanup

	RUNS=$[RUNS+1]
	echo "run ${RUNS} finished"
	sleep 3
done

---- %< ----

There is no growing list of snapshots on the Ceph side after running this for a working day (~5000 iterations).

Comment 9 David Vaanunu 2023-08-10 10:03:32 UTC
Hello Madhu,

ODF version was updated to 4.12.5, still same behavior

will send you live cluster info in gchat

thanks

Comment 11 David Vaanunu 2023-08-10 11:18:05 UTC
once the policy change to 'Delete'
and delete the 'volumesnapshot' & 'volumesnapshotcontent'
the 'csi-snap' was deleted too.