Bug 1951399
Summary: | volumesnapshotcontent cannot be deleted; SnapshotDeleteError Failed to delete snapshot | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat OpenShift Container Storage | Reporter: | henrychi |
Component: | csi-driver | Assignee: | Yug Gupta <ygupta> |
Status: | CLOSED NOTABUG | QA Contact: | Elad <ebenahar> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.6 | CC: | hchiramm, madam, mrajanna, muagarwa, ocs-bugs, ygupta |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-05-18 07:25:08 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
henrychi
2021-04-20 04:33:16 UTC
>snapcontent-fea465c8-5485-48ba-b3de-897bd0f1bc4c true 42949672960 Retain openshift-storage.cephfs.csi.ceph.com ocs-storagecluster-cephfsplugin-snapclass-velero velero-demo-cephfs-pvc-vpl4t 4m12s Retain policy for the snapshot is not tested and maybe it is not supported in OCS. >I0420 01:08:31.456546 1 snapshot_controller.go:439] getSnapshotClass: VolumeSnapshotClassName [ocs-storagecluster-cephfsplugin-snapclass-velero] E0420 01:08:31.457834 1 snapshot_controller_base.go:261] could not sync content "velero-velero-demo-cephfs-pvc-vpl4t-rdnbj": failed to delete snapshot "velero-velero-demo-cephfs-pvc-vpl4t-rdnbj", err: failed to delete snapshot content velero-velero-demo-cephfs-pvc-vpl4t-rdnbj: "rpc error: code = InvalidArgument desc = provided secret is empty" Looks like the snapshotclass is deleted before deleting the volume snapshot object (the provided secret is empty) https://bugzilla.redhat.com/show_bug.cgi?id=1893739#c7 . (this might be causing the issue to delete the snapshot) I don't know how OADP behaves in the case of snapshot backup and restore. at least from the above-provided samples. it's creating a snapshot and snapshot content and deleting it. >oc delete volumesnapshotcontents velero-velero-demo-cephfs-pvc-vpl4t-rdnbj (hangs) The snapshotcontent object is not meant to be deleted by the user as it's a dynamic provisioned. Maybe try to remove the finalizers and delete the volumesnapshotcontent? @Yug can you please try to reproduce this issue and check what is missing here? (In reply to Madhu Rajanna from comment #2) > >snapcontent-fea465c8-5485-48ba-b3de-897bd0f1bc4c true 42949672960 Retain openshift-storage.cephfs.csi.ceph.com ocs-storagecluster-cephfsplugin-snapclass-velero velero-demo-cephfs-pvc-vpl4t 4m12s > > Retain policy for the snapshot is not tested and maybe it is not supported > in OCS. > > > >I0420 01:08:31.456546 1 snapshot_controller.go:439] getSnapshotClass: VolumeSnapshotClassName [ocs-storagecluster-cephfsplugin-snapclass-velero] > E0420 01:08:31.457834 1 snapshot_controller_base.go:261] could not > sync content "velero-velero-demo-cephfs-pvc-vpl4t-rdnbj": failed to delete > snapshot "velero-velero-demo-cephfs-pvc-vpl4t-rdnbj", err: failed to delete > snapshot content velero-velero-demo-cephfs-pvc-vpl4t-rdnbj: "rpc error: code > = InvalidArgument desc = provided secret is empty" > > Looks like the snapshotclass is deleted before deleting the volume snapshot > object (the provided secret is empty) > https://bugzilla.redhat.com/show_bug.cgi?id=1893739#c7 . (this might be > causing the issue to delete the snapshot) > > Above is the case here. The snapshot class is not available for some reason, may be its deleted manually or at restore time, the volume snapshot class is not available. henrychi.com, if you can list down the exact process of snapshot backup and restore wrt volumesnapshot class it would help as well. Let me know if more info is needed. While reproducing, Deleting the backup via `./velero backup delete mybackup` does not seem to delete the volumesnapshot and volumesnapshotcontent created via velero. Can you please share the configuration of velero instance and the backup crd used? 1) Example of velero configuration: cat konveyor.openshift.io_v1alpha1_velero_cr.yaml apiVersion: konveyor.openshift.io/v1alpha1 kind: Velero metadata: name: example-velero spec: olm_managed: false default_velero_plugins: - aws - openshift - csi custom_velero_plugins: - name: cpdbr-velero-plugin image: image-registry.openshift-image-registry.svc:5000/oadp-operator/cpdbr-velero-plugin:latest backup_storage_locations: - name: default provider: aws object_storage: bucket: velero config: region: minio s_3__force_path_style: "true" s_3__url: http://minio-velero.apps.mycluster.ibm.com credentials_secret_ref: name: oadp-repo-secret namespace: oadp-operator enable_restic: true velero_resource_allocation: limits: cpu: "1" memory: 512Mi requests: cpu: 500m memory: 256Mi restic_resource_allocation: limits: cpu: "1" memory: 16Gi requests: cpu: 500m memory: 256Mi velero_image_fqin: velero/velero:v1.5.4 2) Example VolumeSnapshotClass, with deletionPolicy set to Retain cat ocs-storagecluster-cephfsplugin-snapclass-velero.yaml apiVersion: snapshot.storage.k8s.io/v1beta1 deletionPolicy: Retain driver: openshift-storage.cephfs.csi.ceph.com kind: VolumeSnapshotClass metadata: name: ocs-storagecluster-cephfsplugin-snapclass-velero labels: velero.io/csi-volumesnapshot-class: "true" parameters: clusterID: openshift-storage csi.storage.k8s.io/snapshotter-secret-name: rook-csi-cephfs-provisioner csi.storage.k8s.io/snapshotter-secret-namespace: openshift-storage 3) I didn't use a backup crd. I just used velero command line e.g. ./velero backup create mybackup --include-namespaces testns --exclude-resources='Event,Event.events.k8s.io' ---snip-- 2) Example VolumeSnapshotClass, with deletionPolicy set to Retain cat ocs-storagecluster-cephfsplugin-snapclass-velero.yaml apiVersion: snapshot.storage.k8s.io/v1beta1 deletionPolicy: Retain driver: openshift-storage.cephfs.csi.ceph.com kind: VolumeSnapshotClass metadata: name: ocs-storagecluster-cephfsplugin-snapclass-velero labels: velero.io/csi-volumesnapshot-class: "true" parameters: clusterID: openshift-storage csi.storage.k8s.io/snapshotter-secret-name: rook-csi-cephfs-provisioner csi.storage.k8s.io/snapshotter-secret-namespace: openshift-storage --/snip-- For a PVC which is provisioned from a SC with dynamic provisioning which has set the "reclaimPolicy" to "Retain", and if a user delete the PVC, PV object and underlying volume will remain there. It has to be "Manually deleted". Similarly if the VolumeSnapshotClass is set "ReclaimPolicy" = "Retain" and if you delete the volume snapshot , I expect the VolumeSnapshotContent and "Ceph Volume Snapshot" in the ceph cluster to remain there. Isnt it the behaviour we are seeing here ? > > For a PVC which is provisioned from a SC with dynamic provisioning which has > set the "reclaimPolicy" to "Retain", and if a user delete the PVC, PV object > and underlying volume will remain there. It has to be "Manually deleted". > Similarly if the VolumeSnapshotClass is set "ReclaimPolicy" = "Retain" and > if you delete the volume snapshot , I expect the VolumeSnapshotContent and > "Ceph Volume Snapshot" in the ceph cluster to remain there. > > Isnt it the behaviour we are seeing here ? But aren' they trying to delete it manually: >> oc delete volumesnapshotcontents velero-velero-demo-cephfs-pvc-vpl4t-rdnbj I cannot even manually delete the volumensnapshotcontent. It hangs. From the problem description: 1) After restore, there are 2 volumesnapshotcontents. oc get volumesnapshotcontents NAME READYTOUSE RESTORESIZE DELETIONPOLICY DRIVER VOLUMESNAPSHOTCLASS VOLUMESNAPSHOT AGE snapcontent-fea465c8-5485-48ba-b3de-897bd0f1bc4c true 42949672960 Retain openshift-storage.cephfs.csi.ceph.com ocs-storagecluster-cephfsplugin-snapclass-velero velero-demo-cephfs-pvc-vpl4t 4m12s velero-velero-demo-cephfs-pvc-vpl4t-rdnbj true 0 Retain openshift-storage.cephfs.csi.ceph.com ocs-storagecluster-cephfsplugin-snapclass-velero velero-demo-cephfs-pvc-vpl4t 32s 2) After deleting the velero backup, there is 1 volumesnapshotcontent. oc get volumesnapshotcontents NAME READYTOUSE RESTORESIZE DELETIONPOLICY DRIVER VOLUMESNAPSHOTCLASS VOLUMESNAPSHOT AGE velero-velero-demo-cephfs-pvc-vpl4t-rdnbj true 0 Delete openshift-storage.cephfs.csi.ceph.com ocs-storagecluster-cephfsplugin-snapclass-velero velero-demo-cephfs-pvc-vpl4t 77s 3) I cannot manually delete the 1 remaining volumensnapshotcontent. It hangs. oc delete volumesnapshotcontents velero-velero-demo-cephfs-pvc-vpl4t-rdnbj (hangs) Henry/Mudit, here is the confusion: How the second Volumesnapshotcontent got created ? that said, when we take a backup, I expect 1 VolumeSnapshot and 1 VolumeSnapshotContent to be created, then after restore, we see ONLY one VolumeSnapshot but 2 VolumeSnapshotContents, that said, was the extra VolumeSnapshotContent created statically? and if we look at the problem description we can see the original VolumeSnapshotContent (velero-velero-demo-cephfs-pvc-vpl4t-rdnbj) "restore" size is "0" , that shouldnt be the case. So, when this size got reflected/recorded? is it right after the backup creation or after restore or some other operation? Also, the volume snapshot (velero-velero-demo-cephfs-pvc-vpl4t-rdnbj) referring to " ocs-storagecluster-cephfsplugin-snapclass-velero" Volume snapshot class. Is this still exist in this cluster ( as asked in c#3 and #4 which is not yet answered) ? I am not sure what Valero do in the backend at backup and restore time, so please give these details which could help us. ^ I don't know that internals of OADP/Velero/CSI driver, so can't comment on why a second volumesnapshotcontent is created during restore. The original volumesnapshotcontent is snapcontent-fea465c8-5485-48ba-b3de-897bd0f1bc4c, created from backup. The second volumesnapshotcontent is velero-velero-demo-cephfs-pvc-vpl4t-rdnbj, created from restore. I can't see comment #3. Comment #4 is mine. There are no volume snapshots existing in the cluster: oc get volumesnapshot -A No resources found (In reply to henrychi from comment #12) > I don't know that internals of OADP/Velero/CSI driver, so can't comment on > why a second volumesnapshotcontent is created during restore. > > The original volumesnapshotcontent is > snapcontent-fea465c8-5485-48ba-b3de-897bd0f1bc4c, created from backup. > The second volumesnapshotcontent is > velero-velero-demo-cephfs-pvc-vpl4t-rdnbj, created from restore. > > I can't see comment #3. > Comment #4 is mine. > > There are no volume snapshots existing in the cluster: > oc get volumesnapshot -A > No resources found Just noticed those comments are marked `internal` thinking that, the internal comments are visible to you. That caused the confusion, I meant comment 2 and 3, instead of 3 and 4 though. Just made the visibility of comment2 and 3 to public now. In regards to the questions about volumesnapshotclass, I manually created those before doing a backup and restore. After initial creation, they aren't touched. The volumesnapshotclasses still exist in my cluster. 1. Create volumesnapshotclasses vi ocs-storagecluster-rbdplugin-snapclass-velero.yaml apiVersion: snapshot.storage.k8s.io/v1beta1 deletionPolicy: Retain driver: openshift-storage.rbd.csi.ceph.com kind: VolumeSnapshotClass metadata: name: ocs-storagecluster-rbdplugin-snapclass-velero labels: velero.io/csi-volumesnapshot-class: "true" parameters: clusterID: openshift-storage csi.storage.k8s.io/snapshotter-secret-name: rook-csi-rbd-provisioner csi.storage.k8s.io/snapshotter-secret-namespace: openshift-storage vi ocs-storagecluster-cephfsplugin-snapclass-velero.yaml apiVersion: snapshot.storage.k8s.io/v1beta1 deletionPolicy: Retain driver: openshift-storage.cephfs.csi.ceph.com kind: VolumeSnapshotClass metadata: name: ocs-storagecluster-cephfsplugin-snapclass-velero labels: velero.io/csi-volumesnapshot-class: "true" parameters: clusterID: openshift-storage csi.storage.k8s.io/snapshotter-secret-name: rook-csi-cephfs-provisioner csi.storage.k8s.io/snapshotter-secret-namespace: openshift-storage 2. Backup 3. Restore 4. Check if volumesnapshotclasses still exist, and the answer is yes oc get volumesnapshotclass NAME DRIVER DELETIONPOLICY AGE ocs-storagecluster-cephfsplugin-snapclass openshift-storage.cephfs.csi.ceph.com Delete 97d ocs-storagecluster-cephfsplugin-snapclass-velero openshift-storage.cephfs.csi.ceph.com Retain 97d ocs-storagecluster-rbdplugin-snapclass openshift-storage.rbd.csi.ceph.com Delete 97d ocs-storagecluster-rbdplugin-snapclass-velero openshift-storage.rbd.csi.ceph.com Retain 97d Thanks Henry, I think that answers the question that why we have two snapshot content. Restore will also create a new snapshot as well as snapshot content. Two things I want to mention here: 1. This is completely related to https://bugzilla.redhat.com/show_bug.cgi?id=1952708 and we need to see why the restore size for snapshot is 0 2. And as Madhu mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1951399#c2 we don't yet support the retain policy for snapshotclass and we need to play around with it. Right now what OCS supports is the default snapshot class i.e. ocs-storagecluster-cephfsplugin-snapclass and ocs-storagecluster-rbdplugin-snapclass Thanks. I just want to add that using retain policy for snapshotclass was suggested to us by some folks from OADP, and it makes sense to me. A typical test scenario is to delete namespace and then restore. If the policy is delete, when the volumesnapshot gets deleted, the volumesnapshotcontent gets deleted, and restore won't work. A general volumesnapshotcontent created has the following annotations that contain the secret name and the namespace ``` [ygupta@localhost cephfs]$ kubectl get volumesnapshotcontent snapcontent-6593fae7-5f12-41bd-b05f-c62d1a980ba4 -oyaml apiVersion: snapshot.storage.k8s.io/v1 kind: VolumeSnapshotContent metadata: annotations: snapshot.storage.kubernetes.io/deletion-secret-name: csi-cephfs-secret snapshot.storage.kubernetes.io/deletion-secret-namespace: creationTimestamp: "2021-05-12T05:48:23Z" ``` But on the other hand, when Madhu and I looked into the velero created volumesnapshotcontent, It doesn't seem to have the above-mentioned annotations set, and lacks the secret information. ``` [ygupta@localhost cephfs]$ kubectl get volumesnapshotcontent velero-velero-csi-cephfs-pvc-5brg9-5wjdg -oyaml apiVersion: v1 items: - apiVersion: snapshot.storage.k8s.io/v1 kind: VolumeSnapshotContent metadata: annotations: snapshot.storage.kubernetes.io/volumesnapshot-being-deleted: "yes" creationTimestamp: "2021-05-11T11:49:37Z" ``` Due to lacking the necessary secret information, the Deletion of Volumesnapshotcontent created on velero restore gets stuck with error "provided secret is empty". Based on the above-mentioned reason, it does not look like an issue with the OCS operator, but with the restore operation by velero itself, as they seem to be missing some important annotations. The first volumensnapshotcontent that velero backed up has the annotations. The mysterious second volumesnapshotcontent that is created during restore doesn't have the annotations. I'm just an end user of OADP, and don't know why or how the second one is created. Let me know if there's more info I can provide. Is there a way to safely delete the second volumesnapshotcontent, without leaking disk space? (In reply to henrychi from comment #19) > The first volumensnapshotcontent that velero backed up has the annotations. > The mysterious second volumesnapshotcontent that is created during restore > doesn't have the annotations. > I'm just an end user of OADP, and don't know why or how the second one is > created. Let me know if there's more info I can provide. > Regarding any information regarding velero's internal implementation, maybe velero team can provide more insights. > Is there a way to safely delete the second volumesnapshotcontent, without > leaking disk space? As mentioned earlier, the second VolumeSnapshotContent seems to be missing the required annotations to perform the deletion. But as a workaround to delete the VolumeSnapshotContent, you can edit the Volumesnapshotcontent to add the required annotations manually, so that the deletion can go through. This might help you to delete the VolumeSnapshotContent. The workaround of adding the annotations manually allows the volumesnapshotcontent to be deleted. Thanks. @Henrychi am closing this BZ as not a Bug from the OCS side. Please feel free to reopen if you think it's an OCS issue. |