Bug 1893739 - Force deletion doesn't work for snapshots if snapshotclass is already deleted
Summary: Force deletion doesn't work for snapshots if snapshotclass is already deleted
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 4.6
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.7.0
Assignee: Christian Huffman
QA Contact: Wei Duan
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-11-02 13:45 UTC by Neha Berry
Modified: 2021-02-24 15:29 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: When creating snapshots that require credentials, deleting the VolumeSnapshotClass would prevent the resulting snapshots from being deleted. Consequence: Once the VolumeSnapshotClass is deleted, the associated VolumeSnapshots and VolumeSnapshotContents could not be deleted. Fix: The credentials are fetched from the VolumeSnapshotContent instead of relying on the VolumeSnapshotClass to exist. Result: VolumeSnapshots and VolumeSnapshotContents that use credentials can now be deleted as long as the secret containing these credentials continues to exist.
Clone Of:
Environment:
Last Closed: 2021-02-24 15:29:19 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Finalizer-null-approach-to-delete-the-leftovers (154.88 KB, text/plain)
2020-11-06 16:35 UTC, Neha Berry
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift csi-external-snapshotter pull 34 0 None closed Bug 1893739: UPSTREAM: 423: Get credentials before checking if the SnapshotClass exists 2021-02-16 22:11:40 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:29:55 UTC

Description Neha Berry 2020-11-02 13:45:14 UTC
Description of problem:
---------------------------------

In case of PVC deletion, nothing is blocking the PVC deletion even if the storageclass is deleted. external provisioner will send a request to the CSI driver with volumeID and no secrets will be sent as it cannot get the storageclass name. 

But in the case of volume snapshot, the volumesnapshot deletion will never complete as the volumesnapshotclass is already deleted. As per the CSI spec the secrets are optional parameters why external snapshotter is not sending requests to the CSI driver for delete snapshot. is this expected behavior?


Expecially in cases of OCS uninstall, if users fail to delete VolumeSnapshots(VS) before deleting Storagecluster, the VolumeSnapshotClass gets deleted but the VS stay behind. Even a force deletion doesn't work in deleting these dangling VS

$ oc get volumesnapshotclass
No resources found

$oc delete volumesnapshot -n default --all --force --grace-period=0
----

$ oc get volumesnapshot -A
NAMESPACE   NAME                   READYTOUSE   SOURCEPVC     SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS                               SNAPSHOTCONTENT                                    CREATIONTIME   AGE
default     test-cephfs-snapshot   false        test-cephfs                           2Gi           ocs-storagecluster-cephfsplugin-snapclass   snapcontent-bc40d6e8-1387-40df-9e46-104dda851630   36h            36h
default     test-rbd-snapshot      false        test-rbd                              5Gi           ocs-storagecluster-rbdplugin-snapclass      snapcontent-602939aa-73dc-43b2-869e-db975a5a9b05   36h            36h


$ oc get volumesnapshotcontent -A
NAME                                               READYTOUSE   RESTORESIZE   DELETIONPOLICY   DRIVER                                  VOLUMESNAPSHOTCLASS                         VOLUMESNAPSHOT         AGE
snapcontent-602939aa-73dc-43b2-869e-db975a5a9b05   true         5368709120    Delete           openshift-storage.rbd.csi.ceph.com      ocs-storagecluster-rbdplugin-snapclass      test-rbd-snapshot      36h
snapcontent-bc40d6e8-1387-40df-9e46-104dda851630   true         2147483648    Delete           openshift-storage.cephfs.csi.ceph.com   ocs-storagecluster-cephfsplugin-snapclass   test-cephfs-snapshot   36h



Version-Release number of selected component (if applicable):
=============================================================
Tested last with 4.6.0-0.nightly-2020-10-22-034051 and OCS = 4.6.0-147.ci

How reproducible:
======================
Always

Steps to Reproduce:
1. Create an OCS 4.6 cluster with OCP 4.6
2. Create one each of CephFS and RBD PVCs and create snapshots using the default VS classes
3. To initiate OCS uninstall, delete the OBCs and PVCs but do not delete the VS

4. Delete the SToragecluster, which in turn deletes the Volumesnapshot class

$$ oc delete -n openshift-storage storagecluster --all --wait=true

5. Try to delete the dangling and leftover Volumesnapshots as the Cephcluster is already gone (no ceph access)

$ oc delete volumesnapshot -n <project-name> --all --force --grace-period=0
3.

6. See if the VS deletion succeeds

$ oc get volumesnapshotcontent -A

$ oc get volumesnapshot -A

Actual results:
---------------------
The Volumesnapshots fail to get deleted, even with force option


Expected results:
-------------------------
User should be forcefully able to delete the leftovers in case the Volumesnapshot class is unknowingly deleted before VS

Additional info:
=======================



$ oc logs csi-snapshot-controller-8bb7f7589-64tts -f -n openshift-cluster-storage-operator|tee csi-snapshot-controller-8bb7f7589-64tts.log



E1030 06:52:00.160184       1 snapshot_controller.go:1180] failed to retrieve snapshot class ocs-storagecluster-cephfsplugin-snapclass from the informer: "volumesnapshotclass.snapshot.storage.k8s.io \"ocs-storagecluster-cephfsplugin-snapclass\" not found"
E1030 06:52:00.160292       1 snapshot_controller_base.go:331] checkAndUpdateSnapshotClass failed to getSnapshotClass volumesnapshotclass.snapshot.storage.k8s.io "ocs-storagecluster-cephfsplugin-snapclass" not found
I1030 06:52:00.160341       1 snapshot_controller.go:897] cannot get claim from snapshot [test-cephfs-snapshot]: [failed to retrieve PVC test-cephfs from the lister: "persistentvolumeclaim \"test-cephfs\" not found"] Claim may be deleted already. No need to remove finalizer on the claim.


------------------------------------------------------------------

From Git hub issue
==================
https://github.com/kubernetes-csi/external-snapshotter/issues/412


I1029 10:13:02.926462       1 snapshot_controller.go:308] checkandRemoveSnapshotFinalizersAndCheckandDeleteContent: set DeletionTimeStamp on content [snapcontent-f4e2764a-7f9f-4619-a89e-9d7d1318bcd4].
I1029 10:13:02.934162       1 snapshot_controller.go:316] checkandRemoveSnapshotFinalizersAndCheckandDeleteContent: Remove Finalizer for VolumeSnapshot[default/rbd-pvc-snapshot]
I1029 10:13:02.934192       1 snapshot_controller.go:901] checkandRemovePVCFinalizer for snapshot [rbd-pvc-snapshot]: snapshot status [&v1beta1.VolumeSnapshotStatus{BoundVolumeSnapshotContentName:(*string)(0xc000285150), CreationTime:(*v1.Time)(0xc0004c2d80), ReadyToUse:(*bool)(0xc0002a8308), RestoreSize:(*resource.Quantity)(0xc00029f300), Error:(*v1beta1.VolumeSnapshotError)(0xc0002851b0)}]
I1029 10:13:02.943296       1 reflector.go:369] github.com/kubernetes-csi/external-snapshotter/client/v3/informers/externalversions/factory.go:117: forcing resync
I1029 10:13:02.943400       1 snapshot_controller_base.go:158] enqueued "default/rbd-pvc-snapshot" for sync
I1029 10:13:02.951105       1 util.go:264] storeObjectUpdate updating snapshot "default/rbd-pvc-snapshot" with version 1567
I1029 10:13:02.951304       1 snapshot_controller.go:1344] Removed protection finalizer from volume snapshot default/rbd-pvc-snapshot
I1029 10:13:02.951460       1 snapshot_controller_base.go:202] syncSnapshotByKey[default/rbd-pvc-snapshot]
I1029 10:13:02.951525       1 snapshot_controller_base.go:205] snapshotWorker: snapshot namespace [default] name [rbd-pvc-snapshot]
I1029 10:13:02.951545       1 snapshot_controller_base.go:328] checkAndUpdateSnapshotClass [rbd-pvc-snapshot]: VolumeSnapshotClassName [csi-rbdplugin-snapclass]
I1029 10:13:02.951555       1 snapshot_controller.go:1176] getSnapshotClass: VolumeSnapshotClassName [csi-rbdplugin-snapclass]
E1029 10:13:02.951573       1 snapshot_controller.go:1180] failed to retrieve snapshot class csi-rbdplugin-snapclass from the informer: "volumesnapshotclass.snapshot.storage.k8s.io \"csi-rbdplugin-snapclass\" not found"
E1029 10:13:02.951597       1 snapshot_controller_base.go:331] checkAndUpdateSnapshotClass failed to getSnapshotClass volumesnapshotclass.snapshot.storage.k8s.io "csi-rbdplugin-snapclass" not found
I1029 10:13:02.951609       1 snapshot_controller.go:721] updateSnapshotStatusWithEvent[default/rbd-pvc-snapshot]
I1029 10:13:02.951618       1 snapshot_controller.go:724] updateSnapshotStatusWithEvent[rbd-pvc-snapshot]: the same error &{2020-10-29 10:11:39 +0000 UTC 0xc0002851d0} is already set
I1029 10:13:02.951694       1 snapshot_controller_base.go:220] Snapshot "default/rbd-pvc-snapshot" is being deleted. SnapshotClass has already been removed
I1029 10:13:02.951715       1 snapshot_controller_base.go:222] Updating snapshot "default/rbd-pvc-snapshot"
I1029 10:13:02.951728       1 snapshot_controller_base.go:358] updateSnapshot "default/rbd-pvc-snapshot"
I1029 10:13:02.951751       1 util.go:264] storeObjectUpdate updating snapshot "default/rbd-pvc-snapshot" with version 1567
I1029 10:13:02.951778       1 snapshot_controller.go:180] synchronizing VolumeSnapshot[default/rbd-pvc-snapshot]: bound to: "snapcontent-f4e2764a-7f9f-4619-a89e-9d7d1318bcd4", Completed: false
I1029 10:13:02.951792       1 snapshot_controller.go:182] syncSnapshot [default/rbd-pvc-snapshot]: check if we should remove finalizer on snapshot PVC source and remove it if we can
I1029 10:13:02.951813       1 snapshot_controller.go:901] checkandRemovePVCFinalizer for snapshot [rbd-pvc-snapshot]: snapshot status [&v1beta1.VolumeSnapshotStatus{BoundVolumeSnapshotContentName:(*string)(0xc000285150), CreationTime:(*v1.Time)(0xc0004c2d80), ReadyToUse:(*bool)(0xc0002a8308), RestoreSize:(*resource.Quantity)(0xc00029f300), Error:(*v1beta1.VolumeSnapshotError)(0xc0002851b0)}]
I1029 10:13:02.951893       1 snapshot_controller.go:191] syncSnapshot[default/rbd-pvc-snapshot]: check if we should add invalid label on snapshot
I1029 10:13:02.951919       1 snapshot_controller.go:238] processSnapshotWithDeletionTimestamp VolumeSnapshot[default/rbd-pvc-snapshot]: bound to: "snapcontent-f4e2764a-7f9f-4619-a89e-9d7d1318bcd4", Completed: false
I1029 10:13:02.951936       1 snapshot_controller.go:272] processSnapshotWithDeletionTimestamp[default/rbd-pvc-snapshot]: delete snapshot content and remove finalizer from snapshot if needed
I1029 10:13:02.951951       1 snapshot_controller.go:278] checkandRemoveSnapshotFinalizersAndCheckandDeleteContent VolumeSnapshot[default/rbd-pvc-snapshot]: bound to: "snapcontent-f4e2764a-7f9f-4619-a89e-9d7d1318bcd4", Completed: false
I1029 10:13:02.951991       1 snapshot_controller.go:796] isVolumeBeingCreatedFromSnapshot: no volume is being created from snapshot default/rbd-pvc-snapshot
I1029 10:13:02.952021       1 snapshot_controller.go:297] checkandRemoveSnapshotFinalizersAndCheckandDeleteContent[default/rbd-pvc-snapshot]: Set VolumeSnapshotBeingDeleted annotation on the content [snapcontent-f4e2764a-7f9f-4619-a89e-9d7d1318bcd4]
I1029 10:13:02.952036       1 snapshot_controller.go:308] checkandRemoveSnapshotFinalizersAndCheckandDeleteContent: set DeletionTimeStamp on content [snapcontent-f4e2764a-7f9f-4619-a89e-9d7d1318bcd4].
I1029 10:13:02.964786       1 snapshot_controller.go:316] checkandRemoveSnapshotFinalizersAndCheckandDeleteContent: Remove Finalizer for VolumeSnapshot[default/rbd-pvc-snapshot]
I1029 10:13:02.964811       1 snapshot_controller.go:901] checkandRemovePVCFinalizer for snapshot [rbd-pvc-snapshot]: snapshot status [&v1beta1.VolumeSnapshotStatus{BoundVolumeSnapshotContentName:(*string)(0xc000285150), CreationTime:(*v1.Time)(0xc0004c2d80), ReadyToUse:(*bool)(0xc0002a8308), RestoreSize:(*resource.Quantity)(0xc00029f300), Error:(*v1beta1.VolumeSnapshotError)(0xc0002851b0)}]
I1029 10:13:02.974612       1 util.go:264] storeObjectUpdate updating snapshot "default/rbd-pvc-snapshot" with version 1567
I1029 10:13:02.974651       1 snapshot_controller.go:1344] Removed protection finalizer from volume snapshot default/rbd-pvc-snapshot
```

```
E1029 10:16:55.679819       1 snapshot_controller.go:224] getCSISnapshotInput failed to getClassFromVolumeSnapshot failed to retrieve snapshot class csi-rbdplugin-snapclass from the informer: "volumesnapshotclass.snapshot.storage.k8s.io \"csi-rbdplugin-snapclass\" not found"
E1029 10:16:55.685015       1 goroutinemap.go:150] Operation for "delete-snapcontent-f4e2764a-7f9f-4619-a89e-9d7d1318bcd4" failed. No retries permitted until 2020-10-29 10:17:27.679951762 +0000 UTC m=+442.637480008 (durationBeforeRetry 32s). Error: "failed to get input parameters to delete snapshot for content snapcontent-f4e2764a-7f9f-4619-a89e-9d7d1318bcd4: \"failed to retrieve snapshot class csi-rbdplugin-snapclass from the informer: \\\"volumesnapshotclass.snapshot.storage.k8s.io \\\\\\\"csi-rbdplugin-snapclass\\\\\\\" not found\\\"\""
I1029 10:16:55.684904       1 event.go:281] Event(v1.ObjectReference{Kind:"VolumeSnapshotContent", Namespace:"", Name:"snapcontent-f4e2764a-7f9f-4619-a89e-9d7d1318bcd4", UID:"ffc95922-3b31-494e-a40a-abc11f97c648", APIVersion:"snapshot.storage.k8s.io/v1beta1", ResourceVersion:"1565", FieldPath:""}): type: 'Warning' reason: 'SnapshotDeleteError' Failed to get snapshot class or credentials
E1029 10:17:55.679735       1 snapshot_controller.go:481] failed to retrieve snapshot class csi-rbdplugin-snapclass from the informer: "volumesnapshotclass.snapshot.storage.k8s.io \"csi-rbdplugin-snapclass\" not found"
E1029 10:17:55.679764       1 snapshot_controller.go:224] getCSISnapshotInput failed to getClassFromVolumeSnapshot failed to retrieve snapshot class csi-rbdplugin-snapclass from the informer: "volumesnapshotclass.snapshot.storage.k8s.io \"csi-rbdplugin-snapclass\" not found"
E1029 10:17:55.679819       1 goroutinemap.go:150] Operation for "delete-snapcontent-f4e2764a-7f9f-4619-a89e-9d7d1318bcd4" failed. No retries permitted until 2020-10-29 10:18:59.679780946 +0000 UTC m=+534.637309142 (durationBeforeRetry 1m4s). Error: "failed to get input parameters to delete snapshot for content snapcontent-f4e2764a-7f9f-4619-a89e-9d7d1318bcd4: \"failed to retrieve snapshot class csi-rbdplugin-snapclass from the informer: \\\"volumesnapshotclass.snapshot.storage.k8s.io \\\\\\\"csi-rbdplugin-snapclass\\\\\\\" not found\\\"\""
I1029 10:17:55.681554       1 event.go:281] Event(v1.ObjectReference{Kind:"VolumeSnapshotContent", Namespace:"", Name:"snapcontent-f4e2764a-7f9f-4619-a89e-9d7d1318bcd4", UID:"ffc95922-3b31-494e-a40a-abc11f97c648", APIVersion:"snapshot.storage.k8s.io/v1beta1", ResourceVersion:"1565", FieldPath:""}): type: 'Warning' reason: 'SnapshotDeleteError' Failed to get snapshot class or credentials
E1029 10:19:55.680458       1 snapshot_controller.go:481] failed to retrieve snapshot class csi-rbdplugin-snapclass from the informer: "volumesnapshotclass.snapshot.storage.k8s.io \"csi-rbdplugin-snapclass\" not found"
E1029 10:19:55.680653       1 snapshot_controller.go:224] getCSISnapshotInput failed to getClassFromVolumeSnapshot failed to retrieve snapshot class csi-rbdplugin-snapclass from the informer: "volumesnapshotclass.snapshot.storage.k8s.io \"csi-rbdplugin-snapclass\" not found"
E1029 10:19:55.680896       1 goroutinemap.go:150] Operation for "delete-snapcontent-f4e2764a-7f9f-4619-a89e-9d7d1318bcd4" failed. No retries permitted until 2020-10-29 10:21:57.68076332 +0000 UTC m=+712.638291701 (durationBeforeRetry 2m2s). Error: "failed to get input parameters to delete snapshot for content snapcontent-f4e2764a-7f9f-4619-a89e-9d7d1318bcd4: \"failed to retrieve snapshot class csi-rbdplugin-snapclass from the informer: \\\"volumesnapshotclass.snapshot.storage.k8s.io \\\\\\\"csi-rbdplugin-snapclass\\\\\\\" not found\\\"\""
I1029 10:19:55.687959       1 event.go:281] Event(v1.ObjectReference{Kind:"VolumeSnapshotContent", Namespace:"", Name:"snapcontent-f4e2764a-7f9f-4619-a89e-9d7d1318bcd4", UID:"ffc95922-3b31-494e-a40a-abc11f97c648", APIVersion:"snapshot.storage.k8s.io/v1beta1", ResourceVersion:"1565", FieldPath:""}): type: 'Warning' reason: 'SnapshotDeleteError' Failed to get snapshot class or credentials
```

Comment 5 Neha Berry 2020-11-06 09:24:53 UTC
Hi Chris,

Does this have anything to do OCS based CSI containers? 

Anyways, sorry I was working on something else. I could try to reproduce with GA'd OCP 4.6 build by Monday (Nov 9) and provide you with the cluster.

Comment 9 Neha Berry 2020-11-06 16:35:23 UTC
Created attachment 1727165 [details]
Finalizer-null-approach-to-delete-the-leftovers

(In reply to Jan Safranek from comment #2)
> Yes, removing the finalizer with oc patch is the best workaround.

Yes the finalizer approach worked

$ oc patch -n default volu[nberry@localhost bug-repro-1893739]$ oc patch -n default volumesnapshot/test-cephfs-snapshot --type=merge -p '{"metadata": {"finalizers":null}}'
volumesnapshot.snapshot.storage.k8s.io/test-cephfs-snapshot patched

oc patch -n default volumesnapshot/test-rbd-snapshot --type=merge -p '{"metadata": {"finalizers":null}}'; date --utc
volumesnapshot.snapshot.storage.k8s.io/test-rbd-snapshot patched
Fri Nov  6 16:31:41 UTC 2020

$ oc get volumesnapshot -A
No resources found


$ oc get volumesnapshotcontent 
NAME                                               READYTOUSE   RESTORESIZE   DELETIONPOLICY   DRIVER                                  VOLUMESNAPSHOTCLASS                         VOLUMESNAPSHOT         AGE
snapcontent-45ed1739-4e66-47ae-a725-9f754c8fc418   true         5368709120    Delete           openshift-storage.cephfs.csi.ceph.com   ocs-storagecluster-cephfsplugin-snapclass   test-cephfs-snapshot   85m
snapcontent-730a0fc5-18b5-441e-8b6f-2c448c23300b   true         10737418240   Delete           openshift-storage.rbd.csi.ceph.com      ocs-storagecluster-rbdplugin-snapclass      test-rbd-snapshot      84m
[nberry@localhost bug-repro-1893739]$ oc patch -n default volumesnapshotcontent/snapcontent-45ed1739-4e66-47ae-a725-9f754c8fc418 --type=merge -p '{"metadata": {"finalizers":null}}'; date --utc
volumesnapshotcontent.snapshot.storage.k8s.io/snapcontent-45ed1739-4e66-47ae-a725-9f754c8fc418 patched
Fri Nov  6 16:32:29 UTC 2020
[nberry@localhost bug-repro-1893739]$ oc patch -n default volumesnapshotcontent/snapcontent-730a0fc5-18b5-441e-8b6f-2c448c23300b --type=merge -p '{"metadata": {"finalizers":null}}'; date --utc
volumesnapshotcontent.snapshot.storage.k8s.io/snapcontent-730a0fc5-18b5-441e-8b6f-2c448c23300b patched
Fri Nov  6 16:32:49 UTC 2020
[nberry@localhost bug-repro-1893739]$ oc get volumesnapshotcontent 
No resources found


Attached the controller logs for reference

Comment 11 Christian Huffman 2020-11-06 19:04:59 UTC
This issue is pertaining to deleting the VolumeSnapshotClass when the driver requires credentials. In this case, the driver fails to remove the backend snapshot, which prevents the VolumeSnapshotContent being deleted, which prevents the VolumeSnapshot from being deleted. 

Note that even if this is resolved, we still won't be able to delete a VolumeSnapshot if the secret itself has been deleted, as the driver requires the secret to exist to proceed.

At this point I think we can fix the issue with the VolumeSnapshotClass being deleted and attempt to include the description in the VolumeSnapshotContent status message.

The relevant sections are below:

- Here we try to get the class and return a nil credentials if it's not found - https://github.com/kubernetes-csi/external-snapshotter/blob/23b415b6aaa7e0eba402bc3984ae2b726b101a80/pkg/sidecar-controller/snapshot_controller.go#L184-L189
- And then we pass these nil credentials into the deletion request - https://github.com/kubernetes-csi/external-snapshotter/blob/23b415b6aaa7e0eba402bc3984ae2b726b101a80/pkg/sidecar-controller/snapshot_controller.go#L341-L350

Since we have nil credentials, the deletion request fails, and then we see this error logged from line 350.

Comment 22 errata-xmlrpc 2021-02-24 15:29:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.