Bug 1808123 - Namespaces stuck in Terminating with volumesnapshot resources that can't be deleted
Summary: Namespaces stuck in Terminating with volumesnapshot resources that can't be d...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 4.4
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.5.0
Assignee: Christian Huffman
QA Contact: Chao Yang
URL:
Whiteboard:
Depends On:
Blocks: 1815500
TreeView+ depends on / blocked
 
Reported: 2020-02-27 21:09 UTC by Mike Fiedler
Modified: 2020-07-13 17:22 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: If a VolumeSnapshotClass has been removed before the associated VolumeSnapshots, then it was no longer possible to delete the associated resources. Consequence: VolumeSnapshots lingered on the cluster. Fix: The logic to delete a VolumeSnapshot has been updated to examine if the associated VolumeSnapshotClass has already been deleted. Result: VolumeSnapshots can now be successfully deleted if no corresponding VolumeSnapshotClass exists.
Clone Of:
: 1815500 (view as bug list)
Environment:
Last Closed: 2020-07-13 17:22:21 UTC
Target Upstream Version:


Attachments (Terms of Use)
csi-snapshot pod loigs (12.84 KB, application/gzip)
2020-02-27 21:09 UTC, Mike Fiedler
no flags Details


Links
System ID Priority Status Summary Last Updated
Github openshift csi-external-snapshotter pull 17 None closed Bug 1808123: UPSTREAM 275: Allows VolumeSnapshot to be deleted if the class isn't found 2020-08-12 10:47:27 UTC
Red Hat Product Errata RHBA-2020:2409 None None None 2020-07-13 17:22:54 UTC

Description Mike Fiedler 2020-02-27 21:09:21 UTC
Created attachment 1666293 [details]
csi-snapshot pod loigs

Description of problem:

While running openshift/conformance/parallel e2e tests on an AWS cluster, the cluster ended up with 2 namespaces stuck in Terminating state:

e2e-provisioning-8394                                                  Terminating                                                                                                                                                            
e2e-snapshotting-9935                                                  Terminating  

which never went away.   Examining the contents of those namespaces the only items are volumesnapshots:

root@ip-172-31-64-58: ~ # oc get volumesnapshots --all-namespaces
NAMESPACE               NAME             AGE
e2e-provisioning-8394   snapshot-n47sz   21m
e2e-snapshotting-9935   snapshot-29v4b   17m

root@ip-172-31-64-58: ~ # oc get pv
NAME            CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS     CLAIM                                              STORAGECLASS                                                            REASON   AGE
local-pvml9nm   2Gi        RWO            Retain           Released   e2e-persistent-local-volumes-test-821/pvc-6scm7    local-volume-test-storageclass-e2e-persistent-local-volumes-test-821             13m
local-pvwq8bf   2Gi        RWO            Retain           Released   e2e-persistent-local-volumes-test-7626/pvc-b6zwf   local-volume-test-storageclass-e2e-persistent-local-volumes-test-7626            37m

root@ip-172-31-64-58: ~ # oc get pvc --all-namespaces
No resources found

The associated PVs can be deleted but the volumesnapshots cannot.   An oc delete command for one of the volumesnapshots hangs forever.

The snapshot-controller pod logs are full of the following messages in a repeating pattern.   I will include the controller and operator logs, as well as a full oc adm must-gather

E0227 21:02:25.920146       1 snapshot_controller.go:1090] failed to retrieve snapshot class e2e-provisioning-8394-csi-hostpath-e2e-provisioning-8394-vsc from the informer: "volumesnapshotclass.snapshot.storage.k8s.io \"e2e-provisioning-8394-csi-hostpath-e2e-provisioning-8394-vsc\" not found"
E0227 21:02:25.920185       1 snapshot_controller.go:1090] failed to retrieve snapshot class e2e-snapshotting-9935-csi-hostpath-e2e-snapshotting-9935-vsc from the informer: "volumesnapshotclass.snapshot.storage.k8s.io \"e2e-snapshotting-9935-csi-hostpath-e2e-snapshotting-9935-vsc\" not found"
E0227 21:02:25.920216       1 snapshot_controller_base.go:330] checkAndUpdateSnapshotClass failed to getSnapshotClass failed to retrieve snapshot class e2e-provisioning-8394-csi-hostpath-e2e-provisioning-8394-vsc from the informer: "volumesnapshotclass.snapshot.storage.k8s.io \"e2e-provisioning-8394-csi-hostpath-e2e-provisioning-8394-vsc\" not found"
E0227 21:02:25.920230       1 snapshot_controller_base.go:330] checkAndUpdateSnapshotClass failed to getSnapshotClass failed to retrieve snapshot class e2e-snapshotting-9935-csi-hostpath-e2e-snapshotting-9935-vsc from the informer: "volumesnapshotclass.snapshot.storage.k8s.io \"e2e-snapshotting-9935-csi-hostpath-e2e-snapshotting-9935-vsc\" not found"



Version-Release number of selected component (if applicable): 4.4.0-0.nightly-2020-02-26-104940

How reproducible: Always (2 times in a row, anyways)

Steps to Reproduce:
1.  Install standard 3 master, 3 worker cluster on AWS
2.  run openshift-tests run openshift/conformance/parallel (or run the specific snapshot tests - not sure how to do that)
3.  oc get projects at the end of the conformance run

Actual results:

e2e-provisioning-8394                                                  Terminating                                                                                                                                                            
e2e-snapshotting-9935                                                  Terminating  

Expected results:

All e2e projects removed

Will attach snapshot controller and operator logs and provide location of full oc adm must-gather

Comment 4 Mike Fiedler 2020-03-10 14:58:01 UTC
I ran today and got different dangling namespaces - maybe snapshots are a red herring?   

e2e-deployment-6833                                                    Terminating
e2e-deployment-7611                                                    Terminating

The first few times I repro'ed it, was always the provisioning/snapshot namespaces.   I'll keep trying as well

Comment 7 Christian Huffman 2020-03-19 17:17:54 UTC
Upstream PR has been merged. https://github.com/openshift/csi-external-snapshotter/pull/17 has been submitted to cherry-pick this change.

Comment 11 Chao Yang 2020-03-31 08:59:13 UTC
Verification is passed on 4.5.0-0.nightly-2020-03-29-224016
Run below for several times:
openshift-tests run openshift/conformance/parallel --dry-run | grep Feature:VolumeSnapshotDataSource > tests
openshift-tests run openshift/conformance/parallel -f tests

During the test, we can see
oc get volumesnapshots --all-namespaces
NAMESPACE               NAME             READYTOUSE   SOURCEPVC   SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS                                                  SNAPSHOTCONTENT                                    CREATIONTIME   AGE
e2e-provisioning-9317   snapshot-n5p9t   true         pvc-x4xf6                           1Mi           e2e-provisioning-9317-csi-hostpath-e2e-provisioning-9317-vsc   snapcontent-a0422db9-de99-44fc-8f7f-5cf45efcbdea   22s            22s

But when the test is finished, 
oc get volumesnapshots --all-namespaces
No resources found

Comment 13 errata-xmlrpc 2020-07-13 17:22:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.