Bug 1808123

Summary: Namespaces stuck in Terminating with volumesnapshot resources that can't be deleted
Product: OpenShift Container Platform Reporter: Mike Fiedler <mifiedle>
Component: StorageAssignee: Christian Huffman <chuffman>
Status: CLOSED ERRATA QA Contact: Chao Yang <chaoyang>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.4CC: aos-bugs, chaoyang, jsafrane, lxia
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: If a VolumeSnapshotClass has been removed before the associated VolumeSnapshots, then it was no longer possible to delete the associated resources. Consequence: VolumeSnapshots lingered on the cluster. Fix: The logic to delete a VolumeSnapshot has been updated to examine if the associated VolumeSnapshotClass has already been deleted. Result: VolumeSnapshots can now be successfully deleted if no corresponding VolumeSnapshotClass exists.
Story Points: ---
Clone Of:
: 1815500 (view as bug list) Environment:
Last Closed: 2020-07-13 17:22:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1815500    
Attachments:
Description Flags
csi-snapshot pod loigs none

Description Mike Fiedler 2020-02-27 21:09:21 UTC
Created attachment 1666293 [details]
csi-snapshot pod loigs

Description of problem:

While running openshift/conformance/parallel e2e tests on an AWS cluster, the cluster ended up with 2 namespaces stuck in Terminating state:

e2e-provisioning-8394                                                  Terminating                                                                                                                                                            
e2e-snapshotting-9935                                                  Terminating  

which never went away.   Examining the contents of those namespaces the only items are volumesnapshots:

root@ip-172-31-64-58: ~ # oc get volumesnapshots --all-namespaces
NAMESPACE               NAME             AGE
e2e-provisioning-8394   snapshot-n47sz   21m
e2e-snapshotting-9935   snapshot-29v4b   17m

root@ip-172-31-64-58: ~ # oc get pv
NAME            CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS     CLAIM                                              STORAGECLASS                                                            REASON   AGE
local-pvml9nm   2Gi        RWO            Retain           Released   e2e-persistent-local-volumes-test-821/pvc-6scm7    local-volume-test-storageclass-e2e-persistent-local-volumes-test-821             13m
local-pvwq8bf   2Gi        RWO            Retain           Released   e2e-persistent-local-volumes-test-7626/pvc-b6zwf   local-volume-test-storageclass-e2e-persistent-local-volumes-test-7626            37m

root@ip-172-31-64-58: ~ # oc get pvc --all-namespaces
No resources found

The associated PVs can be deleted but the volumesnapshots cannot.   An oc delete command for one of the volumesnapshots hangs forever.

The snapshot-controller pod logs are full of the following messages in a repeating pattern.   I will include the controller and operator logs, as well as a full oc adm must-gather

E0227 21:02:25.920146       1 snapshot_controller.go:1090] failed to retrieve snapshot class e2e-provisioning-8394-csi-hostpath-e2e-provisioning-8394-vsc from the informer: "volumesnapshotclass.snapshot.storage.k8s.io \"e2e-provisioning-8394-csi-hostpath-e2e-provisioning-8394-vsc\" not found"
E0227 21:02:25.920185       1 snapshot_controller.go:1090] failed to retrieve snapshot class e2e-snapshotting-9935-csi-hostpath-e2e-snapshotting-9935-vsc from the informer: "volumesnapshotclass.snapshot.storage.k8s.io \"e2e-snapshotting-9935-csi-hostpath-e2e-snapshotting-9935-vsc\" not found"
E0227 21:02:25.920216       1 snapshot_controller_base.go:330] checkAndUpdateSnapshotClass failed to getSnapshotClass failed to retrieve snapshot class e2e-provisioning-8394-csi-hostpath-e2e-provisioning-8394-vsc from the informer: "volumesnapshotclass.snapshot.storage.k8s.io \"e2e-provisioning-8394-csi-hostpath-e2e-provisioning-8394-vsc\" not found"
E0227 21:02:25.920230       1 snapshot_controller_base.go:330] checkAndUpdateSnapshotClass failed to getSnapshotClass failed to retrieve snapshot class e2e-snapshotting-9935-csi-hostpath-e2e-snapshotting-9935-vsc from the informer: "volumesnapshotclass.snapshot.storage.k8s.io \"e2e-snapshotting-9935-csi-hostpath-e2e-snapshotting-9935-vsc\" not found"



Version-Release number of selected component (if applicable): 4.4.0-0.nightly-2020-02-26-104940

How reproducible: Always (2 times in a row, anyways)

Steps to Reproduce:
1.  Install standard 3 master, 3 worker cluster on AWS
2.  run openshift-tests run openshift/conformance/parallel (or run the specific snapshot tests - not sure how to do that)
3.  oc get projects at the end of the conformance run

Actual results:

e2e-provisioning-8394                                                  Terminating                                                                                                                                                            
e2e-snapshotting-9935                                                  Terminating  

Expected results:

All e2e projects removed

Will attach snapshot controller and operator logs and provide location of full oc adm must-gather

Comment 4 Mike Fiedler 2020-03-10 14:58:01 UTC
I ran today and got different dangling namespaces - maybe snapshots are a red herring?   

e2e-deployment-6833                                                    Terminating
e2e-deployment-7611                                                    Terminating

The first few times I repro'ed it, was always the provisioning/snapshot namespaces.   I'll keep trying as well

Comment 7 Christian Huffman 2020-03-19 17:17:54 UTC
Upstream PR has been merged. https://github.com/openshift/csi-external-snapshotter/pull/17 has been submitted to cherry-pick this change.

Comment 11 Chao Yang 2020-03-31 08:59:13 UTC
Verification is passed on 4.5.0-0.nightly-2020-03-29-224016
Run below for several times:
openshift-tests run openshift/conformance/parallel --dry-run | grep Feature:VolumeSnapshotDataSource > tests
openshift-tests run openshift/conformance/parallel -f tests

During the test, we can see
oc get volumesnapshots --all-namespaces
NAMESPACE               NAME             READYTOUSE   SOURCEPVC   SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS                                                  SNAPSHOTCONTENT                                    CREATIONTIME   AGE
e2e-provisioning-9317   snapshot-n5p9t   true         pvc-x4xf6                           1Mi           e2e-provisioning-9317-csi-hostpath-e2e-provisioning-9317-vsc   snapcontent-a0422db9-de99-44fc-8f7f-5cf45efcbdea   22s            22s

But when the test is finished, 
oc get volumesnapshots --all-namespaces
No resources found

Comment 13 errata-xmlrpc 2020-07-13 17:22:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409