Created attachment 1666293 [details] csi-snapshot pod loigs Description of problem: While running openshift/conformance/parallel e2e tests on an AWS cluster, the cluster ended up with 2 namespaces stuck in Terminating state: e2e-provisioning-8394 Terminating e2e-snapshotting-9935 Terminating which never went away. Examining the contents of those namespaces the only items are volumesnapshots: root@ip-172-31-64-58: ~ # oc get volumesnapshots --all-namespaces NAMESPACE NAME AGE e2e-provisioning-8394 snapshot-n47sz 21m e2e-snapshotting-9935 snapshot-29v4b 17m root@ip-172-31-64-58: ~ # oc get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE local-pvml9nm 2Gi RWO Retain Released e2e-persistent-local-volumes-test-821/pvc-6scm7 local-volume-test-storageclass-e2e-persistent-local-volumes-test-821 13m local-pvwq8bf 2Gi RWO Retain Released e2e-persistent-local-volumes-test-7626/pvc-b6zwf local-volume-test-storageclass-e2e-persistent-local-volumes-test-7626 37m root@ip-172-31-64-58: ~ # oc get pvc --all-namespaces No resources found The associated PVs can be deleted but the volumesnapshots cannot. An oc delete command for one of the volumesnapshots hangs forever. The snapshot-controller pod logs are full of the following messages in a repeating pattern. I will include the controller and operator logs, as well as a full oc adm must-gather E0227 21:02:25.920146 1 snapshot_controller.go:1090] failed to retrieve snapshot class e2e-provisioning-8394-csi-hostpath-e2e-provisioning-8394-vsc from the informer: "volumesnapshotclass.snapshot.storage.k8s.io \"e2e-provisioning-8394-csi-hostpath-e2e-provisioning-8394-vsc\" not found" E0227 21:02:25.920185 1 snapshot_controller.go:1090] failed to retrieve snapshot class e2e-snapshotting-9935-csi-hostpath-e2e-snapshotting-9935-vsc from the informer: "volumesnapshotclass.snapshot.storage.k8s.io \"e2e-snapshotting-9935-csi-hostpath-e2e-snapshotting-9935-vsc\" not found" E0227 21:02:25.920216 1 snapshot_controller_base.go:330] checkAndUpdateSnapshotClass failed to getSnapshotClass failed to retrieve snapshot class e2e-provisioning-8394-csi-hostpath-e2e-provisioning-8394-vsc from the informer: "volumesnapshotclass.snapshot.storage.k8s.io \"e2e-provisioning-8394-csi-hostpath-e2e-provisioning-8394-vsc\" not found" E0227 21:02:25.920230 1 snapshot_controller_base.go:330] checkAndUpdateSnapshotClass failed to getSnapshotClass failed to retrieve snapshot class e2e-snapshotting-9935-csi-hostpath-e2e-snapshotting-9935-vsc from the informer: "volumesnapshotclass.snapshot.storage.k8s.io \"e2e-snapshotting-9935-csi-hostpath-e2e-snapshotting-9935-vsc\" not found" Version-Release number of selected component (if applicable): 4.4.0-0.nightly-2020-02-26-104940 How reproducible: Always (2 times in a row, anyways) Steps to Reproduce: 1. Install standard 3 master, 3 worker cluster on AWS 2. run openshift-tests run openshift/conformance/parallel (or run the specific snapshot tests - not sure how to do that) 3. oc get projects at the end of the conformance run Actual results: e2e-provisioning-8394 Terminating e2e-snapshotting-9935 Terminating Expected results: All e2e projects removed Will attach snapshot controller and operator logs and provide location of full oc adm must-gather
I ran today and got different dangling namespaces - maybe snapshots are a red herring? e2e-deployment-6833 Terminating e2e-deployment-7611 Terminating The first few times I repro'ed it, was always the provisioning/snapshot namespaces. I'll keep trying as well
Upstream PR has been merged. https://github.com/openshift/csi-external-snapshotter/pull/17 has been submitted to cherry-pick this change.
Verification is passed on 4.5.0-0.nightly-2020-03-29-224016 Run below for several times: openshift-tests run openshift/conformance/parallel --dry-run | grep Feature:VolumeSnapshotDataSource > tests openshift-tests run openshift/conformance/parallel -f tests During the test, we can see oc get volumesnapshots --all-namespaces NAMESPACE NAME READYTOUSE SOURCEPVC SOURCESNAPSHOTCONTENT RESTORESIZE SNAPSHOTCLASS SNAPSHOTCONTENT CREATIONTIME AGE e2e-provisioning-9317 snapshot-n5p9t true pvc-x4xf6 1Mi e2e-provisioning-9317-csi-hostpath-e2e-provisioning-9317-vsc snapcontent-a0422db9-de99-44fc-8f7f-5cf45efcbdea 22s 22s But when the test is finished, oc get volumesnapshots --all-namespaces No resources found
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409