After an e2e run, there are some pods stuck in terminating, not clear why they are stuck: $ oc get pods -o yaml apiVersion: v1 items: - apiVersion: v1 kind: Pod metadata: annotations: k8s.v1.cni.cncf.io/networks-status: "" openshift.io/scc: anyuid creationTimestamp: "2020-03-11T06:30:12Z" deletionGracePeriodSeconds: 30 deletionTimestamp: "2020-03-14T04:05:45Z" generateName: pvc-volume-tester- name: pvc-volume-tester-dkp9s namespace: e2e-csi-mock-volumes-7010 resourceVersion: "2073172" selfLink: /api/v1/namespaces/e2e-csi-mock-volumes-7010/pods/pvc-volume-tester-dkp9s uid: 6aefd56f-c478-4295-b8c8-779113edc0d2 spec: containers: - image: k8s.gcr.io/pause:3.1 imagePullPolicy: IfNotPresent name: volume-tester resources: {} securityContext: capabilities: drop: - MKNOD terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /mnt/test name: my-volume - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: default-token-gd6q6 readOnly: true dnsPolicy: ClusterFirst enableServiceLinks: true imagePullSecrets: - name: default-dockercfg-fzcxt nodeName: ip-10-0-132-57.us-east-2.compute.internal priority: 0 restartPolicy: Never schedulerName: default-scheduler securityContext: seLinuxOptions: level: s0:c122,c49 serviceAccount: default serviceAccountName: default terminationGracePeriodSeconds: 30 tolerations: - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 300 - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 300 volumes: - name: my-volume persistentVolumeClaim: claimName: pvc-zmb4r - name: default-token-gd6q6 secret: defaultMode: 420 secretName: default-token-gd6q6 status: conditions: - lastProbeTime: null lastTransitionTime: "2020-03-11T06:30:12Z" reason: PodCompleted status: "True" type: Initialized - lastProbeTime: null lastTransitionTime: "2020-03-14T04:05:16Z" reason: PodCompleted status: "False" type: Ready - lastProbeTime: null lastTransitionTime: "2020-03-14T04:05:16Z" reason: PodCompleted status: "False" type: ContainersReady - lastProbeTime: null lastTransitionTime: "2020-03-11T06:30:12Z" status: "True" type: PodScheduled containerStatuses: - containerID: cri-o://47d13e4a73ef34dbaeaa1dc15f06651e2cae3d767988e0e1f86ccc01b754eee8 image: k8s.gcr.io/pause:3.1 imageID: k8s.gcr.io/pause@sha256:59eec8837a4d942cc19a52b8c09ea75121acc38114a2c68b98983ce9356b8610 lastState: {} name: volume-tester ready: false restartCount: 0 started: false state: terminated: containerID: cri-o://47d13e4a73ef34dbaeaa1dc15f06651e2cae3d767988e0e1f86ccc01b754eee8 exitCode: 0 finishedAt: "2020-03-14T04:05:15Z" reason: Completed startedAt: "2020-03-11T06:30:23Z" hostIP: 10.0.132.57 phase: Succeeded podIP: 10.129.3.125 podIPs: - ip: 10.129.3.125 qosClass: BestEffort startTime: "2020-03-11T06:30:12Z" kind: List metadata: resourceVersion: "" selfLink: "" $ oc describe pod pvc-volume-tester-dkp9s Name: pvc-volume-tester-dkp9s Namespace: e2e-csi-mock-volumes-7010 Priority: 0 Node: ip-10-0-132-57.us-east-2.compute.internal/10.0.132.57 Start Time: Wed, 11 Mar 2020 02:30:12 -0400 Labels: <none> Annotations: k8s.v1.cni.cncf.io/networks-status: openshift.io/scc: anyuid Status: Terminating (lasts 3d10h) Termination Grace Period: 30s IP: 10.129.3.125 IPs: IP: 10.129.3.125 Containers: volume-tester: Container ID: cri-o://47d13e4a73ef34dbaeaa1dc15f06651e2cae3d767988e0e1f86ccc01b754eee8 Image: k8s.gcr.io/pause:3.1 Image ID: k8s.gcr.io/pause@sha256:59eec8837a4d942cc19a52b8c09ea75121acc38114a2c68b98983ce9356b8610 Port: <none> Host Port: <none> State: Terminated Reason: Completed Exit Code: 0 Started: Wed, 11 Mar 2020 02:30:23 -0400 Finished: Sat, 14 Mar 2020 00:05:15 -0400 Ready: False Restart Count: 0 Environment: <none> Mounts: /mnt/test from my-volume (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-gd6q6 (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: my-volume: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: pvc-zmb4r ReadOnly: false default-token-gd6q6: Type: Secret (a volume populated by a Secret) SecretName: default-token-gd6q6 Optional: false QoS Class: BestEffort Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: <none> $ oc version Client Version: v4.2.0-alpha.0-249-gc276ecb Server Version: 4.4.0-0.nightly-2020-03-15-215151 Kubernetes Version: v1.17.1 Putting this on node team for now since i'd expect the node to clean up this pod, but if it's an issue w/ the storage test itself, feel free to reassign. As long as it is not a test-config/behavior specific bug, this is an urgent bug for 4.4 as we must ensure pods actually terminate.
Known upstream issue https://github.com/kubernetes/kubernetes/issues/51835
This is an urgent issue, moving back to 4.4.0. This may not be deferred without architect sign off. We should look at what we can do to improve debugging. We can make e2e runs fail if there are still terminating pods.
@ Ben Parees , May I verify this bz using the steps in https://bugzilla.redhat.com/show_bug.cgi?id=1808123#c11 ? or how should I verify this bz?
Yes i think those steps are ok, but might be best to confirm with Hemant as he was fixing up the e2e test.
verified on version 4.5.0-0.nightly-2020-04-09-231931 Run below commands for several times: openshift-tests run openshift/conformance/parallel --dry-run | grep Feature:VolumeSnapshotDataSource > tests openshift-tests run openshift/conformance/parallel -f tests During the test, we can see # oc get volumesnapshots --all-namespaces NAMESPACE NAME READYTOUSE SOURCEPVC SOURCESNAPSHOTCONTENT RESTORESIZE SNAPSHOTCLASS SNAPSHOTCONTENT CREATIONTIME AGE e2e-provisioning-277 snapshot-h6rff false pvc-wjqxr e2e-provisioning-277-csi-hostpath-e2e-provisioning-277-vsc snapcontent-fc906e79-ffb8-4d8b-8fff-b9dfd6d350b1 14s e2e-provisioning-718 snapshot-g5xjd true pvc-lr9bl 1Mi e2e-provisioning-718-csi-hostpath-e2e-provisioning-718-vsc snapcontent-770b62b5-c31e-462d-a021-b62ae3359015 20s 20s But when the test is finished, # oc get volumesnapshots --all-namespaces No resources found
VERIFIED, no longer needs info.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.5 image release advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409