Description of problem: We're seeing frequent and consistent test failures that seem to be related to removing/evicting pods. More than one of these tests rarely fails per run, but at least one of them is failing consistently on a few platforms. The failures are most consistent on openstack-serial and aws-serial: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-informing#release-openshift-ocp-installer-e2e-openstack-serial-4.7&grid=old https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-blocking#release-openshift-origin-installer-e2e-aws-serial-4.7&grid=old but are also showing up on other platforms. The frequently failing tests, broken down by platform: https://sippy.ci.openshift.org/testdetails?release=4.7&test=k8s.io]%20[sig-node]%20NoExecuteTaintManager%20Single%20Pod%20[Serial]%20evicts%20pods%20from%20tainted%20nodes&test=[k8s.io]%20[sig-node]%20NoExecuteTaintManager%20Single%20Pod%20[Serial]%20eventually%20evict%20pod%20with%20finite%20tolerations%20from%20tainted%20nodes&test=[k8s.io]%20[sig-node]%20NoExecuteTaintManager%20Multiple%20Pods%20[Serial]%20only%20evicts%20pods%20without%20tolerations%20from%20tainted%20nodes&test=[sig-api-machinery]%20Namespaces%20[Serial]%20should%20ensure%20that%20all%20pods%20are%20removed%20when%20a%20namespace%20is%20deleted Additional info: I have not been able to find any more useful information about the possible cause of this issue.
specific failing job link: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-serial-4.7/1348509976322117632
*** This bug has been marked as a duplicate of bug 1908880 ***
Several of the taint test failures in the openstack platform seem to show errors about mounting volumes: like so: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-serial-4.7/1347785226629156864 https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-serial-4.7/1348509976322117632 "[]VolumeDevice{},StartupProbe:nil,} start failed in pod taint-eviction-1_e2e-taint-single-pod-7551(76ddbe1e-163b-4be3-9473-371500d53b85): CreateContainerConfigError: cannot find volume "default-token-9xw5g" to mount into container "pause"" In other taint failures (like this one https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-serial-4.7/1345792152164110336) it is not too clear why the pod is not evacuated. seems something is also amiss with the container in the pod, seems to be in "paused" state. and test fails waiting for it to be deleted/evacuated. Jan 3 19:15:41.445: INFO: At 2021-01-03 19:13:34 +0000 UTC - event for taint-eviction-3: {kubelet zj8bt86c-a9c3a-92s9q-worker-0-6dqq5} Created: Created container pause Jan 3 19:15:41.445: INFO: At 2021-01-03 19:13:34 +0000 UTC - event for taint-eviction-3: {kubelet zj8bt86c-a9c3a-92s9q-worker-0-6dqq5} Started: Started container pause Jan 3 19:15:41.445: INFO: At 2021-01-03 19:14:41 +0000 UTC - event for taint-eviction-3: {taint-controller } TaintManagerEviction: Marking for deletion Pod e2e-taint-single-pod-2191/taint-eviction-3 Jan 3 19:15:41.445: INFO: At 2021-01-03 19:14:41 +0000 UTC - event for taint-eviction-3: {kubelet zj8bt86c-a9c3a-92s9q-worker-0-6dqq5} Killing: Stopping container pause Jan 3 19:15:41.480: INFO: POD NODE PHASE GRACE CONDITIONS Jan 3 19:15:41.480: INFO: taint-eviction-3 zj8bt86c-a9c3a-92s9q-worker-0-6dqq5 Running 30s [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2021-01-03 19:13:31 +0000 UTC } {Ready False 0001-01-01 00:00:00 +0000 UTC 2021-01-03 19:14:42 +0000 UTC ContainersNotReady containers with unready status: [pause]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2021-01-03 19:14:42 +0000 UTC ContainersNotReady containers with unready status: [pause]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2021-01-03 19:13:31 +0000 UTC }] Jan 3 19:15:41.480: INFO: Jan 3 19:15:41.480: INFO: taint-eviction-3[e2e-taint-single-pod-2191].container[pause]=The container could not be located when the pod was deleted. The container used to be Running Jan 3 19:15:41.544: INFO: skipping dumping cluster info - cluster too large Jan 3 19:15:41.544: INFO: Waiting up to 7m0s for all (but 100) nodes to be ready STEP: Destroying namespace "e2e-taint-single-pod-2191" for this suite.