1915494 – Frequent taint-related test failures

Bug 1915494 - Frequent taint-related test failures

Summary: Frequent taint-related test failures

Keywords:
Status:	CLOSED DUPLICATE of bug 1908880
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Elana Hashman
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:	TechnicalReleaseBlocker
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-01-12 18:05 UTC by Fabian von Feilitzsch
Modified:	2021-01-12 23:03 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-01-12 18:31:01 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Fabian von Feilitzsch 2021-01-12 18:05:42 UTC

Description of problem:
We're seeing frequent and consistent test failures that seem to be related to removing/evicting pods. More than one of these tests rarely fails per run, but at least one of them is failing consistently on a few platforms.

The failures are most consistent on openstack-serial and aws-serial:
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-informing#release-openshift-ocp-installer-e2e-openstack-serial-4.7&grid=old
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-blocking#release-openshift-origin-installer-e2e-aws-serial-4.7&grid=old

but are also showing up on other platforms.

The frequently failing tests, broken down by platform:
https://sippy.ci.openshift.org/testdetails?release=4.7&test=k8s.io]%20[sig-node]%20NoExecuteTaintManager%20Single%20Pod%20[Serial]%20evicts%20pods%20from%20tainted%20nodes&test=[k8s.io]%20[sig-node]%20NoExecuteTaintManager%20Single%20Pod%20[Serial]%20eventually%20evict%20pod%20with%20finite%20tolerations%20from%20tainted%20nodes&test=[k8s.io]%20[sig-node]%20NoExecuteTaintManager%20Multiple%20Pods%20[Serial]%20only%20evicts%20pods%20without%20tolerations%20from%20tainted%20nodes&test=[sig-api-machinery]%20Namespaces%20[Serial]%20should%20ensure%20that%20all%20pods%20are%20removed%20when%20a%20namespace%20is%20deleted

Additional info:

I have not been able to find any more useful information about the possible cause of this issue.

Comment 1 Fabian von Feilitzsch 2021-01-12 18:13:53 UTC

specific failing job link: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-serial-4.7/1348509976322117632

Comment 2 Elana Hashman 2021-01-12 18:31:01 UTC


*** This bug has been marked as a duplicate of bug 1908880 ***

Comment 3 Adolfo Duarte 2021-01-12 23:03:32 UTC

Several of the taint test failures in the openstack platform seem to show errors about mounting volumes: 
like so: 

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-serial-4.7/1347785226629156864
https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-serial-4.7/1348509976322117632

"[]VolumeDevice{},StartupProbe:nil,} start failed in pod taint-eviction-1_e2e-taint-single-pod-7551(76ddbe1e-163b-4be3-9473-371500d53b85): CreateContainerConfigError: cannot find volume "default-token-9xw5g" to mount into container "pause""

In other taint failures (like this one https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-serial-4.7/1345792152164110336) it is not too clear why the pod is not evacuated. seems something is also amiss with the container in the pod, seems to be in "paused" state. and test fails waiting for it to be deleted/evacuated. 

Jan  3 19:15:41.445: INFO: At 2021-01-03 19:13:34 +0000 UTC - event for taint-eviction-3: {kubelet zj8bt86c-a9c3a-92s9q-worker-0-6dqq5} Created: Created container pause
Jan  3 19:15:41.445: INFO: At 2021-01-03 19:13:34 +0000 UTC - event for taint-eviction-3: {kubelet zj8bt86c-a9c3a-92s9q-worker-0-6dqq5} Started: Started container pause
Jan  3 19:15:41.445: INFO: At 2021-01-03 19:14:41 +0000 UTC - event for taint-eviction-3: {taint-controller } TaintManagerEviction: Marking for deletion Pod e2e-taint-single-pod-2191/taint-eviction-3
Jan  3 19:15:41.445: INFO: At 2021-01-03 19:14:41 +0000 UTC - event for taint-eviction-3: {kubelet zj8bt86c-a9c3a-92s9q-worker-0-6dqq5} Killing: Stopping container pause
Jan  3 19:15:41.480: INFO: POD               NODE                                 PHASE    GRACE  CONDITIONS
Jan  3 19:15:41.480: INFO: taint-eviction-3  zj8bt86c-a9c3a-92s9q-worker-0-6dqq5  Running  30s    [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2021-01-03 19:13:31 +0000 UTC  } {Ready False 0001-01-01 00:00:00 +0000 UTC 2021-01-03 19:14:42 +0000 UTC ContainersNotReady containers with unready status: [pause]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2021-01-03 19:14:42 +0000 UTC ContainersNotReady containers with unready status: [pause]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2021-01-03 19:13:31 +0000 UTC  }]
Jan  3 19:15:41.480: INFO: 
Jan  3 19:15:41.480: INFO: taint-eviction-3[e2e-taint-single-pod-2191].container[pause]=The container could not be located when the pod was deleted.  The container used to be Running
Jan  3 19:15:41.544: INFO: skipping dumping cluster info - cluster too large
Jan  3 19:15:41.544: INFO: Waiting up to 7m0s for all (but 100) nodes to be ready
STEP: Destroying namespace "e2e-taint-single-pod-2191" for this suite.

Comment 4 Adolfo Duarte 2021-01-12 23:03:44 UTC

Several of the taint test failures in the openstack platform seem to show errors about mounting volumes: 
like so: 

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-serial-4.7/1347785226629156864
https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-serial-4.7/1348509976322117632

"[]VolumeDevice{},StartupProbe:nil,} start failed in pod taint-eviction-1_e2e-taint-single-pod-7551(76ddbe1e-163b-4be3-9473-371500d53b85): CreateContainerConfigError: cannot find volume "default-token-9xw5g" to mount into container "pause""

In other taint failures (like this one https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-serial-4.7/1345792152164110336) it is not too clear why the pod is not evacuated. seems something is also amiss with the container in the pod, seems to be in "paused" state. and test fails waiting for it to be deleted/evacuated. 

Jan  3 19:15:41.445: INFO: At 2021-01-03 19:13:34 +0000 UTC - event for taint-eviction-3: {kubelet zj8bt86c-a9c3a-92s9q-worker-0-6dqq5} Created: Created container pause
Jan  3 19:15:41.445: INFO: At 2021-01-03 19:13:34 +0000 UTC - event for taint-eviction-3: {kubelet zj8bt86c-a9c3a-92s9q-worker-0-6dqq5} Started: Started container pause
Jan  3 19:15:41.445: INFO: At 2021-01-03 19:14:41 +0000 UTC - event for taint-eviction-3: {taint-controller } TaintManagerEviction: Marking for deletion Pod e2e-taint-single-pod-2191/taint-eviction-3
Jan  3 19:15:41.445: INFO: At 2021-01-03 19:14:41 +0000 UTC - event for taint-eviction-3: {kubelet zj8bt86c-a9c3a-92s9q-worker-0-6dqq5} Killing: Stopping container pause
Jan  3 19:15:41.480: INFO: POD               NODE                                 PHASE    GRACE  CONDITIONS
Jan  3 19:15:41.480: INFO: taint-eviction-3  zj8bt86c-a9c3a-92s9q-worker-0-6dqq5  Running  30s    [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2021-01-03 19:13:31 +0000 UTC  } {Ready False 0001-01-01 00:00:00 +0000 UTC 2021-01-03 19:14:42 +0000 UTC ContainersNotReady containers with unready status: [pause]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2021-01-03 19:14:42 +0000 UTC ContainersNotReady containers with unready status: [pause]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2021-01-03 19:13:31 +0000 UTC  }]
Jan  3 19:15:41.480: INFO: 
Jan  3 19:15:41.480: INFO: taint-eviction-3[e2e-taint-single-pod-2191].container[pause]=The container could not be located when the pod was deleted.  The container used to be Running
Jan  3 19:15:41.544: INFO: skipping dumping cluster info - cluster too large
Jan  3 19:15:41.544: INFO: Waiting up to 7m0s for all (but 100) nodes to be ready
STEP: Destroying namespace "e2e-taint-single-pod-2191" for this suite.

Note You need to log in before you can comment on or make changes to this bug.