Bug 1915494

Summary: Frequent taint-related test failures
Product: OpenShift Container Platform Reporter: Fabian von Feilitzsch <fabian>
Component: NodeAssignee: Elana Hashman <ehashman>
Node sub component: Kubelet QA Contact: Sunil Choudhary <schoudha>
Status: CLOSED DUPLICATE Docs Contact:
Severity: high    
Priority: unspecified CC: adduarte, aos-bugs, ehashman, wking
Version: 4.7   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: TechnicalReleaseBlocker
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-01-12 18:31:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 2 Elana Hashman 2021-01-12 18:31:01 UTC

*** This bug has been marked as a duplicate of bug 1908880 ***

Comment 3 Adolfo Duarte 2021-01-12 23:03:32 UTC
Several of the taint test failures in the openstack platform seem to show errors about mounting volumes: 
like so: 

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-serial-4.7/1347785226629156864
https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-serial-4.7/1348509976322117632

"[]VolumeDevice{},StartupProbe:nil,} start failed in pod taint-eviction-1_e2e-taint-single-pod-7551(76ddbe1e-163b-4be3-9473-371500d53b85): CreateContainerConfigError: cannot find volume "default-token-9xw5g" to mount into container "pause""

In other taint failures (like this one https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-serial-4.7/1345792152164110336) it is not too clear why the pod is not evacuated. seems something is also amiss with the container in the pod, seems to be in "paused" state. and test fails waiting for it to be deleted/evacuated. 

Jan  3 19:15:41.445: INFO: At 2021-01-03 19:13:34 +0000 UTC - event for taint-eviction-3: {kubelet zj8bt86c-a9c3a-92s9q-worker-0-6dqq5} Created: Created container pause
Jan  3 19:15:41.445: INFO: At 2021-01-03 19:13:34 +0000 UTC - event for taint-eviction-3: {kubelet zj8bt86c-a9c3a-92s9q-worker-0-6dqq5} Started: Started container pause
Jan  3 19:15:41.445: INFO: At 2021-01-03 19:14:41 +0000 UTC - event for taint-eviction-3: {taint-controller } TaintManagerEviction: Marking for deletion Pod e2e-taint-single-pod-2191/taint-eviction-3
Jan  3 19:15:41.445: INFO: At 2021-01-03 19:14:41 +0000 UTC - event for taint-eviction-3: {kubelet zj8bt86c-a9c3a-92s9q-worker-0-6dqq5} Killing: Stopping container pause
Jan  3 19:15:41.480: INFO: POD               NODE                                 PHASE    GRACE  CONDITIONS
Jan  3 19:15:41.480: INFO: taint-eviction-3  zj8bt86c-a9c3a-92s9q-worker-0-6dqq5  Running  30s    [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2021-01-03 19:13:31 +0000 UTC  } {Ready False 0001-01-01 00:00:00 +0000 UTC 2021-01-03 19:14:42 +0000 UTC ContainersNotReady containers with unready status: [pause]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2021-01-03 19:14:42 +0000 UTC ContainersNotReady containers with unready status: [pause]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2021-01-03 19:13:31 +0000 UTC  }]
Jan  3 19:15:41.480: INFO: 
Jan  3 19:15:41.480: INFO: taint-eviction-3[e2e-taint-single-pod-2191].container[pause]=The container could not be located when the pod was deleted.  The container used to be Running
Jan  3 19:15:41.544: INFO: skipping dumping cluster info - cluster too large
Jan  3 19:15:41.544: INFO: Waiting up to 7m0s for all (but 100) nodes to be ready
STEP: Destroying namespace "e2e-taint-single-pod-2191" for this suite.

Comment 4 Adolfo Duarte 2021-01-12 23:03:44 UTC
Several of the taint test failures in the openstack platform seem to show errors about mounting volumes: 
like so: 

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-serial-4.7/1347785226629156864
https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-serial-4.7/1348509976322117632

"[]VolumeDevice{},StartupProbe:nil,} start failed in pod taint-eviction-1_e2e-taint-single-pod-7551(76ddbe1e-163b-4be3-9473-371500d53b85): CreateContainerConfigError: cannot find volume "default-token-9xw5g" to mount into container "pause""

In other taint failures (like this one https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-serial-4.7/1345792152164110336) it is not too clear why the pod is not evacuated. seems something is also amiss with the container in the pod, seems to be in "paused" state. and test fails waiting for it to be deleted/evacuated. 

Jan  3 19:15:41.445: INFO: At 2021-01-03 19:13:34 +0000 UTC - event for taint-eviction-3: {kubelet zj8bt86c-a9c3a-92s9q-worker-0-6dqq5} Created: Created container pause
Jan  3 19:15:41.445: INFO: At 2021-01-03 19:13:34 +0000 UTC - event for taint-eviction-3: {kubelet zj8bt86c-a9c3a-92s9q-worker-0-6dqq5} Started: Started container pause
Jan  3 19:15:41.445: INFO: At 2021-01-03 19:14:41 +0000 UTC - event for taint-eviction-3: {taint-controller } TaintManagerEviction: Marking for deletion Pod e2e-taint-single-pod-2191/taint-eviction-3
Jan  3 19:15:41.445: INFO: At 2021-01-03 19:14:41 +0000 UTC - event for taint-eviction-3: {kubelet zj8bt86c-a9c3a-92s9q-worker-0-6dqq5} Killing: Stopping container pause
Jan  3 19:15:41.480: INFO: POD               NODE                                 PHASE    GRACE  CONDITIONS
Jan  3 19:15:41.480: INFO: taint-eviction-3  zj8bt86c-a9c3a-92s9q-worker-0-6dqq5  Running  30s    [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2021-01-03 19:13:31 +0000 UTC  } {Ready False 0001-01-01 00:00:00 +0000 UTC 2021-01-03 19:14:42 +0000 UTC ContainersNotReady containers with unready status: [pause]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2021-01-03 19:14:42 +0000 UTC ContainersNotReady containers with unready status: [pause]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2021-01-03 19:13:31 +0000 UTC  }]
Jan  3 19:15:41.480: INFO: 
Jan  3 19:15:41.480: INFO: taint-eviction-3[e2e-taint-single-pod-2191].container[pause]=The container could not be located when the pod was deleted.  The container used to be Running
Jan  3 19:15:41.544: INFO: skipping dumping cluster info - cluster too large
Jan  3 19:15:41.544: INFO: Waiting up to 7m0s for all (but 100) nodes to be ready
STEP: Destroying namespace "e2e-taint-single-pod-2191" for this suite.