Bug 1733581

Summary: failing tests: [sig-scheduling] NoExecuteTaintManager Multiple Pods [Serial] evicts pods with minTolerationSeconds
Product: OpenShift Container Platform Reporter: Hongkai Liu <hongkliu>
Component: kube-schedulerAssignee: Mike Dame <mdame>
Status: CLOSED WORKSFORME QA Contact: Xingxing Xia <xxia>
Severity: high Docs Contact:
Priority: urgent    
Version: 4.2.0CC: aos-bugs, eparis, jerzhang, jokerman, maszulik, mfojtik, rgudimet, shlao, wking
Target Milestone: ---Keywords: Reopened
Target Release: 4.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: buildcop
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-09-23 09:12:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Hongkai Liu 2019-07-26 15:49:59 UTC
Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.2/2512#0:build-log.txt%3A18770


Failing tests:
 [sig-scheduling] NoExecuteTaintManager Multiple Pods [Serial] evicts pods with minTolerationSeconds [Suite:openshift/conformance/serial] [Suite:k8s]
 Writing JUnit report to /tmp/artifacts/junit/junit_e2e_20190726-121548.xml
 error: 1 fail, 49 pass, 167 skip (1h0m56s)
2019/07/26 12:15:49 Container test in pod e2e-aws-serial failed, exit code 1, reason Error
2019/07/26 12:21:54 Copied 192.20Mi of artifacts from e2e-aws-serial to /logs/artifacts/e2e-aws-serial
2019/07/26 12:22:00 Ran for 1h35m31s
error: could not run steps: step e2e-aws-serial failed: template pod "e2e-aws-serial" failed: the pod ci-op-4nikhb87/e2e-aws-serial failed after 1h33m29s (failed containers: test): ContainerFailed one or more containers exited
 Container test exited with code 1, reason Error
---
5 I ns/openshift-image-registry pod/node-ca-6hftd node/ created
Jul 26 12:13:56.002 I ns/openshift-image-registry daemonset/node-ca Created pod: node-ca-6hftd
Jul 26 12:13:56.005 I ns/openshift-image-registry pod/node-ca-6hftd Successfully assigned openshift-image-registry/node-ca-6hftd to ip-10-0-135-33.ec2.internal
Jul 26 12:14:25.483 I ns/openshift-machine-api machine/ci-op-4nikhb87-ce7d8-28skd-master-2 Updated machine ci-op-4nikhb87-ce7d8-28skd-master-2 (11 times)
Jul 26 12:14:27.222 I ns/openshift-machine-api machine/ci-op-4nikhb87-ce7d8-28skd-master-0 Updated machine ci-op-4nikhb87-ce7d8-28skd-master-0 (11 times)
Jul 26 12:14:27.223 I ns/openshift-machine-api machine/ci-op-4nikhb87-ce7d8-28skd-worker-us-east-1a-vj9mx Updated machine ci-op-4nikhb87-ce7d8-28skd-worker-us-east-1a-vj9mx (14 times)
Jul 26 12:14:27.464 I ns/openshift-machine-api machine/ci-op-4nikhb87-ce7d8-28skd-worker-us-east-1a-fc545 Updated machine ci-op-4nikhb87-ce7d8-28skd-worker-us-east-1a-fc545 (14 times)
Jul 26 12:14:27.696 I ns/openshift-machine-api machine/ci-op-4nikhb87-ce7d8-28skd-worker-us-east-1b-xfhdz Updated machine ci-op-4nikhb87-ce7d8-28skd-worker-us-east-1b-xfhdz (14 times)
Jul 26 12:14:29.242 I ns/openshift-machine-api machine/ci-op-4nikhb87-ce7d8-28skd-master-1 Updated machine ci-op-4nikhb87-ce7d8-28skd-master-1 (11 times)
Jul 26 12:14:57.503 I ns/openshift-image-registry pod/node-ca-6hftd Container image "registry.svc.ci.openshift.org/ocp/4.2-2019-07-26-104231@sha256:0f8ea602298e98ad6b3bd049b318783c8e303ca9fe60d6a24b5c1b19a2c6e909" already present on machine
Jul 26 12:14:57.701 I ns/openshift-image-registry pod/node-ca-6hftd Created container node-ca
Jul 26 12:14:57.901 I ns/openshift-image-registry pod/node-ca-6hftd Started container node-ca
 Failing tests:
 [sig-scheduling] NoExecuteTaintManager Multiple Pods [Serial] evicts pods with minTolerationSeconds [Suite:openshift/conformance/serial] [Suite:k8s]
 Writing JUnit report to /tmp/artifacts/junit/junit_e2e_20190726-121548.xml
 error: 1 fail, 49 pass, 167 skip (1h0m56s)

Comment 7 W. Trevor King 2019-08-13 22:08:16 UTC
Clayton pointed out that 10% of 4.2 serial promotion gate failures are failing this test [1].  Recent example [2]:

  [sig-scheduling] NoExecuteTaintManager Multiple Pods [Serial] evicts pods with minTolerationSeconds [Suite:openshift/conformance/serial] [Suite:k8s]
  fail [k8s.io/kubernetes/test/e2e/scheduling/taints.go:440]: Aug 13 18:51:12.866: Failed to evict all Pods. 1 pod(s) is not evicted.

[1]: https://ci-search-ci-search-next.svc.ci.openshift.org/chart?name=release-openshift-origin-installer-e2e-aws-serial-4.2&search=k8s.io/kubernetes/test/e2e/scheduling/taints.go.*%20Failed%20to%20evict%20all%20Pods
[2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.2/3256

Comment 9 Xingxing Xia 2019-08-22 10:15:18 UTC
https://ci-search-ci-search-next.svc.ci.openshift.org/chart?name=release-openshift-origin-installer-e2e-aws-serial-4.2&search=k8s.io/kubernetes/test/e2e/scheduling/taints.go.*%20Failed%20to%20evict%20all%20Pods shows:
58 recent release-openshift-origin-installer-e2e-aws-serial-4.2 jobs
0 (0% of all failures) k8s.io/kubernetes/test/e2e/scheduling/taints.go.* Failed to evict all Pod

In "https://testgrid.k8s.io/redhat-openshift-release-blocking#redhat-release-openshift-origin-installer-e2e-aws-serial-4.2&sort-by-flakiness=" , also see the case is continuously green in executed jobs now. So changing the bug status

Comment 15 Xingxing Xia 2019-09-23 09:12:13 UTC
Checked jobs https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-serial-4.2/216 to latest job 232 as of commenting, the case of this bug "NoExecuteTaintManager Multiple Pods [Serial] evicts pods with minTolerationSeconds" is always green (passed). The PR fix is test code, not functional code. Thus changing the bug status.