4.7 release promotion is fighting with this high-flake test case: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=eventually+evict+pod+with+finite+tolerations+from+tainted+nodes' | grep 'failures match' | sort release-openshift-ocp-installer-e2e-aws-serial-4.7 - 4 runs, 50% failed, 50% of failures match release-openshift-origin-installer-e2e-aws-serial-4.7 - 20 runs, 15% failed, 33% of failures match Example job [1] failed: [k8s.io] [sig-node] NoExecuteTaintManager Single Pod [Serial] eventually evict pod with finite tolerations from tainted nodes [Suite:openshift/conformance/serial] [Suite:k8s] with: fail [k8s.io/kubernetes.2/test/e2e/node/taints.go:274]: Dec 17 18:41:42.555: Pod wasn't evicted stdout for the test included: Dec 17 18:41:42.647: INFO: POD NODE PHASE GRACE CONDITIONS Dec 17 18:41:42.647: INFO: taint-eviction-3 ip-10-0-160-228.ec2.internal Running 30s [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2020-12-17 18:39:32 +0000 UTC } {Ready False 0001-01-01 00:00:00 +0000 UTC 2020-12-17 18:40:43 +0000 UTC ContainersNotReady containers with unready status: [pause]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2020-12-17 18:40:43 +0000 UTC ContainersNotReady containers with unready status: [pause]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2020-12-17 18:39:32 +0000 UTC }] Dec 17 18:41:42.647: INFO: Dec 17 18:41:42.647: INFO: taint-eviction-3[e2e-taint-single-pod-7908].container[pause]=The container could not be located when the pod was deleted. The container used to be Running but I have no idea if that's relevant. I'm guessing at the sub-component too. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.7/1339617837265719296
xref https://github.com/kubernetes/kubernetes/issues/42685 upstream This is a very old test, I'm wondering if it just has a tuning issue (as encountered in the upstream issue)? I'll take a closer look.
I am consistently seeing the error mentioned above on all the 4.7 failures: ``` The container could not be located when the pod was deleted. The container used to be Running ``` This matches https://github.com/kubernetes/kubernetes/issues/97288 - an upstream regression in the 1.20 release. "after patching a deployment, the old pod sticks around for over a minute (or test times out after a minute). This is despite terminationGracePeriodSeconds: 30s." consistent with the behaviour we're seeing here on the flaky tests.
*** Bug 1915494 has been marked as a duplicate of this bug. ***
Checking for this test failure, I see last it failed 4 days in 4.7 serial tests. Do not see any recent failure after fix is merged. https://search.ci.openshift.org/?search=eventually+evict+pod+with+finite+tolerations+from+tainted+nodes&maxAge=168h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633