Bug 1908880

Summary: 4.7 aws-serial CI: NoExecuteTaintManager Single Pod [Serial] eventually evict pod with finite tolerations from tainted nodes
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: NodeAssignee: Elana Hashman <ehashman>
Node sub component: Kubelet QA Contact: Sunil Choudhary <schoudha>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: aos-bugs, fabian, rphillips, tsweeney
Version: 4.7Keywords: UpcomingSprint
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Performance regression in Kubernetes 1.20: checking sandbox deletion caused pod deletions to take much longer. Consequence: many tests that expected pods to be deleted quickly began flaking as pods were not deleted in time Fix: reverted sandbox deletion logic Result: pod deletions should now finish in the expected amount of time without a performance regression
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-24 15:46:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description W. Trevor King 2020-12-17 19:13:11 UTC
4.7 release promotion is fighting with this high-flake test case:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=eventually+evict+pod+with+finite+tolerations+from+tainted+nodes' | grep 'failures match' | sort
release-openshift-ocp-installer-e2e-aws-serial-4.7 - 4 runs, 50% failed, 50% of failures match
release-openshift-origin-installer-e2e-aws-serial-4.7 - 20 runs, 15% failed, 33% of failures match

Example job [1] failed:

  [k8s.io] [sig-node] NoExecuteTaintManager Single Pod [Serial] eventually evict pod with finite tolerations from tainted nodes [Suite:openshift/conformance/serial] [Suite:k8s]

with:

  fail [k8s.io/kubernetes.2/test/e2e/node/taints.go:274]: Dec 17 18:41:42.555: Pod wasn't evicted

stdout for the test included:

  Dec 17 18:41:42.647: INFO: POD               NODE                          PHASE    GRACE  CONDITIONS
  Dec 17 18:41:42.647: INFO: taint-eviction-3  ip-10-0-160-228.ec2.internal  Running  30s    [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2020-12-17 18:39:32 +0000 UTC  } {Ready False 0001-01-01 00:00:00 +0000 UTC 2020-12-17 18:40:43 +0000 UTC ContainersNotReady containers with unready status: [pause]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2020-12-17 18:40:43 +0000 UTC ContainersNotReady containers with unready status: [pause]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2020-12-17 18:39:32 +0000 UTC  }]
  Dec 17 18:41:42.647: INFO: 
  Dec 17 18:41:42.647: INFO: taint-eviction-3[e2e-taint-single-pod-7908].container[pause]=The container could not be located when the pod was deleted.  The container used to be Running

but I have no idea if that's relevant.  I'm guessing at the sub-component too.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.7/1339617837265719296

Comment 2 Elana Hashman 2021-01-08 21:38:37 UTC
xref https://github.com/kubernetes/kubernetes/issues/42685 upstream

This is a very old test, I'm wondering if it just has a tuning issue (as encountered in the upstream issue)? I'll take a closer look.

Comment 3 Elana Hashman 2021-01-08 22:04:11 UTC
I am consistently seeing the error mentioned above on all the 4.7 failures:

```
The container could not be located when the pod was deleted.  The container used to be Running
```


This matches https://github.com/kubernetes/kubernetes/issues/97288 - an upstream regression in the 1.20 release.

"after patching a deployment, the old pod sticks around for over a minute (or test times out after a minute). This is despite terminationGracePeriodSeconds: 30s." consistent with the behaviour we're seeing here on the flaky tests.

Comment 4 Elana Hashman 2021-01-12 18:30:58 UTC
*** Bug 1915494 has been marked as a duplicate of this bug. ***

Comment 6 Sunil Choudhary 2021-01-19 12:04:45 UTC
Checking for this test failure, I see last it failed 4 days in 4.7 serial tests. Do not see any recent failure after fix is merged.
https://search.ci.openshift.org/?search=eventually+evict+pod+with+finite+tolerations+from+tainted+nodes&maxAge=168h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 9 errata-xmlrpc 2021-02-24 15:46:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633