1876918 – scheduler test leaves taint behind

Bug 1876918 - scheduler test leaves taint behind

Summary: scheduler test leaves taint behind

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-scheduler
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	medium
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Mike Dame
QA Contact:	RamaKasturi
Docs Contact:
URL:
Whiteboard:	LifecycleReset
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-09-08 14:02 UTC by David Eads
Modified:	2021-07-27 22:33 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 22:32:55 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift kubernetes pull 513	0	None	closed	Bug 1876918: Move deferred taint cleanup call to ensure all are removed	2021-02-15 05:34:17 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 22:33:19 UTC

Description David Eads 2020-09-08 14:02:04 UTC

This is weird. I found a failure in this test https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-serial-4.6/1303077798943920128 that is actually caused by 

node/ci-op-4it1h40d-22d24-w4mls-worker-c-tc9qh had unexpected taints: []v1.Taint{v1.Taint{Key:"kubernetes.io/e2e-taint-key-90ae0008-a503-4dd3-baf9-b8b02123f7f7", Value:"testing-taint-value-1fe3e0bf-1727-44ba-97cb-c6900294e005", Effect:"PreferNoSchedule", TimeAdded:(*v1.Time)(nil)}}

The only spot I could find that taint was in a scheduler serial test.  The test appears to clean up after itself, but something weird is happening.  Maybe conflicting patches somehow?

Comment 1 Mike Dame 2020-09-10 16:11:36 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 2 Mike Dame 2020-10-02 21:27:09 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug in an upcoming sprint.

Comment 3 Michal Fojtik 2020-10-08 14:04:07 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 4 Mike Dame 2020-10-23 19:05:22 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug in an upcoming sprint.

Comment 5 Michal Fojtik 2020-12-05 01:13:33 UTC

The LifecycleStale keyword was removed because the needinfo? flag was reset.
The bug assignee was notified.

Comment 6 Mike Dame 2020-12-08 18:39:00 UTC

This taint is only used by the scheduler priorities test: https://github.com/kubernetes/kubernetes/blob/64c099d/test/e2e/scheduling/priorities.go#L309, where 10 taints are added to each node and the removal is deferred after each add.

I notice though that the taints are added by a call to a helper, "addRandomTaintToNode" which does 2 things: #1) adds a taint to the node and #2) verifies with the testing framework that the taint was added: https://github.com/kubernetes/kubernetes/blob/64c099d/test/e2e/scheduling/priorities.go#L585-L594

What I'm curious about is what happens if addRandomTaintToNode fails to add the taint, and the framework.Expect function fails the test? From what I can tell, that eventually calls a panic in the Ginkgo wrapper which obviously exits the test.

In that case, the defer() call to remove that specific taint never gets pushed onto the list, which would leave the taint on the node. (https://github.com/kubernetes/kubernetes/blob/64c099d/test/e2e/scheduling/priorities.go#L309-L311)

@David, you probably understand the Go stack better than me. Does that sounds like a possible explanation?

Comment 7 Michal Fojtik 2021-01-07 19:24:34 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 8 David Eads 2021-01-07 19:55:57 UTC

ginkgo is a bit of a mystery to me. Do you see evidence of the panics you have described in the test output? Does it only happen when the test fails?

It does seem like it would be easier simply to have a special test prefix unique to the test run and simply remove everything with the prefix at the end of the run. that way you're only chasing whether you have a very late patch that outran the client test timeout.

Comment 9 Mike Dame 2021-01-07 20:08:59 UTC

I agree, refactoring this to have a guaranteed "remove all these test taints" call instead of stacking defer calls could fix this. I'll create a PR to add that, and hopefully address this flake.

Comment 10 Michal Fojtik 2021-01-07 20:24:39 UTC

The LifecycleStale keyword was removed because the needinfo? flag was reset and the bug got commented on recently.
The bug assignee was notified.

Comment 11 Mike Dame 2021-02-04 18:09:11 UTC

The upstream PR for this is blocked waiting for approval from all of the reviewers, who have been notified

Comment 13 RamaKasturi 2021-02-15 16:56:21 UTC

Verified in the link below and did not see any flaky test from last 2 days, so moving bug to verified state.

[1] https://search.ci.openshift.org/?search=Managed+cluster+should+grow+and+decrease+when+scaling+different+machineSets+simultaneously&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 16 errata-xmlrpc 2021-07-27 22:32:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.