1915945 – [sig-scheduling] SchedulerPreemption [Serial] validates basic preemption works [Conformance]

Bug 1915945 - [sig-scheduling] SchedulerPreemption [Serial] validates basic preemption works [Conformance]

Summary: [sig-scheduling] SchedulerPreemption [Serial] validates basic preemption work...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-scheduler
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Jan Chaloupka
QA Contact:	RamaKasturi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-01-13 19:18 UTC by jamo luhrsen
Modified:	2021-02-24 15:53 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-24 15:53:00 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift kubernetes pull 530	None	closed	bug 1915945: UPSTREAM: 96990: Increase preemption timeout from 1 minute to 2 minutes	2021-01-25 10:28:56 UTC
Github	openshift origin pull 25809	None	closed	bug 1915945: UPSTREAM: 96990: Increase preemption timeout from 1 minute to 2 minutes	2021-01-25 10:29:38 UTC
Red Hat Product Errata	RHSA-2020:5633	None	None	None	2021-02-24 15:53:24 UTC

Description jamo luhrsen 2021-01-13 19:18:54 UTC

Description of problem:

While monitoring the 4.7 CI signal, and noticing the openstack platform had degraded a bit,
we noticed that the following test case is failing periodically:
(NOTE: it is not isolated to the openstack platform)

  [sig-scheduling] SchedulerPreemption [Serial] validates basic preemption works [Conformance] [Suite:openshift/conformance/serial/minimal] [Suite:k8s]

I cannot find any root causes in the logs yet, but here is a failed job:

  https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-serial-4.7/1349053534636609536

Also, I can reproduce this on a cluster-bot cluster. Just launching 4.7 on aws and running the following
simple steps:

  ❯ sudo podman run -it -v $LOCAL_KUBECONFIG_FROM_CLUSTER_BOT:/root/.kube/config:z -i quay.io/openshift/origin-tests:4.7 /bin/bash
  [root@55f251f22bc9 /]# export KUBECONFIG=/root/.kube/config
  [root@55f251f22bc9 /]# time while true; do openshift-tests run-test "[sig-scheduling] SchedulerPreemption [Serial] validates basic preemption works [Conformance] 
  [Suite:openshift/conformance/serial/minimal] [Suite:k8s]"
  if [ $? -ne 0 ]; then
     break
  fi
  done;

That will loop the single test case and stop once it fails. I was able to reproduce it a few times
in the few hours my cluster was available. Unfortunately, my cluster aged out before I had time
to try to collect more debugs than we get from the logs in CI.


Version-Release number of selected component (if applicable):

4.7

How reproducible:

infrequent, but this search shows it happening a handful of times:
  https://search.ci.openshift.org/?search=validates+basic+preemption+works&maxAge=168h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

Steps to Reproduce:

see description above

Comment 1 jamo luhrsen 2021-01-14 00:17:39 UTC

I was able to poke around the master nodes after I caught a failure. again, I lost the cluster before figuring out too much, but I did
see the below snippet (which I don't find in the logs from CI, so hoping it's a little bit helpful):

in this file:
pods/openshift-kube-scheduler_openshift-kube-scheduler-ip-10-0-132-85.us-west-2.compute.internal_59db6637-ae92-4357-b607-9312ce4dab44/kube-scheduler/0.log


2021-01-13T21:35:07.403973257+00:00 stderr F I0113 21:35:07.403873       1 scheduler.go:592] "Successfully bound pod to node" pod="e2e-sched-preemption-3378/pod0-sched-preemption-low-priority" node="ip-10-0-158-73.us-west-2.compute.internal" evaluatedNodes=6 feasibleNodes=1
2021-01-13T21:35:07.672966782+00:00 stderr F I0113 21:35:07.672915       1 scheduler.go:592] "Successfully bound pod to node" pod="e2e-sched-preemption-3378/pod1-sched-preemption-medium-priority" node="ip-10-0-175-169.us-west-2.compute.internal" evaluatedNodes=6 feasibleNodes=1
2021-01-13T21:35:07.884861457+00:00 stderr F I0113 21:35:07.884261       1 scheduler.go:592] "Successfully bound pod to node" pod="e2e-sched-preemption-3378/pod2-sched-preemption-medium-priority" node="ip-10-0-202-167.us-west-2.compute.internal" evaluatedNodes=6 feasibleNodes=1
2021-01-13T21:35:14.378479856+00:00 stderr F I0113 21:35:14.378416       1 factory.go:321] "Unable to schedule pod; no fit; waiting" pod="e2e-sched-preemption-3378/preemptor-pod" err="0/6 nodes are available: 3 Insufficient scheduling.k8s.io/foo, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate."
2021-01-13T21:35:14.387474415+00:00 stderr F I0113 21:35:14.387419       1 factory.go:321] "Unable to schedule pod; no fit; waiting" pod="e2e-sched-preemption-3378/preemptor-pod" err="0/6 nodes are available: 3 Insufficient scheduling.k8s.io/foo, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate."
2021-01-13T21:36:17.579211762+00:00 stderr F E0113 21:36:17.579164       1 framework.go:777] "Failed running Bind plugin" err="pods \"preemptor-pod\" is forbidden: unable to create new content in namespace e2e-sched-preemption-3378 because it is being terminated" plugin="DefaultBinder" pod="e2e-sched-preemption-3378/preemptor-pod"
2021-01-13T21:36:17.579351098+00:00 stderr F I0113 21:36:17.579310       1 scheduler.go:420] Failed to bind pod: e2e-sched-preemption-3378/preemptor-pod
2021-01-13T21:36:17.579393569+00:00 stderr F E0113 21:36:17.579377       1 factory.go:337] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": pods \"preemptor-pod\" is forbidden: unable to create new content in namespace e2e-sched-preemption-3378 because it is being terminated" pod="e2e-sched-preemption-3378/preemptor-pod"
2021-01-13T21:36:20.361326967+00:00 stderr F E0113 21:36:20.361207       1 event_broadcaster.go:253] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"preemptor-pod.1659e851eed18ccf", GenerateName:"", Namespace:"e2e-sched-preemption-3378", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, EventTime:v1.MicroTime{Time:time.Time{wall:0xbff7f7d5155c2ae8, ext:16359491835897, loc:(*time.Location)(0x2ceed00)}}, Series:(*v1.EventSeries)(nil), ReportingController:"default-scheduler", ReportingInstance:"default-scheduler-ip-10-0-132-85", Action:"Scheduling", Reason:"FailedScheduling", Regarding:v1.ObjectReference{Kind:"Pod", Namespace:"e2e-sched-preemption-3378", Name:"preemptor-pod", UID:"fad0c735-b398-4dfc-a2d5-8774b1db4474", APIVersion:"v1", ResourceVersion:"127277", FieldPath:""}, Related:(*v1.ObjectReference)(nil), Note:"skip schedule deleting pod: e2e-sched-preemption-3378/preemptor-pod", Type:"Warning", DeprecatedSource:v1.EventSource{Component:"", Host:""}, DeprecatedFirstTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeprecatedLastTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeprecatedCount:0}': 'events.events.k8s.io "preemptor-pod.1659e851eed18ccf" is forbidden: unable to create new content in namespace e2e-sched-preemption-3378 because it is being terminated' (will not retry!)

These are logs that came for the test case iteration that failed. You can see the timestamps that it's aprox 1m with some messages
about 0 nodes available, then finally the 'Error scheduling pod' which I believe comes because the test case has given up
and probably tearing things down.... deletes the namespace and the scheduler realizes it on the next retry to find a node
to schedule on and gives up.

Comment 2 Jan Chaloupka 2021-01-19 12:52:29 UTC

Checking the recent failures, every test timeouts after 1 minute of waiting for the preemptor pod to be scheduled with a similar kube-scheduler logs like these:
```
I0115 04:44:43.912012       1 factory.go:321] "Unable to schedule pod; no fit; waiting" pod="e2e-sched-preemption-930/preemptor-pod" err="0/6 nodes are available: 3 Insufficient scheduling.k8s.io/foo, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate."
I0115 04:44:43.920822       1 factory.go:321] "Unable to schedule pod; no fit; waiting" pod="e2e-sched-preemption-930/preemptor-pod" err="0/6 nodes are available: 3 Insufficient scheduling.k8s.io/foo, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate."
...
E0115 04:45:49.610424       1 framework.go:777] "Failed running Bind plugin" err="pods \"preemptor-pod\" is forbidden: unable to create new content in namespace e2e-sched-preemption-930 because it is being terminated" plugin="DefaultBinder" pod="e2e-sched-preemption-930/preemptor-pod"
I0115 04:45:49.610463       1 scheduler.go:420] Failed to bind pod: e2e-sched-preemption-930/preemptor-pod
```

- https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.7/1350446580682985472
- https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.7/1350355743152279552
- https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.7/1349924268133060608

It seems it takes slightly over a minute to bind the preemptor-pod and then some time to actually run it.

In some rare cases the preemptor pod gets scheduled (https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.7/1349872296105349120):

```
I0115 02:23:50.826264       1 factory.go:321] "Unable to schedule pod; no fit; waiting" pod="e2e-sched-preemption-3163/preemptor-pod" err="0/6 nodes are available: 3 Insufficient scheduling.k8s.io/foo, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate."
I0115 02:23:50.841743       1 factory.go:321] "Unable to schedule pod; no fit; waiting" pod="e2e-sched-preemption-3163/preemptor-pod" err="0/6 nodes are available: 3 Insufficient scheduling.k8s.io/foo, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate."
I0115 02:24:49.538649       1 scheduler.go:592] "Successfully bound pod to node" pod="e2e-sched-preemption-3163/preemptor-pod" node="ip-10-0-146-177.us-west-1.compute.internal" evaluatedNodes=6 feasibleNodes=1
```

Though it's pretty tight so the test still does not wait long enough for the pod to get running.

Comment 3 jamo luhrsen 2021-01-19 18:04:01 UTC

I don't see this really happening in 4.6, but it is there in 4.7 and 4.8 which makes me think it's
potentially a real regression in performance to bring up a pod. Also, there is this bz (https://bugzilla.redhat.com/show_bug.cgi?id=1915085)
sort of calling out a similar thing that could explain this problem.

Not that increasing the timeout from 1m to 2m would be wrong, because if a pod was really taking
5m to come up we'd still be failing these test cases.

just fyi

Comment 4 jamo luhrsen 2021-01-19 18:05:46 UTC

(In reply to jamo luhrsen from comment #3)
> I don't see this really happening in 4.6, but it is there in 4.7 and 4.8
> which makes me think it's
> potentially a real regression in performance to bring up a pod. Also, there
> is this bz (https://bugzilla.redhat.com/show_bug.cgi?id=1915085)
> sort of calling out a similar thing that could explain this problem.

btw, the upstream issue (https://github.com/kubernetes/kubernetes/issues/98142) noted in that
bz does have some chatter and a proposed hack/fix that could help. Maybe we can wait for
that issue to be resolved before increasing the test timeout?

> 
> Not that increasing the timeout from 1m to 2m would be wrong, because if a
> pod was really taking
> 5m to come up we'd still be failing these test cases.
> 
> just fyi

Comment 5 Jan Chaloupka 2021-01-20 09:55:34 UTC

The upstream PR already merged in o/k. The change would land in the next rebase anyway.

Thanks for pointing out the other bug. Seems like it would be better to have a different test to measure the kubelet performance for runc in kubelet test suite.

Comment 7 RamaKasturi 2021-01-25 10:42:38 UTC

From the link below see that test has not been failing for about 7 days, so moving the bug to verified state.

https://search.ci.openshift.org/?search=validates+basic+preemption+works&maxAge=168h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 10 errata-xmlrpc 2021-02-24 15:53:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.