Bug 1915945

Summary:	[sig-scheduling] SchedulerPreemption [Serial] validates basic preemption works [Conformance]
Product:	OpenShift Container Platform	Reporter:	jamo luhrsen <jluhrsen>
Component:	kube-scheduler	Assignee:	Jan Chaloupka <jchaloup>
Status:	CLOSED ERRATA	QA Contact:	RamaKasturi <knarra>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.7	CC:	aos-bugs, maszulik, mfojtik, wking
Target Milestone:	---	Keywords:	UpcomingSprint
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-02-24 15:53:00 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description jamo luhrsen 2021-01-13 19:18:54 UTC

Description of problem:

While monitoring the 4.7 CI signal, and noticing the openstack platform had degraded a bit,
we noticed that the following test case is failing periodically:
(NOTE: it is not isolated to the openstack platform)

  [sig-scheduling] SchedulerPreemption [Serial] validates basic preemption works [Conformance] [Suite:openshift/conformance/serial/minimal] [Suite:k8s]

I cannot find any root causes in the logs yet, but here is a failed job:

  https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-serial-4.7/1349053534636609536

Also, I can reproduce this on a cluster-bot cluster. Just launching 4.7 on aws and running the following
simple steps:

  ❯ sudo podman run -it -v $LOCAL_KUBECONFIG_FROM_CLUSTER_BOT:/root/.kube/config:z -i quay.io/openshift/origin-tests:4.7 /bin/bash
  [root@55f251f22bc9 /]# export KUBECONFIG=/root/.kube/config
  [root@55f251f22bc9 /]# time while true; do openshift-tests run-test "[sig-scheduling] SchedulerPreemption [Serial] validates basic preemption works [Conformance] 
  [Suite:openshift/conformance/serial/minimal] [Suite:k8s]"
  if [ $? -ne 0 ]; then
     break
  fi
  done;

That will loop the single test case and stop once it fails. I was able to reproduce it a few times
in the few hours my cluster was available. Unfortunately, my cluster aged out before I had time
to try to collect more debugs than we get from the logs in CI.


Version-Release number of selected component (if applicable):

4.7

How reproducible:

infrequent, but this search shows it happening a handful of times:
  https://search.ci.openshift.org/?search=validates+basic+preemption+works&maxAge=168h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

Steps to Reproduce:

see description above

Comment 1 jamo luhrsen 2021-01-14 00:17:39 UTC

I was able to poke around the master nodes after I caught a failure. again, I lost the cluster before figuring out too much, but I did
see the below snippet (which I don't find in the logs from CI, so hoping it's a little bit helpful):

in this file:
pods/openshift-kube-scheduler_openshift-kube-scheduler-ip-10-0-132-85.us-west-2.compute.internal_59db6637-ae92-4357-b607-9312ce4dab44/kube-scheduler/0.log


2021-01-13T21:35:07.403973257+00:00 stderr F I0113 21:35:07.403873       1 scheduler.go:592] "Successfully bound pod to node" pod="e2e-sched-preemption-3378/pod0-sched-preemption-low-priority" node="ip-10-0-158-73.us-west-2.compute.internal" evaluatedNodes=6 feasibleNodes=1
2021-01-13T21:35:07.672966782+00:00 stderr F I0113 21:35:07.672915       1 scheduler.go:592] "Successfully bound pod to node" pod="e2e-sched-preemption-3378/pod1-sched-preemption-medium-priority" node="ip-10-0-175-169.us-west-2.compute.internal" evaluatedNodes=6 feasibleNodes=1
2021-01-13T21:35:07.884861457+00:00 stderr F I0113 21:35:07.884261       1 scheduler.go:592] "Successfully bound pod to node" pod="e2e-sched-preemption-3378/pod2-sched-preemption-medium-priority" node="ip-10-0-202-167.us-west-2.compute.internal" evaluatedNodes=6 feasibleNodes=1
2021-01-13T21:35:14.378479856+00:00 stderr F I0113 21:35:14.378416       1 factory.go:321] "Unable to schedule pod; no fit; waiting" pod="e2e-sched-preemption-3378/preemptor-pod" err="0/6 nodes are available: 3 Insufficient scheduling.k8s.io/foo, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate."
2021-01-13T21:35:14.387474415+00:00 stderr F I0113 21:35:14.387419       1 factory.go:321] "Unable to schedule pod; no fit; waiting" pod="e2e-sched-preemption-3378/preemptor-pod" err="0/6 nodes are available: 3 Insufficient scheduling.k8s.io/foo, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate."
2021-01-13T21:36:17.579211762+00:00 stderr F E0113 21:36:17.579164       1 framework.go:777] "Failed running Bind plugin" err="pods \"preemptor-pod\" is forbidden: unable to create new content in namespace e2e-sched-preemption-3378 because it is being terminated" plugin="DefaultBinder" pod="e2e-sched-preemption-3378/preemptor-pod"
2021-01-13T21:36:17.579351098+00:00 stderr F I0113 21:36:17.579310       1 scheduler.go:420] Failed to bind pod: e2e-sched-preemption-3378/preemptor-pod
2021-01-13T21:36:17.579393569+00:00 stderr F E0113 21:36:17.579377       1 factory.go:337] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": pods \"preemptor-pod\" is forbidden: unable to create new content in namespace e2e-sched-preemption-3378 because it is being terminated" pod="e2e-sched-preemption-3378/preemptor-pod"
2021-01-13T21:36:20.361326967+00:00 stderr F E0113 21:36:20.361207       1 event_broadcaster.go:253] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"preemptor-pod.1659e851eed18ccf", GenerateName:"", Namespace:"e2e-sched-preemption-3378", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, EventTime:v1.MicroTime{Time:time.Time{wall:0xbff7f7d5155c2ae8, ext:16359491835897, loc:(*time.Location)(0x2ceed00)}}, Series:(*v1.EventSeries)(nil), ReportingController:"default-scheduler", ReportingInstance:"default-scheduler-ip-10-0-132-85", Action:"Scheduling", Reason:"FailedScheduling", Regarding:v1.ObjectReference{Kind:"Pod", Namespace:"e2e-sched-preemption-3378", Name:"preemptor-pod", UID:"fad0c735-b398-4dfc-a2d5-8774b1db4474", APIVersion:"v1", ResourceVersion:"127277", FieldPath:""}, Related:(*v1.ObjectReference)(nil), Note:"skip schedule deleting pod: e2e-sched-preemption-3378/preemptor-pod", Type:"Warning", DeprecatedSource:v1.EventSource{Component:"", Host:""}, DeprecatedFirstTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeprecatedLastTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeprecatedCount:0}': 'events.events.k8s.io "preemptor-pod.1659e851eed18ccf" is forbidden: unable to create new content in namespace e2e-sched-preemption-3378 because it is being terminated' (will not retry!)

These are logs that came for the test case iteration that failed. You can see the timestamps that it's aprox 1m with some messages
about 0 nodes available, then finally the 'Error scheduling pod' which I believe comes because the test case has given up
and probably tearing things down.... deletes the namespace and the scheduler realizes it on the next retry to find a node
to schedule on and gives up.

Comment 2 Jan Chaloupka 2021-01-19 12:52:29 UTC

Checking the recent failures, every test timeouts after 1 minute of waiting for the preemptor pod to be scheduled with a similar kube-scheduler logs like these:
```
I0115 04:44:43.912012       1 factory.go:321] "Unable to schedule pod; no fit; waiting" pod="e2e-sched-preemption-930/preemptor-pod" err="0/6 nodes are available: 3 Insufficient scheduling.k8s.io/foo, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate."
I0115 04:44:43.920822       1 factory.go:321] "Unable to schedule pod; no fit; waiting" pod="e2e-sched-preemption-930/preemptor-pod" err="0/6 nodes are available: 3 Insufficient scheduling.k8s.io/foo, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate."
...
E0115 04:45:49.610424       1 framework.go:777] "Failed running Bind plugin" err="pods \"preemptor-pod\" is forbidden: unable to create new content in namespace e2e-sched-preemption-930 because it is being terminated" plugin="DefaultBinder" pod="e2e-sched-preemption-930/preemptor-pod"
I0115 04:45:49.610463       1 scheduler.go:420] Failed to bind pod: e2e-sched-preemption-930/preemptor-pod
```

- https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.7/1350446580682985472
- https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.7/1350355743152279552
- https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.7/1349924268133060608

It seems it takes slightly over a minute to bind the preemptor-pod and then some time to actually run it.

In some rare cases the preemptor pod gets scheduled (https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.7/1349872296105349120):

```
I0115 02:23:50.826264       1 factory.go:321] "Unable to schedule pod; no fit; waiting" pod="e2e-sched-preemption-3163/preemptor-pod" err="0/6 nodes are available: 3 Insufficient scheduling.k8s.io/foo, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate."
I0115 02:23:50.841743       1 factory.go:321] "Unable to schedule pod; no fit; waiting" pod="e2e-sched-preemption-3163/preemptor-pod" err="0/6 nodes are available: 3 Insufficient scheduling.k8s.io/foo, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate."
I0115 02:24:49.538649       1 scheduler.go:592] "Successfully bound pod to node" pod="e2e-sched-preemption-3163/preemptor-pod" node="ip-10-0-146-177.us-west-1.compute.internal" evaluatedNodes=6 feasibleNodes=1
```

Though it's pretty tight so the test still does not wait long enough for the pod to get running.

Comment 3 jamo luhrsen 2021-01-19 18:04:01 UTC

I don't see this really happening in 4.6, but it is there in 4.7 and 4.8 which makes me think it's
potentially a real regression in performance to bring up a pod. Also, there is this bz (https://bugzilla.redhat.com/show_bug.cgi?id=1915085)
sort of calling out a similar thing that could explain this problem.

Not that increasing the timeout from 1m to 2m would be wrong, because if a pod was really taking
5m to come up we'd still be failing these test cases.

just fyi

Comment 4 jamo luhrsen 2021-01-19 18:05:46 UTC

(In reply to jamo luhrsen from comment #3)
> I don't see this really happening in 4.6, but it is there in 4.7 and 4.8
> which makes me think it's
> potentially a real regression in performance to bring up a pod. Also, there
> is this bz (https://bugzilla.redhat.com/show_bug.cgi?id=1915085)
> sort of calling out a similar thing that could explain this problem.

btw, the upstream issue (https://github.com/kubernetes/kubernetes/issues/98142) noted in that
bz does have some chatter and a proposed hack/fix that could help. Maybe we can wait for
that issue to be resolved before increasing the test timeout?

> 
> Not that increasing the timeout from 1m to 2m would be wrong, because if a
> pod was really taking
> 5m to come up we'd still be failing these test cases.
> 
> just fyi

Comment 5 Jan Chaloupka 2021-01-20 09:55:34 UTC

The upstream PR already merged in o/k. The change would land in the next rebase anyway.

Thanks for pointing out the other bug. Seems like it would be better to have a different test to measure the kubelet performance for runc in kubelet test suite.

Comment 7 RamaKasturi 2021-01-25 10:42:38 UTC

From the link below see that test has not been failing for about 7 days, so moving the bug to verified state.

https://search.ci.openshift.org/?search=validates+basic+preemption+works&maxAge=168h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 10 errata-xmlrpc 2021-02-24 15:53:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633