1870342 – [sig-scheduling] SchedulerPredicates [Serial] validates resource limits of pods that are allowed to run [Conformance]

Bug 1870342 - [sig-scheduling] SchedulerPredicates [Serial] validates resource limits of pods that are allowed to run [Conformance]

Summary: [sig-scheduling] SchedulerPredicates [Serial] validates resource limits of po...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-scheduler
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Jan Chaloupka
QA Contact:	RamaKasturi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-08-19 20:09 UTC by David Eads
Modified:	2021-02-24 15:16 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:	[sig-scheduling] SchedulerPredicates [Serial] validates resource limits of pods that are allowed to run [Conformance]
Last Closed:	2021-02-24 15:15:36 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubernetes kubernetes pull 94684	None	closed	[sig-scheduling] SchedulerPriorities [Serial] Pod should avoid nodes that have avoidPod annotation: clean remaining pods	2021-01-27 13:53:25 UTC
Github	openshift kubernetes pull 534	None	closed	bug 1870342: UPSTREAM: 94684: e2e: Pod should avoid nodes that have avoidPod annotation: clean remaining pods	2021-01-27 13:53:25 UTC
Github	openshift origin pull 25820	None	closed	bug 1870342: UPSTREAM: 94684: e2e: Pod should avoid nodes that have avoidPod annotation: clean remaining pods	2021-01-27 13:53:25 UTC
Red Hat Product Errata	RHSA-2020:5633	None	None	None	2021-02-24 15:16:17 UTC

Description David Eads 2020-08-19 20:09:05 UTC

test:
[sig-scheduling] SchedulerPredicates [Serial] validates resource limits of pods that are allowed to run  [Conformance] 

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-scheduling%5C%5D+SchedulerPredicates+%5C%5BSerial%5C%5D+validates+resource+limits+of+pods+that+are+allowed+to+run++%5C%5BConformance%5C%5D

This is the number 2 serial flake in the last two days.

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.6/1296119701423788032

Comment 1 Jan Chaloupka 2020-08-21 12:01:19 UTC

From https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.6/1296119701423788032/artifacts/e2e-aws-serial/pods/openshift-kube-scheduler_openshift-kube-scheduler-ip-10-0-134-89.us-west-1.compute.internal_kube-scheduler.log:

```
I0819 17:30:37.588242       1 scheduler.go:598] "Successfully bound pod to node" pod="e2e-sched-pred-1338/filler-pod-f202bfd0-5e18-4591-937f-352e9124db0e" node="ip-10-0-138-200.us-west-1.compute.internal" evaluatedNodes=6 feasibleNodes=1
I0819 17:30:37.668452       1 scheduler.go:598] "Successfully bound pod to node" pod="e2e-sched-pred-1338/filler-pod-1c9c1bb8-dba8-41e4-a63a-a1946422ed55" node="ip-10-0-153-85.us-west-1.compute.internal" evaluatedNodes=6 feasibleNodes=1
I0819 17:30:37.753573       1 scheduler.go:598] "Successfully bound pod to node" pod="e2e-sched-pred-1338/filler-pod-d0f8b31d-a8ca-4be1-bea6-c430ce2471ba" node="ip-10-0-255-251.us-west-1.compute.internal" evaluatedNodes=6 feasibleNodes=1
I0819 17:30:42.350926       1 scheduler.go:598] "Successfully bound pod to node" pod="e2e-sched-pred-1338/additional-pod" node="ip-10-0-153-85.us-west-1.compute.internal" evaluatedNodes=6 feasibleNodes=2
```

All 3 worker nodes should have >= 70% of cpu allocated (filler pod on each node with the rest of scheduled pods eat 70 of allocatable cpu) so additional-pod should be really pending (it eats 50% of allocatable cpu)

Maybe, some pods are terminated between the filler pods get scheduled and the additional-pod (which can take up to 5s).

Comment 2 Jan Chaloupka 2020-08-24 13:23:13 UTC

Without exposing scheduler's internals (e.g. node cache, list of know pods) it's quite non-trivial to figure out why the resource utilization computed by the e2e is giving more than node's capacity.

I am suggesting to create a new scheduling plugin which will log such information and will be disabled in production (see https://github.com/kubernetes/kubernetes/issues/94192). In case the suggested solution is too insecure or rejected by upstream, we can still enable the plugin downstream and have it removed before the code gets released into production.

Comment 3 Jan Chaloupka 2020-09-02 13:08:23 UTC

Introducing debugger plugin upstream which can at least show the disbalance between which resource consumption is assumed by the scheduler and by the flaking e2e: https://github.com/kubernetes/kubernetes/pull/94423

We might temporarily merge it downstream to get the difference in assumptions.

Comment 4 Jan Chaloupka 2020-09-09 14:17:11 UTC

From https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-serial-4.6/1301858872989847552/artifacts/e2e-metal-serial/pods/openshift-kube-scheduler_openshift-kube-scheduler-master-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com_kube-scheduler.log (what the scheduler see):

```
I0904 13:19:21.079071       1 debugger.go:65] "Dumping node infos" node="master-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[memory=5579Mi pods=0 ephemeral-storage=0 cpu=1895m]"
I0904 13:19:21.079095       1 debugger.go:65] "Dumping node infos" node="master-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=1720m memory=5049Mi pods=0 ephemeral-storage=0]"
I0904 13:19:21.079108       1 debugger.go:65] "Dumping node infos" node="master-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[ephemeral-storage=0 cpu=1934m memory=5999Mi pods=0]"
I0904 13:19:21.079121       1 debugger.go:65] "Dumping node infos" node="worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=773m memory=3097Mi pods=0 ephemeral-storage=0]"
I0904 13:19:21.079133       1 debugger.go:65] "Dumping node infos" node="worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[memory=3282Mi pods=0 ephemeral-storage=0 cpu=572m]"
I0904 13:19:21.079145       1 debugger.go:65] "Dumping node infos" node="worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=468m memory=2931Mi pods=0 ephemeral-storage=0]"
I0904 13:19:21.083259       1 scheduler.go:598] "Successfully bound pod to node" pod="e2e-sched-priority-3802/ead2872e-4ad3-4244-bb02-9f7ff0b6a428-0" node="worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" evaluatedNodes=6 feasibleNodes=1
I0904 13:19:26.257626       1 debugger.go:65] "Dumping node infos" node="master-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=1895m memory=5579Mi pods=0 ephemeral-storage=0]"
I0904 13:19:26.257646       1 debugger.go:65] "Dumping node infos" node="master-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=1720m memory=5049Mi pods=0 ephemeral-storage=0]"
I0904 13:19:26.257656       1 debugger.go:65] "Dumping node infos" node="master-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=1934m memory=5999Mi pods=0 ephemeral-storage=0]"
I0904 13:19:26.257668       1 debugger.go:65] "Dumping node infos" node="worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=3650m memory=14938440Ki pods=0 ephemeral-storage=0]"
I0904 13:19:26.257677       1 debugger.go:65] "Dumping node infos" node="worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=572m memory=3282Mi pods=0 ephemeral-storage=0]"
I0904 13:19:26.257689       1 debugger.go:65] "Dumping node infos" node="worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[memory=2931Mi pods=0 ephemeral-storage=0 cpu=468m]"
I0904 13:19:26.263121       1 scheduler.go:598] "Successfully bound pod to node" pod="e2e-sched-priority-3802/67e791a6-f170-4aff-b2b1-e42a0a6a7c3c-0" node="worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" evaluatedNodes=6 feasibleNodes=1
I0904 13:19:31.436220       1 debugger.go:65] "Dumping node infos" node="master-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=1895m memory=5579Mi pods=0 ephemeral-storage=0]"
I0904 13:19:31.436239       1 debugger.go:65] "Dumping node infos" node="master-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[ephemeral-storage=0 cpu=1720m memory=5049Mi pods=0]"
I0904 13:19:31.436251       1 debugger.go:65] "Dumping node infos" node="master-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=1934m memory=5999Mi pods=0 ephemeral-storage=0]"
I0904 13:19:31.436264       1 debugger.go:65] "Dumping node infos" node="worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[ephemeral-storage=0 cpu=3650m memory=14938440Ki pods=0]"
I0904 13:19:31.436275       1 debugger.go:65] "Dumping node infos" node="worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=572m memory=3282Mi pods=0 ephemeral-storage=0]"
I0904 13:19:31.436288       1 debugger.go:65] "Dumping node infos" node="worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=3650m memory=14874430Ki pods=0 ephemeral-storage=0]"
I0904 13:19:31.440420       1 scheduler.go:598] "Successfully bound pod to node" pod="e2e-sched-priority-3802/ffecd964-0982-4541-ba35-bc53ef1bcd52-0" node="worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" evaluatedNodes=6 feasibleNodes=1
I0904 13:19:40.139882       1 debugger.go:65] "Dumping node infos" node="master-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=1895m memory=5579Mi pods=0 ephemeral-storage=0]"
I0904 13:19:40.139904       1 debugger.go:65] "Dumping node infos" node="master-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=1720m memory=5049Mi pods=0 ephemeral-storage=0]"
I0904 13:19:40.139914       1 debugger.go:65] "Dumping node infos" node="master-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=1934m memory=5999Mi pods=0 ephemeral-storage=0]"
I0904 13:19:40.139922       1 debugger.go:65] "Dumping node infos" node="worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[pods=0 ephemeral-storage=0 cpu=3650m memory=14938440Ki]"
I0904 13:19:40.139930       1 debugger.go:65] "Dumping node infos" node="worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[memory=15231428607 pods=0 ephemeral-storage=0 cpu=3650m]"
I0904 13:19:40.139942       1 debugger.go:65] "Dumping node infos" node="worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[pods=0 ephemeral-storage=0 cpu=3650m memory=14874430Ki]"
I0904 13:19:40.145351       1 scheduler.go:598] "Successfully bound pod to node" pod="e2e-sched-priority-3802/scheduler-priority-avoid-pod-j72gx" node="worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" evaluatedNodes=6 feasibleNodes=3
I0904 13:19:40.146963       1 debugger.go:65] "Dumping node infos" node="master-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[pods=0 ephemeral-storage=0 cpu=1895m memory=5579Mi]"
I0904 13:19:40.147004       1 debugger.go:65] "Dumping node infos" node="master-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[pods=0 ephemeral-storage=0 cpu=1720m memory=5049Mi]"
I0904 13:19:40.147026       1 debugger.go:65] "Dumping node infos" node="master-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=1934m memory=5999Mi pods=0 ephemeral-storage=0]"
I0904 13:19:40.147046       1 debugger.go:65] "Dumping node infos" node="worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=3650m memory=14938440Ki pods=0 ephemeral-storage=0]"
I0904 13:19:40.147068       1 debugger.go:65] "Dumping node infos" node="worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[memory=15231428607 pods=0 ephemeral-storage=0 cpu=3650m]"
I0904 13:19:40.147092       1 debugger.go:65] "Dumping node infos" node="worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[memory=14976830Ki pods=0 ephemeral-storage=0 cpu=3750m]"
I0904 13:19:40.150793       1 scheduler.go:598] "Successfully bound pod to node" pod="e2e-sched-priority-3802/scheduler-priority-avoid-pod-j6kgx" node="worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" evaluatedNodes=6 feasibleNodes=3
I0904 13:20:00.462039       1 debugger.go:65] "Dumping node infos" node="master-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=1895m memory=5579Mi pods=0 ephemeral-storage=0]"
I0904 13:20:00.462066       1 debugger.go:65] "Dumping node infos" node="master-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=1720m memory=5049Mi pods=0 ephemeral-storage=0]"
I0904 13:20:00.462079       1 debugger.go:65] "Dumping node infos" node="master-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=1934m memory=5999Mi pods=0 ephemeral-storage=0]"
I0904 13:20:00.462091       1 debugger.go:65] "Dumping node infos" node="worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[pods=0 ephemeral-storage=0 cpu=3650m memory=14938440Ki]"
I0904 13:20:00.462103       1 debugger.go:65] "Dumping node infos" node="worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=3650m memory=15231428607 pods=0 ephemeral-storage=0]"
I0904 13:20:00.462122       1 debugger.go:65] "Dumping node infos" node="worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[ephemeral-storage=0 cpu=3650m memory=14874430Ki pods=0]"
I0904 13:20:00.466956       1 scheduler.go:598] "Successfully bound pod to node" pod="e2e-sched-pred-4521/filler-pod-5be568c0-1552-4f2e-8ff2-869c5cd610f7" node="worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" evaluatedNodes=6 feasibleNodes=1
I0904 13:20:00.541768       1 debugger.go:65] "Dumping node infos" node="master-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[pods=0 ephemeral-storage=0 cpu=1895m memory=5579Mi]"
I0904 13:20:00.541793       1 debugger.go:65] "Dumping node infos" node="master-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=1720m memory=5049Mi pods=0 ephemeral-storage=0]"
I0904 13:20:00.541806       1 debugger.go:65] "Dumping node infos" node="master-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=1934m memory=5999Mi pods=0 ephemeral-storage=0]"
I0904 13:20:00.541818       1 debugger.go:65] "Dumping node infos" node="worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=6345m memory=14938440Ki pods=0 ephemeral-storage=0]"
I0904 13:20:00.541833       1 debugger.go:65] "Dumping node infos" node="worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=3650m memory=15231428607 pods=0 ephemeral-storage=0]"
I0904 13:20:00.541847       1 debugger.go:65] "Dumping node infos" node="worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=3650m memory=14874430Ki pods=0 ephemeral-storage=0]"
I0904 13:20:00.545296       1 scheduler.go:598] "Successfully bound pod to node" pod="e2e-sched-pred-4521/filler-pod-8e8309c7-b858-481c-8052-246ba24580d0" node="worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" evaluatedNodes=6 feasibleNodes=1
I0904 13:20:00.622148       1 debugger.go:65] "Dumping node infos" node="master-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[memory=5579Mi pods=0 ephemeral-storage=0 cpu=1895m]"
I0904 13:20:00.622175       1 debugger.go:65] "Dumping node infos" node="master-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[memory=5049Mi pods=0 ephemeral-storage=0 cpu=1720m]"
I0904 13:20:00.622188       1 debugger.go:65] "Dumping node infos" node="master-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[ephemeral-storage=0 cpu=1934m memory=5999Mi pods=0]"
I0904 13:20:00.622199       1 debugger.go:65] "Dumping node infos" node="worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=6345m memory=14938440Ki pods=0 ephemeral-storage=0]"
I0904 13:20:00.622213       1 debugger.go:65] "Dumping node infos" node="worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=3650m memory=15231428607 pods=0 ephemeral-storage=0]"
I0904 13:20:00.622229       1 debugger.go:65] "Dumping node infos" node="worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[pods=0 ephemeral-storage=0 cpu=6345m memory=14874430Ki]"
I0904 13:20:00.625867       1 scheduler.go:598] "Successfully bound pod to node" pod="e2e-sched-pred-4521/filler-pod-e98507a6-2221-4312-9f1b-34b7d1124b91" node="worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" evaluatedNodes=6 feasibleNodes=1
I0904 13:20:05.309985       1 debugger.go:65] "Dumping node infos" node="master-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=1895m memory=5579Mi pods=0 ephemeral-storage=0]"
I0904 13:20:05.310011       1 debugger.go:65] "Dumping node infos" node="master-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=1720m memory=5049Mi pods=0 ephemeral-storage=0]"
I0904 13:20:05.310021       1 debugger.go:65] "Dumping node infos" node="master-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[ephemeral-storage=0 cpu=1934m memory=5999Mi pods=0]"
I0904 13:20:05.310029       1 debugger.go:65] "Dumping node infos" node="worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=3468m memory=3097Mi pods=0 ephemeral-storage=0]"
I0904 13:20:05.310037       1 debugger.go:65] "Dumping node infos" node="worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[ephemeral-storage=0 cpu=3267m memory=3282Mi pods=0]"
I0904 13:20:05.310044       1 debugger.go:65] "Dumping node infos" node="worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[pods=0 ephemeral-storage=0 cpu=6345m memory=14874430Ki]"
I0904 13:20:05.314450       1 scheduler.go:598] "Successfully bound pod to node" pod="e2e-sched-pred-4521/additional-pod" node="worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" evaluatedNodes=6 feasibleNodes=2```

From junit (what the e2e test thinks after scheduling the resource fillers):

```
Sep  4 13:20:00.412: INFO: Pod 67e791a6-f170-4aff-b2b1-e42a0a6a7c3c-0 requesting resource cpu=3182m on Node worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod ead2872e-4ad3-4244-bb02-9f7ff0b6a428-0 requesting resource cpu=2877m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod ffecd964-0982-4541-ba35-bc53ef1bcd52-0 requesting resource cpu=3078m on Node worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod tuned-cmmn2 requesting resource cpu=10m on Node worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod tuned-tljds requesting resource cpu=10m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod tuned-v48gt requesting resource cpu=10m on Node worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod dns-default-rc6ht requesting resource cpu=65m on Node worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod dns-default-tbqpg requesting resource cpu=65m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod dns-default-x9v97 requesting resource cpu=65m on Node worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod image-registry-7cf7d78fd6-4mxr9 requesting resource cpu=100m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod node-ca-72247 requesting resource cpu=10m on Node worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod node-ca-8nmtz requesting resource cpu=10m on Node worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod node-ca-zr995 requesting resource cpu=10m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod router-default-556cc9b846-8zrxd requesting resource cpu=100m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod router-default-556cc9b846-x8j5j requesting resource cpu=100m on Node worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod migrator-865bcf9488-tkjff requesting resource cpu=100m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod machine-config-daemon-7gnl4 requesting resource cpu=40m on Node worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod machine-config-daemon-n895c requesting resource cpu=40m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod machine-config-daemon-wf995 requesting resource cpu=40m on Node worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod certified-operators-c4kzf requesting resource cpu=10m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod certified-operators-whjv4 requesting resource cpu=10m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod community-operators-88cwp requesting resource cpu=10m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod community-operators-qvc86 requesting resource cpu=10m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod redhat-marketplace-dxk6r requesting resource cpu=10m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod redhat-marketplace-fnsfv requesting resource cpu=10m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod redhat-operators-n9vwc requesting resource cpu=10m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod redhat-operators-xz9qd requesting resource cpu=10m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod alertmanager-main-0 requesting resource cpu=8m on Node worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod alertmanager-main-1 requesting resource cpu=8m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod alertmanager-main-2 requesting resource cpu=8m on Node worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod grafana-5dc687d5df-mltbk requesting resource cpu=5m on Node worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod kube-state-metrics-747cdddc64-frwfw requesting resource cpu=4m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod node-exporter-mjnv9 requesting resource cpu=9m on Node worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod node-exporter-rklqh requesting resource cpu=9m on Node worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod node-exporter-w8279 requesting resource cpu=9m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod openshift-state-metrics-59c8cd5968-5rdz7 requesting resource cpu=3m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod prometheus-adapter-6b7f847c77-pc945 requesting resource cpu=1m on Node worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod prometheus-adapter-6b7f847c77-xs5kb requesting resource cpu=1m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod prometheus-k8s-0 requesting resource cpu=76m on Node worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod prometheus-k8s-1 requesting resource cpu=76m on Node worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod telemeter-client-657bd8cbd6-7ll4v requesting resource cpu=3m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod thanos-querier-5c57566d79-c5n5d requesting resource cpu=9m on Node worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod thanos-querier-5c57566d79-zc2sf requesting resource cpu=9m on Node worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod multus-9p7zj requesting resource cpu=10m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod multus-b6z24 requesting resource cpu=10m on Node worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod multus-h7dtf requesting resource cpu=10m on Node worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod network-metrics-daemon-9nvjv requesting resource cpu=20m on Node worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod network-metrics-daemon-cmnw6 requesting resource cpu=20m on Node worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod network-metrics-daemon-k2xxs requesting resource cpu=20m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod ovs-k7n5z requesting resource cpu=100m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod ovs-shgxf requesting resource cpu=100m on Node worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod ovs-ssnc6 requesting resource cpu=100m on Node worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod sdn-sblm6 requesting resource cpu=110m on Node worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod sdn-wzgwj requesting resource cpu=110m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
Sep  4 13:20:00.412: INFO: Pod sdn-xw75k requesting resource cpu=110m on Node worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com
```

If you observe the e2e logs you can see 67e791a6-f170-4aff-b2b1-e42a0a6a7c3c-0, ead2872e-4ad3-4244-bb02-9f7ff0b6a428-0 and ffecd964-0982-4541-ba35-bc53ef1bcd52-0 are accounted when computing required cpu. However, all three pods are terminated before the final e2e-sched-pred-4521/additional-pod is scheduled. Previous test "[sig-scheduling] SchedulerPriorities [Serial] Pod should avoid nodes that have avoidPod annotation" is not cleaning its created resources properly and does not wait until all three pods are terminated and deleted. So, the worker nodes are not equalized wrt. cpu since the e2e code does not expect pods getting terminated during the test.

Comment 5 Jan Chaloupka 2020-09-10 14:17:27 UTC

The issue is known. I just need to open a PR upstream and make sure the pods are deleted before the e2e test finishes. Putting UpcomingSprint just in case the fix will not get merged by EOW.

Comment 6 Jan Chaloupka 2020-09-22 11:57:35 UTC

I have updated https://github.com/kubernetes/kubernetes/pull/94684 to have "validates resource limits of pods that are allowed to run" e2e test ignore all pods with DeletionTimestamp set instead of have the previous test to wait longer for filler pods to be deleted. There may be other tests not properly deleting their pods which may affect the resource computation.

Comment 7 Jan Chaloupka 2020-09-23 14:05:11 UTC

If the proposed fix gets merged upstream in time, I will move back to 4.6.

Comment 8 Seth Jennings 2020-12-10 15:26:43 UTC

Still seeing this fail in CI occasionally
https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.7/1336764367294697472

Comment 9 Jan Chaloupka 2021-01-15 11:09:01 UTC

Need more time to address the issue properly upstream.

Comment 11 RamaKasturi 2021-01-27 14:04:21 UTC

Still see the test failing, will wait for another day before moving the bug back to assigned state.

https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-scheduling%5C%5D+SchedulerPredicates+%5C%5BSerial%5C%5D+validates+resource+limits+of+pods+that+are+allowed+to+run++%5C%5BConformance%5C%5D

Comment 12 RamaKasturi 2021-01-29 12:46:08 UTC

Have not seen this failing from last 1 day, so moving the bug to verified state.

https://search.ci.openshift.org/?search=%5C%5Bsig-scheduling%5C%5D+SchedulerPredicates+%5C%5BSerial%5C%5D+validates+resource+limits+of+pods+that+are+allowed+to+run++%5C%5BConformance%5C%5D&maxAge=24h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 15 errata-xmlrpc 2021-02-24 15:15:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.