test: [sig-scheduling] SchedulerPredicates [Serial] validates resource limits of pods that are allowed to run [Conformance] is failing frequently in CI, see search results: https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-scheduling%5C%5D+SchedulerPredicates+%5C%5BSerial%5C%5D+validates+resource+limits+of+pods+that+are+allowed+to+run++%5C%5BConformance%5C%5D This is the number 2 serial flake in the last two days. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.6/1296119701423788032
From https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.6/1296119701423788032/artifacts/e2e-aws-serial/pods/openshift-kube-scheduler_openshift-kube-scheduler-ip-10-0-134-89.us-west-1.compute.internal_kube-scheduler.log: ``` I0819 17:30:37.588242 1 scheduler.go:598] "Successfully bound pod to node" pod="e2e-sched-pred-1338/filler-pod-f202bfd0-5e18-4591-937f-352e9124db0e" node="ip-10-0-138-200.us-west-1.compute.internal" evaluatedNodes=6 feasibleNodes=1 I0819 17:30:37.668452 1 scheduler.go:598] "Successfully bound pod to node" pod="e2e-sched-pred-1338/filler-pod-1c9c1bb8-dba8-41e4-a63a-a1946422ed55" node="ip-10-0-153-85.us-west-1.compute.internal" evaluatedNodes=6 feasibleNodes=1 I0819 17:30:37.753573 1 scheduler.go:598] "Successfully bound pod to node" pod="e2e-sched-pred-1338/filler-pod-d0f8b31d-a8ca-4be1-bea6-c430ce2471ba" node="ip-10-0-255-251.us-west-1.compute.internal" evaluatedNodes=6 feasibleNodes=1 I0819 17:30:42.350926 1 scheduler.go:598] "Successfully bound pod to node" pod="e2e-sched-pred-1338/additional-pod" node="ip-10-0-153-85.us-west-1.compute.internal" evaluatedNodes=6 feasibleNodes=2 ``` All 3 worker nodes should have >= 70% of cpu allocated (filler pod on each node with the rest of scheduled pods eat 70 of allocatable cpu) so additional-pod should be really pending (it eats 50% of allocatable cpu) Maybe, some pods are terminated between the filler pods get scheduled and the additional-pod (which can take up to 5s).
Without exposing scheduler's internals (e.g. node cache, list of know pods) it's quite non-trivial to figure out why the resource utilization computed by the e2e is giving more than node's capacity. I am suggesting to create a new scheduling plugin which will log such information and will be disabled in production (see https://github.com/kubernetes/kubernetes/issues/94192). In case the suggested solution is too insecure or rejected by upstream, we can still enable the plugin downstream and have it removed before the code gets released into production.
Introducing debugger plugin upstream which can at least show the disbalance between which resource consumption is assumed by the scheduler and by the flaking e2e: https://github.com/kubernetes/kubernetes/pull/94423 We might temporarily merge it downstream to get the difference in assumptions.
From https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-serial-4.6/1301858872989847552/artifacts/e2e-metal-serial/pods/openshift-kube-scheduler_openshift-kube-scheduler-master-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com_kube-scheduler.log (what the scheduler see): ``` I0904 13:19:21.079071 1 debugger.go:65] "Dumping node infos" node="master-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[memory=5579Mi pods=0 ephemeral-storage=0 cpu=1895m]" I0904 13:19:21.079095 1 debugger.go:65] "Dumping node infos" node="master-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=1720m memory=5049Mi pods=0 ephemeral-storage=0]" I0904 13:19:21.079108 1 debugger.go:65] "Dumping node infos" node="master-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[ephemeral-storage=0 cpu=1934m memory=5999Mi pods=0]" I0904 13:19:21.079121 1 debugger.go:65] "Dumping node infos" node="worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=773m memory=3097Mi pods=0 ephemeral-storage=0]" I0904 13:19:21.079133 1 debugger.go:65] "Dumping node infos" node="worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[memory=3282Mi pods=0 ephemeral-storage=0 cpu=572m]" I0904 13:19:21.079145 1 debugger.go:65] "Dumping node infos" node="worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=468m memory=2931Mi pods=0 ephemeral-storage=0]" I0904 13:19:21.083259 1 scheduler.go:598] "Successfully bound pod to node" pod="e2e-sched-priority-3802/ead2872e-4ad3-4244-bb02-9f7ff0b6a428-0" node="worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" evaluatedNodes=6 feasibleNodes=1 I0904 13:19:26.257626 1 debugger.go:65] "Dumping node infos" node="master-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=1895m memory=5579Mi pods=0 ephemeral-storage=0]" I0904 13:19:26.257646 1 debugger.go:65] "Dumping node infos" node="master-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=1720m memory=5049Mi pods=0 ephemeral-storage=0]" I0904 13:19:26.257656 1 debugger.go:65] "Dumping node infos" node="master-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=1934m memory=5999Mi pods=0 ephemeral-storage=0]" I0904 13:19:26.257668 1 debugger.go:65] "Dumping node infos" node="worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=3650m memory=14938440Ki pods=0 ephemeral-storage=0]" I0904 13:19:26.257677 1 debugger.go:65] "Dumping node infos" node="worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=572m memory=3282Mi pods=0 ephemeral-storage=0]" I0904 13:19:26.257689 1 debugger.go:65] "Dumping node infos" node="worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[memory=2931Mi pods=0 ephemeral-storage=0 cpu=468m]" I0904 13:19:26.263121 1 scheduler.go:598] "Successfully bound pod to node" pod="e2e-sched-priority-3802/67e791a6-f170-4aff-b2b1-e42a0a6a7c3c-0" node="worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" evaluatedNodes=6 feasibleNodes=1 I0904 13:19:31.436220 1 debugger.go:65] "Dumping node infos" node="master-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=1895m memory=5579Mi pods=0 ephemeral-storage=0]" I0904 13:19:31.436239 1 debugger.go:65] "Dumping node infos" node="master-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[ephemeral-storage=0 cpu=1720m memory=5049Mi pods=0]" I0904 13:19:31.436251 1 debugger.go:65] "Dumping node infos" node="master-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=1934m memory=5999Mi pods=0 ephemeral-storage=0]" I0904 13:19:31.436264 1 debugger.go:65] "Dumping node infos" node="worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[ephemeral-storage=0 cpu=3650m memory=14938440Ki pods=0]" I0904 13:19:31.436275 1 debugger.go:65] "Dumping node infos" node="worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=572m memory=3282Mi pods=0 ephemeral-storage=0]" I0904 13:19:31.436288 1 debugger.go:65] "Dumping node infos" node="worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=3650m memory=14874430Ki pods=0 ephemeral-storage=0]" I0904 13:19:31.440420 1 scheduler.go:598] "Successfully bound pod to node" pod="e2e-sched-priority-3802/ffecd964-0982-4541-ba35-bc53ef1bcd52-0" node="worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" evaluatedNodes=6 feasibleNodes=1 I0904 13:19:40.139882 1 debugger.go:65] "Dumping node infos" node="master-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=1895m memory=5579Mi pods=0 ephemeral-storage=0]" I0904 13:19:40.139904 1 debugger.go:65] "Dumping node infos" node="master-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=1720m memory=5049Mi pods=0 ephemeral-storage=0]" I0904 13:19:40.139914 1 debugger.go:65] "Dumping node infos" node="master-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=1934m memory=5999Mi pods=0 ephemeral-storage=0]" I0904 13:19:40.139922 1 debugger.go:65] "Dumping node infos" node="worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[pods=0 ephemeral-storage=0 cpu=3650m memory=14938440Ki]" I0904 13:19:40.139930 1 debugger.go:65] "Dumping node infos" node="worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[memory=15231428607 pods=0 ephemeral-storage=0 cpu=3650m]" I0904 13:19:40.139942 1 debugger.go:65] "Dumping node infos" node="worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[pods=0 ephemeral-storage=0 cpu=3650m memory=14874430Ki]" I0904 13:19:40.145351 1 scheduler.go:598] "Successfully bound pod to node" pod="e2e-sched-priority-3802/scheduler-priority-avoid-pod-j72gx" node="worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" evaluatedNodes=6 feasibleNodes=3 I0904 13:19:40.146963 1 debugger.go:65] "Dumping node infos" node="master-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[pods=0 ephemeral-storage=0 cpu=1895m memory=5579Mi]" I0904 13:19:40.147004 1 debugger.go:65] "Dumping node infos" node="master-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[pods=0 ephemeral-storage=0 cpu=1720m memory=5049Mi]" I0904 13:19:40.147026 1 debugger.go:65] "Dumping node infos" node="master-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=1934m memory=5999Mi pods=0 ephemeral-storage=0]" I0904 13:19:40.147046 1 debugger.go:65] "Dumping node infos" node="worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=3650m memory=14938440Ki pods=0 ephemeral-storage=0]" I0904 13:19:40.147068 1 debugger.go:65] "Dumping node infos" node="worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[memory=15231428607 pods=0 ephemeral-storage=0 cpu=3650m]" I0904 13:19:40.147092 1 debugger.go:65] "Dumping node infos" node="worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[memory=14976830Ki pods=0 ephemeral-storage=0 cpu=3750m]" I0904 13:19:40.150793 1 scheduler.go:598] "Successfully bound pod to node" pod="e2e-sched-priority-3802/scheduler-priority-avoid-pod-j6kgx" node="worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" evaluatedNodes=6 feasibleNodes=3 I0904 13:20:00.462039 1 debugger.go:65] "Dumping node infos" node="master-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=1895m memory=5579Mi pods=0 ephemeral-storage=0]" I0904 13:20:00.462066 1 debugger.go:65] "Dumping node infos" node="master-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=1720m memory=5049Mi pods=0 ephemeral-storage=0]" I0904 13:20:00.462079 1 debugger.go:65] "Dumping node infos" node="master-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=1934m memory=5999Mi pods=0 ephemeral-storage=0]" I0904 13:20:00.462091 1 debugger.go:65] "Dumping node infos" node="worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[pods=0 ephemeral-storage=0 cpu=3650m memory=14938440Ki]" I0904 13:20:00.462103 1 debugger.go:65] "Dumping node infos" node="worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=3650m memory=15231428607 pods=0 ephemeral-storage=0]" I0904 13:20:00.462122 1 debugger.go:65] "Dumping node infos" node="worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[ephemeral-storage=0 cpu=3650m memory=14874430Ki pods=0]" I0904 13:20:00.466956 1 scheduler.go:598] "Successfully bound pod to node" pod="e2e-sched-pred-4521/filler-pod-5be568c0-1552-4f2e-8ff2-869c5cd610f7" node="worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" evaluatedNodes=6 feasibleNodes=1 I0904 13:20:00.541768 1 debugger.go:65] "Dumping node infos" node="master-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[pods=0 ephemeral-storage=0 cpu=1895m memory=5579Mi]" I0904 13:20:00.541793 1 debugger.go:65] "Dumping node infos" node="master-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=1720m memory=5049Mi pods=0 ephemeral-storage=0]" I0904 13:20:00.541806 1 debugger.go:65] "Dumping node infos" node="master-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=1934m memory=5999Mi pods=0 ephemeral-storage=0]" I0904 13:20:00.541818 1 debugger.go:65] "Dumping node infos" node="worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=6345m memory=14938440Ki pods=0 ephemeral-storage=0]" I0904 13:20:00.541833 1 debugger.go:65] "Dumping node infos" node="worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=3650m memory=15231428607 pods=0 ephemeral-storage=0]" I0904 13:20:00.541847 1 debugger.go:65] "Dumping node infos" node="worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=3650m memory=14874430Ki pods=0 ephemeral-storage=0]" I0904 13:20:00.545296 1 scheduler.go:598] "Successfully bound pod to node" pod="e2e-sched-pred-4521/filler-pod-8e8309c7-b858-481c-8052-246ba24580d0" node="worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" evaluatedNodes=6 feasibleNodes=1 I0904 13:20:00.622148 1 debugger.go:65] "Dumping node infos" node="master-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[memory=5579Mi pods=0 ephemeral-storage=0 cpu=1895m]" I0904 13:20:00.622175 1 debugger.go:65] "Dumping node infos" node="master-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[memory=5049Mi pods=0 ephemeral-storage=0 cpu=1720m]" I0904 13:20:00.622188 1 debugger.go:65] "Dumping node infos" node="master-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[ephemeral-storage=0 cpu=1934m memory=5999Mi pods=0]" I0904 13:20:00.622199 1 debugger.go:65] "Dumping node infos" node="worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=6345m memory=14938440Ki pods=0 ephemeral-storage=0]" I0904 13:20:00.622213 1 debugger.go:65] "Dumping node infos" node="worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=3650m memory=15231428607 pods=0 ephemeral-storage=0]" I0904 13:20:00.622229 1 debugger.go:65] "Dumping node infos" node="worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[pods=0 ephemeral-storage=0 cpu=6345m memory=14874430Ki]" I0904 13:20:00.625867 1 scheduler.go:598] "Successfully bound pod to node" pod="e2e-sched-pred-4521/filler-pod-e98507a6-2221-4312-9f1b-34b7d1124b91" node="worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" evaluatedNodes=6 feasibleNodes=1 I0904 13:20:05.309985 1 debugger.go:65] "Dumping node infos" node="master-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=1895m memory=5579Mi pods=0 ephemeral-storage=0]" I0904 13:20:05.310011 1 debugger.go:65] "Dumping node infos" node="master-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=1720m memory=5049Mi pods=0 ephemeral-storage=0]" I0904 13:20:05.310021 1 debugger.go:65] "Dumping node infos" node="master-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[ephemeral-storage=0 cpu=1934m memory=5999Mi pods=0]" I0904 13:20:05.310029 1 debugger.go:65] "Dumping node infos" node="worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[cpu=3468m memory=3097Mi pods=0 ephemeral-storage=0]" I0904 13:20:05.310037 1 debugger.go:65] "Dumping node infos" node="worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[ephemeral-storage=0 cpu=3267m memory=3282Mi pods=0]" I0904 13:20:05.310044 1 debugger.go:65] "Dumping node infos" node="worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" requested="[pods=0 ephemeral-storage=0 cpu=6345m memory=14874430Ki]" I0904 13:20:05.314450 1 scheduler.go:598] "Successfully bound pod to node" pod="e2e-sched-pred-4521/additional-pod" node="worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com" evaluatedNodes=6 feasibleNodes=2``` From junit (what the e2e test thinks after scheduling the resource fillers): ``` Sep 4 13:20:00.412: INFO: Pod 67e791a6-f170-4aff-b2b1-e42a0a6a7c3c-0 requesting resource cpu=3182m on Node worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod ead2872e-4ad3-4244-bb02-9f7ff0b6a428-0 requesting resource cpu=2877m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod ffecd964-0982-4541-ba35-bc53ef1bcd52-0 requesting resource cpu=3078m on Node worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod tuned-cmmn2 requesting resource cpu=10m on Node worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod tuned-tljds requesting resource cpu=10m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod tuned-v48gt requesting resource cpu=10m on Node worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod dns-default-rc6ht requesting resource cpu=65m on Node worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod dns-default-tbqpg requesting resource cpu=65m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod dns-default-x9v97 requesting resource cpu=65m on Node worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod image-registry-7cf7d78fd6-4mxr9 requesting resource cpu=100m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod node-ca-72247 requesting resource cpu=10m on Node worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod node-ca-8nmtz requesting resource cpu=10m on Node worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod node-ca-zr995 requesting resource cpu=10m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod router-default-556cc9b846-8zrxd requesting resource cpu=100m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod router-default-556cc9b846-x8j5j requesting resource cpu=100m on Node worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod migrator-865bcf9488-tkjff requesting resource cpu=100m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod machine-config-daemon-7gnl4 requesting resource cpu=40m on Node worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod machine-config-daemon-n895c requesting resource cpu=40m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod machine-config-daemon-wf995 requesting resource cpu=40m on Node worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod certified-operators-c4kzf requesting resource cpu=10m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod certified-operators-whjv4 requesting resource cpu=10m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod community-operators-88cwp requesting resource cpu=10m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod community-operators-qvc86 requesting resource cpu=10m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod redhat-marketplace-dxk6r requesting resource cpu=10m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod redhat-marketplace-fnsfv requesting resource cpu=10m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod redhat-operators-n9vwc requesting resource cpu=10m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod redhat-operators-xz9qd requesting resource cpu=10m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod alertmanager-main-0 requesting resource cpu=8m on Node worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod alertmanager-main-1 requesting resource cpu=8m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod alertmanager-main-2 requesting resource cpu=8m on Node worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod grafana-5dc687d5df-mltbk requesting resource cpu=5m on Node worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod kube-state-metrics-747cdddc64-frwfw requesting resource cpu=4m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod node-exporter-mjnv9 requesting resource cpu=9m on Node worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod node-exporter-rklqh requesting resource cpu=9m on Node worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod node-exporter-w8279 requesting resource cpu=9m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod openshift-state-metrics-59c8cd5968-5rdz7 requesting resource cpu=3m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod prometheus-adapter-6b7f847c77-pc945 requesting resource cpu=1m on Node worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod prometheus-adapter-6b7f847c77-xs5kb requesting resource cpu=1m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod prometheus-k8s-0 requesting resource cpu=76m on Node worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod prometheus-k8s-1 requesting resource cpu=76m on Node worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod telemeter-client-657bd8cbd6-7ll4v requesting resource cpu=3m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod thanos-querier-5c57566d79-c5n5d requesting resource cpu=9m on Node worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod thanos-querier-5c57566d79-zc2sf requesting resource cpu=9m on Node worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod multus-9p7zj requesting resource cpu=10m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod multus-b6z24 requesting resource cpu=10m on Node worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod multus-h7dtf requesting resource cpu=10m on Node worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod network-metrics-daemon-9nvjv requesting resource cpu=20m on Node worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod network-metrics-daemon-cmnw6 requesting resource cpu=20m on Node worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod network-metrics-daemon-k2xxs requesting resource cpu=20m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod ovs-k7n5z requesting resource cpu=100m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod ovs-shgxf requesting resource cpu=100m on Node worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod ovs-ssnc6 requesting resource cpu=100m on Node worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod sdn-sblm6 requesting resource cpu=110m on Node worker-2.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod sdn-wzgwj requesting resource cpu=110m on Node worker-0.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com Sep 4 13:20:00.412: INFO: Pod sdn-xw75k requesting resource cpu=110m on Node worker-1.ci-op-qfypnb1x-18a82.origin-ci-int-aws.dev.rhcloud.com ``` If you observe the e2e logs you can see 67e791a6-f170-4aff-b2b1-e42a0a6a7c3c-0, ead2872e-4ad3-4244-bb02-9f7ff0b6a428-0 and ffecd964-0982-4541-ba35-bc53ef1bcd52-0 are accounted when computing required cpu. However, all three pods are terminated before the final e2e-sched-pred-4521/additional-pod is scheduled. Previous test "[sig-scheduling] SchedulerPriorities [Serial] Pod should avoid nodes that have avoidPod annotation" is not cleaning its created resources properly and does not wait until all three pods are terminated and deleted. So, the worker nodes are not equalized wrt. cpu since the e2e code does not expect pods getting terminated during the test.
The issue is known. I just need to open a PR upstream and make sure the pods are deleted before the e2e test finishes. Putting UpcomingSprint just in case the fix will not get merged by EOW.
I have updated https://github.com/kubernetes/kubernetes/pull/94684 to have "validates resource limits of pods that are allowed to run" e2e test ignore all pods with DeletionTimestamp set instead of have the previous test to wait longer for filler pods to be deleted. There may be other tests not properly deleting their pods which may affect the resource computation.
If the proposed fix gets merged upstream in time, I will move back to 4.6.
Still seeing this fail in CI occasionally https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.7/1336764367294697472
Need more time to address the issue properly upstream.
Still see the test failing, will wait for another day before moving the bug back to assigned state. https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-scheduling%5C%5D+SchedulerPredicates+%5C%5BSerial%5C%5D+validates+resource+limits+of+pods+that+are+allowed+to+run++%5C%5BConformance%5C%5D
Have not seen this failing from last 1 day, so moving the bug to verified state. https://search.ci.openshift.org/?search=%5C%5Bsig-scheduling%5C%5D+SchedulerPredicates+%5C%5BSerial%5C%5D+validates+resource+limits+of+pods+that+are+allowed+to+run++%5C%5BConformance%5C%5D&maxAge=24h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633