Bug 1779303
Summary: | serial CI failing with: SchedulerPredicates [Serial] validates resource limits of pods that are allowed to run | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> |
Component: | kube-scheduler | Assignee: | Jan Chaloupka <jchaloup> |
Status: | CLOSED ERRATA | QA Contact: | RamaKasturi <knarra> |
Severity: | medium | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.3.0 | CC: | aos-bugs, maszulik, mfojtik, yinzhou |
Target Milestone: | --- | ||
Target Release: | 4.4.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-05-13 21:53:49 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
W. Trevor King
2019-12-03 17:41:45 UTC
Checking occurrences of "validates resource limits of pods that are allowed to run": [1] https://search.svc.ci.openshift.org/?search=validates+resource+limits+of+pods+that+are+allowed+to+run&maxAge=336h&context=2&type=all Reported failures: - release-openshift-origin-installer-e2e-aws-serial-4.4 #29 35 hours Failed due to: ``` Jan 11 23:35:57.568: INFO: At 2020-01-11 23:31:21 +0000 UTC - event for filler-pod-c3c825d5-8ea6-4e0e-b37d-0d33e39216fc: {kubelet ip-10-0-134-77.ec2.internal} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_filler-pod-c3c825d5-8ea6-4e0e-b37d-0d33e39216fc_e2e-sched-pred-9151_7bcf8102-6434-4661-8696-d1f71d1eae0c_0(7ccc1871d3a4cb9304767f874467b5815bdf27164f7f7c18002ddbac8e9a8339): netplugin failed with no error message Jan 11 23:35:57.568: INFO: At 2020-01-11 23:31:35 +0000 UTC - event for filler-pod-c3c825d5-8ea6-4e0e-b37d-0d33e39216fc: {kubelet ip-10-0-134-77.ec2.internal} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_filler-pod-c3c825d5-8ea6-4e0e-b37d-0d33e39216fc_e2e-sched-pred-9151_7bcf8102-6434-4661-8696-d1f71d1eae0c_0(8941d6a67456a47467ce359ea5ddcb7d5e8f2dadcb5c0e369b0a61d5b39d7455): netplugin failed with no error message Jan 11 23:35:57.568: INFO: At 2020-01-11 23:31:48 +0000 UTC - event for filler-pod-c3c825d5-8ea6-4e0e-b37d-0d33e39216fc: {kubelet ip-10-0-134-77.ec2.internal} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_filler-pod-c3c825d5-8ea6-4e0e-b37d-0d33e39216fc_e2e-sched-pred-9151_7bcf8102-6434-4661-8696-d1f71d1eae0c_0(9ed80bd3da297d0ea5efc8b420362b421694b003a0967eecf37903d85e54275c): netplugin failed with no error message Jan 11 23:35:57.568: INFO: At 2020-01-11 23:32:00 +0000 UTC - event for filler-pod-c3c825d5-8ea6-4e0e-b37d-0d33e39216fc: {kubelet ip-10-0-134-77.ec2.internal} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_filler-pod-c3c825d5-8ea6-4e0e-b37d-0d33e39216fc_e2e-sched-pred-9151_7bcf8102-6434-4661-8696-d1f71d1eae0c_0(05858b36d9ceea1a324a90988ad8d7f9454a7e646e9e8b07f03eda7ff9139ca5): netplugin failed with no error message Jan 11 23:35:57.568: INFO: At 2020-01-11 23:32:13 +0000 UTC - event for filler-pod-c3c825d5-8ea6-4e0e-b37d-0d33e39216fc: {kubelet ip-10-0-134-77.ec2.internal} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_filler-pod-c3c825d5-8ea6-4e0e-b37d-0d33e39216fc_e2e-sched-pred-9151_7bcf8102-6434-4661-8696-d1f71d1eae0c_0(8d74a246740f45bce77d7c98361a3f2641243eb885e0dcac8fa9b3c8cbd3759d): netplugin failed with no error message Jan 11 23:35:57.568: INFO: At 2020-01-11 23:32:27 +0000 UTC - event for filler-pod-c3c825d5-8ea6-4e0e-b37d-0d33e39216fc: {kubelet ip-10-0-134-77.ec2.internal} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_filler-pod-c3c825d5-8ea6-4e0e-b37d-0d33e39216fc_e2e-sched-pred-9151_7bcf8102-6434-4661-8696-d1f71d1eae0c_0(6a52bb72ffbac3f3d955c9203862b625329ba302e544657686c5a4abd4b8e74b): netplugin failed with no error message Jan 11 23:35:57.568: INFO: At 2020-01-11 23:32:40 +0000 UTC - event for filler-pod-c3c825d5-8ea6-4e0e-b37d-0d33e39216fc: {kubelet ip-10-0-134-77.ec2.internal} FailedCreatePodSandBox: (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_filler-pod-c3c825d5-8ea6-4e0e-b37d-0d33e39216fc_e2e-sched-pred-9151_7bcf8102-6434-4661-8696-d1f71d1eae0c_0(ced44010c7fdefa2593314f46b651f1f7112120c2cff62c9efa5f49e90100d8c): netplugin failed with no error message ``` - release-openshift-ocp-installer-e2e-openstack-serial-4.3 #782 42 hours Here, the scheduler were unexpectedly able to schedule three more pods. Checking the code I can see the pod template is actually setting just limits, missing requests part (https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/test/e2e/scheduling/predicates.go#L340-L342). Limit constrains the maximal amount of resources, not minimal. Plus, limits are read when priorities are computed, not predicates. - release-openshift-ocp-installer-e2e-metal-serial-4.3 #689 2 days Seems to be the same case as the previous (just limits, missing requests) - release-openshift-origin-installer-e2e-gcp-serial-4.4 #710 2 days Seems to be the same case as the previous (just limits, missing requests) - pull-ci-openshift-origin-master-e2e-aws-serial #11669 6 days Seems to be the same case as the previous (just limits, missing requests) In general, when the test passes, kube-scheduler logs: - from https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-serial-4.4/370: ``` I0112 02:44:07.732967 1 factory.go:453] Unable to schedule e2e-sched-pred-6058/additional-pod: no fit: 0/6 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 5 Insufficient cpu.; waiting I0112 02:44:07.741405 1 factory.go:453] Unable to schedule e2e-sched-pred-6058/additional-pod: no fit: 0/6 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 5 Insufficient cpu.; waiting I0112 02:44:08.865687 1 factory.go:453] Unable to schedule e2e-sched-pred-6058/additional-pod: no fit: 0/6 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 5 Insufficient cpu.; waiting I0112 02:44:11.106238 1 factory.go:453] Unable to schedule e2e-sched-pred-6058/additional-pod: no fit: 0/6 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 5 Insufficient cpu.; waiting ``` - from https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.4/368: ``` I0112 00:39:03.446244 1 factory.go:453] Unable to schedule e2e-sched-pred-9787/additional-pod: no fit: 0/6 nodes are available: 3 Insufficient cpu, 3 node(s) had taints that the pod didn't tolerate.; waiting I0112 00:39:04.587016 1 factory.go:453] Unable to schedule e2e-sched-pred-9787/additional-pod: no fit: 0/6 nodes are available: 3 Insufficient cpu, 3 node(s) had taints that the pod didn't tolerate.; waiting I0112 00:39:07.107098 1 factory.go:453] Unable to schedule e2e-sched-pred-9787/additional-pod: no fit: 0/6 nodes are available: 3 Insufficient cpu, 3 node(s) had taints that the pod didn't tolerate.; waiting ``` - from https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.3/784: ``` I0112 01:02:57.656163 1 factory.go:545] Unable to schedule e2e-sched-pred-7986/additional-pod: no fit: 0/6 nodes are available: 3 Insufficient cpu, 3 node(s) had taints that the pod didn't tolerate.; waiting I0112 01:02:58.785092 1 factory.go:545] Unable to schedule e2e-sched-pred-7986/additional-pod: no fit: 0/6 nodes are available: 3 Insufficient cpu, 3 node(s) had taints that the pod didn't tolerate.; waiting I0112 01:03:00.900971 1 factory.go:545] Unable to schedule e2e-sched-pred-7986/additional-pod: no fit: 0/6 nodes are available: 3 Insufficient cpu, 3 node(s) had taints that the pod didn't tolerate.; waiting ``` - from https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-serial-4.4/734: ``` I0112 00:56:23.946836 1 factory.go:453] Unable to schedule e2e-sched-pred-7953/additional-pod: no fit: 0/6 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 5 Insufficient cpu.; waiting I0112 00:56:24.967432 1 factory.go:453] Unable to schedule e2e-sched-pred-7953/additional-pod: no fit: 0/6 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 5 Insufficient cpu.; waiting I0112 00:56:27.844969 1 factory.go:453] Unable to schedule e2e-sched-pred-7953/additional-pod: no fit: 0/6 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 5 Insufficient cpu.; waiting ``` - from https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-serial-4.4/368 ``` I0112 00:56:55.511789 1 factory.go:453] Unable to schedule e2e-sched-pred-1477/additional-pod: no fit: 0/6 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 5 Insufficient cpu.; waiting I0112 00:56:56.650100 1 factory.go:453] Unable to schedule e2e-sched-pred-1477/additional-pod: no fit: 0/6 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 5 Insufficient cpu.; waiting I0112 00:56:58.829270 1 factory.go:453] Unable to schedule e2e-sched-pred-1477/additional-pod: no fit: 0/6 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 5 Insufficient cpu.; waiting ``` In case it fails, kube-scheduler logs: - from https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/24364/pull-ci-openshift-origin-master-e2e-aws-serial/11669: ``` I0106 18:47:51.747996 1 scheduler.go:751] pod e2e-sched-pred-8851/additional-pod is bound successfully on node "ip-10-0-135-144.ec2.internal", 6 nodes evaluated, 1 nodes were found feasible. ``` - from https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-serial-4.4/710: ``` I0110 18:11:03.795308 1 scheduler.go:751] pod e2e-sched-pred-5236/additional-pod is bound successfully on node "ci-op-kcwht-w-c-zr47n.c.openshift-gce-devel-ci.internal", 6 nodes evaluated, 1 nodes were found feasible. ``` - from https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-serial-4.3/689: ``` I0111 01:29:47.214258 1 scheduler.go:667] pod e2e-sched-pred-2745/additional-pod is bound successfully on node "worker-2.ci-op-7fbyp60w-6c024.origin-ci-int-aws.dev.rhcloud.com", 6 nodes evaluated, 1 nodes were found feasible. ``` - from https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-serial-4.3/782: ``` I0111 16:28:46.160042 1 scheduler.go:667] pod e2e-sched-pred-8927/additional-pod is bound successfully on node "lilmihkr-9f2ed-clrqw-worker-jl4jf", 6 nodes evaluated, 2 nodes were found feasible. ``` Based on the logs from both sides it seems the test passes by accident. Either due to taints not being tolerated or due to insufficient cpu. Unfortunately, junit does not store logs of passing tests so I can't validate what causes the insufficient cpu case. Most likely the filler pods that eat 70% of cpu. And/or test being run too soon (when some of the node taints are still set) or other tests not clearing their taints (at all or too late). Usage of the limits in the test is unclear to me. Asking upstream for guidance: https://github.com/kubernetes/kubernetes/issues/87191 Upstream PR: https://github.com/kubernetes/kubernetes/pull/87242 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581 |