Bug 1779303 - serial CI failing with: SchedulerPredicates [Serial] validates resource limits of pods that are allowed to run
Summary: serial CI failing with: SchedulerPredicates [Serial] validates resource limit...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-scheduler
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.4.0
Assignee: Jan Chaloupka
QA Contact: RamaKasturi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-12-03 17:41 UTC by W. Trevor King
Modified: 2020-05-13 21:53 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-13 21:53:49 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift origin pull 24498 0 None closed bug 1779303: UPSTREAM: 87242: SchedulerPredicates [Serial] validates resource limits of pods that are allowed to run - s... 2021-02-12 21:41:25 UTC
Red Hat Product Errata RHBA-2020:0581 0 None None None 2020-05-13 21:53:50 UTC

Description W. Trevor King 2019-12-03 17:41:45 UTC
Failures like [1]:

  fail [k8s.io/kubernetes/test/e2e/scheduling/predicates.go:749]: Unexpected error:
    <*errors.errorString | 0xc00028c1d0>: {
        s: "timed out waiting for the condition",
    }
    timed out waiting for the condition
  occurred

  failed: (2m15s) 2019-12-02T20:56:59 "[sig-scheduling] SchedulerPredicates [Serial] validates resource limits of pods that are allowed to run  [Conformance] [Suite:openshift/conformance/serial/minimal] [Suite:k8s]"

This shows up in 8% of our serial failures, although ~50% of the serial failures are setup-time issues, so it's really ~16% of the serial failures that made it far enough to start running the e2e suite [2].

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-serial-4.3/427
[2]: https://search.svc.ci.openshift.org/chart?name=serial&search=failed:.*validates%20resource%20limits%20of%20pods%20that%20are%20allowed%20to%20run&search=Container%20setup%20exited%20with%20code%20.,%20reason%20Error

Comment 1 Jan Chaloupka 2020-01-13 13:12:20 UTC
Checking occurrences of "validates resource limits of pods that are allowed to run":

[1] https://search.svc.ci.openshift.org/?search=validates+resource+limits+of+pods+that+are+allowed+to+run&maxAge=336h&context=2&type=all

Reported failures:
- release-openshift-origin-installer-e2e-aws-serial-4.4 #29 35 hours

Failed due to:
```
Jan 11 23:35:57.568: INFO: At 2020-01-11 23:31:21 +0000 UTC - event for filler-pod-c3c825d5-8ea6-4e0e-b37d-0d33e39216fc: {kubelet ip-10-0-134-77.ec2.internal} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_filler-pod-c3c825d5-8ea6-4e0e-b37d-0d33e39216fc_e2e-sched-pred-9151_7bcf8102-6434-4661-8696-d1f71d1eae0c_0(7ccc1871d3a4cb9304767f874467b5815bdf27164f7f7c18002ddbac8e9a8339): netplugin failed with no error message
Jan 11 23:35:57.568: INFO: At 2020-01-11 23:31:35 +0000 UTC - event for filler-pod-c3c825d5-8ea6-4e0e-b37d-0d33e39216fc: {kubelet ip-10-0-134-77.ec2.internal} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_filler-pod-c3c825d5-8ea6-4e0e-b37d-0d33e39216fc_e2e-sched-pred-9151_7bcf8102-6434-4661-8696-d1f71d1eae0c_0(8941d6a67456a47467ce359ea5ddcb7d5e8f2dadcb5c0e369b0a61d5b39d7455): netplugin failed with no error message
Jan 11 23:35:57.568: INFO: At 2020-01-11 23:31:48 +0000 UTC - event for filler-pod-c3c825d5-8ea6-4e0e-b37d-0d33e39216fc: {kubelet ip-10-0-134-77.ec2.internal} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_filler-pod-c3c825d5-8ea6-4e0e-b37d-0d33e39216fc_e2e-sched-pred-9151_7bcf8102-6434-4661-8696-d1f71d1eae0c_0(9ed80bd3da297d0ea5efc8b420362b421694b003a0967eecf37903d85e54275c): netplugin failed with no error message
Jan 11 23:35:57.568: INFO: At 2020-01-11 23:32:00 +0000 UTC - event for filler-pod-c3c825d5-8ea6-4e0e-b37d-0d33e39216fc: {kubelet ip-10-0-134-77.ec2.internal} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_filler-pod-c3c825d5-8ea6-4e0e-b37d-0d33e39216fc_e2e-sched-pred-9151_7bcf8102-6434-4661-8696-d1f71d1eae0c_0(05858b36d9ceea1a324a90988ad8d7f9454a7e646e9e8b07f03eda7ff9139ca5): netplugin failed with no error message
Jan 11 23:35:57.568: INFO: At 2020-01-11 23:32:13 +0000 UTC - event for filler-pod-c3c825d5-8ea6-4e0e-b37d-0d33e39216fc: {kubelet ip-10-0-134-77.ec2.internal} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_filler-pod-c3c825d5-8ea6-4e0e-b37d-0d33e39216fc_e2e-sched-pred-9151_7bcf8102-6434-4661-8696-d1f71d1eae0c_0(8d74a246740f45bce77d7c98361a3f2641243eb885e0dcac8fa9b3c8cbd3759d): netplugin failed with no error message
Jan 11 23:35:57.568: INFO: At 2020-01-11 23:32:27 +0000 UTC - event for filler-pod-c3c825d5-8ea6-4e0e-b37d-0d33e39216fc: {kubelet ip-10-0-134-77.ec2.internal} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_filler-pod-c3c825d5-8ea6-4e0e-b37d-0d33e39216fc_e2e-sched-pred-9151_7bcf8102-6434-4661-8696-d1f71d1eae0c_0(6a52bb72ffbac3f3d955c9203862b625329ba302e544657686c5a4abd4b8e74b): netplugin failed with no error message
Jan 11 23:35:57.568: INFO: At 2020-01-11 23:32:40 +0000 UTC - event for filler-pod-c3c825d5-8ea6-4e0e-b37d-0d33e39216fc: {kubelet ip-10-0-134-77.ec2.internal} FailedCreatePodSandBox: (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_filler-pod-c3c825d5-8ea6-4e0e-b37d-0d33e39216fc_e2e-sched-pred-9151_7bcf8102-6434-4661-8696-d1f71d1eae0c_0(ced44010c7fdefa2593314f46b651f1f7112120c2cff62c9efa5f49e90100d8c): netplugin failed with no error message
```

- release-openshift-ocp-installer-e2e-openstack-serial-4.3 #782 42 hours
Here, the scheduler were unexpectedly able to schedule three more pods. Checking the code I can see the pod template is actually setting just limits, missing requests part (https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/test/e2e/scheduling/predicates.go#L340-L342). Limit constrains the maximal amount of resources, not minimal. Plus, limits are read when priorities are computed, not predicates.

- release-openshift-ocp-installer-e2e-metal-serial-4.3 #689 2 days
Seems to be the same case as the previous (just limits, missing requests)

- release-openshift-origin-installer-e2e-gcp-serial-4.4 #710 2 days
Seems to be the same case as the previous (just limits, missing requests)

- pull-ci-openshift-origin-master-e2e-aws-serial #11669 6 days
Seems to be the same case as the previous (just limits, missing requests)

Comment 2 Jan Chaloupka 2020-01-13 14:04:34 UTC
In general, when the test passes, kube-scheduler logs:

- from https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-serial-4.4/370:
```
I0112 02:44:07.732967       1 factory.go:453] Unable to schedule e2e-sched-pred-6058/additional-pod: no fit: 0/6 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 5 Insufficient cpu.; waiting
I0112 02:44:07.741405       1 factory.go:453] Unable to schedule e2e-sched-pred-6058/additional-pod: no fit: 0/6 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 5 Insufficient cpu.; waiting
I0112 02:44:08.865687       1 factory.go:453] Unable to schedule e2e-sched-pred-6058/additional-pod: no fit: 0/6 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 5 Insufficient cpu.; waiting
I0112 02:44:11.106238       1 factory.go:453] Unable to schedule e2e-sched-pred-6058/additional-pod: no fit: 0/6 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 5 Insufficient cpu.; waiting
```

- from https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.4/368:
```
I0112 00:39:03.446244       1 factory.go:453] Unable to schedule e2e-sched-pred-9787/additional-pod: no fit: 0/6 nodes are available: 3 Insufficient cpu, 3 node(s) had taints that the pod didn't tolerate.; waiting
I0112 00:39:04.587016       1 factory.go:453] Unable to schedule e2e-sched-pred-9787/additional-pod: no fit: 0/6 nodes are available: 3 Insufficient cpu, 3 node(s) had taints that the pod didn't tolerate.; waiting
I0112 00:39:07.107098       1 factory.go:453] Unable to schedule e2e-sched-pred-9787/additional-pod: no fit: 0/6 nodes are available: 3 Insufficient cpu, 3 node(s) had taints that the pod didn't tolerate.; waiting
```

- from https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.3/784:
```
I0112 01:02:57.656163       1 factory.go:545] Unable to schedule e2e-sched-pred-7986/additional-pod: no fit: 0/6 nodes are available: 3 Insufficient cpu, 3 node(s) had taints that the pod didn't tolerate.; waiting
I0112 01:02:58.785092       1 factory.go:545] Unable to schedule e2e-sched-pred-7986/additional-pod: no fit: 0/6 nodes are available: 3 Insufficient cpu, 3 node(s) had taints that the pod didn't tolerate.; waiting
I0112 01:03:00.900971       1 factory.go:545] Unable to schedule e2e-sched-pred-7986/additional-pod: no fit: 0/6 nodes are available: 3 Insufficient cpu, 3 node(s) had taints that the pod didn't tolerate.; waiting
```

- from https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-serial-4.4/734:
```
I0112 00:56:23.946836       1 factory.go:453] Unable to schedule e2e-sched-pred-7953/additional-pod: no fit: 0/6 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 5 Insufficient cpu.; waiting
I0112 00:56:24.967432       1 factory.go:453] Unable to schedule e2e-sched-pred-7953/additional-pod: no fit: 0/6 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 5 Insufficient cpu.; waiting
I0112 00:56:27.844969       1 factory.go:453] Unable to schedule e2e-sched-pred-7953/additional-pod: no fit: 0/6 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 5 Insufficient cpu.; waiting
```

- from https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-serial-4.4/368
```
I0112 00:56:55.511789       1 factory.go:453] Unable to schedule e2e-sched-pred-1477/additional-pod: no fit: 0/6 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 5 Insufficient cpu.; waiting
I0112 00:56:56.650100       1 factory.go:453] Unable to schedule e2e-sched-pred-1477/additional-pod: no fit: 0/6 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 5 Insufficient cpu.; waiting
I0112 00:56:58.829270       1 factory.go:453] Unable to schedule e2e-sched-pred-1477/additional-pod: no fit: 0/6 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 5 Insufficient cpu.; waiting
```

In case it fails, kube-scheduler logs:

- from https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/24364/pull-ci-openshift-origin-master-e2e-aws-serial/11669:
```
I0106 18:47:51.747996       1 scheduler.go:751] pod e2e-sched-pred-8851/additional-pod is bound successfully on node "ip-10-0-135-144.ec2.internal", 6 nodes evaluated, 1 nodes were found feasible.
```

- from https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-serial-4.4/710:
```
I0110 18:11:03.795308       1 scheduler.go:751] pod e2e-sched-pred-5236/additional-pod is bound successfully on node "ci-op-kcwht-w-c-zr47n.c.openshift-gce-devel-ci.internal", 6 nodes evaluated, 1 nodes were found feasible.
```

- from https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-serial-4.3/689:
```
I0111 01:29:47.214258       1 scheduler.go:667] pod e2e-sched-pred-2745/additional-pod is bound successfully on node "worker-2.ci-op-7fbyp60w-6c024.origin-ci-int-aws.dev.rhcloud.com", 6 nodes evaluated, 1 nodes were found feasible. 
```

- from https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-serial-4.3/782:
```
I0111 16:28:46.160042       1 scheduler.go:667] pod e2e-sched-pred-8927/additional-pod is bound successfully on node "lilmihkr-9f2ed-clrqw-worker-jl4jf", 6 nodes evaluated, 2 nodes were found feasible.
```

Based on the logs from both sides it seems the test passes by accident. Either due to taints not being tolerated or due to insufficient cpu. Unfortunately, junit does not store logs of passing tests so I can't validate what causes the insufficient cpu case. Most likely the filler pods that eat 70% of cpu. And/or test being run too soon (when some of the node taints are still set) or other tests not clearing their taints (at all or too late).

Comment 3 Jan Chaloupka 2020-01-14 12:19:38 UTC
Usage of the limits in the test is unclear to me. Asking upstream for guidance: https://github.com/kubernetes/kubernetes/issues/87191

Comment 4 Jan Chaloupka 2020-01-15 12:42:14 UTC
Upstream PR: https://github.com/kubernetes/kubernetes/pull/87242

Comment 8 errata-xmlrpc 2020-05-13 21:53:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581


Note You need to log in before you can comment on or make changes to this bug.