Bug 1779303

Summary: serial CI failing with: SchedulerPredicates [Serial] validates resource limits of pods that are allowed to run
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: kube-schedulerAssignee: Jan Chaloupka <jchaloup>
Status: CLOSED ERRATA QA Contact: RamaKasturi <knarra>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.3.0CC: aos-bugs, maszulik, mfojtik, yinzhou
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-13 21:53:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description W. Trevor King 2019-12-03 17:41:45 UTC
Failures like [1]:

  fail [k8s.io/kubernetes/test/e2e/scheduling/predicates.go:749]: Unexpected error:
    <*errors.errorString | 0xc00028c1d0>: {
        s: "timed out waiting for the condition",
    }
    timed out waiting for the condition
  occurred

  failed: (2m15s) 2019-12-02T20:56:59 "[sig-scheduling] SchedulerPredicates [Serial] validates resource limits of pods that are allowed to run  [Conformance] [Suite:openshift/conformance/serial/minimal] [Suite:k8s]"

This shows up in 8% of our serial failures, although ~50% of the serial failures are setup-time issues, so it's really ~16% of the serial failures that made it far enough to start running the e2e suite [2].

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-serial-4.3/427
[2]: https://search.svc.ci.openshift.org/chart?name=serial&search=failed:.*validates%20resource%20limits%20of%20pods%20that%20are%20allowed%20to%20run&search=Container%20setup%20exited%20with%20code%20.,%20reason%20Error

Comment 1 Jan Chaloupka 2020-01-13 13:12:20 UTC
Checking occurrences of "validates resource limits of pods that are allowed to run":

[1] https://search.svc.ci.openshift.org/?search=validates+resource+limits+of+pods+that+are+allowed+to+run&maxAge=336h&context=2&type=all

Reported failures:
- release-openshift-origin-installer-e2e-aws-serial-4.4 #29 35 hours

Failed due to:
```
Jan 11 23:35:57.568: INFO: At 2020-01-11 23:31:21 +0000 UTC - event for filler-pod-c3c825d5-8ea6-4e0e-b37d-0d33e39216fc: {kubelet ip-10-0-134-77.ec2.internal} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_filler-pod-c3c825d5-8ea6-4e0e-b37d-0d33e39216fc_e2e-sched-pred-9151_7bcf8102-6434-4661-8696-d1f71d1eae0c_0(7ccc1871d3a4cb9304767f874467b5815bdf27164f7f7c18002ddbac8e9a8339): netplugin failed with no error message
Jan 11 23:35:57.568: INFO: At 2020-01-11 23:31:35 +0000 UTC - event for filler-pod-c3c825d5-8ea6-4e0e-b37d-0d33e39216fc: {kubelet ip-10-0-134-77.ec2.internal} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_filler-pod-c3c825d5-8ea6-4e0e-b37d-0d33e39216fc_e2e-sched-pred-9151_7bcf8102-6434-4661-8696-d1f71d1eae0c_0(8941d6a67456a47467ce359ea5ddcb7d5e8f2dadcb5c0e369b0a61d5b39d7455): netplugin failed with no error message
Jan 11 23:35:57.568: INFO: At 2020-01-11 23:31:48 +0000 UTC - event for filler-pod-c3c825d5-8ea6-4e0e-b37d-0d33e39216fc: {kubelet ip-10-0-134-77.ec2.internal} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_filler-pod-c3c825d5-8ea6-4e0e-b37d-0d33e39216fc_e2e-sched-pred-9151_7bcf8102-6434-4661-8696-d1f71d1eae0c_0(9ed80bd3da297d0ea5efc8b420362b421694b003a0967eecf37903d85e54275c): netplugin failed with no error message
Jan 11 23:35:57.568: INFO: At 2020-01-11 23:32:00 +0000 UTC - event for filler-pod-c3c825d5-8ea6-4e0e-b37d-0d33e39216fc: {kubelet ip-10-0-134-77.ec2.internal} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_filler-pod-c3c825d5-8ea6-4e0e-b37d-0d33e39216fc_e2e-sched-pred-9151_7bcf8102-6434-4661-8696-d1f71d1eae0c_0(05858b36d9ceea1a324a90988ad8d7f9454a7e646e9e8b07f03eda7ff9139ca5): netplugin failed with no error message
Jan 11 23:35:57.568: INFO: At 2020-01-11 23:32:13 +0000 UTC - event for filler-pod-c3c825d5-8ea6-4e0e-b37d-0d33e39216fc: {kubelet ip-10-0-134-77.ec2.internal} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_filler-pod-c3c825d5-8ea6-4e0e-b37d-0d33e39216fc_e2e-sched-pred-9151_7bcf8102-6434-4661-8696-d1f71d1eae0c_0(8d74a246740f45bce77d7c98361a3f2641243eb885e0dcac8fa9b3c8cbd3759d): netplugin failed with no error message
Jan 11 23:35:57.568: INFO: At 2020-01-11 23:32:27 +0000 UTC - event for filler-pod-c3c825d5-8ea6-4e0e-b37d-0d33e39216fc: {kubelet ip-10-0-134-77.ec2.internal} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_filler-pod-c3c825d5-8ea6-4e0e-b37d-0d33e39216fc_e2e-sched-pred-9151_7bcf8102-6434-4661-8696-d1f71d1eae0c_0(6a52bb72ffbac3f3d955c9203862b625329ba302e544657686c5a4abd4b8e74b): netplugin failed with no error message
Jan 11 23:35:57.568: INFO: At 2020-01-11 23:32:40 +0000 UTC - event for filler-pod-c3c825d5-8ea6-4e0e-b37d-0d33e39216fc: {kubelet ip-10-0-134-77.ec2.internal} FailedCreatePodSandBox: (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_filler-pod-c3c825d5-8ea6-4e0e-b37d-0d33e39216fc_e2e-sched-pred-9151_7bcf8102-6434-4661-8696-d1f71d1eae0c_0(ced44010c7fdefa2593314f46b651f1f7112120c2cff62c9efa5f49e90100d8c): netplugin failed with no error message
```

- release-openshift-ocp-installer-e2e-openstack-serial-4.3 #782 42 hours
Here, the scheduler were unexpectedly able to schedule three more pods. Checking the code I can see the pod template is actually setting just limits, missing requests part (https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/test/e2e/scheduling/predicates.go#L340-L342). Limit constrains the maximal amount of resources, not minimal. Plus, limits are read when priorities are computed, not predicates.

- release-openshift-ocp-installer-e2e-metal-serial-4.3 #689 2 days
Seems to be the same case as the previous (just limits, missing requests)

- release-openshift-origin-installer-e2e-gcp-serial-4.4 #710 2 days
Seems to be the same case as the previous (just limits, missing requests)

- pull-ci-openshift-origin-master-e2e-aws-serial #11669 6 days
Seems to be the same case as the previous (just limits, missing requests)

Comment 2 Jan Chaloupka 2020-01-13 14:04:34 UTC
In general, when the test passes, kube-scheduler logs:

- from https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-serial-4.4/370:
```
I0112 02:44:07.732967       1 factory.go:453] Unable to schedule e2e-sched-pred-6058/additional-pod: no fit: 0/6 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 5 Insufficient cpu.; waiting
I0112 02:44:07.741405       1 factory.go:453] Unable to schedule e2e-sched-pred-6058/additional-pod: no fit: 0/6 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 5 Insufficient cpu.; waiting
I0112 02:44:08.865687       1 factory.go:453] Unable to schedule e2e-sched-pred-6058/additional-pod: no fit: 0/6 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 5 Insufficient cpu.; waiting
I0112 02:44:11.106238       1 factory.go:453] Unable to schedule e2e-sched-pred-6058/additional-pod: no fit: 0/6 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 5 Insufficient cpu.; waiting
```

- from https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.4/368:
```
I0112 00:39:03.446244       1 factory.go:453] Unable to schedule e2e-sched-pred-9787/additional-pod: no fit: 0/6 nodes are available: 3 Insufficient cpu, 3 node(s) had taints that the pod didn't tolerate.; waiting
I0112 00:39:04.587016       1 factory.go:453] Unable to schedule e2e-sched-pred-9787/additional-pod: no fit: 0/6 nodes are available: 3 Insufficient cpu, 3 node(s) had taints that the pod didn't tolerate.; waiting
I0112 00:39:07.107098       1 factory.go:453] Unable to schedule e2e-sched-pred-9787/additional-pod: no fit: 0/6 nodes are available: 3 Insufficient cpu, 3 node(s) had taints that the pod didn't tolerate.; waiting
```

- from https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-serial-4.3/784:
```
I0112 01:02:57.656163       1 factory.go:545] Unable to schedule e2e-sched-pred-7986/additional-pod: no fit: 0/6 nodes are available: 3 Insufficient cpu, 3 node(s) had taints that the pod didn't tolerate.; waiting
I0112 01:02:58.785092       1 factory.go:545] Unable to schedule e2e-sched-pred-7986/additional-pod: no fit: 0/6 nodes are available: 3 Insufficient cpu, 3 node(s) had taints that the pod didn't tolerate.; waiting
I0112 01:03:00.900971       1 factory.go:545] Unable to schedule e2e-sched-pred-7986/additional-pod: no fit: 0/6 nodes are available: 3 Insufficient cpu, 3 node(s) had taints that the pod didn't tolerate.; waiting
```

- from https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-serial-4.4/734:
```
I0112 00:56:23.946836       1 factory.go:453] Unable to schedule e2e-sched-pred-7953/additional-pod: no fit: 0/6 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 5 Insufficient cpu.; waiting
I0112 00:56:24.967432       1 factory.go:453] Unable to schedule e2e-sched-pred-7953/additional-pod: no fit: 0/6 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 5 Insufficient cpu.; waiting
I0112 00:56:27.844969       1 factory.go:453] Unable to schedule e2e-sched-pred-7953/additional-pod: no fit: 0/6 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 5 Insufficient cpu.; waiting
```

- from https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-serial-4.4/368
```
I0112 00:56:55.511789       1 factory.go:453] Unable to schedule e2e-sched-pred-1477/additional-pod: no fit: 0/6 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 5 Insufficient cpu.; waiting
I0112 00:56:56.650100       1 factory.go:453] Unable to schedule e2e-sched-pred-1477/additional-pod: no fit: 0/6 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 5 Insufficient cpu.; waiting
I0112 00:56:58.829270       1 factory.go:453] Unable to schedule e2e-sched-pred-1477/additional-pod: no fit: 0/6 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 5 Insufficient cpu.; waiting
```

In case it fails, kube-scheduler logs:

- from https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/24364/pull-ci-openshift-origin-master-e2e-aws-serial/11669:
```
I0106 18:47:51.747996       1 scheduler.go:751] pod e2e-sched-pred-8851/additional-pod is bound successfully on node "ip-10-0-135-144.ec2.internal", 6 nodes evaluated, 1 nodes were found feasible.
```

- from https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-serial-4.4/710:
```
I0110 18:11:03.795308       1 scheduler.go:751] pod e2e-sched-pred-5236/additional-pod is bound successfully on node "ci-op-kcwht-w-c-zr47n.c.openshift-gce-devel-ci.internal", 6 nodes evaluated, 1 nodes were found feasible.
```

- from https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-metal-serial-4.3/689:
```
I0111 01:29:47.214258       1 scheduler.go:667] pod e2e-sched-pred-2745/additional-pod is bound successfully on node "worker-2.ci-op-7fbyp60w-6c024.origin-ci-int-aws.dev.rhcloud.com", 6 nodes evaluated, 1 nodes were found feasible. 
```

- from https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-serial-4.3/782:
```
I0111 16:28:46.160042       1 scheduler.go:667] pod e2e-sched-pred-8927/additional-pod is bound successfully on node "lilmihkr-9f2ed-clrqw-worker-jl4jf", 6 nodes evaluated, 2 nodes were found feasible.
```

Based on the logs from both sides it seems the test passes by accident. Either due to taints not being tolerated or due to insufficient cpu. Unfortunately, junit does not store logs of passing tests so I can't validate what causes the insufficient cpu case. Most likely the filler pods that eat 70% of cpu. And/or test being run too soon (when some of the node taints are still set) or other tests not clearing their taints (at all or too late).

Comment 3 Jan Chaloupka 2020-01-14 12:19:38 UTC
Usage of the limits in the test is unclear to me. Asking upstream for guidance: https://github.com/kubernetes/kubernetes/issues/87191

Comment 4 Jan Chaloupka 2020-01-15 12:42:14 UTC
Upstream PR: https://github.com/kubernetes/kubernetes/pull/87242

Comment 8 errata-xmlrpc 2020-05-13 21:53:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581