Bug 1734486 - Configure new Scheduler Policy should take effect
Summary: Configure new Scheduler Policy should take effect
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-scheduler
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.2.0
Assignee: Mike Dame
QA Contact: ge liu
URL:
Whiteboard:
Depends On:
Blocks: 1729918
TreeView+ depends on / blocked
 
Reported: 2019-07-30 16:40 UTC by Mike Dame
Modified: 2019-10-16 06:34 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1729918
Environment:
Last Closed: 2019-10-16 06:34:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:2922 0 None None None 2019-10-16 06:34:20 UTC

Comment 1 Mike Dame 2019-07-30 16:42:54 UTC
Opening this to confirm that in 4.2, this problem no longer exists after https://github.com/openshift/cluster-kube-scheduler-operator/pull/136 so that we can justify backporting the changes in https://github.com/openshift/cluster-kube-scheduler-operator/pull/158

Comment 7 ge liu 2019-08-02 03:45:10 UTC
Regrading to commant 6, pls don't miss last char '=' for kubeconfig's url

Comment 8 Mike Dame 2019-08-09 14:01:44 UTC
Ravi, could you please take a look at this? It seems like since we are seeing the log line change after editing the policy configmap (indicating that the scheduler is aware of the new policy) that it should be taking effect. If not, then it would indicate that this is a problem with the Kubernetes scheduler itself right?

Comment 10 Mike Dame 2019-08-21 20:00:58 UTC
I’ve done a lot of debugging into this, and I think I figured out why in your tests the pod isn’t landing on the node you expect. The policy *is* taking effect (which I’ll demonstrate) but the issue is that scheduler priorities are not inherently deterministic; they only enable formulas that attempt to calculate a normalized priority score for each node. This score can only be between 0 and 10, and the scheduler does a lot of rounding, so in some cases (especially small test cases such as this) it’s very likely that a “desired” node and an “undesired” node ultimately end up with the same score.

I can show this using two simpler priorities: `LeastRequestedPriority` and `MostRequestedPriority`

My cluster started with 3 nodes:
----------------------------
Node1:
$ oc describe node ip-10-0-129-96.us-west-2.compute.internal
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests      Limits
  --------                    --------      ------
  cpu                         1110m (74%)   300m (20%)
  memory                      2429Mi (34%)  587Mi (8%)

Node2:
$ oc describe node ip-10-0-157-205.us-west-2.compute.internal
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests      Limits
  --------                    --------      ------
  cpu                         840m (56%)    100m (6%)
  memory                      1821Mi (25%)  537Mi (7%)

Node3:
$ oc describe node ip-10-0-167-240.us-west-2.compute.internal
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests      Limits
  --------                    --------      ------
  cpu                         1210m (80%)   300m (20%)
  memory                      2841Mi (40%)  587Mi (8%)
--------------------------------

While these appear to be different enough to score different priorities (eg, for LeastRequested a pod should always land on Node2), they are not. I found this by logging the node scores in the scheduler (see [1]):
---------------------------------
I0821 18:23:26.443610       1 resource_allocation.go:78] empty-operator-pod -> ip-10-0-157-205.us-west-2.compute.internal: LeastResourceAllocation, capacity 1500 millicores 7420715008 memory bytes, total request 1740 millicores 3796893696 memory bytes, score 2

I0821 18:23:26.443636       1 resource_allocation.go:78] empty-operator-pod -> ip-10-0-167-240.us-west-2.compute.internal: LeastResourceAllocation, capacity 1500 millicores 7420715008 memory bytes, total request 1610 millicores 3817865216 memory bytes, score 2

I0821 18:23:26.443645       1 resource_allocation.go:78] empty-operator-pod -> ip-10-0-129-96.us-west-2.compute.internal: LeastResourceAllocation, capacity 1500 millicores 7420715008 memory bytes, total request 1610 millicores 3595567104 memory bytes, score 2
--------------------------------------

You can see that, in the scheduler, the total requested CPU scored for each node is actually higher than the allocatable (to be honest I don’t know why this is, but it makes following the calculations easier because it then defaults CPU to `0` [2][3]). So in that case, if we plug the remaining numbers into the scoring functions[4][5], the result rounds out to 2 for each node, making each node equally schedulable.

Creating a filler pod to take up significantly more space on a node produces new values for Node1:
```filler-pod.yaml```
apiVersion: v1
kind: Pod
metadata:
  name: filler-pod
spec:
  containers:
    - image: "docker.io/ocpqe/hello-pod"
      name: hello-pod
      resources:
        requests:
          memory: "2000Mi"
```
$ oc describe node ip-10-0-129-96.us-west-2.compute.internal
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests      Limits
  --------                    --------      ------
 cpu                         1110m (74%)   300m (20%)
 memory                      4429Mi (62%)  587Mi (8%)

Now this is enough of a difference for the scheduler to produce different scores for each node:
---------------------
I0821 18:50:23.651263       1 resource_allocation.go:78] empty-operator-pod -> ip-10-0-167-240.us-west-2.compute.internal: MostResourceAllocation, capacity 1500 millicores 7420715008 memory bytes, total request 1610 millicores 3817865216 memory bytes, score 2

I0821 18:50:23.651284       1 resource_allocation.go:78] empty-operator-pod -> ip-10-0-129-96.us-west-2.compute.internal: MostResourceAllocation, capacity 1500 millicores 7420715008 memory bytes, total request 1710 millicores 5692719104 memory bytes, score 3

I0821 18:50:23.651292       1 resource_allocation.go:78] empty-operator-pod -> ip-10-0-157-205.us-west-2.compute.internal: MostResourceAllocation, capacity 1500 millicores 7420715008 memory bytes, total request 1740 millicores 3796893696 memory bytes, score 2
----------------

Here you can see that while using `MostRequestedPriority`, Node1 always gets a higher score, and the pod is always scheduled onto that node:

$ oc get -o wide all
NAME                     READY   STATUS              RESTARTS   AGE     IP           NODE                                        NOMINATED NODE   READINESS GATES
pod/empty-operator-pod   0/1     ContainerCreating   0          3s      <none>       ip-10-0-129-96.us-west-2.compute.internal   <none>           <none>
pod/filler-pod           1/1     Running             0          2m27s   10.131.0.8   ip-10-0-129-96.us-west-2.compute.internal   <none>           <none>

Additionally, when the scheduler config is flipped to use LEASTRequestedPriority, the node scores are now consistently all flipped!
---------------------
I0821 19:06:32.022833       1 resource_allocation.go:78] empty-operator-pod -> ip-10-0-157-205.us-west-2.compute.internal: LeastResourceAllocation, capacity 1500 millicores 7420715008 memory bytes, total request 1740 millicores 3796893696 memory bytes, score 2

I0821 19:06:32.022833       1 resource_allocation.go:78] empty-operator-pod -> ip-10-0-167-240.us-west-2.compute.internal: LeastResourceAllocation, capacity 1500 millicores 7420715008 memory bytes, total request 1610 millicores 3817865216 memory bytes, score 2

I0821 19:06:32.022873       1 resource_allocation.go:78] empty-operator-pod -> ip-10-0-129-96.us-west-2.compute.internal: LeastResourceAllocation, capacity 1500 millicores 7420715008 memory bytes, total request 1710 millicores 5692719104 memory bytes, score 1
-----------------------

Now, Node1 always has the *lowest* score, and the pod is accordingly not scheduled onto it.

`ImageLocalityPriority`, which was first tested here, suffers from similar rounding and normalization woes (see it’s calculation functions at [6]). It is based on the number of nodes that have the image, the number of total nodes, and the size of the image itself. All of which, in our case, are small enough that I believe they might always produce a priority score of 0, which is the minimum and would put the node which has the image at the same priority as any other node which does not.

However, I believe that using `LeastRequestedPriority` and `MostRequestedPriority` (two of the simpler and polar priority functions) and much bigger differences in node requests I’ve demonstrated here that changing the policy configmap does in fact take effect (as evidenced by the logs we’ve already seen).

If you would like to recreate this using my custom scheduler image (normally these scores aren’t logged unless `--v=10`), it is built into my scheduler operator image available at `docker.io/mdame/kso`.

[1] https://github.com/kubernetes/kubernetes/blob/62e97173867a5a7817c2e2ecea78a0e91a370675/pkg/scheduler/algorithm/priorities/resource_allocation.go#L68-L85
[2] https://github.com/kubernetes/kubernetes/blob/90df64b75b61e6ea45f211144dd391ecd4263fb5/pkg/scheduler/algorithm/priorities/most_requested.go#L46
[3] https://github.com/kubernetes/kubernetes/blob/90df64b75b61e6ea45f211144dd391ecd4263fb5/pkg/scheduler/algorithm/priorities/least_requested.go
[4] https://github.com/kubernetes/kubernetes/blob/90df64b75b61e6ea45f211144dd391ecd4263fb5/pkg/scheduler/algorithm/priorities/least_requested.go#L36
[5] https://github.com/kubernetes/kubernetes/blob/90df64b75b61e6ea45f211144dd391ecd4263fb5/pkg/scheduler/algorithm/priorities/most_requested.go#L34
[6] https://github.com/kubernetes/kubernetes/blob/90df64b75b61e6ea45f211144dd391ecd4263fb5/pkg/scheduler/algorithm/priorities/image_locality.go#L50

Comment 11 Mike Dame 2019-08-21 20:36:48 UTC
I updated my hyperkube image to also log for ImageLocalityPriority after [1], and as I thought the scores of every node is 0:

I0821 20:34:53.064475       1 image_locality.go:57] empty-operator-pod -> ip-10-0-129-96.us-west-2.compute.internal: ImageLocalityPriority, score 0
I0821 20:34:53.064500       1 image_locality.go:57] empty-operator-pod -> ip-10-0-167-240.us-west-2.compute.internal: ImageLocalityPriority, score 0
I0821 20:34:53.064507       1 image_locality.go:57] empty-operator-pod -> ip-10-0-157-205.us-west-2.compute.internal: ImageLocalityPriority, score 0

In this case, node `ip-10-0-129-96.us-west-2.compute.internal` was the only node where the image was present. We are using too small of an image on too few nodes for it to make a difference in the algorithm.

[1] https://github.com/kubernetes/kubernetes/blob/90df64b75b61e6ea45f211144dd391ecd4263fb5/pkg/scheduler/algorithm/priorities/image_locality.go#L50

Comment 12 Mike Dame 2019-08-21 21:04:46 UTC
To prove the `ImageLocalityPriority` error, I tried creating a big pod that just pulls the OpenShift release image (it is >1GB in size compared to hello-pod which is only 147MB):

$ cat big-pod.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: empty-operator-pod
spec:
  containers:
    - image: "registry.svc.ci.openshift.org/openshift/release:golang-1.12"
      name: big-test

The first time I create it, none of the nodes have the image so it gets put on a random node:
I0821 20:58:24.209323       1 image_locality.go:57] empty-operator-pod -> ip-10-0-129-96.us-west-2.compute.internal: ImageLocalityPriority, score 0
I0821 20:58:24.210491       1 image_locality.go:57] empty-operator-pod -> ip-10-0-167-240.us-west-2.compute.internal: ImageLocalityPriority, score 0
I0821 20:58:24.210518       1 image_locality.go:57] empty-operator-pod -> ip-10-0-157-205.us-west-2.compute.internal: ImageLocalityPriority, score 0
I0821 20:58:24.216285       1 scheduler.go:572] pod test-pod/empty-operator-pod is bound successfully on node ip-10-0-167-240.us-west-2.compute.internal, 6 nodes evaluated, 3 nodes were found feasible

$ oc get pods -o wide
NAME                 READY   STATUS      RESTARTS   AGE    IP            NODE                                         NOMINATED NODE   READINESS GATES
empty-operator-pod   0/1     Completed   3          90s    10.129.2.11   ip-10-0-167-240.us-west-2.compute.internal   <none>           <none>

In following attempts, the node with the image gets a different score, and the new pod is thus scheduled onto that node:

I0821 21:00:29.239119       1 image_locality.go:57] empty-operator-pod -> ip-10-0-129-96.us-west-2.compute.internal: ImageLocalityPriority, score 0
I0821 21:00:29.239144       1 image_locality.go:57] empty-operator-pod -> ip-10-0-167-240.us-west-2.compute.internal: ImageLocalityPriority, score 1
I0821 21:00:29.239149       1 image_locality.go:57] empty-operator-pod -> ip-10-0-157-205.us-west-2.compute.internal: ImageLocalityPriority, score 0
I0821 21:00:29.244555       1 scheduler.go:572] pod test-pod/empty-operator-pod is bound successfully on node ip-10-0-167-240.us-west-2.compute.internal, 6 nodes evaluated, 3 nodes were found feasible

$ oc get pods -o wide
NAME                 READY   STATUS             RESTARTS   AGE     IP            NODE                                         NOMINATED NODE   READINESS GATES
empty-operator-pod   0/1     CrashLoopBackOff   5          3m31s   10.129.2.12   ip-10-0-167-240.us-west-2.compute.internal   <none>           <none>

Comment 14 Xingxing Xia 2019-08-27 10:00:46 UTC
Mike, Ge Liu checked with me together, thanks for so detailed debugging and demo. We understood your specific data can prove the policy takes effect.
However, from customer perspective, it is hard to make customer believe it by crafting such significant big data and logging it via additional setting --v=10. Customer doesn't know what data scope is significant enough. Thus this sounds not UX friendly to customer. The score granularity 0 - 10 [1] after rounded is too coarse. If score is more fine-granulated e.g. 0.00 ~ 10.00, it should solve this issue.
[1] https://docs.openshift.com/container-platform/4.1/nodes/scheduling/nodes-scheduler-default.html#nodes-scheduler-default-about_nodes-scheduler-default

Thus assigning back. If it is better to move this to VERIFIED and open separate bug for the score granularity issue, @Ge Liu, please help do it.
Thanks

Comment 15 Maciej Szulik 2019-08-27 12:39:39 UTC
The configuration and current algorithm of the scheduler are working as designed. Mike clearly pointed out that the functionality
at fault was fixed and works as expected. If you think that we need to improve the algorithm, that looks to me like an RFE.
In that case feel free to open a card on workloads team jira and we'll consider that among other items this team is currently 
working on. I'm moving this back to test, if the functionality is working as it should please close the bug as verified.

Comment 17 Mike Dame 2019-08-30 19:21:55 UTC
By the way, there is some discussion upstream (as of the sig-scheduling meeting yesterday, coincidentally) about changing these priority scores from 0-100, or some better range that 0-10. So we may not even need a card for that if we get it upstream for free

Comment 18 ge liu 2019-09-05 08:04:10 UTC
yes, it make sense, thanks for kindly explanation.

Comment 19 errata-xmlrpc 2019-10-16 06:34:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.