Opening this to confirm that in 4.2, this problem no longer exists after https://github.com/openshift/cluster-kube-scheduler-operator/pull/136 so that we can justify backporting the changes in https://github.com/openshift/cluster-kube-scheduler-operator/pull/158
Regrading to commant 6, pls don't miss last char '=' for kubeconfig's url
Ravi, could you please take a look at this? It seems like since we are seeing the log line change after editing the policy configmap (indicating that the scheduler is aware of the new policy) that it should be taking effect. If not, then it would indicate that this is a problem with the Kubernetes scheduler itself right?
I’ve done a lot of debugging into this, and I think I figured out why in your tests the pod isn’t landing on the node you expect. The policy *is* taking effect (which I’ll demonstrate) but the issue is that scheduler priorities are not inherently deterministic; they only enable formulas that attempt to calculate a normalized priority score for each node. This score can only be between 0 and 10, and the scheduler does a lot of rounding, so in some cases (especially small test cases such as this) it’s very likely that a “desired” node and an “undesired” node ultimately end up with the same score. I can show this using two simpler priorities: `LeastRequestedPriority` and `MostRequestedPriority` My cluster started with 3 nodes: ---------------------------- Node1: $ oc describe node ip-10-0-129-96.us-west-2.compute.internal Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 1110m (74%) 300m (20%) memory 2429Mi (34%) 587Mi (8%) Node2: $ oc describe node ip-10-0-157-205.us-west-2.compute.internal Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 840m (56%) 100m (6%) memory 1821Mi (25%) 537Mi (7%) Node3: $ oc describe node ip-10-0-167-240.us-west-2.compute.internal Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 1210m (80%) 300m (20%) memory 2841Mi (40%) 587Mi (8%) -------------------------------- While these appear to be different enough to score different priorities (eg, for LeastRequested a pod should always land on Node2), they are not. I found this by logging the node scores in the scheduler (see [1]): --------------------------------- I0821 18:23:26.443610 1 resource_allocation.go:78] empty-operator-pod -> ip-10-0-157-205.us-west-2.compute.internal: LeastResourceAllocation, capacity 1500 millicores 7420715008 memory bytes, total request 1740 millicores 3796893696 memory bytes, score 2 I0821 18:23:26.443636 1 resource_allocation.go:78] empty-operator-pod -> ip-10-0-167-240.us-west-2.compute.internal: LeastResourceAllocation, capacity 1500 millicores 7420715008 memory bytes, total request 1610 millicores 3817865216 memory bytes, score 2 I0821 18:23:26.443645 1 resource_allocation.go:78] empty-operator-pod -> ip-10-0-129-96.us-west-2.compute.internal: LeastResourceAllocation, capacity 1500 millicores 7420715008 memory bytes, total request 1610 millicores 3595567104 memory bytes, score 2 -------------------------------------- You can see that, in the scheduler, the total requested CPU scored for each node is actually higher than the allocatable (to be honest I don’t know why this is, but it makes following the calculations easier because it then defaults CPU to `0` [2][3]). So in that case, if we plug the remaining numbers into the scoring functions[4][5], the result rounds out to 2 for each node, making each node equally schedulable. Creating a filler pod to take up significantly more space on a node produces new values for Node1: ```filler-pod.yaml``` apiVersion: v1 kind: Pod metadata: name: filler-pod spec: containers: - image: "docker.io/ocpqe/hello-pod" name: hello-pod resources: requests: memory: "2000Mi" ``` $ oc describe node ip-10-0-129-96.us-west-2.compute.internal Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 1110m (74%) 300m (20%) memory 4429Mi (62%) 587Mi (8%) Now this is enough of a difference for the scheduler to produce different scores for each node: --------------------- I0821 18:50:23.651263 1 resource_allocation.go:78] empty-operator-pod -> ip-10-0-167-240.us-west-2.compute.internal: MostResourceAllocation, capacity 1500 millicores 7420715008 memory bytes, total request 1610 millicores 3817865216 memory bytes, score 2 I0821 18:50:23.651284 1 resource_allocation.go:78] empty-operator-pod -> ip-10-0-129-96.us-west-2.compute.internal: MostResourceAllocation, capacity 1500 millicores 7420715008 memory bytes, total request 1710 millicores 5692719104 memory bytes, score 3 I0821 18:50:23.651292 1 resource_allocation.go:78] empty-operator-pod -> ip-10-0-157-205.us-west-2.compute.internal: MostResourceAllocation, capacity 1500 millicores 7420715008 memory bytes, total request 1740 millicores 3796893696 memory bytes, score 2 ---------------- Here you can see that while using `MostRequestedPriority`, Node1 always gets a higher score, and the pod is always scheduled onto that node: $ oc get -o wide all NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/empty-operator-pod 0/1 ContainerCreating 0 3s <none> ip-10-0-129-96.us-west-2.compute.internal <none> <none> pod/filler-pod 1/1 Running 0 2m27s 10.131.0.8 ip-10-0-129-96.us-west-2.compute.internal <none> <none> Additionally, when the scheduler config is flipped to use LEASTRequestedPriority, the node scores are now consistently all flipped! --------------------- I0821 19:06:32.022833 1 resource_allocation.go:78] empty-operator-pod -> ip-10-0-157-205.us-west-2.compute.internal: LeastResourceAllocation, capacity 1500 millicores 7420715008 memory bytes, total request 1740 millicores 3796893696 memory bytes, score 2 I0821 19:06:32.022833 1 resource_allocation.go:78] empty-operator-pod -> ip-10-0-167-240.us-west-2.compute.internal: LeastResourceAllocation, capacity 1500 millicores 7420715008 memory bytes, total request 1610 millicores 3817865216 memory bytes, score 2 I0821 19:06:32.022873 1 resource_allocation.go:78] empty-operator-pod -> ip-10-0-129-96.us-west-2.compute.internal: LeastResourceAllocation, capacity 1500 millicores 7420715008 memory bytes, total request 1710 millicores 5692719104 memory bytes, score 1 ----------------------- Now, Node1 always has the *lowest* score, and the pod is accordingly not scheduled onto it. `ImageLocalityPriority`, which was first tested here, suffers from similar rounding and normalization woes (see it’s calculation functions at [6]). It is based on the number of nodes that have the image, the number of total nodes, and the size of the image itself. All of which, in our case, are small enough that I believe they might always produce a priority score of 0, which is the minimum and would put the node which has the image at the same priority as any other node which does not. However, I believe that using `LeastRequestedPriority` and `MostRequestedPriority` (two of the simpler and polar priority functions) and much bigger differences in node requests I’ve demonstrated here that changing the policy configmap does in fact take effect (as evidenced by the logs we’ve already seen). If you would like to recreate this using my custom scheduler image (normally these scores aren’t logged unless `--v=10`), it is built into my scheduler operator image available at `docker.io/mdame/kso`. [1] https://github.com/kubernetes/kubernetes/blob/62e97173867a5a7817c2e2ecea78a0e91a370675/pkg/scheduler/algorithm/priorities/resource_allocation.go#L68-L85 [2] https://github.com/kubernetes/kubernetes/blob/90df64b75b61e6ea45f211144dd391ecd4263fb5/pkg/scheduler/algorithm/priorities/most_requested.go#L46 [3] https://github.com/kubernetes/kubernetes/blob/90df64b75b61e6ea45f211144dd391ecd4263fb5/pkg/scheduler/algorithm/priorities/least_requested.go [4] https://github.com/kubernetes/kubernetes/blob/90df64b75b61e6ea45f211144dd391ecd4263fb5/pkg/scheduler/algorithm/priorities/least_requested.go#L36 [5] https://github.com/kubernetes/kubernetes/blob/90df64b75b61e6ea45f211144dd391ecd4263fb5/pkg/scheduler/algorithm/priorities/most_requested.go#L34 [6] https://github.com/kubernetes/kubernetes/blob/90df64b75b61e6ea45f211144dd391ecd4263fb5/pkg/scheduler/algorithm/priorities/image_locality.go#L50
I updated my hyperkube image to also log for ImageLocalityPriority after [1], and as I thought the scores of every node is 0: I0821 20:34:53.064475 1 image_locality.go:57] empty-operator-pod -> ip-10-0-129-96.us-west-2.compute.internal: ImageLocalityPriority, score 0 I0821 20:34:53.064500 1 image_locality.go:57] empty-operator-pod -> ip-10-0-167-240.us-west-2.compute.internal: ImageLocalityPriority, score 0 I0821 20:34:53.064507 1 image_locality.go:57] empty-operator-pod -> ip-10-0-157-205.us-west-2.compute.internal: ImageLocalityPriority, score 0 In this case, node `ip-10-0-129-96.us-west-2.compute.internal` was the only node where the image was present. We are using too small of an image on too few nodes for it to make a difference in the algorithm. [1] https://github.com/kubernetes/kubernetes/blob/90df64b75b61e6ea45f211144dd391ecd4263fb5/pkg/scheduler/algorithm/priorities/image_locality.go#L50
To prove the `ImageLocalityPriority` error, I tried creating a big pod that just pulls the OpenShift release image (it is >1GB in size compared to hello-pod which is only 147MB): $ cat big-pod.yaml apiVersion: v1 kind: Pod metadata: name: empty-operator-pod spec: containers: - image: "registry.svc.ci.openshift.org/openshift/release:golang-1.12" name: big-test The first time I create it, none of the nodes have the image so it gets put on a random node: I0821 20:58:24.209323 1 image_locality.go:57] empty-operator-pod -> ip-10-0-129-96.us-west-2.compute.internal: ImageLocalityPriority, score 0 I0821 20:58:24.210491 1 image_locality.go:57] empty-operator-pod -> ip-10-0-167-240.us-west-2.compute.internal: ImageLocalityPriority, score 0 I0821 20:58:24.210518 1 image_locality.go:57] empty-operator-pod -> ip-10-0-157-205.us-west-2.compute.internal: ImageLocalityPriority, score 0 I0821 20:58:24.216285 1 scheduler.go:572] pod test-pod/empty-operator-pod is bound successfully on node ip-10-0-167-240.us-west-2.compute.internal, 6 nodes evaluated, 3 nodes were found feasible $ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES empty-operator-pod 0/1 Completed 3 90s 10.129.2.11 ip-10-0-167-240.us-west-2.compute.internal <none> <none> In following attempts, the node with the image gets a different score, and the new pod is thus scheduled onto that node: I0821 21:00:29.239119 1 image_locality.go:57] empty-operator-pod -> ip-10-0-129-96.us-west-2.compute.internal: ImageLocalityPriority, score 0 I0821 21:00:29.239144 1 image_locality.go:57] empty-operator-pod -> ip-10-0-167-240.us-west-2.compute.internal: ImageLocalityPriority, score 1 I0821 21:00:29.239149 1 image_locality.go:57] empty-operator-pod -> ip-10-0-157-205.us-west-2.compute.internal: ImageLocalityPriority, score 0 I0821 21:00:29.244555 1 scheduler.go:572] pod test-pod/empty-operator-pod is bound successfully on node ip-10-0-167-240.us-west-2.compute.internal, 6 nodes evaluated, 3 nodes were found feasible $ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES empty-operator-pod 0/1 CrashLoopBackOff 5 3m31s 10.129.2.12 ip-10-0-167-240.us-west-2.compute.internal <none> <none>
Mike, Ge Liu checked with me together, thanks for so detailed debugging and demo. We understood your specific data can prove the policy takes effect. However, from customer perspective, it is hard to make customer believe it by crafting such significant big data and logging it via additional setting --v=10. Customer doesn't know what data scope is significant enough. Thus this sounds not UX friendly to customer. The score granularity 0 - 10 [1] after rounded is too coarse. If score is more fine-granulated e.g. 0.00 ~ 10.00, it should solve this issue. [1] https://docs.openshift.com/container-platform/4.1/nodes/scheduling/nodes-scheduler-default.html#nodes-scheduler-default-about_nodes-scheduler-default Thus assigning back. If it is better to move this to VERIFIED and open separate bug for the score granularity issue, @Ge Liu, please help do it. Thanks
The configuration and current algorithm of the scheduler are working as designed. Mike clearly pointed out that the functionality at fault was fixed and works as expected. If you think that we need to improve the algorithm, that looks to me like an RFE. In that case feel free to open a card on workloads team jira and we'll consider that among other items this team is currently working on. I'm moving this back to test, if the functionality is working as it should please close the bug as verified.
By the way, there is some discussion upstream (as of the sig-scheduling meeting yesterday, coincidentally) about changing these priority scores from 0-100, or some better range that 0-10. So we may not even need a card for that if we get it upstream for free
yes, it make sense, thanks for kindly explanation.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922