Description of problem: Observed failure for test: [sig-scheduling] Multi-AZ Clusters should spread the pods of a replication controller across zones [Suite:openshift/conformance/parallel] [Suite:k8s] fail [k8s.io/kubernetes/test/e2e/scheduling/ubernetes_lite.go:169]: Pods were not evenly spread across zones. 1 in one zone and 4 in another zone Expected <int>: 1 to be within 1 of ~ <int>: 4 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-proxy-4.2/47 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-proxy-4.2/7 Version-Release number of selected component (if applicable): How reproducible: Sometimes Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Scheduler logs: ``` I1009 12:47:17.788394 1 scheduler.go:572] pod e2e-multi-az-13/test-service-0 is bound successfully on node ip-10-0-148-231.ec2.internal, 6 nodes evaluated, 3 nodes were found feasible I1009 12:47:17.847418 1 scheduler.go:572] pod e2e-multi-az-13/test-service-1 is bound successfully on node ip-10-0-142-212.ec2.internal, 6 nodes evaluated, 3 nodes were found feasible I1009 12:47:17.925481 1 scheduler.go:572] pod e2e-multi-az-13/test-service-2 is bound successfully on node ip-10-0-138-0.ec2.internal, 6 nodes evaluated, 3 nodes were found feasible I1009 12:47:17.959932 1 scheduler.go:572] pod e2e-multi-az-13/test-service-3 is bound successfully on node ip-10-0-148-231.ec2.internal, 6 nodes evaluated, 3 nodes were found feasible I1009 12:47:17.987308 1 scheduler.go:572] pod e2e-multi-az-13/test-service-4 is bound successfully on node ip-10-0-138-0.ec2.internal, 6 nodes evaluated, 3 nodes were found feasible I1009 13:02:04.875870 1 scheduler.go:572] pod e2e-multi-az-847/ubelite-spread-rc-fdf50acf-ea94-11e9-81ad-0a58ac10c406-wtsbr is bound successfully on node ip-10-0-142-212.ec2.internal, 6 nodes evaluated, 3 nodes were found feasible I1009 13:02:04.905571 1 scheduler.go:572] pod e2e-multi-az-847/ubelite-spread-rc-fdf50acf-ea94-11e9-81ad-0a58ac10c406-q8v64 is bound successfully on node ip-10-0-138-0.ec2.internal, 6 nodes evaluated, 3 nodes were found feasible I1009 13:02:04.905971 1 scheduler.go:572] pod e2e-multi-az-847/ubelite-spread-rc-fdf50acf-ea94-11e9-81ad-0a58ac10c406-dxp5j is bound successfully on node ip-10-0-148-231.ec2.internal, 6 nodes evaluated, 3 nodes were found feasible I1009 13:02:04.944989 1 scheduler.go:572] pod e2e-multi-az-847/ubelite-spread-rc-fdf50acf-ea94-11e9-81ad-0a58ac10c406-qwqv6 is bound successfully on node ip-10-0-142-212.ec2.internal, 6 nodes evaluated, 3 nodes were found feasible I1009 13:02:04.945616 1 scheduler.go:572] pod e2e-multi-az-847/ubelite-spread-rc-fdf50acf-ea94-11e9-81ad-0a58ac10c406-fbb9b is bound successfully on node ip-10-0-138-0.ec2.internal, 6 nodes evaluated, 3 nodes were found feasible ``` Nodes: - ip-10-0-142-212.ec2.internal in us-east-1a - ip-10-0-138-0.ec2.internal in us-east-1a - ip-10-0-148-231.ec2.internal in us-east-1b SelectorSpreadPriority is the priority taking into account zone spreading. Computed scores for nodes: 1. before e2e-multi-az-847/ubelite-spread-rc-fdf50acf-ea94-11e9-81ad-0a58ac10c406-wtsbr is scheduled: - ip-10-0-142-212.ec2.internal: - maxCountByNodeName=0, result[0].Score=0 => fScore=MaxPriority=10 - maxCountByZone=0, zoneScore=10 - fScore = 10*1/3 + 10*2/3 = 10, result[0].Score=10 - ip-10-0-138-0.ec2.internal: - maxCountByNodeName=0, result[1].Score=0 => fScore=MaxPriority=10 - maxCountByZone=0, zoneScore=10 - fScore = 10*1/3 + 10*2/3 = 10, result[1].Score=10 - ip-10-0-148-231.ec2.internal: - maxCountByNodeName=0, result[2].Score=0 => fScore=MaxPriority=10 - maxCountByZone=0, zoneScore=10 - fScore = 10*1/3 + 10*2/3 = 10, result[2].Score=10 Final score vector (10,10,10). Node "ip-10-0-142-212.ec2.internal" picked. 2. before e2e-multi-az-847/ubelite-spread-rc-fdf50acf-ea94-11e9-81ad-0a58ac10c406-q8v64 is scheduled: - ip-10-0-142-212.ec2.internal: - maxCountByNodeName=1, result[0].Score=1 => fScore=0 - maxCountByZone=1 => zoneScore=0 - fScore = 0*1/3 + 0*2/3 = 0, result[0].Score=0 - ip-10-0-138-0.ec2.internal: - maxCountByNodeName=1, result[1].Score=0 => fScore=MaxPriority=10 - maxCountByZone=1, zoneScore=0 - fScore = 10*1/3 + 0*2/3 = 10/3, result[1].Score=10/3 - ip-10-0-148-231.ec2.internal: - maxCountByNodeName=1, result[2].Score=0 => fScore=MaxPriority=10 - maxCountByZone=1, zoneScore=10 - fScore = 10*1/3 + 10*2/3 = 10, result[2].Score=10 Final score vector (0,10/3,10). Although, "ip-10-0-148-231.ec2.internal" has higher priority over "ip-10-0-138-0.ec2.internal", node "ip-10-0-138-0.ec2.internal" was picked instead. So there must be other priorities that gives higher importance to "ip-10-0-138-0.ec2.internal" over "ip-10-0-148-231.ec2.internal". We can rule out the following based on what can be assumed from the test and scheduler logs: - ImageLocality since all the nodes already have the image (pulled by pods under e2e-multi-az-13 in the previous test). - LeastResourceAllocation priority as well since it takes into account only node's capacity and pod's requested resources (all the pods are exactly the same). - BalancedResourceAllocation - the same reasoning as for LeastResourceAllocation - ServiceSpreadingPriority - pod does not fit any service label selector - InterPodAffinityPriority - no affinity/anti-affinity specified in pod's spec - NodePreferAvoidPodsPriority - scoped to a single namespace, not to "e2e-multi-az-847" - NodeAffinityPriority - node node affinity specified in pod's spec Hard to say which priority is the culprit here. Would be helpful to run the scheduler with log level 10. The SelectorSpreadPriority then prints score for each node. The same holds for a bunch of other priority functions: - InterPodAffinityPriority - EvenPodsSpreadPriority - ResourceLimitsPriority In the third scheduling iteration when pod "e2e-multi-az-847/ubelite-spread-rc-fdf50acf-ea94-11e9-81ad-0a58ac10c406-dxp5j" is scheduled: - ip-10-0-142-212.ec2.internal: - maxCountByNodeName=1, result[0].Score=1 => fScore=0 - maxCountByZone=2 => zoneScore=0 - fScore = 0*1/3 + 0*2/3 = 0, result[0].Score=0 - ip-10-0-138-0.ec2.internal: - maxCountByNodeName=1, result[1].Score=1 => fScore=0 - maxCountByZone=2, zoneScore=0 - fScore = 0*1/3 + 0*2/3 = 10/3, result[1].Score=0 - ip-10-0-148-231.ec2.internal: - maxCountByNodeName=1, result[2].Score=0 => fScore=MaxPriority=10 - maxCountByZone=2, zoneScore=10 - fScore = 10*1/3 + 10*2/3 = 10, result[2].Score=10 Final score vector (0,0,10). Here, node "ip-10-0-148-231.ec2.internal" is picked as expected. So it might be the case the total score added by other priorities is in interval <10/3;10>. In the fourth iteration when pod "e2e-multi-az-847/ubelite-spread-rc-fdf50acf-ea94-11e9-81ad-0a58ac10c406-qwqv6" is scheduled: - ip-10-0-142-212.ec2.internal: - maxCountByNodeName=1, result[0].Score=1 => fScore=0 - maxCountByZone=2 => zoneScore=0 - fScore = 0*1/3 + 0*2/3 = 0, result[0].Score=0 - ip-10-0-138-0.ec2.internal: - maxCountByNodeName=1, result[1].Score=1 => fScore=0 - maxCountByZone=2, zoneScore=0 - fScore = 0*1/3 + 0*2/3 = 0, result[1].Score=0 - ip-10-0-148-231.ec2.internal: - maxCountByNodeName=1, result[2].Score=1 => fScore=0 - maxCountByZone=2, zoneScore=5 - fScore = 0*1/3 + 5*2/3 = 10/3, result[2].Score=10/3 Final score vector (0,0,10/3). Node "ip-10-0-148-231.ec2.internal" was supposed to be picked. Node "ip-10-0-142-212.ec2.internal" was picked instead. In the last iteration when pod "e2e-multi-az-847/ubelite-spread-rc-fdf50acf-ea94-11e9-81ad-0a58ac10c406-fbb9b" is scheduled: - ip-10-0-142-212.ec2.internal: - maxCountByNodeName=2, result[0].Score=2 => fScore=0 - maxCountByZone=3 => zoneScore=0 - fScore = 0*1/3 + 0*2/3 = 0, result[0].Score=0 - ip-10-0-138-0.ec2.internal: - maxCountByNodeName=2, result[1].Score=1 => fScore=MaxPriority*0.5=5 - maxCountByZone=3, zoneScore=0 - fScore = 5*1/3 + 0*2/3 = 5/3, result[1].Score=5/3 - ip-10-0-148-231.ec2.internal: - maxCountByNodeName=2, result[2].Score=1 => fScore=MaxPriority*0.5=5 - maxCountByZone=3, zoneScore=MaxPriority*2/3=20/3 - fScore = 5*1/3 + (20/3)*2/3 = 10, result[2].Score=5/3+40/9=45/9=5 Final score vector (0,5/3,5). Node "ip-10-0-148-231.ec2.internal" was supposed to be picked. Node "ip-10-0-138-0.ec2.internal" was picked instead. On the other hand, if we assume the zones are not taken into account, the node selection is not incorrect.
From https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-proxy-4.2/7: ``` I1002 20:05:46.994366 1 scheduler.go:572] pod e2e-multi-az-2870/ubelite-spread-rc-05d39ae6-e550-11e9-a30f-0a58ac105e0a-cqgzz is bound successfully on node ip-10-0-132-156.ec2.internal, 6 nodes evaluated, 3 nodes were found feasible I1002 20:05:47.016929 1 scheduler.go:572] pod e2e-multi-az-2870/ubelite-spread-rc-05d39ae6-e550-11e9-a30f-0a58ac105e0a-rt5zg is bound successfully on node ip-10-0-132-74.ec2.internal, 6 nodes evaluated, 3 nodes were found feasible I1002 20:05:47.044041 1 scheduler.go:572] pod e2e-multi-az-2870/ubelite-spread-rc-05d39ae6-e550-11e9-a30f-0a58ac105e0a-jhkgd is bound successfully on node ip-10-0-145-131.ec2.internal, 6 nodes evaluated, 3 nodes were found feasible I1002 20:05:47.065871 1 scheduler.go:572] pod e2e-multi-az-2870/ubelite-spread-rc-05d39ae6-e550-11e9-a30f-0a58ac105e0a-jrh2l is bound successfully on node ip-10-0-132-74.ec2.internal, 6 nodes evaluated, 3 nodes were found feasible I1002 20:05:47.079153 1 scheduler.go:572] pod e2e-multi-az-2870/ubelite-spread-rc-05d39ae6-e550-11e9-a30f-0a58ac105e0a-c5z52 is bound successfully on node ip-10-0-132-156.ec2.internal, 6 nodes evaluated, 3 nodes were found feasible ``` - ip-10-0-132-156.ec2.internal from us-east-1a - ip-10-0-132-74.ec2.internal from us-east-1a - ip-10-0-145-131.ec2.internal from us-east-1b The same explanation applies. From the scheduler logs: ``` I1002 19:44:01.375473 1 factory.go:412] Creating scheduler with fit predicates 'map[CheckNodeUnschedulable:{} CheckVolumeBinding:{} GeneralPredicates:{} MatchInterPodAffinity:{} MaxAzureDiskVolumeCount:{} MaxCSIVolumeCountPred:{} MaxEBSVolumeCount:{} MaxGCEPDVolumeCount:{} NoDiskConflict:{} NoVolumeZoneConflict:{} PodToleratesNodeTaints:{}]' and priority functions 'map[BalancedResourceAllocation:{} ImageLocalityPriority:{} InterPodAffinityPriority:{} LeastRequestedPriority:{} NodeAffinityPriority:{} NodePreferAvoidPodsPriority:{} SelectorSpreadPriority:{} TaintTolerationPriority:{}]' ``` The following priorities are taken into account: - BalancedResourceAllocation - ImageLocalityPriority - InterPodAffinityPriority - LeastRequestedPriority - NodeAffinityPriority - NodePreferAvoidPodsPriority - SelectorSpreadPriority - TaintTolerationPriority Analyzing BalancedResourceAllocation and LeastRequestedPriority again: - BalancedResourceAllocation: 10*(1-|cpuFraction-memoryFraction|), where cpu|memoryFraction = (total requested (pod + already scheduled pods))/(node allocatable) - LeastRequestedPriority: 5[(CPUcapacity - CPUrequested)/CPUcapacity + (MEMcapacity - MEMrequested)/MEMcapacity] Both of the priorities actually takes into account even already scheduled pods. So in case node "ip-10-0-148-231.ec2.internal" (from us-east-1b) has significantly less free resources left than nodes from us-east-1a zone, node "ip-10-0-148-231.ec2.internal" might be less suitable for scheduling after all. In the first analysis, in case where both nodes from us-east-1a get 0 score by SelectorSpreadPriority, i.e. (0,0,10), node "ip-10-0-148-231.ec2.internal" is picked. On the other hand if at least one of us-east-1a nodes get score at least 10/3, it's sufficient to move node "ip-10-0-148-231.ec2.internal" to the second place. From the first analysis: - (10,10,10) -> any node can be picked - (0,10/3,10) -> us-east-1a node is picked - (0,0,10) -> us-east-1b node is picked - (0,0,10/3) -> 10/3 is not sufficient to beat 0 score of us-east-1a nodes, us-east-1a node is still picked - (0,5/3,5) -> 5/3 still beats 5, us-east-1a node is picked. Wrt. cpu/memory resources, let's consider the following situation: - node "ip-10-0-142-212.ec2.internal in us-east-1a" has 40% of CPU free, from observation free memory is usually hugely available so let's give it 50% - node "ip-10-0-138-0.ec2.internal" has 40% of CPU free - node "ip-10-0-148-231.ec2.internal" has only 10% of CPU free Computed scores [1]: Node ip-10-0-142-212.ec2.internal, BalancedResourceAllocation score=9.0, LeastRequestedPriority=4.5, in total=13.5 Node ip-10-0-138-0.ec2.internal, BalancedResourceAllocation score=9.0, LeastRequestedPriority=4.5, in total=13.5 Node ip-10-0-148-231.ec2.internal, BalancedResourceAllocation score=6.0, LeastRequestedPriority=3.0, in total=9.0 Both BalancedResourceAllocation and LeastRequestedPriority gives (13.5, 13.5, 9). Which gives difference 4.5 in score between us-east-1a and us-east-1b zones. The difference is sufficient to change (0,0,10/3) or (0,5/3,5) in favor of the second node from us-east-1a. To beat (0,10/3,10) case, it's sufficient to have at least 47% of CPU left on us-east-1a nodes: Node ip-10-0-142-212.ec2.internal, BalancedResourceAllocation score=9.7, LeastRequestedPriority=4.85, in total=14.55 Node ip-10-0-138-0.ec2.internal, BalancedResourceAllocation score=9.7, LeastRequestedPriority=4.85, in total=14.55 Node ip-10-0-148-231.ec2.internal, BalancedResourceAllocation score=5.0, LeastRequestedPriority=3.5, in total=8.5 Though, it's impossible to get the state of the cluster by the time "Multi-AZ Clusters should spread the pods of a replication controller across zones" was run to support the computation. ------------------------------------------------------------------------------------------------ [1] The following code in Python was used to get the result: ``` def balancedResourceAllocation(cpuFraction, memoryFraction): return 10.0*(1.0-abs(cpuFraction-memoryFraction)) def leastRequestedPriority(cpuFraction, memoryFraction): return 5*(cpuFraction + memoryFraction) # resources left nodes = {"ip-10-0-142-212.ec2.internal in us-east-1a": {"cpu": 0.4, "memory": 0.5}, "ip-10-0-138-0.ec2.internal": {"cpu": 0.4, "memory": 0.5}, "ip-10-0-148-231.ec2.internal": {"cpu": 0.1, "memory": 0.5}} for node in sorted(nodes.keys()): resources=nodes[node] s1=balancedResourceAllocation(resources["cpu"], resources["memory"]) s2=leastRequestedPriority(resources["cpu"], resources["memory"]) print("Node {}, BalancedResourceAllocation score={}, LeastRequestedPriority={}, in total={}".format(node,s1,s2,s1+s2)) ```
*** Bug 1724337 has been marked as a duplicate of this bug. ***
This is causing 21% of failures: https://search.svc.ci.openshift.org/?search=Multi-AZ+Clusters+should+spread+the+pods+of+a+service+across+zones&maxAge=336h&context=2&type=all
The 'service' variant of this test is currently causing trouble in the 4.3 branch: https://bugzilla.redhat.com/show_bug.cgi?id=1806594
Duping against 1806594 which I plan to fix and backport. *** This bug has been marked as a duplicate of bug 1806594 ***