1760193 – [e2e] [[sig-scheduling] Multi-AZ Clusters should spread the pods of a replication controller across zones [Suite:openshift/conformance/parallel] [Suite:k8s]

Bug 1760193 - [e2e] [[sig-scheduling] Multi-AZ Clusters should spread the pods of a replication controller across zones [Suite:openshift/conformance/parallel] [Suite:k8s]

Summary: [e2e] [[sig-scheduling] Multi-AZ Clusters should spread the pods of a replica...

Keywords:
Status:	CLOSED DUPLICATE of bug 1806594
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-scheduler
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Jan Chaloupka
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1724337 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-10-10 06:07 UTC by Jianwei Hou
Modified:	2020-03-17 17:28 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-03-17 17:28:45 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Jianwei Hou 2019-10-10 06:07:19 UTC

Description of problem:
Observed failure for test: [sig-scheduling] Multi-AZ Clusters should spread the pods of a replication controller across zones [Suite:openshift/conformance/parallel] [Suite:k8s] 

fail [k8s.io/kubernetes/test/e2e/scheduling/ubernetes_lite.go:169]: Pods were not evenly spread across zones.  1 in one zone and 4 in another zone
Expected
    <int>: 1
to be within 1 of ~
    <int>: 4


https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-proxy-4.2/47
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-proxy-4.2/7

Version-Release number of selected component (if applicable):


How reproducible:
Sometimes

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Jan Chaloupka 2019-11-06 11:34:34 UTC

Scheduler logs:

```
I1009 12:47:17.788394       1 scheduler.go:572] pod e2e-multi-az-13/test-service-0 is bound successfully on node ip-10-0-148-231.ec2.internal, 6 nodes evaluated, 3 nodes were found feasible
I1009 12:47:17.847418       1 scheduler.go:572] pod e2e-multi-az-13/test-service-1 is bound successfully on node ip-10-0-142-212.ec2.internal, 6 nodes evaluated, 3 nodes were found feasible
I1009 12:47:17.925481       1 scheduler.go:572] pod e2e-multi-az-13/test-service-2 is bound successfully on node ip-10-0-138-0.ec2.internal, 6 nodes evaluated, 3 nodes were found feasible
I1009 12:47:17.959932       1 scheduler.go:572] pod e2e-multi-az-13/test-service-3 is bound successfully on node ip-10-0-148-231.ec2.internal, 6 nodes evaluated, 3 nodes were found feasible
I1009 12:47:17.987308       1 scheduler.go:572] pod e2e-multi-az-13/test-service-4 is bound successfully on node ip-10-0-138-0.ec2.internal, 6 nodes evaluated, 3 nodes were found feasible

I1009 13:02:04.875870       1 scheduler.go:572] pod e2e-multi-az-847/ubelite-spread-rc-fdf50acf-ea94-11e9-81ad-0a58ac10c406-wtsbr is bound successfully on node ip-10-0-142-212.ec2.internal, 6 nodes evaluated, 3 nodes were found feasible
I1009 13:02:04.905571       1 scheduler.go:572] pod e2e-multi-az-847/ubelite-spread-rc-fdf50acf-ea94-11e9-81ad-0a58ac10c406-q8v64 is bound successfully on node ip-10-0-138-0.ec2.internal, 6 nodes evaluated, 3 nodes were found feasible
I1009 13:02:04.905971       1 scheduler.go:572] pod e2e-multi-az-847/ubelite-spread-rc-fdf50acf-ea94-11e9-81ad-0a58ac10c406-dxp5j is bound successfully on node ip-10-0-148-231.ec2.internal, 6 nodes evaluated, 3 nodes were found feasible
I1009 13:02:04.944989       1 scheduler.go:572] pod e2e-multi-az-847/ubelite-spread-rc-fdf50acf-ea94-11e9-81ad-0a58ac10c406-qwqv6 is bound successfully on node ip-10-0-142-212.ec2.internal, 6 nodes evaluated, 3 nodes were found feasible
I1009 13:02:04.945616       1 scheduler.go:572] pod e2e-multi-az-847/ubelite-spread-rc-fdf50acf-ea94-11e9-81ad-0a58ac10c406-fbb9b is bound successfully on node ip-10-0-138-0.ec2.internal, 6 nodes evaluated, 3 nodes were found feasible
```

Nodes:
- ip-10-0-142-212.ec2.internal in us-east-1a
- ip-10-0-138-0.ec2.internal in us-east-1a
- ip-10-0-148-231.ec2.internal in us-east-1b

SelectorSpreadPriority is the priority taking into account zone spreading.

Computed scores for nodes:
1. before e2e-multi-az-847/ubelite-spread-rc-fdf50acf-ea94-11e9-81ad-0a58ac10c406-wtsbr is scheduled:
   - ip-10-0-142-212.ec2.internal:
     - maxCountByNodeName=0, result[0].Score=0 => fScore=MaxPriority=10
     - maxCountByZone=0, zoneScore=10
     - fScore = 10*1/3 + 10*2/3 = 10, result[0].Score=10
   - ip-10-0-138-0.ec2.internal:
     - maxCountByNodeName=0, result[1].Score=0 => fScore=MaxPriority=10
     - maxCountByZone=0, zoneScore=10
     - fScore = 10*1/3 + 10*2/3 = 10, result[1].Score=10
   - ip-10-0-148-231.ec2.internal:
     - maxCountByNodeName=0, result[2].Score=0 => fScore=MaxPriority=10
     - maxCountByZone=0, zoneScore=10
     - fScore = 10*1/3 + 10*2/3 = 10, result[2].Score=10
   Final score vector (10,10,10). Node "ip-10-0-142-212.ec2.internal" picked.

2. before e2e-multi-az-847/ubelite-spread-rc-fdf50acf-ea94-11e9-81ad-0a58ac10c406-q8v64 is scheduled:
   - ip-10-0-142-212.ec2.internal:
     - maxCountByNodeName=1, result[0].Score=1 => fScore=0
     - maxCountByZone=1 => zoneScore=0
     - fScore = 0*1/3 + 0*2/3 = 0, result[0].Score=0
   - ip-10-0-138-0.ec2.internal:
     - maxCountByNodeName=1, result[1].Score=0 => fScore=MaxPriority=10
     - maxCountByZone=1, zoneScore=0
     - fScore = 10*1/3 + 0*2/3 = 10/3, result[1].Score=10/3
   - ip-10-0-148-231.ec2.internal:
     - maxCountByNodeName=1, result[2].Score=0 => fScore=MaxPriority=10
     - maxCountByZone=1, zoneScore=10
     - fScore = 10*1/3 + 10*2/3 = 10, result[2].Score=10
   Final score vector (0,10/3,10). 

Although, "ip-10-0-148-231.ec2.internal" has higher priority over "ip-10-0-138-0.ec2.internal", node "ip-10-0-138-0.ec2.internal" was picked instead. So there must be other priorities that gives higher importance to "ip-10-0-138-0.ec2.internal" over "ip-10-0-148-231.ec2.internal". We can rule out the following based on what can be assumed from the test and scheduler logs:
- ImageLocality since all the nodes already have the image (pulled by pods under e2e-multi-az-13 in the previous test).
- LeastResourceAllocation priority as well since it takes into account only node's capacity and pod's requested resources (all the pods are exactly the same).
- BalancedResourceAllocation - the same reasoning as for LeastResourceAllocation
- ServiceSpreadingPriority - pod does not fit any service label selector
- InterPodAffinityPriority - no affinity/anti-affinity specified in pod's spec
- NodePreferAvoidPodsPriority - scoped to a single namespace, not to "e2e-multi-az-847"
- NodeAffinityPriority - node node affinity specified in pod's spec

Hard to say which priority is the culprit here. Would be helpful to run the scheduler with log level 10. The SelectorSpreadPriority then prints score for each node. The same holds for a bunch of other priority functions:
- InterPodAffinityPriority
- EvenPodsSpreadPriority
- ResourceLimitsPriority

In the third scheduling iteration when pod "e2e-multi-az-847/ubelite-spread-rc-fdf50acf-ea94-11e9-81ad-0a58ac10c406-dxp5j" is scheduled:
   - ip-10-0-142-212.ec2.internal:
     - maxCountByNodeName=1, result[0].Score=1 => fScore=0
     - maxCountByZone=2 => zoneScore=0
     - fScore = 0*1/3 + 0*2/3 = 0, result[0].Score=0
   - ip-10-0-138-0.ec2.internal:
     - maxCountByNodeName=1, result[1].Score=1 => fScore=0
     - maxCountByZone=2, zoneScore=0
     - fScore = 0*1/3 + 0*2/3 = 10/3, result[1].Score=0
   - ip-10-0-148-231.ec2.internal:
     - maxCountByNodeName=1, result[2].Score=0 => fScore=MaxPriority=10
     - maxCountByZone=2, zoneScore=10
     - fScore = 10*1/3 + 10*2/3 = 10, result[2].Score=10
   Final score vector (0,0,10). 

Here, node "ip-10-0-148-231.ec2.internal" is picked as expected. So it might be the case the total score added by other priorities is in interval <10/3;10>.

In the fourth iteration when pod "e2e-multi-az-847/ubelite-spread-rc-fdf50acf-ea94-11e9-81ad-0a58ac10c406-qwqv6" is scheduled:
   - ip-10-0-142-212.ec2.internal:
     - maxCountByNodeName=1, result[0].Score=1 => fScore=0
     - maxCountByZone=2 => zoneScore=0
     - fScore = 0*1/3 + 0*2/3 = 0, result[0].Score=0
   - ip-10-0-138-0.ec2.internal:
     - maxCountByNodeName=1, result[1].Score=1 => fScore=0
     - maxCountByZone=2, zoneScore=0
     - fScore = 0*1/3 + 0*2/3 = 0, result[1].Score=0
   - ip-10-0-148-231.ec2.internal:
     - maxCountByNodeName=1, result[2].Score=1 => fScore=0
     - maxCountByZone=2, zoneScore=5
     - fScore = 0*1/3 + 5*2/3 = 10/3, result[2].Score=10/3
   Final score vector (0,0,10/3). Node "ip-10-0-148-231.ec2.internal" was supposed to be picked. Node "ip-10-0-142-212.ec2.internal" was picked instead.

In the last iteration when pod "e2e-multi-az-847/ubelite-spread-rc-fdf50acf-ea94-11e9-81ad-0a58ac10c406-fbb9b" is scheduled:
   - ip-10-0-142-212.ec2.internal:
     - maxCountByNodeName=2, result[0].Score=2 => fScore=0
     - maxCountByZone=3 => zoneScore=0
     - fScore = 0*1/3 + 0*2/3 = 0, result[0].Score=0
   - ip-10-0-138-0.ec2.internal:
     - maxCountByNodeName=2, result[1].Score=1 => fScore=MaxPriority*0.5=5
     - maxCountByZone=3, zoneScore=0
     - fScore = 5*1/3 + 0*2/3 = 5/3, result[1].Score=5/3
   - ip-10-0-148-231.ec2.internal:
     - maxCountByNodeName=2, result[2].Score=1 => fScore=MaxPriority*0.5=5
     - maxCountByZone=3, zoneScore=MaxPriority*2/3=20/3
     - fScore = 5*1/3 + (20/3)*2/3 = 10, result[2].Score=5/3+40/9=45/9=5
   Final score vector (0,5/3,5). Node "ip-10-0-148-231.ec2.internal" was supposed to be picked. Node "ip-10-0-138-0.ec2.internal" was picked instead.

On the other hand, if we assume the zones are not taken into account, the node selection is not incorrect.

Comment 2 Jan Chaloupka 2019-11-06 12:59:41 UTC

From https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-proxy-4.2/7:


```
I1002 20:05:46.994366       1 scheduler.go:572] pod e2e-multi-az-2870/ubelite-spread-rc-05d39ae6-e550-11e9-a30f-0a58ac105e0a-cqgzz is bound successfully on node ip-10-0-132-156.ec2.internal, 6 nodes evaluated, 3 nodes were found feasible
I1002 20:05:47.016929       1 scheduler.go:572] pod e2e-multi-az-2870/ubelite-spread-rc-05d39ae6-e550-11e9-a30f-0a58ac105e0a-rt5zg is bound successfully on node ip-10-0-132-74.ec2.internal, 6 nodes evaluated, 3 nodes were found feasible
I1002 20:05:47.044041       1 scheduler.go:572] pod e2e-multi-az-2870/ubelite-spread-rc-05d39ae6-e550-11e9-a30f-0a58ac105e0a-jhkgd is bound successfully on node ip-10-0-145-131.ec2.internal, 6 nodes evaluated, 3 nodes were found feasible
I1002 20:05:47.065871       1 scheduler.go:572] pod e2e-multi-az-2870/ubelite-spread-rc-05d39ae6-e550-11e9-a30f-0a58ac105e0a-jrh2l is bound successfully on node ip-10-0-132-74.ec2.internal, 6 nodes evaluated, 3 nodes were found feasible
I1002 20:05:47.079153       1 scheduler.go:572] pod e2e-multi-az-2870/ubelite-spread-rc-05d39ae6-e550-11e9-a30f-0a58ac105e0a-c5z52 is bound successfully on node ip-10-0-132-156.ec2.internal, 6 nodes evaluated, 3 nodes were found feasible
```

- ip-10-0-132-156.ec2.internal from us-east-1a
- ip-10-0-132-74.ec2.internal from us-east-1a
- ip-10-0-145-131.ec2.internal from us-east-1b

The same explanation applies.

From the scheduler logs:
```
I1002 19:44:01.375473       1 factory.go:412] Creating scheduler with fit predicates 'map[CheckNodeUnschedulable:{} CheckVolumeBinding:{} GeneralPredicates:{} MatchInterPodAffinity:{} MaxAzureDiskVolumeCount:{} MaxCSIVolumeCountPred:{} MaxEBSVolumeCount:{} MaxGCEPDVolumeCount:{} NoDiskConflict:{} NoVolumeZoneConflict:{} PodToleratesNodeTaints:{}]' and priority functions 'map[BalancedResourceAllocation:{} ImageLocalityPriority:{} InterPodAffinityPriority:{} LeastRequestedPriority:{} NodeAffinityPriority:{} NodePreferAvoidPodsPriority:{} SelectorSpreadPriority:{} TaintTolerationPriority:{}]'
```

The following priorities are taken into account:
- BalancedResourceAllocation
- ImageLocalityPriority
- InterPodAffinityPriority
- LeastRequestedPriority
- NodeAffinityPriority
- NodePreferAvoidPodsPriority
- SelectorSpreadPriority
- TaintTolerationPriority

Analyzing BalancedResourceAllocation and LeastRequestedPriority again:
- BalancedResourceAllocation: 10*(1-|cpuFraction-memoryFraction|), where cpu|memoryFraction = (total requested (pod + already scheduled pods))/(node allocatable)
- LeastRequestedPriority: 5[(CPUcapacity - CPUrequested)/CPUcapacity + (MEMcapacity - MEMrequested)/MEMcapacity]

Both of the priorities actually takes into account even already scheduled pods. So in case node "ip-10-0-148-231.ec2.internal" (from us-east-1b) has significantly less free resources left than nodes from us-east-1a zone, node "ip-10-0-148-231.ec2.internal" might be less suitable for scheduling after all.

In the first analysis, in case where both nodes from us-east-1a get 0 score by SelectorSpreadPriority, i.e. (0,0,10), node "ip-10-0-148-231.ec2.internal" is picked. On the other hand if at least one of us-east-1a nodes get score at least 10/3, it's sufficient to move node "ip-10-0-148-231.ec2.internal" to the second place. From the first analysis:
- (10,10,10) -> any node can be picked
- (0,10/3,10) -> us-east-1a node is picked
- (0,0,10) -> us-east-1b node is picked
- (0,0,10/3) -> 10/3 is not sufficient to beat 0 score of us-east-1a nodes, us-east-1a node is still picked
- (0,5/3,5) -> 5/3 still beats 5, us-east-1a node is picked.

Wrt. cpu/memory resources, let's consider the following situation:
- node "ip-10-0-142-212.ec2.internal in us-east-1a" has 40% of CPU free, from observation free memory is usually hugely available so let's give it 50%
- node "ip-10-0-138-0.ec2.internal" has 40% of CPU free
- node "ip-10-0-148-231.ec2.internal" has only 10% of CPU free

Computed scores [1]:

Node ip-10-0-142-212.ec2.internal, BalancedResourceAllocation score=9.0, LeastRequestedPriority=4.5, in total=13.5
Node ip-10-0-138-0.ec2.internal, BalancedResourceAllocation score=9.0, LeastRequestedPriority=4.5, in total=13.5
Node ip-10-0-148-231.ec2.internal, BalancedResourceAllocation score=6.0, LeastRequestedPriority=3.0, in total=9.0

Both BalancedResourceAllocation and LeastRequestedPriority gives (13.5, 13.5, 9). Which gives difference 4.5 in score between us-east-1a and us-east-1b zones. The difference is sufficient to change (0,0,10/3) or (0,5/3,5) in favor of the second node from us-east-1a. 

To beat (0,10/3,10) case, it's sufficient to have at least 47% of CPU left on us-east-1a nodes:
Node ip-10-0-142-212.ec2.internal, BalancedResourceAllocation score=9.7, LeastRequestedPriority=4.85, in total=14.55
Node ip-10-0-138-0.ec2.internal, BalancedResourceAllocation score=9.7, LeastRequestedPriority=4.85, in total=14.55
Node ip-10-0-148-231.ec2.internal, BalancedResourceAllocation score=5.0, LeastRequestedPriority=3.5, in total=8.5

Though, it's impossible to get the state of the cluster by the time "Multi-AZ Clusters should spread the pods of a replication controller across zones" was run to support the computation.

------------------------------------------------------------------------------------------------

[1] The following code in Python was used to get the result:
```
def balancedResourceAllocation(cpuFraction, memoryFraction):
    return 10.0*(1.0-abs(cpuFraction-memoryFraction))

def leastRequestedPriority(cpuFraction, memoryFraction):
    return 5*(cpuFraction + memoryFraction)

# resources left
nodes = {"ip-10-0-142-212.ec2.internal in us-east-1a": {"cpu": 0.4, "memory": 0.5}, "ip-10-0-138-0.ec2.internal": {"cpu": 0.4, "memory": 0.5}, "ip-10-0-148-231.ec2.internal": {"cpu": 0.1, "memory": 0.5}}
for node in sorted(nodes.keys()):
    resources=nodes[node]
    s1=balancedResourceAllocation(resources["cpu"], resources["memory"])
    s2=leastRequestedPriority(resources["cpu"], resources["memory"])
    print("Node {}, BalancedResourceAllocation score={}, LeastRequestedPriority={}, in total={}".format(node,s1,s2,s1+s2))
```

Comment 3 Jan Chaloupka 2019-11-15 11:13:13 UTC

*** Bug 1724337 has been marked as a duplicate of this bug. ***

Comment 4 Phil Cameron 2019-12-18 16:51:54 UTC

This is causing 21% of failures:
  https://search.svc.ci.openshift.org/?search=Multi-AZ+Clusters+should+spread+the+pods+of+a+service+across+zones&maxAge=336h&context=2&type=all

Comment 5 Dan Mace 2020-02-24 15:24:34 UTC

The 'service' variant of this test is currently causing trouble in the 4.3 branch: https://bugzilla.redhat.com/show_bug.cgi?id=1806594

Comment 6 Clayton Coleman 2020-03-17 17:28:45 UTC

Duping against 1806594 which I plan to fix and backport.

*** This bug has been marked as a duplicate of bug 1806594 ***

Note You need to log in before you can comment on or make changes to this bug.