1900819 – Scaled ingress replicas following sharded pattern don't balance evenly across multi-AZ

Bug 1900819 - Scaled ingress replicas following sharded pattern don't balance evenly across multi-AZ

Summary: Scaled ingress replicas following sharded pattern don't balance evenly across...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Miciah Dashiel Butler Masters
QA Contact:	jechen
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1978845 1984103
TreeView+	depends on / blocked

Reported:	2020-11-23 19:12 UTC by Keith Wall
Modified:	2022-08-04 22:30 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 22:34:24 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-ingress-operator pull 577	0	None	closed	Bug 1900819: Specify topology spread constraints	2021-06-08 05:04:59 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 22:34:53 UTC

Description Keith Wall 2020-11-23 19:12:38 UTC

Description of problem:

Our application runs on OSD and targets a multi-AZ (with 3 zones) cluster for resiliency purposes.

It implements its own ingresscontrollers following the sharded pattern described in the OpenShift documentation. 

https://docs.openshift.com/container-platform/4.5/networking/ingress-operator.html#nw-ingress-sharding_configuring-ingress

When ingress is scaled to 3 replicas (spec.replicas in the IngressController) 1 replica get created per AZ.  However, if the replicas are scaled to, say 6, sometimes the resulting ingress pods aren't balanced across from the zones:

oc get nodes  $(oc get pods -n openshift-ingress -l ingresscontroller.operator.openshift.io/deployment-ingresscontroller=sharded -o jsonpath='{range .items[*]} {.spec.nodeName}{"\n"}') -o jsonpath='{range .items[*]} {.metadata.name} {" "} {.metadata.labels.failure-domain\.beta\.kubernetes\.io/zone} {"\n"} {end}'

ip-10-0-154-239.ec2.internal us-east-1a
ip-10-0-145-224.ec2.internal us-east-1a
ip-10-0-207-167.ec2.internal us-east-1c
ip-10-0-140-78.ec2.internal us-east-1a
ip-10-0-213-39.ec2.internal us-east-1c
ip-10-0-165-73.ec2.internal us-east-1b

(notice the imbalance on 1a (3 instance)  and 1b(1))

There doesn't seem to be a method to influence this from the ingresscontrollers config.  

I think I need a method to specify https://docs.openshift.com/container-platform/4.6/nodes/scheduling/nodes-scheduler-pod-topology-spread-constraints.html


Steps to Reproduce:
1. Implement sharded ingress as per docs.
2. Scale to 6 replicas
3. Run the command above
4. Sometime the ingress pods are out of balance.


How reproducible: 50%


Actual results:

As above


Expected results:

Balanced ingress across the multi-AZ cluster

Additional info:

Comment 1 Miciah Dashiel Butler Masters 2020-12-07 03:19:48 UTC

We'll look into this in the upcoming sprint.

Comment 2 Miciah Dashiel Butler Masters 2021-02-06 00:23:04 UTC

We don't want to prevent scheduling more replicas than there are AZs, so we should use "ScheduleAnyway".  We can use a label selector with the deployment's hash so that replicas from the same generation of the same ingresscontroller are spread out, if possible.  So we could do something like the following:  

    spec:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: ScheduleAnyway
        labelSelector: 
          matchLabels:
            "ingresscontroller.operator.openshift.io/hash": <hash>

I'll look into this for the next release.

Comment 3 Miciah Dashiel Butler Masters 2021-02-26 06:29:58 UTC

I'll work on this in the upcoming sprint.

Comment 5 jechen 2021-05-08 02:05:00 UTC

Verified in 4.8.0-0.nightly-2021-05-07-075528
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-05-07-075528   True        False         88m     Cluster version is 4.8.0-0.nightly-2021-05-07-075528

$ oc get node
NAME                                         STATUS   ROLES    AGE   VERSION
ip-10-0-144-180.us-east-2.compute.internal   Ready    master   44m   v1.21.0-rc.0+291e731
ip-10-0-156-137.us-east-2.compute.internal   Ready    worker   36m   v1.21.0-rc.0+291e731
ip-10-0-176-65.us-east-2.compute.internal    Ready    master   44m   v1.21.0-rc.0+291e731
ip-10-0-191-128.us-east-2.compute.internal   Ready    worker   36m   v1.21.0-rc.0+291e731
ip-10-0-210-199.us-east-2.compute.internal   Ready    master   44m   v1.21.0-rc.0+291e731
ip-10-0-211-12.us-east-2.compute.internal    Ready    worker   35m   v1.21.0-rc.0+291e731

1.Check the current machineset 
$ oc get machinesets -n openshift-machine-api
NAME                                   DESIRED   CURRENT   READY   AVAILABLE   AGE
jechen-0507a-tgh97-worker-us-east-2a   1         1         1       1           49m
jechen-0507a-tgh97-worker-us-east-2b   1         1         1       1           49m
jechen-0507a-tgh97-worker-us-east-2c   1         1         1       1           49m

2. Scale the  each of the above machineset
$  oc -n openshift-machine-api scale --replicas=2  machinesets jechen-0507a-tgh97-worker-us-east-2a
machineset.machine.openshift.io/jechen-0507a-tgh97-worker-us-east-2a scaled
$ oc -n openshift-machine-api scale --replicas=2  machinesets jechen-0507a-tgh97-worker-us-east-2b
machineset.machine.openshift.io/jechen-0507a-tgh97-worker-us-east-2b scaled
$ oc -n openshift-machine-api scale --replicas=2  machinesets jechen-0507a-tgh97-worker-us-east-2c
machineset.machine.openshift.io/jechen-0507a-tgh97-worker-us-east-2c scaled

3. check if machineset scalings are successful
$ oc get machinesets -n openshift-machine-api
NAME                                   DESIRED   CURRENT   READY   AVAILABLE   AGE
jechen-0507a-tgh97-worker-us-east-2a   2         2         1       1           56m
jechen-0507a-tgh97-worker-us-east-2b   2         2         1       1           56m
jechen-0507a-tgh97-worker-us-east-2c   2         2         1       1           56m

$ oc get node |grep worker
ip-10-0-144-156.us-east-2.compute.internal   Ready    worker   78m    v1.21.0-rc.0+291e731
ip-10-0-156-137.us-east-2.compute.internal   Ready    worker   121m   v1.21.0-rc.0+291e731
ip-10-0-177-87.us-east-2.compute.internal    Ready    worker   77m    v1.21.0-rc.0+291e731
ip-10-0-191-128.us-east-2.compute.internal   Ready    worker   121m   v1.21.0-rc.0+291e731
ip-10-0-211-12.us-east-2.compute.internal    Ready    worker   121m   v1.21.0-rc.0+291e731
ip-10-0-214-102.us-east-2.compute.internal   Ready    worker   78m    v1.21.0-rc.0+291e731


4. create custom ingress controller with routerSelector
$ cat ingressctl-route-selector.yaml
kind: IngressController
apiVersion: operator.openshift.io/v1
metadata:
  name: test
  namespace: openshift-ingress-operator
spec:
  defaultCertificate:
    name: router-certs-default
  domain: router-test.jechen-0507a.qe.devcluster.openshift.com
  replicas: 1
  endpointPublishingStrategy:
    type: NodePortService
  routeSelector:
    matchLabels:
      route: router-test


$ oc create -f ingressctl-route-selector.yaml

5. scale up the ingresscontroller above to 6
$ oc -n openshift-ingress get pod -owide | |grep router-test
NAME                              READY   STATUS    RESTARTS   AGE    IP            NODE                                         NOMINATED NODE   READINESS GATES
router-test-84d997cdf8-27s5v      1/1     Running   0          13s    10.131.0.40   ip-10-0-156-137.us-east-2.compute.internal   <none>           <none>
router-test-84d997cdf8-6bvzw      1/1     Running   0          13s    10.128.2.50   ip-10-0-191-128.us-east-2.compute.internal   <none>           <none>
router-test-84d997cdf8-dbcpr      1/1     Running   0          116s   10.130.2.9    ip-10-0-144-156.us-east-2.compute.internal   <none>           <none>
router-test-84d997cdf8-mxxhb      1/1     Running   0          13s    10.131.2.8    ip-10-0-214-102.us-east-2.compute.internal   <none>           <none>
router-test-84d997cdf8-vkvv6      1/1     Running   0          13s    10.128.4.7    ip-10-0-177-87.us-east-2.compute.internal    <none>           <none>
router-test-84d997cdf8-x2xtf      1/1     Running   0          13s    10.129.2.30   ip-10-0-211-12.us-east-2.compute.internal    <none>           <none>


$ oc get nodes  $(oc get pods -n openshift-ingress -l ingresscontroller.operator.openshift.io/deployment-ingresscontroller=test -o jsonpath='{range .items[*]} {.spec.nodeName}{"\n"}') -o jsonpath='{range .items[*]} {.metadata.name} {" "} {.metadata.labels.failure-domain\.beta\.kubernetes\.io/zone} {"\n"} {end}'
  ip-10-0-156-137.us-east-2.compute.internal   us-east-2a 
  ip-10-0-191-128.us-east-2.compute.internal   us-east-2b 
  ip-10-0-144-156.us-east-2.compute.internal   us-east-2a 
  ip-10-0-214-102.us-east-2.compute.internal   us-east-2c 
  ip-10-0-177-87.us-east-2.compute.internal   us-east-2b 
  ip-10-0-211-12.us-east-2.compute.internal   us-east-2c 

Balanced ingress is across the multi-AZ cluster

Comment 6 jechen 2021-05-08 02:20:11 UTC

second test to verify using namespaceSelector

1. create another custom ingresscontroller with namespaceSelector
$ cat ingressctl-namespace-selector.yaml
kind: IngressController
apiVersion: operator.openshift.io/v1
metadata:
  name: test2
  namespace: openshift-ingress-operator
spec:
  defaultCertificate:
    name: router-certs-default 
  domain: router-test2.jechen-0507a.qe.devcluster.openshift.com
  replicas: 1
  endpointPublishingStrategy:
    type: NodePortService
  namespaceSelector:
    matchLabels:
      namespace: router-test2


$ oc create -f ./test2/ingressctl-namespace-selector.yaml
ingresscontroller.operator.openshift.io/test2 created


$ oc -n openshift-ingress get pod -owide |grep router-test2
router-test2-6d74dc5656-5n9rh     1/1     Running   0          7s     10.131.2.9    ip-10-0-214-102.us-east-2.compute.internal   <none>           <none>

2. scale up the ingresscontroller to 6
$ oc -n openshift-ingress-operator edit ingresscontroller/test2
ingresscontroller.operator.openshift.io/test2 edited

$ oc -n openshift-ingress get pod -owide |grep router-test2
router-test2-6d74dc5656-2xr7w     1/1     Running   0          7s     10.129.2.32   ip-10-0-211-12.us-east-2.compute.internal    <none>           <none>
router-test2-6d74dc5656-j5xrp     1/1     Running   0          7s     10.130.2.11   ip-10-0-144-156.us-east-2.compute.internal   <none>           <none>
router-test2-6d74dc5656-jbw8m     1/1     Running   0          70s    10.131.2.10   ip-10-0-214-102.us-east-2.compute.internal   <none>           <none>
router-test2-6d74dc5656-jmz25     1/1     Running   0          7s     10.128.4.9    ip-10-0-177-87.us-east-2.compute.internal    <none>           <none>
router-test2-6d74dc5656-jnp7s     1/1     Running   0          7s     10.131.0.42   ip-10-0-156-137.us-east-2.compute.internal   <none>           <none>
router-test2-6d74dc5656-k2xsv     1/1     Running   0          7s     10.128.2.73   ip-10-0-191-128.us-east-2.compute.internal   <none>           <none>

3. Verify ingress are balanced across multi-AZ cluster
$ oc get nodes  $(oc get pods -n openshift-ingress -l ingresscontroller.operator.openshift.io/deployment-ingresscontroller=test2 -o jsonpath='{range .items[*]} {.spec.nodeName}{"\n"}') -o jsonpath='{range .items[*]} {.metadata.name} {" "} {.metadata.labels.failure-domain\.beta\.kubernetes\.io/zone} {"\n"} {end}'
 ip-10-0-211-12.us-east-2.compute.internal   us-east-2c 
  ip-10-0-144-156.us-east-2.compute.internal   us-east-2a 
  ip-10-0-214-102.us-east-2.compute.internal   us-east-2c 
  ip-10-0-177-87.us-east-2.compute.internal   us-east-2b 
  ip-10-0-156-137.us-east-2.compute.internal   us-east-2a 
  ip-10-0-191-128.us-east-2.compute.internal   us-east-2b

Comment 8 Brandi Munilla 2021-06-24 16:50:33 UTC

Hi, does this bug require doc text? If so, please update the doc text field.

Comment 10 errata-xmlrpc 2021-07-27 22:34:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.