Description of problem: Our application runs on OSD and targets a multi-AZ (with 3 zones) cluster for resiliency purposes. It implements its own ingresscontrollers following the sharded pattern described in the OpenShift documentation. https://docs.openshift.com/container-platform/4.5/networking/ingress-operator.html#nw-ingress-sharding_configuring-ingress When ingress is scaled to 3 replicas (spec.replicas in the IngressController) 1 replica get created per AZ. However, if the replicas are scaled to, say 6, sometimes the resulting ingress pods aren't balanced across from the zones: oc get nodes $(oc get pods -n openshift-ingress -l ingresscontroller.operator.openshift.io/deployment-ingresscontroller=sharded -o jsonpath='{range .items[*]} {.spec.nodeName}{"\n"}') -o jsonpath='{range .items[*]} {.metadata.name} {" "} {.metadata.labels.failure-domain\.beta\.kubernetes\.io/zone} {"\n"} {end}' ip-10-0-154-239.ec2.internal us-east-1a ip-10-0-145-224.ec2.internal us-east-1a ip-10-0-207-167.ec2.internal us-east-1c ip-10-0-140-78.ec2.internal us-east-1a ip-10-0-213-39.ec2.internal us-east-1c ip-10-0-165-73.ec2.internal us-east-1b (notice the imbalance on 1a (3 instance) and 1b(1)) There doesn't seem to be a method to influence this from the ingresscontrollers config. I think I need a method to specify https://docs.openshift.com/container-platform/4.6/nodes/scheduling/nodes-scheduler-pod-topology-spread-constraints.html Steps to Reproduce: 1. Implement sharded ingress as per docs. 2. Scale to 6 replicas 3. Run the command above 4. Sometime the ingress pods are out of balance. How reproducible: 50% Actual results: As above Expected results: Balanced ingress across the multi-AZ cluster Additional info:
We'll look into this in the upcoming sprint.
We don't want to prevent scheduling more replicas than there are AZs, so we should use "ScheduleAnyway". We can use a label selector with the deployment's hash so that replicas from the same generation of the same ingresscontroller are spread out, if possible. So we could do something like the following: spec: topologySpreadConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: ScheduleAnyway labelSelector: matchLabels: "ingresscontroller.operator.openshift.io/hash": <hash> I'll look into this for the next release.
I'll work on this in the upcoming sprint.
Verified in 4.8.0-0.nightly-2021-05-07-075528 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-05-07-075528 True False 88m Cluster version is 4.8.0-0.nightly-2021-05-07-075528 $ oc get node NAME STATUS ROLES AGE VERSION ip-10-0-144-180.us-east-2.compute.internal Ready master 44m v1.21.0-rc.0+291e731 ip-10-0-156-137.us-east-2.compute.internal Ready worker 36m v1.21.0-rc.0+291e731 ip-10-0-176-65.us-east-2.compute.internal Ready master 44m v1.21.0-rc.0+291e731 ip-10-0-191-128.us-east-2.compute.internal Ready worker 36m v1.21.0-rc.0+291e731 ip-10-0-210-199.us-east-2.compute.internal Ready master 44m v1.21.0-rc.0+291e731 ip-10-0-211-12.us-east-2.compute.internal Ready worker 35m v1.21.0-rc.0+291e731 1.Check the current machineset $ oc get machinesets -n openshift-machine-api NAME DESIRED CURRENT READY AVAILABLE AGE jechen-0507a-tgh97-worker-us-east-2a 1 1 1 1 49m jechen-0507a-tgh97-worker-us-east-2b 1 1 1 1 49m jechen-0507a-tgh97-worker-us-east-2c 1 1 1 1 49m 2. Scale the each of the above machineset $ oc -n openshift-machine-api scale --replicas=2 machinesets jechen-0507a-tgh97-worker-us-east-2a machineset.machine.openshift.io/jechen-0507a-tgh97-worker-us-east-2a scaled $ oc -n openshift-machine-api scale --replicas=2 machinesets jechen-0507a-tgh97-worker-us-east-2b machineset.machine.openshift.io/jechen-0507a-tgh97-worker-us-east-2b scaled $ oc -n openshift-machine-api scale --replicas=2 machinesets jechen-0507a-tgh97-worker-us-east-2c machineset.machine.openshift.io/jechen-0507a-tgh97-worker-us-east-2c scaled 3. check if machineset scalings are successful $ oc get machinesets -n openshift-machine-api NAME DESIRED CURRENT READY AVAILABLE AGE jechen-0507a-tgh97-worker-us-east-2a 2 2 1 1 56m jechen-0507a-tgh97-worker-us-east-2b 2 2 1 1 56m jechen-0507a-tgh97-worker-us-east-2c 2 2 1 1 56m $ oc get node |grep worker ip-10-0-144-156.us-east-2.compute.internal Ready worker 78m v1.21.0-rc.0+291e731 ip-10-0-156-137.us-east-2.compute.internal Ready worker 121m v1.21.0-rc.0+291e731 ip-10-0-177-87.us-east-2.compute.internal Ready worker 77m v1.21.0-rc.0+291e731 ip-10-0-191-128.us-east-2.compute.internal Ready worker 121m v1.21.0-rc.0+291e731 ip-10-0-211-12.us-east-2.compute.internal Ready worker 121m v1.21.0-rc.0+291e731 ip-10-0-214-102.us-east-2.compute.internal Ready worker 78m v1.21.0-rc.0+291e731 4. create custom ingress controller with routerSelector $ cat ingressctl-route-selector.yaml kind: IngressController apiVersion: operator.openshift.io/v1 metadata: name: test namespace: openshift-ingress-operator spec: defaultCertificate: name: router-certs-default domain: router-test.jechen-0507a.qe.devcluster.openshift.com replicas: 1 endpointPublishingStrategy: type: NodePortService routeSelector: matchLabels: route: router-test $ oc create -f ingressctl-route-selector.yaml 5. scale up the ingresscontroller above to 6 $ oc -n openshift-ingress get pod -owide | |grep router-test NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES router-test-84d997cdf8-27s5v 1/1 Running 0 13s 10.131.0.40 ip-10-0-156-137.us-east-2.compute.internal <none> <none> router-test-84d997cdf8-6bvzw 1/1 Running 0 13s 10.128.2.50 ip-10-0-191-128.us-east-2.compute.internal <none> <none> router-test-84d997cdf8-dbcpr 1/1 Running 0 116s 10.130.2.9 ip-10-0-144-156.us-east-2.compute.internal <none> <none> router-test-84d997cdf8-mxxhb 1/1 Running 0 13s 10.131.2.8 ip-10-0-214-102.us-east-2.compute.internal <none> <none> router-test-84d997cdf8-vkvv6 1/1 Running 0 13s 10.128.4.7 ip-10-0-177-87.us-east-2.compute.internal <none> <none> router-test-84d997cdf8-x2xtf 1/1 Running 0 13s 10.129.2.30 ip-10-0-211-12.us-east-2.compute.internal <none> <none> $ oc get nodes $(oc get pods -n openshift-ingress -l ingresscontroller.operator.openshift.io/deployment-ingresscontroller=test -o jsonpath='{range .items[*]} {.spec.nodeName}{"\n"}') -o jsonpath='{range .items[*]} {.metadata.name} {" "} {.metadata.labels.failure-domain\.beta\.kubernetes\.io/zone} {"\n"} {end}' ip-10-0-156-137.us-east-2.compute.internal us-east-2a ip-10-0-191-128.us-east-2.compute.internal us-east-2b ip-10-0-144-156.us-east-2.compute.internal us-east-2a ip-10-0-214-102.us-east-2.compute.internal us-east-2c ip-10-0-177-87.us-east-2.compute.internal us-east-2b ip-10-0-211-12.us-east-2.compute.internal us-east-2c Balanced ingress is across the multi-AZ cluster
second test to verify using namespaceSelector 1. create another custom ingresscontroller with namespaceSelector $ cat ingressctl-namespace-selector.yaml kind: IngressController apiVersion: operator.openshift.io/v1 metadata: name: test2 namespace: openshift-ingress-operator spec: defaultCertificate: name: router-certs-default domain: router-test2.jechen-0507a.qe.devcluster.openshift.com replicas: 1 endpointPublishingStrategy: type: NodePortService namespaceSelector: matchLabels: namespace: router-test2 $ oc create -f ./test2/ingressctl-namespace-selector.yaml ingresscontroller.operator.openshift.io/test2 created $ oc -n openshift-ingress get pod -owide |grep router-test2 router-test2-6d74dc5656-5n9rh 1/1 Running 0 7s 10.131.2.9 ip-10-0-214-102.us-east-2.compute.internal <none> <none> 2. scale up the ingresscontroller to 6 $ oc -n openshift-ingress-operator edit ingresscontroller/test2 ingresscontroller.operator.openshift.io/test2 edited $ oc -n openshift-ingress get pod -owide |grep router-test2 router-test2-6d74dc5656-2xr7w 1/1 Running 0 7s 10.129.2.32 ip-10-0-211-12.us-east-2.compute.internal <none> <none> router-test2-6d74dc5656-j5xrp 1/1 Running 0 7s 10.130.2.11 ip-10-0-144-156.us-east-2.compute.internal <none> <none> router-test2-6d74dc5656-jbw8m 1/1 Running 0 70s 10.131.2.10 ip-10-0-214-102.us-east-2.compute.internal <none> <none> router-test2-6d74dc5656-jmz25 1/1 Running 0 7s 10.128.4.9 ip-10-0-177-87.us-east-2.compute.internal <none> <none> router-test2-6d74dc5656-jnp7s 1/1 Running 0 7s 10.131.0.42 ip-10-0-156-137.us-east-2.compute.internal <none> <none> router-test2-6d74dc5656-k2xsv 1/1 Running 0 7s 10.128.2.73 ip-10-0-191-128.us-east-2.compute.internal <none> <none> 3. Verify ingress are balanced across multi-AZ cluster $ oc get nodes $(oc get pods -n openshift-ingress -l ingresscontroller.operator.openshift.io/deployment-ingresscontroller=test2 -o jsonpath='{range .items[*]} {.spec.nodeName}{"\n"}') -o jsonpath='{range .items[*]} {.metadata.name} {" "} {.metadata.labels.failure-domain\.beta\.kubernetes\.io/zone} {"\n"} {end}' ip-10-0-211-12.us-east-2.compute.internal us-east-2c ip-10-0-144-156.us-east-2.compute.internal us-east-2a ip-10-0-214-102.us-east-2.compute.internal us-east-2c ip-10-0-177-87.us-east-2.compute.internal us-east-2b ip-10-0-156-137.us-east-2.compute.internal us-east-2a ip-10-0-191-128.us-east-2.compute.internal us-east-2b
Hi, does this bug require doc text? If so, please update the doc text field.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438