Description of problem: In OpenShift 4.7, the default scheduler changed to use a new plugin PodTopologySpread instead of a SelectorSpread. The SelectorSpread tries to schedule replicas to different nodes. The newly introduced PodTopologySpread internal default requires 2 node labels, kubernetes.io/hostname and topology.kubernetes.io/zone. It spreads replicas based on these labels and basically leads same result as the SelectorSpread. The PodTopologySpread does not work when the required labels are not defined. It seems this change is not well documented in 4.7, putting a lot of users under risks of non-HA, no node level failure tolerant pod placement. Most of Baremetal UPI users who don't have zone label and rely on the default scheduler will be affected. Possibly RHV and vSphere users as well. Version-Release number of selected component (if applicable): 4.7 How reproducible: Always Steps to Reproduce: 1. Create pods on imbalanced nodes. In this case we have 3 compute nodes. Confirm the small pods spread at least 2 nodes. oc create deployment large --image gcr.io/google-samples/node-hello:1.0 --replicas 0 oc set resources deploy/large --limits=cpu=2,memory=4Gi oc scale deploy/large --replicas=2 oc create deployment small --image gcr.io/google-samples/node-hello:1.0 --replicas 0 oc set resources deploy/small --limits=cpu=0.1,memory=128Mi oc scale deploy/small --replicas 6 2. Remove zone label and re-test the small pod scheduling. All pods go to 1 node. $ oc label nodes -l node-role.kubernetes.io/worker topology.kubernetes.io/zone- $ oc delete pod -l app=small $ oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES large-794448d7c7-45tr4 1/1 Running 0 6m36s 10.129.2.27 ip-10-0-128-109.ap-northeast-1.compute.internal <none> <none> large-794448d7c7-d7p87 1/1 Running 0 6m4s 10.128.2.53 ip-10-0-183-92.ap-northeast-1.compute.internal <none> <none> small-8c4c9dc99-4ggft 1/1 Running 0 34s 10.130.2.48 ip-10-0-185-241.ap-northeast-1.compute.internal <none> <none> small-8c4c9dc99-fp2w5 1/1 Running 0 34s 10.130.2.50 ip-10-0-185-241.ap-northeast-1.compute.internal <none> <none> small-8c4c9dc99-mgldf 1/1 Running 0 35s 10.130.2.47 ip-10-0-185-241.ap-northeast-1.compute.internal <none> <none> small-8c4c9dc99-nd72c 1/1 Running 0 35s 10.130.2.46 ip-10-0-185-241.ap-northeast-1.compute.internal <none> <none> small-8c4c9dc99-wszcq 1/1 Running 0 35s 10.130.2.49 ip-10-0-185-241.ap-northeast-1.compute.internal <none> <none> small-8c4c9dc99-zp6g4 1/1 Running 0 34s 10.130.2.51 ip-10-0-185-241.ap-northeast-1.compute.internal <none> <none> Actual results: Without the zone label, the pods don't spread. Expected results: Pods spread to multiple hosts. Additional info:
> The PodTopologySpread does not work when the required labels are not defined. > > It seems this change is not well documented in 4.7, putting a lot of users under risks of non-HA, no node level failure tolerant pod placement. > > Most of Baremetal UPI users who don't have zone label and rely on the default scheduler will be affected. Possibly RHV and vSphere users as well. This is a well known limitation. Documented upstream: https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/#prerequisites. It's also mentioned in https://docs.openshift.com/container-platform/4.7/nodes/scheduling/nodes-scheduler-pod-topology-spread-constraints.html#nodes-scheduler-pod-topology-spread-constraints-configuring_nodes-scheduler-pod-topology-spread-constraints: ``` Prerequisites A cluster administrator has added the required labels to nodes. ``` It's true that migration from SelectorSpread to PodTopologySpread is performed under the hood. So a user may not know in advance all the relevant nodes have to set the labels to have the PodTopologySpread work properly. Something we might stress in the release notes. Something like: ``` Starting 4.7, SelectorSpread plugin is replaced by PodTopologySpread. Some of the original SelectorSpread plugin functionality is emulated by the PodTopologySpread plugin. It's strongly recommended to switch to the PodTopologySpread plugin directly. In both case, all nodes has to be correctly labeled from now on as described in https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/#prerequisites so both plugins work as expected. ``` or similar.
Andrea, can you take a look at this?
Known upstream issue: https://github.com/kubernetes/kubernetes/issues/102136
Moving the bug to verified state as the required changes to the cluster which ensures pod replicas are spread properly are already present in the 4.7 release notes and they work well.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.7.24 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3032