Bug 1979433 - Default PodTopologySpread dones't work in non-CloudProvider env in OpenShift 4.7
Summary: Default PodTopologySpread dones't work in non-CloudProvider env in OpenShift...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-scheduler
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.7.z
Assignee: Jan Chaloupka
QA Contact: RamaKasturi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-07-06 02:57 UTC by Takayoshi Kimura
Modified: 2021-09-23 07:58 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-08-17 12:12:09 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubernetes kubernetes issues 102136 0 None open "scheduler: non-compatible change in default topology spread constraints" 2021-07-12 11:26:02 UTC
Github openshift openshift-docs pull 34459 0 None closed BZ-1979433: Adding RN for pod topology spread constraints by default 2021-07-26 09:56:08 UTC
Red Hat Knowledge Base (Solution) 6166232 0 None None None 2021-07-06 03:02:28 UTC
Red Hat Product Errata RHBA-2021:3032 0 None None None 2021-08-17 12:12:50 UTC

Description Takayoshi Kimura 2021-07-06 02:57:41 UTC
Description of problem:

In OpenShift 4.7, the default scheduler changed to use a new plugin PodTopologySpread instead of a SelectorSpread. The SelectorSpread tries to schedule replicas to different nodes.

The newly introduced PodTopologySpread internal default requires 2 node labels, kubernetes.io/hostname and topology.kubernetes.io/zone. It spreads replicas based on these labels and basically leads same result as the SelectorSpread.

The PodTopologySpread does not work when the required labels are not defined.

It seems this change is not well documented in 4.7, putting a lot of users under risks of non-HA, no node level failure tolerant pod placement.

Most of Baremetal UPI users who don't have zone label and rely on the default scheduler will be affected. Possibly RHV and vSphere users as well.

Version-Release number of selected component (if applicable):

4.7

How reproducible:

Always

Steps to Reproduce:

1. Create pods on imbalanced nodes. In this case we have 3 compute nodes. Confirm the small pods spread at least 2 nodes.

oc create deployment large --image gcr.io/google-samples/node-hello:1.0 --replicas 0
oc set resources deploy/large --limits=cpu=2,memory=4Gi
oc scale deploy/large --replicas=2

oc create deployment small --image gcr.io/google-samples/node-hello:1.0 --replicas 0
oc set resources deploy/small --limits=cpu=0.1,memory=128Mi
oc scale deploy/small --replicas 6

2. Remove zone label and re-test the small pod scheduling. All pods go to 1 node.

$ oc label nodes -l node-role.kubernetes.io/worker topology.kubernetes.io/zone-
$ oc delete pod -l app=small
$ oc get pod -o wide
NAME                     READY   STATUS        RESTARTS   AGE     IP            NODE                                              NOMINATED NODE   READINESS GATES
large-794448d7c7-45tr4   1/1     Running       0          6m36s   10.129.2.27   ip-10-0-128-109.ap-northeast-1.compute.internal   <none>           <none>
large-794448d7c7-d7p87   1/1     Running       0          6m4s    10.128.2.53   ip-10-0-183-92.ap-northeast-1.compute.internal    <none>           <none>
small-8c4c9dc99-4ggft    1/1     Running       0          34s     10.130.2.48   ip-10-0-185-241.ap-northeast-1.compute.internal   <none>           <none>
small-8c4c9dc99-fp2w5    1/1     Running       0          34s     10.130.2.50   ip-10-0-185-241.ap-northeast-1.compute.internal   <none>           <none>
small-8c4c9dc99-mgldf    1/1     Running       0          35s     10.130.2.47   ip-10-0-185-241.ap-northeast-1.compute.internal   <none>           <none>
small-8c4c9dc99-nd72c    1/1     Running       0          35s     10.130.2.46   ip-10-0-185-241.ap-northeast-1.compute.internal   <none>           <none>
small-8c4c9dc99-wszcq    1/1     Running       0          35s     10.130.2.49   ip-10-0-185-241.ap-northeast-1.compute.internal   <none>           <none>
small-8c4c9dc99-zp6g4    1/1     Running       0          34s     10.130.2.51   ip-10-0-185-241.ap-northeast-1.compute.internal   <none>           <none>

Actual results:

Without the zone label, the pods don't spread.

Expected results:

Pods spread to multiple hosts.

Additional info:

Comment 1 Jan Chaloupka 2021-07-12 09:39:40 UTC
> The PodTopologySpread does not work when the required labels are not defined.
>
> It seems this change is not well documented in 4.7, putting a lot of users under risks of non-HA, no node level failure tolerant pod placement.
>
> Most of Baremetal UPI users who don't have zone label and rely on the default scheduler will be affected. Possibly RHV and vSphere users as well.

This is a well known limitation. Documented upstream: https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/#prerequisites.

It's also mentioned in https://docs.openshift.com/container-platform/4.7/nodes/scheduling/nodes-scheduler-pod-topology-spread-constraints.html#nodes-scheduler-pod-topology-spread-constraints-configuring_nodes-scheduler-pod-topology-spread-constraints:

```
Prerequisites
A cluster administrator has added the required labels to nodes.
```

It's true that migration from SelectorSpread to PodTopologySpread is performed under the hood. So a user may not know in advance all the relevant nodes have to set the labels to have the PodTopologySpread work properly. Something we might stress in the release notes. Something like:

```
Starting 4.7, SelectorSpread plugin is replaced by PodTopologySpread. Some of the original SelectorSpread plugin functionality is emulated by the PodTopologySpread plugin. It's strongly recommended to switch to  the PodTopologySpread plugin directly. In both case, all nodes has to be correctly labeled from now on as described in https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/#prerequisites so both plugins work as expected.
```

or similar.

Comment 2 Jan Chaloupka 2021-07-12 09:40:51 UTC
Andrea, can you take a look at this?

Comment 3 Jan Chaloupka 2021-07-12 11:26:02 UTC
Known upstream issue: https://github.com/kubernetes/kubernetes/issues/102136

Comment 6 RamaKasturi 2021-08-10 06:36:02 UTC
Moving the bug to verified state as the required changes to the cluster which ensures pod replicas are spread properly are already present in the 4.7 release notes and they work well.

Comment 8 errata-xmlrpc 2021-08-17 12:12:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.24 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3032


Note You need to log in before you can comment on or make changes to this bug.