1979433 – Default PodTopologySpread dones't work in non-CloudProvider env in OpenShift 4.7

Bug 1979433 - Default PodTopologySpread dones't work in non-CloudProvider env in OpenShift 4.7

Summary: Default PodTopologySpread dones't work in non-CloudProvider env in OpenShift...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-scheduler
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.7.z
Assignee:	Jan Chaloupka
QA Contact:	RamaKasturi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-07-06 02:57 UTC by Takayoshi Kimura
Modified:	2024-10-01 18:54 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-08-17 12:12:09 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubernetes kubernetes issues 102136	None	open	"scheduler: non-compatible change in default topology spread constraints"	2021-07-12 11:26:02 UTC
Github	openshift openshift-docs pull 34459	None	closed	BZ-1979433: Adding RN for pod topology spread constraints by default	2021-07-26 09:56:08 UTC
Red Hat Knowledge Base (Solution)	6166232	None	None	None	2021-07-06 03:02:28 UTC
Red Hat Product Errata	RHBA-2021:3032	None	None	None	2021-08-17 12:12:50 UTC

Description Takayoshi Kimura 2021-07-06 02:57:41 UTC

Description of problem:

In OpenShift 4.7, the default scheduler changed to use a new plugin PodTopologySpread instead of a SelectorSpread. The SelectorSpread tries to schedule replicas to different nodes.

The newly introduced PodTopologySpread internal default requires 2 node labels, kubernetes.io/hostname and topology.kubernetes.io/zone. It spreads replicas based on these labels and basically leads same result as the SelectorSpread.

The PodTopologySpread does not work when the required labels are not defined.

It seems this change is not well documented in 4.7, putting a lot of users under risks of non-HA, no node level failure tolerant pod placement.

Most of Baremetal UPI users who don't have zone label and rely on the default scheduler will be affected. Possibly RHV and vSphere users as well.

Version-Release number of selected component (if applicable):

4.7

How reproducible:

Always

Steps to Reproduce:

1. Create pods on imbalanced nodes. In this case we have 3 compute nodes. Confirm the small pods spread at least 2 nodes.

oc create deployment large --image gcr.io/google-samples/node-hello:1.0 --replicas 0
oc set resources deploy/large --limits=cpu=2,memory=4Gi
oc scale deploy/large --replicas=2

oc create deployment small --image gcr.io/google-samples/node-hello:1.0 --replicas 0
oc set resources deploy/small --limits=cpu=0.1,memory=128Mi
oc scale deploy/small --replicas 6

2. Remove zone label and re-test the small pod scheduling. All pods go to 1 node.

$ oc label nodes -l node-role.kubernetes.io/worker topology.kubernetes.io/zone-
$ oc delete pod -l app=small
$ oc get pod -o wide
NAME                     READY   STATUS        RESTARTS   AGE     IP            NODE                                              NOMINATED NODE   READINESS GATES
large-794448d7c7-45tr4   1/1     Running       0          6m36s   10.129.2.27   ip-10-0-128-109.ap-northeast-1.compute.internal   <none>           <none>
large-794448d7c7-d7p87   1/1     Running       0          6m4s    10.128.2.53   ip-10-0-183-92.ap-northeast-1.compute.internal    <none>           <none>
small-8c4c9dc99-4ggft    1/1     Running       0          34s     10.130.2.48   ip-10-0-185-241.ap-northeast-1.compute.internal   <none>           <none>
small-8c4c9dc99-fp2w5    1/1     Running       0          34s     10.130.2.50   ip-10-0-185-241.ap-northeast-1.compute.internal   <none>           <none>
small-8c4c9dc99-mgldf    1/1     Running       0          35s     10.130.2.47   ip-10-0-185-241.ap-northeast-1.compute.internal   <none>           <none>
small-8c4c9dc99-nd72c    1/1     Running       0          35s     10.130.2.46   ip-10-0-185-241.ap-northeast-1.compute.internal   <none>           <none>
small-8c4c9dc99-wszcq    1/1     Running       0          35s     10.130.2.49   ip-10-0-185-241.ap-northeast-1.compute.internal   <none>           <none>
small-8c4c9dc99-zp6g4    1/1     Running       0          34s     10.130.2.51   ip-10-0-185-241.ap-northeast-1.compute.internal   <none>           <none>

Actual results:

Without the zone label, the pods don't spread.

Expected results:

Pods spread to multiple hosts.

Additional info:

Comment 1 Jan Chaloupka 2021-07-12 09:39:40 UTC

> The PodTopologySpread does not work when the required labels are not defined.
>
> It seems this change is not well documented in 4.7, putting a lot of users under risks of non-HA, no node level failure tolerant pod placement.
>
> Most of Baremetal UPI users who don't have zone label and rely on the default scheduler will be affected. Possibly RHV and vSphere users as well.

This is a well known limitation. Documented upstream: https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/#prerequisites.

It's also mentioned in https://docs.openshift.com/container-platform/4.7/nodes/scheduling/nodes-scheduler-pod-topology-spread-constraints.html#nodes-scheduler-pod-topology-spread-constraints-configuring_nodes-scheduler-pod-topology-spread-constraints:

```
Prerequisites
A cluster administrator has added the required labels to nodes.
```

It's true that migration from SelectorSpread to PodTopologySpread is performed under the hood. So a user may not know in advance all the relevant nodes have to set the labels to have the PodTopologySpread work properly. Something we might stress in the release notes. Something like:

```
Starting 4.7, SelectorSpread plugin is replaced by PodTopologySpread. Some of the original SelectorSpread plugin functionality is emulated by the PodTopologySpread plugin. It's strongly recommended to switch to  the PodTopologySpread plugin directly. In both case, all nodes has to be correctly labeled from now on as described in https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/#prerequisites so both plugins work as expected.
```

or similar.

Comment 2 Jan Chaloupka 2021-07-12 09:40:51 UTC

Andrea, can you take a look at this?

Comment 3 Jan Chaloupka 2021-07-12 11:26:02 UTC

Known upstream issue: https://github.com/kubernetes/kubernetes/issues/102136

Comment 6 RamaKasturi 2021-08-10 06:36:02 UTC

Moving the bug to verified state as the required changes to the cluster which ensures pod replicas are spread properly are already present in the 4.7 release notes and they work well.

Comment 8 errata-xmlrpc 2021-08-17 12:12:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.24 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3032

Note You need to log in before you can comment on or make changes to this bug.