Bug 1945387

Summary:	Image Registry deployment should have 2 replicas and hard anti-affinity rules
Product:	OpenShift Container Platform	Reporter:	ravig <rgudimet>
Component:	Image Registry	Assignee:	Ricardo Maraschini <rmarasch>
Status:	CLOSED ERRATA	QA Contact:	XiuJuan Wang <xiuwang>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.8	CC:	aos-bugs, gagore, mrobson, oarribas, obulatov, rmarasch, wewang, wking, xiuwang
Target Milestone:	---	Keywords:	TestCaseNeeded
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: Image registry pods were being scheduled on the same node. Consequence: In case of problems in the node where pods were scheduled the registry was becoming unavailable for a while. Fix: When we have only two replicas we start to require (requiredDuringSchedulingIgnoredDuringExecution) the pods to be scheduled in different nodes. If the number of replicas is higher than 2 then we prefer (preferredDuringSchedulingIgnoredDuringExecution) them to run in different nodes but do not enforce it. This fix also added rules for maxUnavailable (1) and maxSurge (1) when the number of replicas is 2. Result: Image registry pods are fairly distributed among the nodes, allowing nodes to fail without making the registry unavailable.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-07-27 22:57:00 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1973693, 1986486

Description ravig 2021-03-31 18:39:37 UTC

Description of problem:
Currently, all the image registry deployment replicas can land on single node which causes single point of failure. In order to highly available, it should set the affinities to `requiredDuringSchedulingIgnoredDuringExecution` instead of `PreferredDuringSchedulingIgnoredDuringExecution`

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 5 XiuJuan Wang 2021-05-14 03:46:13 UTC

Setting baremetal SNO cluster, the default image registry replicas=1, then raise up the replicas to 2, check the image registry pod running.
The new two pods can't be running due to not match pod anti-affinity rules.

$oc get pods 
NAME                                               READY   STATUS      RESTARTS   AGE
cluster-image-registry-operator-7879c9b8bb-bwp8d   1/1     Running     5          17h
image-pruner-27015840-hcmhx                        0/1     Completed   0          3h43m
image-registry-5f5cbb89c4-qz7mv                    0/1     Pending     0          2m27s
image-registry-5f5cbb89c4-rg7rv                    0/1     Pending     0          2m27s
image-registry-66dd4f45fb-ljj8n                    1/1     Running     0          17h
node-ca-k4nxf                                      1/1     Running     0          17h


oc describe pods image-registry-5f5cbb89c4-qz7mv

Events:
  Type     Reason            Age                 From               Message
  ----     ------            ----                ----               -------
  Warning  FailedScheduling  5s (x5 over 2m20s)  default-scheduler  0/1 nodes are available: 1 node(s) didn't match pod affinity/anti-affinity rules, 1 node(s) didn't match pod anti-affinity rules.

Verified on 4.8.0-0.nightly-2021-05-13-002125 cluster.

Comment 6 XiuJuan Wang 2021-05-14 06:03:03 UTC

On a multiple nodes(3 masters,3 workers)cluster, when set replicas to 3, the 3 image registry pods schedule the same nodes sometimes.Since the pod anti-affinity is following preferredDuringSchedulingIgnoredDuringExecution rules.

Should we consider this scenario?

$oc patch config.image cluster -p '{"spec":{"replicas":3}}' --type=merge

$oc get pod -o wide
NAME                                               READY   STATUS    RESTARTS   AGE    IP            NODE                                        NOMINATED NODE   READINESS GATES
cluster-image-registry-operator-5b597c45bc-2msvq   1/1     Running   0          46m    10.129.0.68   ip-10-0-56-47.us-east-2.compute.internal    <none>           <none>
image-registry-77d5f86878-8f74x                    1/1     Running   0          60s    10.128.2.29   ip-10-0-77-236.us-east-2.compute.internal   <none>           <none>
image-registry-77d5f86878-gz2jh                    1/1     Running   0          81s    10.128.2.27   ip-10-0-77-236.us-east-2.compute.internal   <none>           <none>
image-registry-77d5f86878-w4pq6                    1/1     Running   0          70s    10.128.2.28   ip-10-0-77-236.us-east-2.compute.internal   <none>           <none>

$oc get pods image-registry-77d5f86878-4jtgc -o json  | jq -r '[.spec.affinity]'
[
  {
    "podAntiAffinity": {
      "preferredDuringSchedulingIgnoredDuringExecution": [
        {
          "podAffinityTerm": {
            "labelSelector": {
              "matchLabels": {
                "docker-registry": "default"
              }
            },
            "namespaces": [
              "openshift-image-registry"
            ],
            "topologyKey": "kubernetes.io/hostname"
          },
          "weight": 100
        }
      ]
    }
  }
]

Comment 7 Ricardo Maraschini 2021-05-14 07:35:45 UTC

When we have two replicas we start to require (requiredDuringSchedulingIgnoredDuringExecution) the pods to be scheduled in different nodes. If the number of replicas is higher than 2 then we prefer (preferredDuringSchedulingIgnoredDuringExecution) them to run in different nodes but do not enforce it. This fix also added rules for maxUnavailable (1) and maxSurge (1) when the number of replicas is 2.

Comment 8 XiuJuan Wang 2021-05-14 09:35:39 UTC

Ricardo, thank you for explaining that.

Comment 10 Ganesh Gore 2021-06-17 17:56:20 UTC

Hi,

Will it be backported in 4.7?

I have cu who is looking for similar flexibility to change hard anti-affinity rules.

Currently, it is set by default as "preferredDuringSchedulingIgnoredDuringExecution", however, cu would like to change it to the "requiredDuringSchedulingIgnoredDuringExecution" so that the pods must get scheduled on a different node if the condition matches.

Cu is running on OCP4.7

Let me know if I need to open a separate bug for 4.7?

Thanks & Regards,
Ganesh Gore

Comment 12 errata-xmlrpc 2021-07-27 22:57:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Comment 14 Red Hat Bugzilla 2023-09-18 00:25:32 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days