Bug 1945387 - Image Registry deployment should have 2 replicas and hard anti-affinity rules
Summary: Image Registry deployment should have 2 replicas and hard anti-affinity rules
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Image Registry
Version: 4.8
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.8.0
Assignee: Ricardo Maraschini
QA Contact: XiuJuan Wang
URL:
Whiteboard:
Depends On:
Blocks: 1973693 1986486
TreeView+ depends on / blocked
 
Reported: 2021-03-31 18:39 UTC by ravig
Modified: 2023-09-18 00:25 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Image registry pods were being scheduled on the same node. Consequence: In case of problems in the node where pods were scheduled the registry was becoming unavailable for a while. Fix: When we have only two replicas we start to require (requiredDuringSchedulingIgnoredDuringExecution) the pods to be scheduled in different nodes. If the number of replicas is higher than 2 then we prefer (preferredDuringSchedulingIgnoredDuringExecution) them to run in different nodes but do not enforce it. This fix also added rules for maxUnavailable (1) and maxSurge (1) when the number of replicas is 2. Result: Image registry pods are fairly distributed among the nodes, allowing nodes to fail without making the registry unavailable.
Clone Of:
Environment:
Last Closed: 2021-07-27 22:57:00 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-image-registry-operator pull 681 0 None open Bug 1945387: Setting required pod anti-affinity rules 2021-04-29 09:58:36 UTC
Red Hat Knowledge Base (Solution) 5397921 0 None None None 2021-09-22 15:36:03 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:57:28 UTC

Description ravig 2021-03-31 18:39:37 UTC
Description of problem:
Currently, all the image registry deployment replicas can land on single node which causes single point of failure. In order to highly available, it should set the affinities to `requiredDuringSchedulingIgnoredDuringExecution` instead of `PreferredDuringSchedulingIgnoredDuringExecution`

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 5 XiuJuan Wang 2021-05-14 03:46:13 UTC
Setting baremetal SNO cluster, the default image registry replicas=1, then raise up the replicas to 2, check the image registry pod running.
The new two pods can't be running due to not match pod anti-affinity rules.

$oc get pods 
NAME                                               READY   STATUS      RESTARTS   AGE
cluster-image-registry-operator-7879c9b8bb-bwp8d   1/1     Running     5          17h
image-pruner-27015840-hcmhx                        0/1     Completed   0          3h43m
image-registry-5f5cbb89c4-qz7mv                    0/1     Pending     0          2m27s
image-registry-5f5cbb89c4-rg7rv                    0/1     Pending     0          2m27s
image-registry-66dd4f45fb-ljj8n                    1/1     Running     0          17h
node-ca-k4nxf                                      1/1     Running     0          17h


oc describe pods image-registry-5f5cbb89c4-qz7mv

Events:
  Type     Reason            Age                 From               Message
  ----     ------            ----                ----               -------
  Warning  FailedScheduling  5s (x5 over 2m20s)  default-scheduler  0/1 nodes are available: 1 node(s) didn't match pod affinity/anti-affinity rules, 1 node(s) didn't match pod anti-affinity rules.

Verified on 4.8.0-0.nightly-2021-05-13-002125 cluster.

Comment 6 XiuJuan Wang 2021-05-14 06:03:03 UTC
On a multiple nodes(3 masters,3 workers)cluster, when set replicas to 3, the 3 image registry pods schedule the same nodes sometimes.Since the pod anti-affinity is following preferredDuringSchedulingIgnoredDuringExecution rules.

Should we consider this scenario?

$oc patch config.image cluster -p '{"spec":{"replicas":3}}' --type=merge

$oc get pod -o wide
NAME                                               READY   STATUS    RESTARTS   AGE    IP            NODE                                        NOMINATED NODE   READINESS GATES
cluster-image-registry-operator-5b597c45bc-2msvq   1/1     Running   0          46m    10.129.0.68   ip-10-0-56-47.us-east-2.compute.internal    <none>           <none>
image-registry-77d5f86878-8f74x                    1/1     Running   0          60s    10.128.2.29   ip-10-0-77-236.us-east-2.compute.internal   <none>           <none>
image-registry-77d5f86878-gz2jh                    1/1     Running   0          81s    10.128.2.27   ip-10-0-77-236.us-east-2.compute.internal   <none>           <none>
image-registry-77d5f86878-w4pq6                    1/1     Running   0          70s    10.128.2.28   ip-10-0-77-236.us-east-2.compute.internal   <none>           <none>

$oc get pods image-registry-77d5f86878-4jtgc -o json  | jq -r '[.spec.affinity]'
[
  {
    "podAntiAffinity": {
      "preferredDuringSchedulingIgnoredDuringExecution": [
        {
          "podAffinityTerm": {
            "labelSelector": {
              "matchLabels": {
                "docker-registry": "default"
              }
            },
            "namespaces": [
              "openshift-image-registry"
            ],
            "topologyKey": "kubernetes.io/hostname"
          },
          "weight": 100
        }
      ]
    }
  }
]

Comment 7 Ricardo Maraschini 2021-05-14 07:35:45 UTC
When we have two replicas we start to require (requiredDuringSchedulingIgnoredDuringExecution) the pods to be scheduled in different nodes. If the number of replicas is higher than 2 then we prefer (preferredDuringSchedulingIgnoredDuringExecution) them to run in different nodes but do not enforce it. This fix also added rules for maxUnavailable (1) and maxSurge (1) when the number of replicas is 2.

Comment 8 XiuJuan Wang 2021-05-14 09:35:39 UTC
Ricardo, thank you for explaining that.

Comment 10 Ganesh Gore 2021-06-17 17:56:20 UTC
Hi,

Will it be backported in 4.7?

I have cu who is looking for similar flexibility to change hard anti-affinity rules.

Currently, it is set by default as "preferredDuringSchedulingIgnoredDuringExecution", however, cu would like to change it to the "requiredDuringSchedulingIgnoredDuringExecution" so that the pods must get scheduled on a different node if the condition matches.

Cu is running on OCP4.7

Let me know if I need to open a separate bug for 4.7?

Thanks & Regards,
Ganesh Gore

Comment 12 errata-xmlrpc 2021-07-27 22:57:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Comment 14 Red Hat Bugzilla 2023-09-18 00:25:32 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.