Bug 1981639 - Imageregistry bumps out N+1 pods when set replicas to N(N>2) and Y(=workers number) pods are scheduled to different workers, the left pods will keep pending
Summary: Imageregistry bumps out N+1 pods when set replicas to N(N>2) and Y(=workers n...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Image Registry
Version: 4.9
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.9.0
Assignee: Oleg Bulatov
QA Contact: XiuJuan Wang
URL:
Whiteboard:
Depends On:
Blocks: 2004028
TreeView+ depends on / blocked
 
Reported: 2021-07-13 04:06 UTC by XiuJuan Wang
Modified: 2022-10-12 02:05 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-10-18 17:38:57 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
reproducer.sh (819 bytes, text/plain)
2021-07-30 11:32 UTC, Oleg Bulatov
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-image-registry-operator pull 709 0 None None None 2021-08-11 10:52:07 UTC
Red Hat Bugzilla 1986486 1 high CLOSED Image Registry deployment should have 2 replicas and hard anti-affinity rules 2021-09-22 15:35:53 UTC
Red Hat Knowledge Base (Solution) 5397921 0 None None None 2021-09-22 15:37:06 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:38:59 UTC

Description XiuJuan Wang 2021-07-13 04:06:31 UTC
Description of problem:
There are 3 workers, and set imageregistry replicas to 3. When the pods are scheduled to 3 different workers, the fourth pod will come out and keep pending due to node affinity.
If 3 pods are scheduled to same worker, then 3 pods are all running. No the fourth pod bumps out.


Version-Release number of selected component (if applicable):
4.9.0-0.nightly-2021-07-12-143404
4.8.0-0.nightly-2021-07-09-181248

How reproducible:
90%

Steps to Reproduce:
1. Set to imageregistry replicas to 3
2. Check image registry pods
3.

Actual results:

$ oc get pods -n openshift-image-registry  -o wide
NAME                                               READY   STATUS      RESTARTS   AGE    IP             NODE                                         NOMINATED NODE   READINESS GATES
cluster-image-registry-operator-6bf6f8bf9c-tscmp   1/1     Running     1          166m   10.129.0.19    ip-10-0-157-212.us-east-2.compute.internal   <none>           <none>
image-pruner-27102240-6rh66                        0/1     Completed   0          150m   10.129.2.8     ip-10-0-221-163.us-east-2.compute.internal   <none>           <none>
image-registry-7894857cd-csrpx                     1/1     Running     0          36m    10.131.0.15    ip-10-0-190-62.us-east-2.compute.internal    <none>           <none>
image-registry-7894857cd-sgdxg                     1/1     Running     0          36m    10.128.2.64    ip-10-0-128-6.us-east-2.compute.internal     <none>           <none>
image-registry-7894857cd-txwn9                     1/1     Running     0          4s     10.129.2.16    ip-10-0-221-163.us-east-2.compute.internal   <none>           <none>
image-registry-dddb4c944-4l92q                     0/1     Pending     0          4s     <none>         <none>                                       <none>           <none>

$ oc get pod image-registry-dddb4c944-4l92q -n openshift-image-registry  -o json | jq -r .status
{
  "conditions": [
    {
      "lastProbeTime": null,
      "lastTransitionTime": "2021-07-13T02:30:00Z",
      "message": "0/6 nodes are available: 3 node(s) didn't match pod affinity/anti-affinity rules, 3 node(s) didn't satisfy existing pods anti-affinity rules, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.",
      "reason": "Unschedulable",
      "status": "False",
      "type": "PodScheduled"
    }
  ],
  "phase": "Pending",
  "qosClass": "Burstable"
}

Expected results:
Only set the number of pods to run

Additional info:

Comment 1 XiuJuan Wang 2021-07-13 04:16:50 UTC
Must gather log http://virt-openshift-05.lab.eng.nay.redhat.com/xiuwang/1981639.tar.gz

Comment 2 Oleg Bulatov 2021-07-30 11:32:59 UTC
Created attachment 1809373 [details]
reproducer.sh

When the image registry operator updates the deployment

  * replicas: 2 -> 3
  * anti-affinity: requiredDuringSchedulingIgnoredDuringExecution -> preferredDuringSchedulingIgnoredDuringExecution

the deployment controller tries to create 3 pods with the required anti-affinity and 1 pod with preferred.

As I have 3 worker nodes, all my nodes have pods with the required anti-affinity. The new pod with the preferred anti-affinity cannot be scheduled as all nodes are occupied, so the deployment cannot proceed.

Output from ./reproducer.sh:
...
+ oc patch configs.imageregistry.operator.openshift.io/cluster --type merge -p '{"spec":{"managementState":"Managed"}}'
config.imageregistry.operator.openshift.io/cluster patched
+ sleep 10
+ dump_deployment
+ kubectl get -n openshift-image-registry deploy image-registry -o yaml
+ grep -e generation: -e replicas: -e DuringExec -e deployment.kubernetes.io
    deployment.kubernetes.io/revision: "2"
  generation: 2
  replicas: 2
          requiredDuringSchedulingIgnoredDuringExecution:
  replicas: 2
+ dump_replicasets
+ kubectl get -n openshift-image-registry rs -o yaml
+ grep -e '^-' -e '^    name:' -e replicas: -e DuringExec -e deployment.kubernetes.io
- apiVersion: apps/v1
      deployment.kubernetes.io/desired-replicas: "1"
      deployment.kubernetes.io/max-replicas: "2"
      deployment.kubernetes.io/revision: "1"
    name: cluster-image-registry-operator-7bbfb5995d
    replicas: 1
    replicas: 1
- apiVersion: apps/v1
      deployment.kubernetes.io/desired-replicas: "2"
      deployment.kubernetes.io/max-replicas: "3"
      deployment.kubernetes.io/revision: "2"
    name: image-registry-5b4f6b6b6f
    replicas: 2
            requiredDuringSchedulingIgnoredDuringExecution:
    replicas: 2
- apiVersion: apps/v1
      deployment.kubernetes.io/desired-replicas: "2"
      deployment.kubernetes.io/max-replicas: "3"
      deployment.kubernetes.io/revision: "1"
    name: image-registry-5f4c7bd8d5
    replicas: 0
            requiredDuringSchedulingIgnoredDuringExecution:
    replicas: 0
+ oc patch configs.imageregistry.operator.openshift.io/cluster --type merge -p '{"spec":{"replicas":3}}'
config.imageregistry.operator.openshift.io/cluster patched
+ sleep 10
+ dump_deployment
+ kubectl get -n openshift-image-registry deploy image-registry -o yaml
+ grep -e generation: -e replicas: -e DuringExec -e deployment.kubernetes.io
    deployment.kubernetes.io/revision: "3"
  generation: 3
  replicas: 3
          preferredDuringSchedulingIgnoredDuringExecution:
  replicas: 4
+ dump_replicasets
+ kubectl get -n openshift-image-registry rs -o yaml
+ grep -e '^-' -e '^    name:' -e replicas: -e DuringExec -e deployment.kubernetes.io
- apiVersion: apps/v1
      deployment.kubernetes.io/desired-replicas: "1"
      deployment.kubernetes.io/max-replicas: "2"
      deployment.kubernetes.io/revision: "1"
    name: cluster-image-registry-operator-7bbfb5995d
    replicas: 1
    replicas: 1
- apiVersion: apps/v1
      deployment.kubernetes.io/desired-replicas: "3"
      deployment.kubernetes.io/max-replicas: "4"
      deployment.kubernetes.io/revision: "2"
    name: image-registry-5b4f6b6b6f
    replicas: 3
            requiredDuringSchedulingIgnoredDuringExecution:
    replicas: 3
- apiVersion: apps/v1
      deployment.kubernetes.io/desired-replicas: "2"
      deployment.kubernetes.io/max-replicas: "3"
      deployment.kubernetes.io/revision: "1"
    name: image-registry-5f4c7bd8d5
    replicas: 0
            requiredDuringSchedulingIgnoredDuringExecution:
    replicas: 0
- apiVersion: apps/v1
      deployment.kubernetes.io/desired-replicas: "3"
      deployment.kubernetes.io/max-replicas: "4"
      deployment.kubernetes.io/revision: "3"
    name: image-registry-79f5d7599f
    replicas: 1
            preferredDuringSchedulingIgnoredDuringExecution:
    replicas: 1

Comment 3 Filip Krepinsky 2021-08-04 12:23:58 UTC
The problem is with the cluster-image-registry-operator configuration.

when replicas are set to 2
- 
- PodAntiAffinity is set to RequiredDuringSchedulingIgnoredDuringExecution ( https://github.com/openshift/cluster-image-registry-operator/blob/master/pkg/resource/podtemplatespec.go#L394 )
- maxUnavailable is set to 1 ( https://github.com/openshift/cluster-image-registry-operator/blob/master/pkg/resource/deployment.go#L99 )


but when we scale to 3 

- we ask for changes in PodAntiAffinity to PreferredDuringSchedulingIgnoredDuringExecution ( https://github.com/openshift/cluster-image-registry-operator/blob/master/pkg/resource/podtemplatespec.go#L376 )
- maxUnavailable is set to default 25% which is not even 1 pod (0.75 to be precise)

1. old ReplicaSet is scaled to 3 to honor new maxUnavailable
2. new ReplicaSet is created with 1 pod with PreferredDuringSchedulingIgnoredDuringExecution, but cannot be scheduled due to old ReplicaSet pods PodAntiAffinity with RequiredDuringSchedulingIgnoredDuringExecution 

So the rules of the Deployment prevent it from progressing.

To fix this I propose to set maxUnavailable to 1 also for 3 replica deployments or tune the PodAntiAffinity rules


@obulatov What do you think about these options?

Comment 11 wewang 2021-08-16 06:03:42 UTC
Verified in version
Version:
4.9.0-0.nightly-2021-08-14-065522

[wewang@localhost work]$ oc get pods -n openshift-image-registry  -o wide
NAME                                              READY   STATUS      RESTARTS        AGE     IP             NODE                                         NOMINATED NODE   READINESS GATES
cluster-image-registry-operator-5dc84c844-bzhlw   1/1     Running     1 (6h38m ago)   6h45m   10.128.0.31    ip-10-0-137-141.eu-west-2.compute.internal   <none>           <none>
image-pruner-27151200--1-rhvvn                    0/1     Completed   0               5h58m   10.128.2.15    ip-10-0-201-158.eu-west-2.compute.internal   <none>           <none>
image-registry-7ddf5ccf9-68vq8                    1/1     Running     0               42m     10.129.2.163   ip-10-0-165-218.eu-west-2.compute.internal   <none>           <none>
image-registry-7ddf5ccf9-jzptq                    1/1     Running     0               2m40s   10.131.0.42    ip-10-0-156-75.eu-west-2.compute.internal    <none>           <none>
image-registry-7ddf5ccf9-nlncg                    1/1     Running     0               42m     10.129.2.162   ip-10-0-165-218.eu-west-2.compute.internal   <none>           <none>
image-registry-7ddf5ccf9-v68ww                    1/1     Running     0               42m     10.129.2.164   ip-10-0-165-218.eu-west-2.compute.internal   <none>

Comment 16 errata-xmlrpc 2021-10-18 17:38:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.