Bug 1981639

Summary: Imageregistry bumps out N+1 pods when set replicas to N(N>2) and Y(=workers number) pods are scheduled to different workers, the left pods will keep pending
Product: OpenShift Container Platform Reporter: XiuJuan Wang <xiuwang>
Component: Image RegistryAssignee: Oleg Bulatov <obulatov>
Status: CLOSED ERRATA QA Contact: XiuJuan Wang <xiuwang>
Severity: high Docs Contact:
Priority: high    
Version: 4.9CC: aos-bugs, fkrepins, gagore, oarribas, obulatov, wewang
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-10-18 17:38:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2004028    
Attachments:
Description Flags
reproducer.sh none

Description XiuJuan Wang 2021-07-13 04:06:31 UTC
Description of problem:
There are 3 workers, and set imageregistry replicas to 3. When the pods are scheduled to 3 different workers, the fourth pod will come out and keep pending due to node affinity.
If 3 pods are scheduled to same worker, then 3 pods are all running. No the fourth pod bumps out.


Version-Release number of selected component (if applicable):
4.9.0-0.nightly-2021-07-12-143404
4.8.0-0.nightly-2021-07-09-181248

How reproducible:
90%

Steps to Reproduce:
1. Set to imageregistry replicas to 3
2. Check image registry pods
3.

Actual results:

$ oc get pods -n openshift-image-registry  -o wide
NAME                                               READY   STATUS      RESTARTS   AGE    IP             NODE                                         NOMINATED NODE   READINESS GATES
cluster-image-registry-operator-6bf6f8bf9c-tscmp   1/1     Running     1          166m   10.129.0.19    ip-10-0-157-212.us-east-2.compute.internal   <none>           <none>
image-pruner-27102240-6rh66                        0/1     Completed   0          150m   10.129.2.8     ip-10-0-221-163.us-east-2.compute.internal   <none>           <none>
image-registry-7894857cd-csrpx                     1/1     Running     0          36m    10.131.0.15    ip-10-0-190-62.us-east-2.compute.internal    <none>           <none>
image-registry-7894857cd-sgdxg                     1/1     Running     0          36m    10.128.2.64    ip-10-0-128-6.us-east-2.compute.internal     <none>           <none>
image-registry-7894857cd-txwn9                     1/1     Running     0          4s     10.129.2.16    ip-10-0-221-163.us-east-2.compute.internal   <none>           <none>
image-registry-dddb4c944-4l92q                     0/1     Pending     0          4s     <none>         <none>                                       <none>           <none>

$ oc get pod image-registry-dddb4c944-4l92q -n openshift-image-registry  -o json | jq -r .status
{
  "conditions": [
    {
      "lastProbeTime": null,
      "lastTransitionTime": "2021-07-13T02:30:00Z",
      "message": "0/6 nodes are available: 3 node(s) didn't match pod affinity/anti-affinity rules, 3 node(s) didn't satisfy existing pods anti-affinity rules, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.",
      "reason": "Unschedulable",
      "status": "False",
      "type": "PodScheduled"
    }
  ],
  "phase": "Pending",
  "qosClass": "Burstable"
}

Expected results:
Only set the number of pods to run

Additional info:

Comment 1 XiuJuan Wang 2021-07-13 04:16:50 UTC
Must gather log http://virt-openshift-05.lab.eng.nay.redhat.com/xiuwang/1981639.tar.gz

Comment 2 Oleg Bulatov 2021-07-30 11:32:59 UTC
Created attachment 1809373 [details]
reproducer.sh

When the image registry operator updates the deployment

  * replicas: 2 -> 3
  * anti-affinity: requiredDuringSchedulingIgnoredDuringExecution -> preferredDuringSchedulingIgnoredDuringExecution

the deployment controller tries to create 3 pods with the required anti-affinity and 1 pod with preferred.

As I have 3 worker nodes, all my nodes have pods with the required anti-affinity. The new pod with the preferred anti-affinity cannot be scheduled as all nodes are occupied, so the deployment cannot proceed.

Output from ./reproducer.sh:
...
+ oc patch configs.imageregistry.operator.openshift.io/cluster --type merge -p '{"spec":{"managementState":"Managed"}}'
config.imageregistry.operator.openshift.io/cluster patched
+ sleep 10
+ dump_deployment
+ kubectl get -n openshift-image-registry deploy image-registry -o yaml
+ grep -e generation: -e replicas: -e DuringExec -e deployment.kubernetes.io
    deployment.kubernetes.io/revision: "2"
  generation: 2
  replicas: 2
          requiredDuringSchedulingIgnoredDuringExecution:
  replicas: 2
+ dump_replicasets
+ kubectl get -n openshift-image-registry rs -o yaml
+ grep -e '^-' -e '^    name:' -e replicas: -e DuringExec -e deployment.kubernetes.io
- apiVersion: apps/v1
      deployment.kubernetes.io/desired-replicas: "1"
      deployment.kubernetes.io/max-replicas: "2"
      deployment.kubernetes.io/revision: "1"
    name: cluster-image-registry-operator-7bbfb5995d
    replicas: 1
    replicas: 1
- apiVersion: apps/v1
      deployment.kubernetes.io/desired-replicas: "2"
      deployment.kubernetes.io/max-replicas: "3"
      deployment.kubernetes.io/revision: "2"
    name: image-registry-5b4f6b6b6f
    replicas: 2
            requiredDuringSchedulingIgnoredDuringExecution:
    replicas: 2
- apiVersion: apps/v1
      deployment.kubernetes.io/desired-replicas: "2"
      deployment.kubernetes.io/max-replicas: "3"
      deployment.kubernetes.io/revision: "1"
    name: image-registry-5f4c7bd8d5
    replicas: 0
            requiredDuringSchedulingIgnoredDuringExecution:
    replicas: 0
+ oc patch configs.imageregistry.operator.openshift.io/cluster --type merge -p '{"spec":{"replicas":3}}'
config.imageregistry.operator.openshift.io/cluster patched
+ sleep 10
+ dump_deployment
+ kubectl get -n openshift-image-registry deploy image-registry -o yaml
+ grep -e generation: -e replicas: -e DuringExec -e deployment.kubernetes.io
    deployment.kubernetes.io/revision: "3"
  generation: 3
  replicas: 3
          preferredDuringSchedulingIgnoredDuringExecution:
  replicas: 4
+ dump_replicasets
+ kubectl get -n openshift-image-registry rs -o yaml
+ grep -e '^-' -e '^    name:' -e replicas: -e DuringExec -e deployment.kubernetes.io
- apiVersion: apps/v1
      deployment.kubernetes.io/desired-replicas: "1"
      deployment.kubernetes.io/max-replicas: "2"
      deployment.kubernetes.io/revision: "1"
    name: cluster-image-registry-operator-7bbfb5995d
    replicas: 1
    replicas: 1
- apiVersion: apps/v1
      deployment.kubernetes.io/desired-replicas: "3"
      deployment.kubernetes.io/max-replicas: "4"
      deployment.kubernetes.io/revision: "2"
    name: image-registry-5b4f6b6b6f
    replicas: 3
            requiredDuringSchedulingIgnoredDuringExecution:
    replicas: 3
- apiVersion: apps/v1
      deployment.kubernetes.io/desired-replicas: "2"
      deployment.kubernetes.io/max-replicas: "3"
      deployment.kubernetes.io/revision: "1"
    name: image-registry-5f4c7bd8d5
    replicas: 0
            requiredDuringSchedulingIgnoredDuringExecution:
    replicas: 0
- apiVersion: apps/v1
      deployment.kubernetes.io/desired-replicas: "3"
      deployment.kubernetes.io/max-replicas: "4"
      deployment.kubernetes.io/revision: "3"
    name: image-registry-79f5d7599f
    replicas: 1
            preferredDuringSchedulingIgnoredDuringExecution:
    replicas: 1

Comment 3 Filip Krepinsky 2021-08-04 12:23:58 UTC
The problem is with the cluster-image-registry-operator configuration.

when replicas are set to 2
- 
- PodAntiAffinity is set to RequiredDuringSchedulingIgnoredDuringExecution ( https://github.com/openshift/cluster-image-registry-operator/blob/master/pkg/resource/podtemplatespec.go#L394 )
- maxUnavailable is set to 1 ( https://github.com/openshift/cluster-image-registry-operator/blob/master/pkg/resource/deployment.go#L99 )


but when we scale to 3 

- we ask for changes in PodAntiAffinity to PreferredDuringSchedulingIgnoredDuringExecution ( https://github.com/openshift/cluster-image-registry-operator/blob/master/pkg/resource/podtemplatespec.go#L376 )
- maxUnavailable is set to default 25% which is not even 1 pod (0.75 to be precise)

1. old ReplicaSet is scaled to 3 to honor new maxUnavailable
2. new ReplicaSet is created with 1 pod with PreferredDuringSchedulingIgnoredDuringExecution, but cannot be scheduled due to old ReplicaSet pods PodAntiAffinity with RequiredDuringSchedulingIgnoredDuringExecution 

So the rules of the Deployment prevent it from progressing.

To fix this I propose to set maxUnavailable to 1 also for 3 replica deployments or tune the PodAntiAffinity rules


@obulatov What do you think about these options?

Comment 11 wewang 2021-08-16 06:03:42 UTC
Verified in version
Version:
4.9.0-0.nightly-2021-08-14-065522

[wewang@localhost work]$ oc get pods -n openshift-image-registry  -o wide
NAME                                              READY   STATUS      RESTARTS        AGE     IP             NODE                                         NOMINATED NODE   READINESS GATES
cluster-image-registry-operator-5dc84c844-bzhlw   1/1     Running     1 (6h38m ago)   6h45m   10.128.0.31    ip-10-0-137-141.eu-west-2.compute.internal   <none>           <none>
image-pruner-27151200--1-rhvvn                    0/1     Completed   0               5h58m   10.128.2.15    ip-10-0-201-158.eu-west-2.compute.internal   <none>           <none>
image-registry-7ddf5ccf9-68vq8                    1/1     Running     0               42m     10.129.2.163   ip-10-0-165-218.eu-west-2.compute.internal   <none>           <none>
image-registry-7ddf5ccf9-jzptq                    1/1     Running     0               2m40s   10.131.0.42    ip-10-0-156-75.eu-west-2.compute.internal    <none>           <none>
image-registry-7ddf5ccf9-nlncg                    1/1     Running     0               42m     10.129.2.162   ip-10-0-165-218.eu-west-2.compute.internal   <none>           <none>
image-registry-7ddf5ccf9-v68ww                    1/1     Running     0               42m     10.129.2.164   ip-10-0-165-218.eu-west-2.compute.internal   <none>

Comment 16 errata-xmlrpc 2021-10-18 17:38:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759