Description of problem: There are 3 workers, and set imageregistry replicas to 3. When the pods are scheduled to 3 different workers, the fourth pod will come out and keep pending due to node affinity. If 3 pods are scheduled to same worker, then 3 pods are all running. No the fourth pod bumps out. Version-Release number of selected component (if applicable): 4.9.0-0.nightly-2021-07-12-143404 4.8.0-0.nightly-2021-07-09-181248 How reproducible: 90% Steps to Reproduce: 1. Set to imageregistry replicas to 3 2. Check image registry pods 3. Actual results: $ oc get pods -n openshift-image-registry -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES cluster-image-registry-operator-6bf6f8bf9c-tscmp 1/1 Running 1 166m 10.129.0.19 ip-10-0-157-212.us-east-2.compute.internal <none> <none> image-pruner-27102240-6rh66 0/1 Completed 0 150m 10.129.2.8 ip-10-0-221-163.us-east-2.compute.internal <none> <none> image-registry-7894857cd-csrpx 1/1 Running 0 36m 10.131.0.15 ip-10-0-190-62.us-east-2.compute.internal <none> <none> image-registry-7894857cd-sgdxg 1/1 Running 0 36m 10.128.2.64 ip-10-0-128-6.us-east-2.compute.internal <none> <none> image-registry-7894857cd-txwn9 1/1 Running 0 4s 10.129.2.16 ip-10-0-221-163.us-east-2.compute.internal <none> <none> image-registry-dddb4c944-4l92q 0/1 Pending 0 4s <none> <none> <none> <none> $ oc get pod image-registry-dddb4c944-4l92q -n openshift-image-registry -o json | jq -r .status { "conditions": [ { "lastProbeTime": null, "lastTransitionTime": "2021-07-13T02:30:00Z", "message": "0/6 nodes are available: 3 node(s) didn't match pod affinity/anti-affinity rules, 3 node(s) didn't satisfy existing pods anti-affinity rules, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.", "reason": "Unschedulable", "status": "False", "type": "PodScheduled" } ], "phase": "Pending", "qosClass": "Burstable" } Expected results: Only set the number of pods to run Additional info:
Must gather log http://virt-openshift-05.lab.eng.nay.redhat.com/xiuwang/1981639.tar.gz
Created attachment 1809373 [details] reproducer.sh When the image registry operator updates the deployment * replicas: 2 -> 3 * anti-affinity: requiredDuringSchedulingIgnoredDuringExecution -> preferredDuringSchedulingIgnoredDuringExecution the deployment controller tries to create 3 pods with the required anti-affinity and 1 pod with preferred. As I have 3 worker nodes, all my nodes have pods with the required anti-affinity. The new pod with the preferred anti-affinity cannot be scheduled as all nodes are occupied, so the deployment cannot proceed. Output from ./reproducer.sh: ... + oc patch configs.imageregistry.operator.openshift.io/cluster --type merge -p '{"spec":{"managementState":"Managed"}}' config.imageregistry.operator.openshift.io/cluster patched + sleep 10 + dump_deployment + kubectl get -n openshift-image-registry deploy image-registry -o yaml + grep -e generation: -e replicas: -e DuringExec -e deployment.kubernetes.io deployment.kubernetes.io/revision: "2" generation: 2 replicas: 2 requiredDuringSchedulingIgnoredDuringExecution: replicas: 2 + dump_replicasets + kubectl get -n openshift-image-registry rs -o yaml + grep -e '^-' -e '^ name:' -e replicas: -e DuringExec -e deployment.kubernetes.io - apiVersion: apps/v1 deployment.kubernetes.io/desired-replicas: "1" deployment.kubernetes.io/max-replicas: "2" deployment.kubernetes.io/revision: "1" name: cluster-image-registry-operator-7bbfb5995d replicas: 1 replicas: 1 - apiVersion: apps/v1 deployment.kubernetes.io/desired-replicas: "2" deployment.kubernetes.io/max-replicas: "3" deployment.kubernetes.io/revision: "2" name: image-registry-5b4f6b6b6f replicas: 2 requiredDuringSchedulingIgnoredDuringExecution: replicas: 2 - apiVersion: apps/v1 deployment.kubernetes.io/desired-replicas: "2" deployment.kubernetes.io/max-replicas: "3" deployment.kubernetes.io/revision: "1" name: image-registry-5f4c7bd8d5 replicas: 0 requiredDuringSchedulingIgnoredDuringExecution: replicas: 0 + oc patch configs.imageregistry.operator.openshift.io/cluster --type merge -p '{"spec":{"replicas":3}}' config.imageregistry.operator.openshift.io/cluster patched + sleep 10 + dump_deployment + kubectl get -n openshift-image-registry deploy image-registry -o yaml + grep -e generation: -e replicas: -e DuringExec -e deployment.kubernetes.io deployment.kubernetes.io/revision: "3" generation: 3 replicas: 3 preferredDuringSchedulingIgnoredDuringExecution: replicas: 4 + dump_replicasets + kubectl get -n openshift-image-registry rs -o yaml + grep -e '^-' -e '^ name:' -e replicas: -e DuringExec -e deployment.kubernetes.io - apiVersion: apps/v1 deployment.kubernetes.io/desired-replicas: "1" deployment.kubernetes.io/max-replicas: "2" deployment.kubernetes.io/revision: "1" name: cluster-image-registry-operator-7bbfb5995d replicas: 1 replicas: 1 - apiVersion: apps/v1 deployment.kubernetes.io/desired-replicas: "3" deployment.kubernetes.io/max-replicas: "4" deployment.kubernetes.io/revision: "2" name: image-registry-5b4f6b6b6f replicas: 3 requiredDuringSchedulingIgnoredDuringExecution: replicas: 3 - apiVersion: apps/v1 deployment.kubernetes.io/desired-replicas: "2" deployment.kubernetes.io/max-replicas: "3" deployment.kubernetes.io/revision: "1" name: image-registry-5f4c7bd8d5 replicas: 0 requiredDuringSchedulingIgnoredDuringExecution: replicas: 0 - apiVersion: apps/v1 deployment.kubernetes.io/desired-replicas: "3" deployment.kubernetes.io/max-replicas: "4" deployment.kubernetes.io/revision: "3" name: image-registry-79f5d7599f replicas: 1 preferredDuringSchedulingIgnoredDuringExecution: replicas: 1
The problem is with the cluster-image-registry-operator configuration. when replicas are set to 2 - - PodAntiAffinity is set to RequiredDuringSchedulingIgnoredDuringExecution ( https://github.com/openshift/cluster-image-registry-operator/blob/master/pkg/resource/podtemplatespec.go#L394 ) - maxUnavailable is set to 1 ( https://github.com/openshift/cluster-image-registry-operator/blob/master/pkg/resource/deployment.go#L99 ) but when we scale to 3 - we ask for changes in PodAntiAffinity to PreferredDuringSchedulingIgnoredDuringExecution ( https://github.com/openshift/cluster-image-registry-operator/blob/master/pkg/resource/podtemplatespec.go#L376 ) - maxUnavailable is set to default 25% which is not even 1 pod (0.75 to be precise) 1. old ReplicaSet is scaled to 3 to honor new maxUnavailable 2. new ReplicaSet is created with 1 pod with PreferredDuringSchedulingIgnoredDuringExecution, but cannot be scheduled due to old ReplicaSet pods PodAntiAffinity with RequiredDuringSchedulingIgnoredDuringExecution So the rules of the Deployment prevent it from progressing. To fix this I propose to set maxUnavailable to 1 also for 3 replica deployments or tune the PodAntiAffinity rules @obulatov What do you think about these options?
Verified in version Version: 4.9.0-0.nightly-2021-08-14-065522 [wewang@localhost work]$ oc get pods -n openshift-image-registry -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES cluster-image-registry-operator-5dc84c844-bzhlw 1/1 Running 1 (6h38m ago) 6h45m 10.128.0.31 ip-10-0-137-141.eu-west-2.compute.internal <none> <none> image-pruner-27151200--1-rhvvn 0/1 Completed 0 5h58m 10.128.2.15 ip-10-0-201-158.eu-west-2.compute.internal <none> <none> image-registry-7ddf5ccf9-68vq8 1/1 Running 0 42m 10.129.2.163 ip-10-0-165-218.eu-west-2.compute.internal <none> <none> image-registry-7ddf5ccf9-jzptq 1/1 Running 0 2m40s 10.131.0.42 ip-10-0-156-75.eu-west-2.compute.internal <none> <none> image-registry-7ddf5ccf9-nlncg 1/1 Running 0 42m 10.129.2.162 ip-10-0-165-218.eu-west-2.compute.internal <none> <none> image-registry-7ddf5ccf9-v68ww 1/1 Running 0 42m 10.129.2.164 ip-10-0-165-218.eu-west-2.compute.internal <none>
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759