Bug 1981639
| Summary: | Imageregistry bumps out N+1 pods when set replicas to N(N>2) and Y(=workers number) pods are scheduled to different workers, the left pods will keep pending | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | XiuJuan Wang <xiuwang> | ||||
| Component: | Image Registry | Assignee: | Oleg Bulatov <obulatov> | ||||
| Status: | CLOSED ERRATA | QA Contact: | XiuJuan Wang <xiuwang> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 4.9 | CC: | aos-bugs, fkrepins, gagore, oarribas, obulatov, wewang | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 4.9.0 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2021-10-18 17:38:57 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 2004028 | ||||||
| Attachments: |
|
||||||
Created attachment 1809373 [details]
reproducer.sh
When the image registry operator updates the deployment
* replicas: 2 -> 3
* anti-affinity: requiredDuringSchedulingIgnoredDuringExecution -> preferredDuringSchedulingIgnoredDuringExecution
the deployment controller tries to create 3 pods with the required anti-affinity and 1 pod with preferred.
As I have 3 worker nodes, all my nodes have pods with the required anti-affinity. The new pod with the preferred anti-affinity cannot be scheduled as all nodes are occupied, so the deployment cannot proceed.
Output from ./reproducer.sh:
...
+ oc patch configs.imageregistry.operator.openshift.io/cluster --type merge -p '{"spec":{"managementState":"Managed"}}'
config.imageregistry.operator.openshift.io/cluster patched
+ sleep 10
+ dump_deployment
+ kubectl get -n openshift-image-registry deploy image-registry -o yaml
+ grep -e generation: -e replicas: -e DuringExec -e deployment.kubernetes.io
deployment.kubernetes.io/revision: "2"
generation: 2
replicas: 2
requiredDuringSchedulingIgnoredDuringExecution:
replicas: 2
+ dump_replicasets
+ kubectl get -n openshift-image-registry rs -o yaml
+ grep -e '^-' -e '^ name:' -e replicas: -e DuringExec -e deployment.kubernetes.io
- apiVersion: apps/v1
deployment.kubernetes.io/desired-replicas: "1"
deployment.kubernetes.io/max-replicas: "2"
deployment.kubernetes.io/revision: "1"
name: cluster-image-registry-operator-7bbfb5995d
replicas: 1
replicas: 1
- apiVersion: apps/v1
deployment.kubernetes.io/desired-replicas: "2"
deployment.kubernetes.io/max-replicas: "3"
deployment.kubernetes.io/revision: "2"
name: image-registry-5b4f6b6b6f
replicas: 2
requiredDuringSchedulingIgnoredDuringExecution:
replicas: 2
- apiVersion: apps/v1
deployment.kubernetes.io/desired-replicas: "2"
deployment.kubernetes.io/max-replicas: "3"
deployment.kubernetes.io/revision: "1"
name: image-registry-5f4c7bd8d5
replicas: 0
requiredDuringSchedulingIgnoredDuringExecution:
replicas: 0
+ oc patch configs.imageregistry.operator.openshift.io/cluster --type merge -p '{"spec":{"replicas":3}}'
config.imageregistry.operator.openshift.io/cluster patched
+ sleep 10
+ dump_deployment
+ kubectl get -n openshift-image-registry deploy image-registry -o yaml
+ grep -e generation: -e replicas: -e DuringExec -e deployment.kubernetes.io
deployment.kubernetes.io/revision: "3"
generation: 3
replicas: 3
preferredDuringSchedulingIgnoredDuringExecution:
replicas: 4
+ dump_replicasets
+ kubectl get -n openshift-image-registry rs -o yaml
+ grep -e '^-' -e '^ name:' -e replicas: -e DuringExec -e deployment.kubernetes.io
- apiVersion: apps/v1
deployment.kubernetes.io/desired-replicas: "1"
deployment.kubernetes.io/max-replicas: "2"
deployment.kubernetes.io/revision: "1"
name: cluster-image-registry-operator-7bbfb5995d
replicas: 1
replicas: 1
- apiVersion: apps/v1
deployment.kubernetes.io/desired-replicas: "3"
deployment.kubernetes.io/max-replicas: "4"
deployment.kubernetes.io/revision: "2"
name: image-registry-5b4f6b6b6f
replicas: 3
requiredDuringSchedulingIgnoredDuringExecution:
replicas: 3
- apiVersion: apps/v1
deployment.kubernetes.io/desired-replicas: "2"
deployment.kubernetes.io/max-replicas: "3"
deployment.kubernetes.io/revision: "1"
name: image-registry-5f4c7bd8d5
replicas: 0
requiredDuringSchedulingIgnoredDuringExecution:
replicas: 0
- apiVersion: apps/v1
deployment.kubernetes.io/desired-replicas: "3"
deployment.kubernetes.io/max-replicas: "4"
deployment.kubernetes.io/revision: "3"
name: image-registry-79f5d7599f
replicas: 1
preferredDuringSchedulingIgnoredDuringExecution:
replicas: 1
The problem is with the cluster-image-registry-operator configuration. when replicas are set to 2 - - PodAntiAffinity is set to RequiredDuringSchedulingIgnoredDuringExecution ( https://github.com/openshift/cluster-image-registry-operator/blob/master/pkg/resource/podtemplatespec.go#L394 ) - maxUnavailable is set to 1 ( https://github.com/openshift/cluster-image-registry-operator/blob/master/pkg/resource/deployment.go#L99 ) but when we scale to 3 - we ask for changes in PodAntiAffinity to PreferredDuringSchedulingIgnoredDuringExecution ( https://github.com/openshift/cluster-image-registry-operator/blob/master/pkg/resource/podtemplatespec.go#L376 ) - maxUnavailable is set to default 25% which is not even 1 pod (0.75 to be precise) 1. old ReplicaSet is scaled to 3 to honor new maxUnavailable 2. new ReplicaSet is created with 1 pod with PreferredDuringSchedulingIgnoredDuringExecution, but cannot be scheduled due to old ReplicaSet pods PodAntiAffinity with RequiredDuringSchedulingIgnoredDuringExecution So the rules of the Deployment prevent it from progressing. To fix this I propose to set maxUnavailable to 1 also for 3 replica deployments or tune the PodAntiAffinity rules @obulatov What do you think about these options? Verified in version Version: 4.9.0-0.nightly-2021-08-14-065522 [wewang@localhost work]$ oc get pods -n openshift-image-registry -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES cluster-image-registry-operator-5dc84c844-bzhlw 1/1 Running 1 (6h38m ago) 6h45m 10.128.0.31 ip-10-0-137-141.eu-west-2.compute.internal <none> <none> image-pruner-27151200--1-rhvvn 0/1 Completed 0 5h58m 10.128.2.15 ip-10-0-201-158.eu-west-2.compute.internal <none> <none> image-registry-7ddf5ccf9-68vq8 1/1 Running 0 42m 10.129.2.163 ip-10-0-165-218.eu-west-2.compute.internal <none> <none> image-registry-7ddf5ccf9-jzptq 1/1 Running 0 2m40s 10.131.0.42 ip-10-0-156-75.eu-west-2.compute.internal <none> <none> image-registry-7ddf5ccf9-nlncg 1/1 Running 0 42m 10.129.2.162 ip-10-0-165-218.eu-west-2.compute.internal <none> <none> image-registry-7ddf5ccf9-v68ww 1/1 Running 0 42m 10.129.2.164 ip-10-0-165-218.eu-west-2.compute.internal <none> Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759 |
Description of problem: There are 3 workers, and set imageregistry replicas to 3. When the pods are scheduled to 3 different workers, the fourth pod will come out and keep pending due to node affinity. If 3 pods are scheduled to same worker, then 3 pods are all running. No the fourth pod bumps out. Version-Release number of selected component (if applicable): 4.9.0-0.nightly-2021-07-12-143404 4.8.0-0.nightly-2021-07-09-181248 How reproducible: 90% Steps to Reproduce: 1. Set to imageregistry replicas to 3 2. Check image registry pods 3. Actual results: $ oc get pods -n openshift-image-registry -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES cluster-image-registry-operator-6bf6f8bf9c-tscmp 1/1 Running 1 166m 10.129.0.19 ip-10-0-157-212.us-east-2.compute.internal <none> <none> image-pruner-27102240-6rh66 0/1 Completed 0 150m 10.129.2.8 ip-10-0-221-163.us-east-2.compute.internal <none> <none> image-registry-7894857cd-csrpx 1/1 Running 0 36m 10.131.0.15 ip-10-0-190-62.us-east-2.compute.internal <none> <none> image-registry-7894857cd-sgdxg 1/1 Running 0 36m 10.128.2.64 ip-10-0-128-6.us-east-2.compute.internal <none> <none> image-registry-7894857cd-txwn9 1/1 Running 0 4s 10.129.2.16 ip-10-0-221-163.us-east-2.compute.internal <none> <none> image-registry-dddb4c944-4l92q 0/1 Pending 0 4s <none> <none> <none> <none> $ oc get pod image-registry-dddb4c944-4l92q -n openshift-image-registry -o json | jq -r .status { "conditions": [ { "lastProbeTime": null, "lastTransitionTime": "2021-07-13T02:30:00Z", "message": "0/6 nodes are available: 3 node(s) didn't match pod affinity/anti-affinity rules, 3 node(s) didn't satisfy existing pods anti-affinity rules, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.", "reason": "Unschedulable", "status": "False", "type": "PodScheduled" } ], "phase": "Pending", "qosClass": "Burstable" } Expected results: Only set the number of pods to run Additional info: