Bug 1981639
Summary: | Imageregistry bumps out N+1 pods when set replicas to N(N>2) and Y(=workers number) pods are scheduled to different workers, the left pods will keep pending | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | XiuJuan Wang <xiuwang> | ||||
Component: | Image Registry | Assignee: | Oleg Bulatov <obulatov> | ||||
Status: | CLOSED ERRATA | QA Contact: | XiuJuan Wang <xiuwang> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 4.9 | CC: | aos-bugs, fkrepins, gagore, oarribas, obulatov, wewang | ||||
Target Milestone: | --- | ||||||
Target Release: | 4.9.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2021-10-18 17:38:57 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 2004028 | ||||||
Attachments: |
|
Description
XiuJuan Wang
2021-07-13 04:06:31 UTC
Created attachment 1809373 [details]
reproducer.sh
When the image registry operator updates the deployment
* replicas: 2 -> 3
* anti-affinity: requiredDuringSchedulingIgnoredDuringExecution -> preferredDuringSchedulingIgnoredDuringExecution
the deployment controller tries to create 3 pods with the required anti-affinity and 1 pod with preferred.
As I have 3 worker nodes, all my nodes have pods with the required anti-affinity. The new pod with the preferred anti-affinity cannot be scheduled as all nodes are occupied, so the deployment cannot proceed.
Output from ./reproducer.sh:
...
+ oc patch configs.imageregistry.operator.openshift.io/cluster --type merge -p '{"spec":{"managementState":"Managed"}}'
config.imageregistry.operator.openshift.io/cluster patched
+ sleep 10
+ dump_deployment
+ kubectl get -n openshift-image-registry deploy image-registry -o yaml
+ grep -e generation: -e replicas: -e DuringExec -e deployment.kubernetes.io
deployment.kubernetes.io/revision: "2"
generation: 2
replicas: 2
requiredDuringSchedulingIgnoredDuringExecution:
replicas: 2
+ dump_replicasets
+ kubectl get -n openshift-image-registry rs -o yaml
+ grep -e '^-' -e '^ name:' -e replicas: -e DuringExec -e deployment.kubernetes.io
- apiVersion: apps/v1
deployment.kubernetes.io/desired-replicas: "1"
deployment.kubernetes.io/max-replicas: "2"
deployment.kubernetes.io/revision: "1"
name: cluster-image-registry-operator-7bbfb5995d
replicas: 1
replicas: 1
- apiVersion: apps/v1
deployment.kubernetes.io/desired-replicas: "2"
deployment.kubernetes.io/max-replicas: "3"
deployment.kubernetes.io/revision: "2"
name: image-registry-5b4f6b6b6f
replicas: 2
requiredDuringSchedulingIgnoredDuringExecution:
replicas: 2
- apiVersion: apps/v1
deployment.kubernetes.io/desired-replicas: "2"
deployment.kubernetes.io/max-replicas: "3"
deployment.kubernetes.io/revision: "1"
name: image-registry-5f4c7bd8d5
replicas: 0
requiredDuringSchedulingIgnoredDuringExecution:
replicas: 0
+ oc patch configs.imageregistry.operator.openshift.io/cluster --type merge -p '{"spec":{"replicas":3}}'
config.imageregistry.operator.openshift.io/cluster patched
+ sleep 10
+ dump_deployment
+ kubectl get -n openshift-image-registry deploy image-registry -o yaml
+ grep -e generation: -e replicas: -e DuringExec -e deployment.kubernetes.io
deployment.kubernetes.io/revision: "3"
generation: 3
replicas: 3
preferredDuringSchedulingIgnoredDuringExecution:
replicas: 4
+ dump_replicasets
+ kubectl get -n openshift-image-registry rs -o yaml
+ grep -e '^-' -e '^ name:' -e replicas: -e DuringExec -e deployment.kubernetes.io
- apiVersion: apps/v1
deployment.kubernetes.io/desired-replicas: "1"
deployment.kubernetes.io/max-replicas: "2"
deployment.kubernetes.io/revision: "1"
name: cluster-image-registry-operator-7bbfb5995d
replicas: 1
replicas: 1
- apiVersion: apps/v1
deployment.kubernetes.io/desired-replicas: "3"
deployment.kubernetes.io/max-replicas: "4"
deployment.kubernetes.io/revision: "2"
name: image-registry-5b4f6b6b6f
replicas: 3
requiredDuringSchedulingIgnoredDuringExecution:
replicas: 3
- apiVersion: apps/v1
deployment.kubernetes.io/desired-replicas: "2"
deployment.kubernetes.io/max-replicas: "3"
deployment.kubernetes.io/revision: "1"
name: image-registry-5f4c7bd8d5
replicas: 0
requiredDuringSchedulingIgnoredDuringExecution:
replicas: 0
- apiVersion: apps/v1
deployment.kubernetes.io/desired-replicas: "3"
deployment.kubernetes.io/max-replicas: "4"
deployment.kubernetes.io/revision: "3"
name: image-registry-79f5d7599f
replicas: 1
preferredDuringSchedulingIgnoredDuringExecution:
replicas: 1
The problem is with the cluster-image-registry-operator configuration. when replicas are set to 2 - - PodAntiAffinity is set to RequiredDuringSchedulingIgnoredDuringExecution ( https://github.com/openshift/cluster-image-registry-operator/blob/master/pkg/resource/podtemplatespec.go#L394 ) - maxUnavailable is set to 1 ( https://github.com/openshift/cluster-image-registry-operator/blob/master/pkg/resource/deployment.go#L99 ) but when we scale to 3 - we ask for changes in PodAntiAffinity to PreferredDuringSchedulingIgnoredDuringExecution ( https://github.com/openshift/cluster-image-registry-operator/blob/master/pkg/resource/podtemplatespec.go#L376 ) - maxUnavailable is set to default 25% which is not even 1 pod (0.75 to be precise) 1. old ReplicaSet is scaled to 3 to honor new maxUnavailable 2. new ReplicaSet is created with 1 pod with PreferredDuringSchedulingIgnoredDuringExecution, but cannot be scheduled due to old ReplicaSet pods PodAntiAffinity with RequiredDuringSchedulingIgnoredDuringExecution So the rules of the Deployment prevent it from progressing. To fix this I propose to set maxUnavailable to 1 also for 3 replica deployments or tune the PodAntiAffinity rules @obulatov What do you think about these options? Verified in version Version: 4.9.0-0.nightly-2021-08-14-065522 [wewang@localhost work]$ oc get pods -n openshift-image-registry -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES cluster-image-registry-operator-5dc84c844-bzhlw 1/1 Running 1 (6h38m ago) 6h45m 10.128.0.31 ip-10-0-137-141.eu-west-2.compute.internal <none> <none> image-pruner-27151200--1-rhvvn 0/1 Completed 0 5h58m 10.128.2.15 ip-10-0-201-158.eu-west-2.compute.internal <none> <none> image-registry-7ddf5ccf9-68vq8 1/1 Running 0 42m 10.129.2.163 ip-10-0-165-218.eu-west-2.compute.internal <none> <none> image-registry-7ddf5ccf9-jzptq 1/1 Running 0 2m40s 10.131.0.42 ip-10-0-156-75.eu-west-2.compute.internal <none> <none> image-registry-7ddf5ccf9-nlncg 1/1 Running 0 42m 10.129.2.162 ip-10-0-165-218.eu-west-2.compute.internal <none> <none> image-registry-7ddf5ccf9-v68ww 1/1 Running 0 42m 10.129.2.164 ip-10-0-165-218.eu-west-2.compute.internal <none> Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759 |