Description of problem: Installed cluster(upi on gcp), etcd operator prompt err: etcd 4.11.0-0.nightly-2022-04-10-231114 True False True 168m DefragControllerDegraded: failed to dial endpoint https://10.0.0.7:2379 with maintenance client: context canceled # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-04-10-231114 True False 158m Error while reconciling 4.11.0-0.nightly-2022-04-10-231114: the cluster operator etcd is degraded # oc get etcd cluster -o yaml - lastTransitionTime: "2022-04-11T02:18:52Z" message: 'failed to dial endpoint https://10.0.0.7:2379 with maintenance client: context canceled' reason: Error status: "True" type: DefragControllerDegraded Check the log, got err msg, but etcd member and operator pod are in Running status: E0411 02:14:14.989364 1 base_controller.go:272] EtcdEndpointsController reconciliation failed: no etcd members are present E0411 02:14:15.364304 1 base_controller.go:272] DefragController reconciliation failed: Operation cannot be fulfilled on etcds.operator.openshift.io "cluster": the object has been modified; please apply your changes to the latest version and try again How reproducible: sometimes Steps to Reproduce: Installed cluster, etcd co status is abnormal Actual results: as description Expected results etcd co status is normal
must-gather: https://virt-openshift-05.lab.eng.nay.redhat.com/geliu/must-gather-0411.tar.gz
https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/96014/console hits the installation failure again: 04-21 03:10:32.198 etcd 4.11.0-0.nightly-2022-04-20-215725 True False True 35m DefragControllerDegraded: failed to dial endpoint https://10.0.0.6:2379 with maintenance client: context canceled 04-21 03:10:32.200 Using oc describe to check status of bad core clusteroperators ... 04-21 03:10:32.200 Name: etcd 04-21 03:10:32.200 Status: 04-21 03:10:32.200 Conditions: 04-21 03:10:32.200 Last Transition Time: 2022-04-21T02:46:45Z 04-21 03:10:32.200 Message: DefragControllerDegraded: failed to dial endpoint https://10.0.0.6:2379 with maintenance client: context canceled 04-21 03:10:32.200 Reason: DefragController_Error 04-21 03:10:32.200 Status: True 04-21 03:10:32.200 Type: Degraded
Verified with 4.11.0-0.nightly-2022-05-24-062131, install ocp cluster with upi+gcp mode, it succeed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069