Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2073901

Summary: Installation failed due to etcd operator Err:DefragControllerDegraded: failed to dial endpoint https://10.0.0.7:2379 with maintenance client: context canceled
Product: OpenShift Container Platform Reporter: ge liu <geliu>
Component: EtcdAssignee: Thomas Jungblut <tjungblu>
Status: CLOSED ERRATA QA Contact: ge liu <geliu>
Severity: high Docs Contact:
Priority: medium    
Version: 4.11CC: mmasters, tjungblu, xxia, yanyang
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-10 11:05:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description ge liu 2022-04-11 05:17:33 UTC
Description of problem:

Installed cluster(upi on gcp), etcd operator prompt err: 
etcd 4.11.0-0.nightly-2022-04-10-231114   True        False         True       168m    DefragControllerDegraded: failed to dial endpoint https://10.0.0.7:2379 with maintenance client: context canceled

# oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-04-10-231114   True        False         158m    Error while reconciling 4.11.0-0.nightly-2022-04-10-231114: the cluster operator etcd is degraded

# oc get etcd cluster -o yaml
- lastTransitionTime: "2022-04-11T02:18:52Z"
    message: 'failed to dial endpoint https://10.0.0.7:2379 with maintenance client:
    context canceled'
    reason: Error
    status: "True"
    type: DefragControllerDegraded

Check the log, got err msg, but etcd member and operator pod are in Running status:

E0411 02:14:14.989364       1 base_controller.go:272] EtcdEndpointsController reconciliation failed: no etcd members are present
E0411 02:14:15.364304       1 base_controller.go:272] DefragController reconciliation failed: Operation cannot be fulfilled on etcds.operator.openshift.io "cluster": the object has been modified; please apply your changes to the latest version and try again


How reproducible:
sometimes

Steps to Reproduce:
Installed cluster, etcd co status is abnormal
Actual results:
as description
Expected results
etcd co status is normal

Comment 2 Xingxing Xia 2022-04-21 03:43:39 UTC
https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/96014/console hits the installation failure again:
04-21 03:10:32.198  etcd                                       4.11.0-0.nightly-2022-04-20-215725   True   False   True    35m   DefragControllerDegraded: failed to dial endpoint https://10.0.0.6:2379 with maintenance client: context canceled
04-21 03:10:32.200  Using oc describe to check status of bad core clusteroperators ...
04-21 03:10:32.200  Name: etcd
04-21 03:10:32.200  Status:
04-21 03:10:32.200    Conditions:
04-21 03:10:32.200      Last Transition Time:  2022-04-21T02:46:45Z
04-21 03:10:32.200      Message:               DefragControllerDegraded: failed to dial endpoint https://10.0.0.6:2379 with maintenance client: context canceled
04-21 03:10:32.200      Reason:                DefragController_Error
04-21 03:10:32.200      Status:                True
04-21 03:10:32.200      Type:                  Degraded

Comment 5 ge liu 2022-05-24 11:32:30 UTC
Verified with 4.11.0-0.nightly-2022-05-24-062131, install ocp cluster with upi+gcp mode, it succeed.

Comment 7 errata-xmlrpc 2022-08-10 11:05:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069