Bug 2073901 - Installation failed due to etcd operator Err:DefragControllerDegraded: failed to dial endpoint https://10.0.0.7:2379 with maintenance client: context canceled
Summary: Installation failed due to etcd operator Err:DefragControllerDegraded: failed...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.11
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
: 4.11.0
Assignee: Thomas Jungblut
QA Contact: ge liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-04-11 05:17 UTC by ge liu
Modified: 2022-08-10 11:05 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-10 11:05:44 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-etcd-operator pull 812 0 None open Bug 2073901: revisit defrag controller degradation 2022-05-03 12:08:35 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 11:05:58 UTC

Description ge liu 2022-04-11 05:17:33 UTC
Description of problem:

Installed cluster(upi on gcp), etcd operator prompt err: 
etcd 4.11.0-0.nightly-2022-04-10-231114   True        False         True       168m    DefragControllerDegraded: failed to dial endpoint https://10.0.0.7:2379 with maintenance client: context canceled

# oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-04-10-231114   True        False         158m    Error while reconciling 4.11.0-0.nightly-2022-04-10-231114: the cluster operator etcd is degraded

# oc get etcd cluster -o yaml
- lastTransitionTime: "2022-04-11T02:18:52Z"
    message: 'failed to dial endpoint https://10.0.0.7:2379 with maintenance client:
    context canceled'
    reason: Error
    status: "True"
    type: DefragControllerDegraded

Check the log, got err msg, but etcd member and operator pod are in Running status:

E0411 02:14:14.989364       1 base_controller.go:272] EtcdEndpointsController reconciliation failed: no etcd members are present
E0411 02:14:15.364304       1 base_controller.go:272] DefragController reconciliation failed: Operation cannot be fulfilled on etcds.operator.openshift.io "cluster": the object has been modified; please apply your changes to the latest version and try again


How reproducible:
sometimes

Steps to Reproduce:
Installed cluster, etcd co status is abnormal
Actual results:
as description
Expected results
etcd co status is normal

Comment 2 Xingxing Xia 2022-04-21 03:43:39 UTC
https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/96014/console hits the installation failure again:
04-21 03:10:32.198  etcd                                       4.11.0-0.nightly-2022-04-20-215725   True   False   True    35m   DefragControllerDegraded: failed to dial endpoint https://10.0.0.6:2379 with maintenance client: context canceled
04-21 03:10:32.200  Using oc describe to check status of bad core clusteroperators ...
04-21 03:10:32.200  Name: etcd
04-21 03:10:32.200  Status:
04-21 03:10:32.200    Conditions:
04-21 03:10:32.200      Last Transition Time:  2022-04-21T02:46:45Z
04-21 03:10:32.200      Message:               DefragControllerDegraded: failed to dial endpoint https://10.0.0.6:2379 with maintenance client: context canceled
04-21 03:10:32.200      Reason:                DefragController_Error
04-21 03:10:32.200      Status:                True
04-21 03:10:32.200      Type:                  Degraded

Comment 5 ge liu 2022-05-24 11:32:30 UTC
Verified with 4.11.0-0.nightly-2022-05-24-062131, install ocp cluster with upi+gcp mode, it succeed.

Comment 7 errata-xmlrpc 2022-08-10 11:05:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.