2073901 – Installation failed due to etcd operator Err:DefragControllerDegraded: failed to dial endpoint https://10.0.0.7:2379 with maintenance client: context canceled

Bug 2073901 - Installation failed due to etcd operator Err:DefragControllerDegraded: failed to dial endpoint https://10.0.0.7:2379 with maintenance client: context canceled

Summary: Installation failed due to etcd operator Err:DefragControllerDegraded: failed...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.11
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Thomas Jungblut
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-04-11 05:17 UTC by ge liu
Modified:	2022-08-10 11:05 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-08-10 11:05:44 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-etcd-operator pull 812	0	None	open	Bug 2073901: revisit defrag controller degradation	2022-05-03 12:08:35 UTC
Red Hat Product Errata	RHSA-2022:5069	0	None	None	None	2022-08-10 11:05:58 UTC

Description ge liu 2022-04-11 05:17:33 UTC

Description of problem:

Installed cluster(upi on gcp), etcd operator prompt err: 
etcd 4.11.0-0.nightly-2022-04-10-231114   True        False         True       168m    DefragControllerDegraded: failed to dial endpoint https://10.0.0.7:2379 with maintenance client: context canceled

# oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-04-10-231114   True        False         158m    Error while reconciling 4.11.0-0.nightly-2022-04-10-231114: the cluster operator etcd is degraded

# oc get etcd cluster -o yaml
- lastTransitionTime: "2022-04-11T02:18:52Z"
    message: 'failed to dial endpoint https://10.0.0.7:2379 with maintenance client:
    context canceled'
    reason: Error
    status: "True"
    type: DefragControllerDegraded

Check the log, got err msg, but etcd member and operator pod are in Running status:

E0411 02:14:14.989364       1 base_controller.go:272] EtcdEndpointsController reconciliation failed: no etcd members are present
E0411 02:14:15.364304       1 base_controller.go:272] DefragController reconciliation failed: Operation cannot be fulfilled on etcds.operator.openshift.io "cluster": the object has been modified; please apply your changes to the latest version and try again


How reproducible:
sometimes

Steps to Reproduce:
Installed cluster, etcd co status is abnormal
Actual results:
as description
Expected results
etcd co status is normal

Comment 1 ge liu 2022-04-11 05:25:41 UTC

must-gather: https://virt-openshift-05.lab.eng.nay.redhat.com/geliu/must-gather-0411.tar.gz

Comment 2 Xingxing Xia 2022-04-21 03:43:39 UTC

https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/96014/console hits the installation failure again:
04-21 03:10:32.198  etcd                                       4.11.0-0.nightly-2022-04-20-215725   True   False   True    35m   DefragControllerDegraded: failed to dial endpoint https://10.0.0.6:2379 with maintenance client: context canceled
04-21 03:10:32.200  Using oc describe to check status of bad core clusteroperators ...
04-21 03:10:32.200  Name: etcd
04-21 03:10:32.200  Status:
04-21 03:10:32.200    Conditions:
04-21 03:10:32.200      Last Transition Time:  2022-04-21T02:46:45Z
04-21 03:10:32.200      Message:               DefragControllerDegraded: failed to dial endpoint https://10.0.0.6:2379 with maintenance client: context canceled
04-21 03:10:32.200      Reason:                DefragController_Error
04-21 03:10:32.200      Status:                True
04-21 03:10:32.200      Type:                  Degraded

Comment 5 ge liu 2022-05-24 11:32:30 UTC

Verified with 4.11.0-0.nightly-2022-05-24-062131, install ocp cluster with upi+gcp mode, it succeed.

Comment 7 errata-xmlrpc 2022-08-10 11:05:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.