2077833 – Frequent failure to upgrade etcd operator on ovn clusters: operator was not available (EtcdMembers_No quorum): EtcdMembersAvailable: 1 of 3 members are available

Bug 2077833 - Frequent failure to upgrade etcd operator on ovn clusters: operator was not available (EtcdMembers_No quorum): EtcdMembersAvailable: 1 of 3 members are available

Summary: Frequent failure to upgrade etcd operator on ovn clusters: operator was not a...

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Thomas Jungblut
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	2016574 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-04-22 11:16 UTC by Devan Goodwin
Modified:	2022-07-13 16:21 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-07-13 16:21:22 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-etcd-operator pull 800	0	None	Merged	Bug 2077833: add more logging	2022-04-27 06:38:39 UTC
Github	openshift cluster-etcd-operator pull 801	0	None	Merged	fix races in etcdclient	2022-05-13 15:43:25 UTC

Description Devan Goodwin 2022-04-22 11:16:52 UTC

Description of problem:

TRT has identified a problem that seems fairly common on OVN (particularly on Azure) where the etcd operator fails to upgrade

https://search.ci.openshift.org/?search=Failed+to+upgrade+etcd&maxAge=168h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Some snapshots of results from today:

periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn-upgrade (all) - 211 runs, 85% failed, 20% of failures match = 17% impact

periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade (all) - 437 runs, 79% failed, 9% of failures match = 7% impact

periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-gcp-ovn-rt-upgrade (all) - 28 runs, 100% failed, 7% of failures match = 7% impact

periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-ovirt-upgrade (all) - 28 runs, 68% failed, 5% of failures match = 4% impact

If you exclude ovn job results, there are NO hits. This problem seemingly never happens on SDN.


In the main examples TRT tried to analyze, the etcd operator is reporting Available=false for this reason for hours: 

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn-upgrade/1516832201574977536

We dug through logs but were not able to determine why the members are not healthy, though there were some suspicious "apply request took too long" entries in the main etcd pods of the bad nodes throughout the logfiles.

TRT is going to investigate breaking out a clear test for this specific case.

Comment 2 Devan Goodwin 2022-04-26 10:59:42 UTC

No hits for three days, I'll keep watching and report back as soon as we see something.

Comment 5 Thomas Jungblut 2022-04-27 10:56:47 UTC

*** Bug 2016574 has been marked as a duplicate of this bug. ***

Comment 6 Devan Goodwin 2022-04-27 18:24:32 UTC

It's back:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade/1519289813402914816

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade/1519110556793966592

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade/1519110557616050176

Comment 7 W. Trevor King 2022-04-28 07:04:12 UTC

With bug 2016574 closed as a dup, this bug is also now aiming to fix:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=120h&type=junit&search=event+happened.*times.*something+is+wrong.*deployment/etcd-operator.*Degraded+message+changed.*EndpointsDegraded' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-upgrade (all) - 5 runs, 80% failed, 25% of failures match = 20% impact
periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.11-upgrade-from-stable-4.10-e2e-openstack-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.9-upgrade-from-stable-4.8-e2e-openstack-upgrade (all) - 5 runs, 100% failed, 20% of failures match = 20% impact

To avoid issues like:

  : [sig-arch] events should not repeat pathologically	0s
  2 events happened too frequently

  event happened 32 times, something is wrong: ns/openshift-etcd-operator deployment/etcd-operator - reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdEndpointsDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nEtcdMembersDegraded: No unhealthy members found"
  event happened 29 times, something is wrong: ns/openshift-etcd-operator deployment/etcd-operator - reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdEndpointsDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found"

Note You need to log in before you can comment on or make changes to this bug.