Bug 2077833

Summary:	Frequent failure to upgrade etcd operator on ovn clusters: operator was not available (EtcdMembers_No quorum): EtcdMembersAvailable: 1 of 3 members are available
Product:	OpenShift Container Platform	Reporter:	Devan Goodwin <dgoodwin>
Component:	Etcd	Assignee:	Thomas Jungblut <tjungblu>
Status:	CLOSED DEFERRED	QA Contact:	ge liu <geliu>
Severity:	low	Docs Contact:
Priority:	low
Version:	4.11	CC:	sreber, tjungblu, wking
Target Milestone:	---
Target Release:	4.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-07-13 16:21:22 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Devan Goodwin 2022-04-22 11:16:52 UTC

Description of problem:

TRT has identified a problem that seems fairly common on OVN (particularly on Azure) where the etcd operator fails to upgrade

https://search.ci.openshift.org/?search=Failed+to+upgrade+etcd&maxAge=168h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Some snapshots of results from today:

periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn-upgrade (all) - 211 runs, 85% failed, 20% of failures match = 17% impact

periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade (all) - 437 runs, 79% failed, 9% of failures match = 7% impact

periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-gcp-ovn-rt-upgrade (all) - 28 runs, 100% failed, 7% of failures match = 7% impact

periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-ovirt-upgrade (all) - 28 runs, 68% failed, 5% of failures match = 4% impact

If you exclude ovn job results, there are NO hits. This problem seemingly never happens on SDN.


In the main examples TRT tried to analyze, the etcd operator is reporting Available=false for this reason for hours: 

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn-upgrade/1516832201574977536

We dug through logs but were not able to determine why the members are not healthy, though there were some suspicious "apply request took too long" entries in the main etcd pods of the bad nodes throughout the logfiles.

TRT is going to investigate breaking out a clear test for this specific case.

Comment 2 Devan Goodwin 2022-04-26 10:59:42 UTC

No hits for three days, I'll keep watching and report back as soon as we see something.

Comment 5 Thomas Jungblut 2022-04-27 10:56:47 UTC

*** Bug 2016574 has been marked as a duplicate of this bug. ***

Comment 6 Devan Goodwin 2022-04-27 18:24:32 UTC

It's back:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade/1519289813402914816

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade/1519110556793966592

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade/1519110557616050176

Comment 7 W. Trevor King 2022-04-28 07:04:12 UTC

With bug 2016574 closed as a dup, this bug is also now aiming to fix:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=120h&type=junit&search=event+happened.*times.*something+is+wrong.*deployment/etcd-operator.*Degraded+message+changed.*EndpointsDegraded' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-upgrade (all) - 5 runs, 80% failed, 25% of failures match = 20% impact
periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.11-upgrade-from-stable-4.10-e2e-openstack-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.9-upgrade-from-stable-4.8-e2e-openstack-upgrade (all) - 5 runs, 100% failed, 20% of failures match = 20% impact

To avoid issues like:

  : [sig-arch] events should not repeat pathologically	0s
  2 events happened too frequently

  event happened 32 times, something is wrong: ns/openshift-etcd-operator deployment/etcd-operator - reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdEndpointsDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nEtcdMembersDegraded: No unhealthy members found"
  event happened 29 times, something is wrong: ns/openshift-etcd-operator deployment/etcd-operator - reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdEndpointsDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found"