Bug 2077833

Summary: Frequent failure to upgrade etcd operator on ovn clusters: operator was not available (EtcdMembers_No quorum): EtcdMembersAvailable: 1 of 3 members are available
Product: OpenShift Container Platform Reporter: Devan Goodwin <dgoodwin>
Component: EtcdAssignee: Thomas Jungblut <tjungblu>
Status: CLOSED DEFERRED QA Contact: ge liu <geliu>
Severity: low Docs Contact:
Priority: low    
Version: 4.11CC: sreber, tjungblu, wking
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-07-13 16:21:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Devan Goodwin 2022-04-22 11:16:52 UTC
Description of problem:

TRT has identified a problem that seems fairly common on OVN (particularly on Azure) where the etcd operator fails to upgrade

https://search.ci.openshift.org/?search=Failed+to+upgrade+etcd&maxAge=168h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Some snapshots of results from today:

periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn-upgrade (all) - 211 runs, 85% failed, 20% of failures match = 17% impact

periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade (all) - 437 runs, 79% failed, 9% of failures match = 7% impact

periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-gcp-ovn-rt-upgrade (all) - 28 runs, 100% failed, 7% of failures match = 7% impact

periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-ovirt-upgrade (all) - 28 runs, 68% failed, 5% of failures match = 4% impact

If you exclude ovn job results, there are NO hits. This problem seemingly never happens on SDN.


In the main examples TRT tried to analyze, the etcd operator is reporting Available=false for this reason for hours: 

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn-upgrade/1516832201574977536

We dug through logs but were not able to determine why the members are not healthy, though there were some suspicious "apply request took too long" entries in the main etcd pods of the bad nodes throughout the logfiles.

TRT is going to investigate breaking out a clear test for this specific case.

Comment 2 Devan Goodwin 2022-04-26 10:59:42 UTC
No hits for three days, I'll keep watching and report back as soon as we see something.

Comment 5 Thomas Jungblut 2022-04-27 10:56:47 UTC
*** Bug 2016574 has been marked as a duplicate of this bug. ***

Comment 7 W. Trevor King 2022-04-28 07:04:12 UTC
With bug 2016574 closed as a dup, this bug is also now aiming to fix:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=120h&type=junit&search=event+happened.*times.*something+is+wrong.*deployment/etcd-operator.*Degraded+message+changed.*EndpointsDegraded' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-upgrade (all) - 5 runs, 80% failed, 25% of failures match = 20% impact
periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.11-upgrade-from-stable-4.10-e2e-openstack-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.9-upgrade-from-stable-4.8-e2e-openstack-upgrade (all) - 5 runs, 100% failed, 20% of failures match = 20% impact

To avoid issues like:

  : [sig-arch] events should not repeat pathologically	0s
  2 events happened too frequently

  event happened 32 times, something is wrong: ns/openshift-etcd-operator deployment/etcd-operator - reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdEndpointsDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nEtcdMembersDegraded: No unhealthy members found"
  event happened 29 times, something is wrong: ns/openshift-etcd-operator deployment/etcd-operator - reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdEndpointsDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found"