Bug 2077833 - Frequent failure to upgrade etcd operator on ovn clusters: operator was not available (EtcdMembers_No quorum): EtcdMembersAvailable: 1 of 3 members are available
Summary: Frequent failure to upgrade etcd operator on ovn clusters: operator was not a...
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.11
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.11.0
Assignee: Thomas Jungblut
QA Contact: ge liu
URL:
Whiteboard:
: 2016574 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-04-22 11:16 UTC by Devan Goodwin
Modified: 2022-07-13 16:21 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-07-13 16:21:22 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-etcd-operator pull 800 0 None Merged Bug 2077833: add more logging 2022-04-27 06:38:39 UTC
Github openshift cluster-etcd-operator pull 801 0 None Merged fix races in etcdclient 2022-05-13 15:43:25 UTC

Description Devan Goodwin 2022-04-22 11:16:52 UTC
Description of problem:

TRT has identified a problem that seems fairly common on OVN (particularly on Azure) where the etcd operator fails to upgrade

https://search.ci.openshift.org/?search=Failed+to+upgrade+etcd&maxAge=168h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Some snapshots of results from today:

periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn-upgrade (all) - 211 runs, 85% failed, 20% of failures match = 17% impact

periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade (all) - 437 runs, 79% failed, 9% of failures match = 7% impact

periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-gcp-ovn-rt-upgrade (all) - 28 runs, 100% failed, 7% of failures match = 7% impact

periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-ovirt-upgrade (all) - 28 runs, 68% failed, 5% of failures match = 4% impact

If you exclude ovn job results, there are NO hits. This problem seemingly never happens on SDN.


In the main examples TRT tried to analyze, the etcd operator is reporting Available=false for this reason for hours: 

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-e2e-azure-ovn-upgrade/1516832201574977536

We dug through logs but were not able to determine why the members are not healthy, though there were some suspicious "apply request took too long" entries in the main etcd pods of the bad nodes throughout the logfiles.

TRT is going to investigate breaking out a clear test for this specific case.

Comment 2 Devan Goodwin 2022-04-26 10:59:42 UTC
No hits for three days, I'll keep watching and report back as soon as we see something.

Comment 5 Thomas Jungblut 2022-04-27 10:56:47 UTC
*** Bug 2016574 has been marked as a duplicate of this bug. ***

Comment 7 W. Trevor King 2022-04-28 07:04:12 UTC
With bug 2016574 closed as a dup, this bug is also now aiming to fix:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=120h&type=junit&search=event+happened.*times.*something+is+wrong.*deployment/etcd-operator.*Degraded+message+changed.*EndpointsDegraded' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-azure-upgrade (all) - 5 runs, 80% failed, 25% of failures match = 20% impact
periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.11-upgrade-from-stable-4.10-e2e-openstack-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.9-upgrade-from-stable-4.8-e2e-openstack-upgrade (all) - 5 runs, 100% failed, 20% of failures match = 20% impact

To avoid issues like:

  : [sig-arch] events should not repeat pathologically	0s
  2 events happened too frequently

  event happened 32 times, something is wrong: ns/openshift-etcd-operator deployment/etcd-operator - reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdEndpointsDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nEtcdMembersDegraded: No unhealthy members found"
  event happened 29 times, something is wrong: ns/openshift-etcd-operator deployment/etcd-operator - reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdEndpointsDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found"


Note You need to log in before you can comment on or make changes to this bug.