2006975 – clusteroperator/etcd status condition should not change reasons frequently due to EtcdEndpointsDegraded

Bug 2006975 - clusteroperator/etcd status condition should not change reasons frequently due to EtcdEndpointsDegraded

Summary: clusteroperator/etcd status condition should not change reasons frequently du...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Haseeb Tariq
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2009016
TreeView+	depends on / blocked

Reported:	2021-09-22 18:38 UTC by Haseeb Tariq
Modified:	2022-03-10 16:13 UTC (History)
CC List:	0 users
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2009016 (view as bug list)
Environment:
Last Closed:	2022-03-10 16:12:52 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-etcd-operator pull 660	0	None	open	Bug 2006975: Suppress noisy logs and improve client errors	2021-09-29 08:30:48 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-10 16:13:13 UTC

Description Haseeb Tariq 2021-09-22 18:38:06 UTC

Description of problem:
Seeing the following test failure in recent CI runs
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/664/pull-ci-openshift-cluster-etcd-operator-master-e2e-agnostic/1440433281509101568

```
: [sig-arch] events should not repeat pathologically 

event happened 21 times, something is wrong: ns/openshift-etcd-operator namespace/openshift-etcd-operator - reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdEndpointsDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found"
event happened 21 times, something is wrong: ns/openshift-etcd-operator namespace/openshift-etcd-operator - reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdEndpointsDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nEtcdMembersDegraded: No unhealthy members found"
```

The status condition flaps on the status message due to the addition/removal of the following reason:
```
EtcdEndpointsDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing
```

Needs to be determined if this is expected behavior (e.g during upgrade jobs) or if there is an issue with how the clusteroperator/etcd status condition is updated.

Version-Release number of selected component (if applicable):
Seen on 4.10 and/or CI runs on master.


Steps to Reproduce:
As seen in CI runs:
https://search.ci.openshift.org/?search=EtcdEndpointsDegraded%3A+rpc+error%3A+code+%3D+Canceled+desc+%3D+grpc%3A+the+client+connection+is+closing&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job&wrap=on

Comment 1 Haseeb Tariq 2021-09-22 18:50:43 UTC

Bumping to a high since this has been failing across multiple release jobs
https://search.ci.openshift.org/?search=EtcdEndpointsDegraded%3A+rpc+error%3A+code+%3D+Canceled+desc+%3D+grpc%3A+the+client+connection+is+closing&maxAge=336h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-canary/1436114643499094016

Comment 2 Haseeb Tariq 2021-09-22 19:50:47 UTC

Adding to the known event exceptions list for now: https://github.com/openshift/origin/pull/26475

Comment 5 ge liu 2021-10-08 09:23:47 UTC

This issue still exists in 4.9 according to ci log, 4.9 need to backport after this.

Comment 6 Haseeb Tariq 2021-10-08 18:27:17 UTC

@geliu Thanks for verifying.
The 4.9 backport is ready and waiting on staff-eng-approved labels
https://bugzilla.redhat.com/show_bug.cgi?id=2009016
https://github.com/openshift/cluster-etcd-operator/pull/679

Comment 9 errata-xmlrpc 2022-03-10 16:12:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.