Bug 2009016 - clusteroperator/etcd status condition should not change reasons frequently due to EtcdEndpointsDegraded
Summary: clusteroperator/etcd status condition should not change reasons frequently du...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.10
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.9.z
Assignee: Haseeb Tariq
QA Contact: ge liu
URL:
Whiteboard:
Depends On: 2006975
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-09-29 17:25 UTC by Haseeb Tariq
Modified: 2021-10-26 17:22 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 2006975
Environment:
Last Closed: 2021-10-26 17:22:42 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-etcd-operator pull 679 0 None open [release-4.9] Bug 2009016: Suppress noisy logs and improve client errors 2021-09-30 08:22:25 UTC
Red Hat Product Errata RHBA-2021:3935 0 None None None 2021-10-26 17:22:57 UTC

Description Haseeb Tariq 2021-09-29 17:25:21 UTC
+++ This bug was initially created as a clone of Bug #2006975 +++

Description of problem:
Seeing the following test failure in recent CI runs
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/664/pull-ci-openshift-cluster-etcd-operator-master-e2e-agnostic/1440433281509101568

```
: [sig-arch] events should not repeat pathologically 

event happened 21 times, something is wrong: ns/openshift-etcd-operator namespace/openshift-etcd-operator - reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdEndpointsDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found"
event happened 21 times, something is wrong: ns/openshift-etcd-operator namespace/openshift-etcd-operator - reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdEndpointsDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nEtcdMembersDegraded: No unhealthy members found"
```

The status condition flaps on the status message due to the addition/removal of the following reason:
```
EtcdEndpointsDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing
```

Needs to be determined if this is expected behavior (e.g during upgrade jobs) or if there is an issue with how the clusteroperator/etcd status condition is updated.

Version-Release number of selected component (if applicable):
Seen on 4.10 and/or CI runs on master.


Steps to Reproduce:
As seen in CI runs:
https://search.ci.openshift.org/?search=EtcdEndpointsDegraded%3A+rpc+error%3A+code+%3D+Canceled+desc+%3D+grpc%3A+the+client+connection+is+closing&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job&wrap=on

--- Additional comment from Haseeb Tariq on 2021-09-22 18:50:43 UTC ---

Bumping to a high since this has been failing across multiple release jobs
https://search.ci.openshift.org/?search=EtcdEndpointsDegraded%3A+rpc+error%3A+code+%3D+Canceled+desc+%3D+grpc%3A+the+client+connection+is+closing&maxAge=336h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-canary/1436114643499094016

--- Additional comment from Haseeb Tariq on 2021-09-22 19:50:47 UTC ---

Adding to the known event exceptions list for now: https://github.com/openshift/origin/pull/26475

Comment 3 ge liu 2021-10-15 07:58:48 UTC
Verified with 4.9.0-0.nightly-2021-10-14-182021

Comment 6 errata-xmlrpc 2021-10-26 17:22:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.9.4 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3935


Note You need to log in before you can comment on or make changes to this bug.