Bug 2009016

Summary: clusteroperator/etcd status condition should not change reasons frequently due to EtcdEndpointsDegraded
Product: OpenShift Container Platform Reporter: Haseeb Tariq <htariq>
Component: EtcdAssignee: Haseeb Tariq <htariq>
Status: CLOSED ERRATA QA Contact: ge liu <geliu>
Severity: high Docs Contact:
Priority: high    
Version: 4.10CC: geliu
Target Milestone: ---   
Target Release: 4.9.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 2006975 Environment:
Last Closed: 2021-10-26 17:22:42 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2006975    
Bug Blocks:    

Description Haseeb Tariq 2021-09-29 17:25:21 UTC
+++ This bug was initially created as a clone of Bug #2006975 +++

Description of problem:
Seeing the following test failure in recent CI runs
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/664/pull-ci-openshift-cluster-etcd-operator-master-e2e-agnostic/1440433281509101568

```
: [sig-arch] events should not repeat pathologically 

event happened 21 times, something is wrong: ns/openshift-etcd-operator namespace/openshift-etcd-operator - reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdEndpointsDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found"
event happened 21 times, something is wrong: ns/openshift-etcd-operator namespace/openshift-etcd-operator - reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdEndpointsDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nEtcdMembersDegraded: No unhealthy members found"
```

The status condition flaps on the status message due to the addition/removal of the following reason:
```
EtcdEndpointsDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing
```

Needs to be determined if this is expected behavior (e.g during upgrade jobs) or if there is an issue with how the clusteroperator/etcd status condition is updated.

Version-Release number of selected component (if applicable):
Seen on 4.10 and/or CI runs on master.


Steps to Reproduce:
As seen in CI runs:
https://search.ci.openshift.org/?search=EtcdEndpointsDegraded%3A+rpc+error%3A+code+%3D+Canceled+desc+%3D+grpc%3A+the+client+connection+is+closing&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job&wrap=on

--- Additional comment from Haseeb Tariq on 2021-09-22 18:50:43 UTC ---

Bumping to a high since this has been failing across multiple release jobs
https://search.ci.openshift.org/?search=EtcdEndpointsDegraded%3A+rpc+error%3A+code+%3D+Canceled+desc+%3D+grpc%3A+the+client+connection+is+closing&maxAge=336h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-canary/1436114643499094016

--- Additional comment from Haseeb Tariq on 2021-09-22 19:50:47 UTC ---

Adding to the known event exceptions list for now: https://github.com/openshift/origin/pull/26475

Comment 3 ge liu 2021-10-15 07:58:48 UTC
Verified with 4.9.0-0.nightly-2021-10-14-182021

Comment 6 errata-xmlrpc 2021-10-26 17:22:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.9.4 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3935