Bug 2006975

Summary: clusteroperator/etcd status condition should not change reasons frequently due to EtcdEndpointsDegraded
Product: OpenShift Container Platform Reporter: Haseeb Tariq <htariq>
Component: EtcdAssignee: Haseeb Tariq <htariq>
Status: CLOSED ERRATA QA Contact: ge liu <geliu>
Severity: high Docs Contact:
Priority: high    
Version: 4.10   
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2009016 (view as bug list) Environment:
Last Closed: 2022-03-10 16:12:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 2009016    

Description Haseeb Tariq 2021-09-22 18:38:06 UTC
Description of problem:
Seeing the following test failure in recent CI runs
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/664/pull-ci-openshift-cluster-etcd-operator-master-e2e-agnostic/1440433281509101568

```
: [sig-arch] events should not repeat pathologically 

event happened 21 times, something is wrong: ns/openshift-etcd-operator namespace/openshift-etcd-operator - reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdEndpointsDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found"
event happened 21 times, something is wrong: ns/openshift-etcd-operator namespace/openshift-etcd-operator - reason/OperatorStatusChanged Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdEndpointsDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nEtcdMembersDegraded: No unhealthy members found"
```

The status condition flaps on the status message due to the addition/removal of the following reason:
```
EtcdEndpointsDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing
```

Needs to be determined if this is expected behavior (e.g during upgrade jobs) or if there is an issue with how the clusteroperator/etcd status condition is updated.

Version-Release number of selected component (if applicable):
Seen on 4.10 and/or CI runs on master.


Steps to Reproduce:
As seen in CI runs:
https://search.ci.openshift.org/?search=EtcdEndpointsDegraded%3A+rpc+error%3A+code+%3D+Canceled+desc+%3D+grpc%3A+the+client+connection+is+closing&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job&wrap=on

Comment 2 Haseeb Tariq 2021-09-22 19:50:47 UTC
Adding to the known event exceptions list for now: https://github.com/openshift/origin/pull/26475

Comment 5 ge liu 2021-10-08 09:23:47 UTC
This issue still exists in 4.9 according to ci log, 4.9 need to backport after this.

Comment 6 Haseeb Tariq 2021-10-08 18:27:17 UTC
@geliu Thanks for verifying.
The 4.9 backport is ready and waiting on staff-eng-approved labels
https://bugzilla.redhat.com/show_bug.cgi?id=2009016
https://github.com/openshift/cluster-etcd-operator/pull/679

Comment 9 errata-xmlrpc 2022-03-10 16:12:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056