Bug 1832986 - EtcdMembersDegraded false alarms
Summary: EtcdMembersDegraded false alarms
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd Operator
Version: 4.4
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.5.0
Assignee: Sam Batschelet
QA Contact: ge liu
URL:
Whiteboard:
: 1820734 1834976 1835368 1844501 (view as bug list)
Depends On:
Blocks: 1838781
TreeView+ depends on / blocked
 
Reported: 2020-05-07 15:08 UTC by ggore
Modified: 2020-09-20 13:04 UTC (History)
51 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Improper health checks Consequence: Incorrect event reports and misleading log messages Fix: Detect the health status correctly and prepare complete English sentences. Result: Proper status is reported in events, and log messages reflect the current health status with complete English sentences.
Clone Of:
: 1838781 (view as bug list)
Environment:
Last Closed: 2020-07-13 17:35:59 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift cluster-etcd-operator pull 343 None closed Bug 1832986: pkg/etcdcli: fix health checks 2020-09-25 11:34:40 UTC
Red Hat Knowledge Base (Solution) 5070671 None None None 2020-05-26 15:18:50 UTC
Red Hat Product Errata RHBA-2020:2409 None None None 2020-07-13 17:36:19 UTC

Internal Links: 1852309

Description ggore 2020-05-07 15:08:33 UTC
Description of problem:

After fresh installation of 4.4.3 or upgrading from 4.3 to 4.4.3, etcd operator consistently shows these logs:
~~~
57m         Normal   OperatorStatusChanged   deployment/etcd-operator   Status for clusteroperator/etcd changed: Available message changed from "StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 5\nEtcdMembersAvailable: node3.example.com,node4.example.com,node5.example.com members are available,  have not started,  are unhealthy,  are unknown" to "StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 5\nEtcdMembersAvailable: node3.example.com,node5.example.com members are available,  have not started, node4.example.com are unhealthy,  are unknown"
~~~

Etcd cluster is healthy and all etcd members are available.

Comment 2 Morgan Peterman 2020-05-13 15:32:04 UTC
*** Bug 1834976 has been marked as a duplicate of this bug. ***

Comment 3 kedar 2020-05-14 09:42:26 UTC
Hello,

Cu has provided more information on the issue as follows

The cluster is experiencing regular operator failures for various operators, logging messages such as regarding pods for a few operators such as. Often the operator's pods redeploy shortly after:

community-operators-57bcd7676c-v4pw2
Readiness probe failed: timeout: failed to connect service "localhost:50051" within 1s 

catalog-operator-6c6c549cf7-8p542
Liveness probe failed: Get http://10.130.0.19:8080/healthz: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

We presume this could be related to etcd.


We have a different cluster which is running on vSphere Infrastructure (rather than OpenStack), deployed using UPI (rather than IPI) which appears to have this identical issue. We have been unable to deploy any 4.4 cluster without experiencing this issue.

Comment 4 Sam Batschelet 2020-05-14 13:43:49 UTC
*** Bug 1835368 has been marked as a duplicate of this bug. ***

Comment 6 kedar 2020-05-15 08:44:06 UTC
Hello,

Cu reported back with useful information regarding the issue.

`````````````````
This issue is becoming urgent because our largest customer has requested an OpenShift 4.4 cluster.


I have noticed something odd which could be related - in all clusters (OpenStack IPI and VMware UPI), the logs for etcd-operator pod (in openshift-etcd-operator project) is showing continual and repeated connection errors, trying to connect to the IP address of the bootstrap VM:


From VMware-based UPI cluster:


$ oc logs etcd-operator-5d97b6445f-6xjhh | grep dialling
...
W0514 14:33:04.546487       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.1.1.250:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.1.1.250:2379: connect: no route to host". Reconnecting...
W0514 14:33:07.345430       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.1.1.250:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.1.1.250:2379: operation was canceled". Reconnecting...
W0514 14:33:08.642433       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.1.1.250:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.1.1.250:2379: connect: no route to host". Reconnecting...
and in the OpenStack IPI cluster (where the earlier logs uploaded to this case came from)


$ oc logs etcd-operator-59cf47554b-m95s9 | grep dialing
W0514 14:52:00.115637       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.0.0.19:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.19:2379: operation was canceled". Reconnecting...
W0514 14:52:00.422120       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.0.0.19:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.19:2379: operation was canceled". Reconnecting...
W0514 14:52:00.436588       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.0.0.19:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.19:2379: operation was canceled". Reconnecting...
W0514 14:52:00.976723       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.0.0.19:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.19:2379: connect: no route to host". Reconnecting...
As we know, one of the final steps of an install is to delete the bootstrap VM; manually in UPI and automatically in IPI. In our UPI install, we delete the bootstrap about 30 seconds after openshift-install wait-for bootstrap-complete completes. Therefore, the etcd-operator should NOT be trying to connect to the bootstrap.


Note - I have confirmed that the unhealthy member issue (and the bootstrap etcd-operator behaviour) remains after the cluster is updated to version 4.4.4



`````````````````

Comment 7 john schiwitz 2020-05-16 15:03:38 UTC
i also tried an upgrade to 4.4.3 from 4.3.19 last night.  the upgrade is stalled because of a degraded etcd operator
[ocp41@dhmaapp286 ocp4upc]$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.19    True        True          10h     Working towards 4.4.3: 11% complete
[ocp41@dhmaapp286 ocp4upc]$ oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.3.19    True        False         False      121d
cloud-credential                           4.3.19    True        False         False      121d
cluster-autoscaler                         4.3.19    True        False         False      121d
console                                    4.3.19    True        False         False      11h
dns                                        4.3.19    True        False         False      23d
etcd                                       4.4.3     False       True          True       10h
image-registry                             4.3.19    True        False         False      11h
ingress                                    4.3.19    True        False         False      22h
insights                                   4.3.19    True        False         False      120d
kube-apiserver                             4.3.19    True        False         False      121d
kube-controller-manager                    4.3.19    True        False         False      121d
kube-scheduler                             4.3.19    True        False         False      121d
machine-api                                4.3.19    True        False         False      121d
machine-config                             4.3.19    True        False         False      19h
marketplace                                4.3.19    True        False         False      11h
monitoring                                 4.3.19    True        False         False      11h
network                                    4.3.19    True        False         False      121d
node-tuning                                4.3.19    True        False         False      5h32m
openshift-apiserver                        4.3.19    True        False         False      11h
openshift-controller-manager               4.3.19    True        False         False      23d
openshift-samples                          4.3.19    True        False         False      2m19s
operator-lifecycle-manager                 4.3.19    True        False         False      121d
operator-lifecycle-manager-catalog         4.3.19    True        False         False      121d
operator-lifecycle-manager-packageserver   4.3.19    True        False         False      11h
service-ca                                 4.3.19    True        False         False      121d
service-catalog-apiserver                  4.3.19    True        False         False      121d
service-catalog-controller-manager         4.3.19    True        False         False      121d
storage                                    4.3.19    True        False         False      11h

getting similar messages in the etcd operator logs

[ocp41@dhmaapp286 ocp4upc]$ oc logs etcd-operator-59cf47554b-shccm |grep dial
W0516 14:16:58.777781       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.131.27.26:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.131.27.26:2379: operation was canceled". Reconnecting...
W0516 14:20:56.916372       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.131.27.26:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.131.27.26:2379: operation was canceled". Reconnecting...
E0516 14:23:13.636242       1 etcdcli.go:324] error getting etcd member etcd-member-shmamtr003.hma.am.honda.com status: &errors.errorString{s:"failed to dial endpoint https://10.131.27.28:2379 with maintenance client: context canceled"}
W0516 14:25:13.039698       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.131.27.26:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.131.27.26:2379: operation was canceled". Reconnecting...
E0516 14:27:13.618487       1 etcdcli.go:324] error getting etcd member etcd-member-shmamtr002.hma.am.honda.com status: &errors.errorString{s:"failed to dial endpoint https://10.131.27.27:2379 with maintenance client: context canceled"}
W0516 14:30:38.613126       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.131.27.26:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.131.27.26:2379: operation was canceled". Reconnecting...
W0516 14:35:49.848968       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.131.27.26:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.131.27.26:2379: operation was canceled". Reconnecting...
W0516 14:43:13.557082       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.131.27.26:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.131.27.26:2379: operation was canceled". Reconnecting...
W0516 14:43:54.288387       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.131.27.28:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.131.27.28:2379: operation was canceled". Reconnecting...
E0516 14:45:13.597490       1 etcdcli.go:324] error getting etcd member etcd-member-shmamtr003.hma.am.honda.com status: &errors.errorString{s:"failed to dial endpoint https://10.131.27.28:2379 with maintenance client: context canceled"}
W0516 14:46:00.858231       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.131.27.26:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.131.27.26:2379: operation was canceled". Reconnecting...
W0516 14:46:13.720890       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.131.27.26:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.131.27.26:2379: operation was canceled". Reconnecting...
W0516 14:47:01.127274       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.131.27.28:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.131.27.28:2379: operation was canceled". Reconnecting...
E0516 14:56:13.578309       1 etcdcli.go:324] error getting etcd member etcd-member-shmamtr002.hma.am.honda.com status: &errors.errorString{s:"failed to dial endpoint https://10.131.27.27:2379 with maintenance client: context canceled"}
E0516 15:01:13.590141       1 etcdcli.go:324] error getting etcd member etcd-member-shmamtr001.hma.am.honda.com status: &errors.errorString{s:"failed to dial endpoint https://10.131.27.26:2379 with maintenance client: context canceled"}

Comment 8 Alessandro Rossi 2020-05-16 23:00:19 UTC
Same on 4.4.3 fresh install, but be aware, it's not only a false flag, etcd operator keeps restarting randomly from time to time about every 2hrs, and it could be stuck in degrading status for hours.

In addition to this, etcd and etcd-operator still keep dialing BOOTSTRAP machine, even hours/days later the setup.

Comment 10 Sam Batschelet 2020-05-19 16:30:15 UTC
We have a fix for health checks that is going through the last review stages. The plan is to backport to 4.4.z.

https://github.com/openshift/cluster-etcd-operator/pull/343

Comment 12 Sam Batschelet 2020-05-19 16:44:10 UTC
I would like to clarify the impact of this bug. Currently, we have a bug in cluster-etcd-operator that results in false negatives with our etcd health checks. etcd is still monitored by Prometheus and etcd-quorum-guard. In the event of an actual outage, you will receive Alerts against `etcdMemberDown`. So unless you are observing this Alert these events can safely be ignored.

Comment 13 Dan Mace 2020-05-20 16:28:13 UTC
*** Bug 1820734 has been marked as a duplicate of this bug. ***

Comment 16 Alessandro Rossi 2020-05-21 14:10:52 UTC
Can confirm that it is still present in 4.4.4 release.

Comment 17 Sam Batschelet 2020-05-21 18:45:42 UTC
Fix will land in 4.4.6

Comment 20 Mark McKinstry 2020-05-26 23:26:51 UTC
Is there a workaround for this until the new version is released?

Comment 21 Sam Batschelet 2020-05-27 00:06:14 UTC
> Is there a workaround for this until the new version is released?

From 4.4.4 release notes[1]

When updating an OpenShift Container Platform cluster from version 4.3 to 4.4, the etcd Operator sometimes fails to upgrade because it is in a degraded state. This is caused by an InstallerPod failure. As a workaround, you must force a new revision on etcd to overcome the InstallerPod failure, which allows the etcd Operator to recover:

Force a new revision on etcd:


$ oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
Verify the nodes are at the latest revision:


$ oc get etcd '-o=jsonpath={range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'

[]1https://docs.openshift.com/container-platform/4.4/release_notes/ocp-4-4-release-notes.html

Comment 22 Sam Batschelet 2020-05-27 00:17:35 UTC
To be clear we are talking about 2 bugs. This BZ covers a bug that reports invalid health checks. So for example etcd eventing unhealthy member when the member is actually up. If you are seeing the issues reported in https://bugzilla.redhat.com/show_bug.cgi?id=1832986#c19 you are most likely seeing 1830510[1]. That fix landed in 4.4.4.

# invalid events such as reported in the description[2]
$ oc get events -n openshift-etcd-operator
LAST SEEN   TYPE      REASON                  OBJECT                     MESSAGE
3s          Warning   UnhealthyEtcdMember     deployment/etcd-operator   unhealthy members: ip-10-0-137-159.us-west-1.compute.internal, ip-10-0-17-119.us-west-1.compute.internal

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1830510
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1832986#c0

Comment 25 kedar 2020-05-28 13:57:09 UTC
Hello Team,

Tried Updating to 4.4.6 does not seem to have resolve this issue. 

Can you confirm if the proposed fix was actually included?

At the moment, the errata document attached to 4.4.6 appears to be 404 - https://access.redhat.com/errata/RHBA-2020:2310

Comment 27 Sam Batschelet 2020-05-28 14:19:00 UTC
looks like it missed 4.4.6 so 4.4.7

https://openshift-release.svc.ci.openshift.org/releasestream/4-stable/release/4.4.6

Comment 29 ge liu 2020-06-05 07:30:50 UTC
Hello Sam, pls help to take a look comments 28

Comment 31 Sam Batschelet 2020-06-05 14:38:27 UTC
*** Bug 1844501 has been marked as a duplicate of this bug. ***

Comment 34 Christopher Wawak 2020-06-09 18:31:51 UTC
What is the probability this will land in 4.4.7? Thanks!

Comment 35 Sam Batschelet 2020-06-09 18:34:35 UTC
> What is the probability this will land in 4.4.7? Thanks!

100%

Comment 36 David Hernández Fernández 2020-06-15 14:49:50 UTC
We still see the following switchover on 4.4.8, is it expected? 
NAMESPACE                 LAST SEEN   TYPE      REASON                  OBJECT                                          MESSAGE
openshift-etcd-operator   2m         Normal    OperatorStatusChanged   deployment/etcd-operator                        Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "EtcdMemberIPMigratorDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nNodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found"
openshift-etcd-operator   2m         Normal    OperatorStatusChanged   deployment/etcd-operator                        Status for clusteroperator/etcd changed: Degraded message changed from "EtcdMemberIPMigratorDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nNodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found"

Comment 37 Sam Batschelet 2020-06-15 17:49:21 UTC
we could explore if the transient failure should result in degraded state but the underlying message is unrelated to this bug.

Comment 41 Andreas Karis 2020-06-24 08:27:53 UTC
Hi,

Which version number of the etcd container contains the fix? Is this already fixed in 4.4.7 or 4.4.8?

Thanks,

Andreas

Comment 42 Sam Batschelet 2020-06-24 13:36:16 UTC
4.4.7

https://bugzilla.redhat.com/show_bug.cgi?id=1832986#c35

Comment 43 Pedro Ibáñez 2020-06-26 16:30:24 UTC
Hello,

An upgrade from 4.4.6 to 4.4.7 got an etcd cluster with 4 members, the 4th member was the etcd-bootstrap.
A manual remove with the "etcdctl member remove <memberid>" command solved the issue with the etcd operator as it was in a degraded status.
Is there anything else to be done?

Thanks

Comment 44 W. Trevor King 2020-06-26 19:58:25 UTC
> An upgrade from 4.4.6 to 4.4.7 got an etcd cluster with 4 members, the 4th member was the etcd-bootstrap.

No idea why etcd-bootstrap would still be around by the time you initiate an update.  But sounds like a different bug from this one.  Can you file a new bug and provide more details (e.g. linking to a must-gather, or giving steps to reproduce)?

Comment 45 Andreas Karis 2020-06-27 07:14:36 UTC
No need to open a new BZ. Look at https://access.redhat.com/solutions/5161361 | https://bugzilla.redhat.com/show_bug.cgi?id=1832923 for further details about the bootstrap node bug.

Comment 46 Sam Batschelet 2020-06-28 15:14:35 UTC
> No need to open a new BZ. Look at https://access.redhat.com/solutions/5161361 | https://bugzilla.redhat.com/show_bug.cgi?id=1832923 for further details about the bootstrap node bug.

Well no that is not true this is 100% a bug if the report is accurate. I have never seen this in CI myself.

Comment 47 Pedro Ibáñez 2020-06-29 08:30:20 UTC
Thanks W. Trevor, Andreas and Sam,
can be that the etcd-bootstrap member was there since installation? I'm trying to get a must-gather tarball and I'll open a new BZ.

Comment 48 errata-xmlrpc 2020-07-13 17:35:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.