1832986 – EtcdMembersDegraded false alarms

Bug 1832986 - EtcdMembersDegraded false alarms

Summary: EtcdMembersDegraded false alarms

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd Operator
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Sam Batschelet
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Duplicates (4):	1820734 1834976 1835368 1844501 (view as bug list)
Depends On:
Blocks:	1838781
TreeView+	depends on / blocked

Reported:	2020-05-07 15:08 UTC by ggore
Modified:	2023-12-15 17:51 UTC (History)
CC List:	51 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Improper health checks Consequence: Incorrect event reports and misleading log messages Fix: Detect the health status correctly and prepare complete English sentences. Result: Proper status is reported in events, and log messages reflect the current health status with complete English sentences.
Clone Of:
Clones:	1838781 (view as bug list)
Environment:
Last Closed:	2020-07-13 17:35:59 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-etcd-operator pull 343	None	closed	Bug 1832986: pkg/etcdcli: fix health checks	2021-02-16 14:03:34 UTC
Red Hat Knowledge Base (Solution)	5070671	None	None	None	2020-05-26 15:18:50 UTC
Red Hat Product Errata	RHBA-2020:2409	None	None	None	2020-07-13 17:36:19 UTC

Internal Links: 1852309

Description ggore 2020-05-07 15:08:33 UTC

Description of problem:

After fresh installation of 4.4.3 or upgrading from 4.3 to 4.4.3, etcd operator consistently shows these logs:
~~~
57m         Normal   OperatorStatusChanged   deployment/etcd-operator   Status for clusteroperator/etcd changed: Available message changed from "StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 5\nEtcdMembersAvailable: node3.example.com,node4.example.com,node5.example.com members are available,  have not started,  are unhealthy,  are unknown" to "StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 5\nEtcdMembersAvailable: node3.example.com,node5.example.com members are available,  have not started, node4.example.com are unhealthy,  are unknown"
~~~

Etcd cluster is healthy and all etcd members are available.

Comment 2 Morgan Peterman 2020-05-13 15:32:04 UTC

*** Bug 1834976 has been marked as a duplicate of this bug. ***

Comment 3 kedar 2020-05-14 09:42:26 UTC

Hello,

Cu has provided more information on the issue as follows

The cluster is experiencing regular operator failures for various operators, logging messages such as regarding pods for a few operators such as. Often the operator's pods redeploy shortly after:

community-operators-57bcd7676c-v4pw2
Readiness probe failed: timeout: failed to connect service "localhost:50051" within 1s 

catalog-operator-6c6c549cf7-8p542
Liveness probe failed: Get http://10.130.0.19:8080/healthz: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

We presume this could be related to etcd.


We have a different cluster which is running on vSphere Infrastructure (rather than OpenStack), deployed using UPI (rather than IPI) which appears to have this identical issue. We have been unable to deploy any 4.4 cluster without experiencing this issue.

Comment 4 Sam Batschelet 2020-05-14 13:43:49 UTC

*** Bug 1835368 has been marked as a duplicate of this bug. ***

Comment 6 kedar 2020-05-15 08:44:06 UTC

Hello,

Cu reported back with useful information regarding the issue.

`````````````````
This issue is becoming urgent because our largest customer has requested an OpenShift 4.4 cluster.


I have noticed something odd which could be related - in all clusters (OpenStack IPI and VMware UPI), the logs for etcd-operator pod (in openshift-etcd-operator project) is showing continual and repeated connection errors, trying to connect to the IP address of the bootstrap VM:


From VMware-based UPI cluster:


$ oc logs etcd-operator-5d97b6445f-6xjhh | grep dialling
...
W0514 14:33:04.546487       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.1.1.250:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.1.1.250:2379: connect: no route to host". Reconnecting...
W0514 14:33:07.345430       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.1.1.250:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.1.1.250:2379: operation was canceled". Reconnecting...
W0514 14:33:08.642433       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.1.1.250:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.1.1.250:2379: connect: no route to host". Reconnecting...
and in the OpenStack IPI cluster (where the earlier logs uploaded to this case came from)


$ oc logs etcd-operator-59cf47554b-m95s9 | grep dialing
W0514 14:52:00.115637       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.0.0.19:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.19:2379: operation was canceled". Reconnecting...
W0514 14:52:00.422120       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.0.0.19:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.19:2379: operation was canceled". Reconnecting...
W0514 14:52:00.436588       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.0.0.19:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.19:2379: operation was canceled". Reconnecting...
W0514 14:52:00.976723       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.0.0.19:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.19:2379: connect: no route to host". Reconnecting...
As we know, one of the final steps of an install is to delete the bootstrap VM; manually in UPI and automatically in IPI. In our UPI install, we delete the bootstrap about 30 seconds after openshift-install wait-for bootstrap-complete completes. Therefore, the etcd-operator should NOT be trying to connect to the bootstrap.


Note - I have confirmed that the unhealthy member issue (and the bootstrap etcd-operator behaviour) remains after the cluster is updated to version 4.4.4



`````````````````

Comment 7 john schiwitz 2020-05-16 15:03:38 UTC

i also tried an upgrade to 4.4.3 from 4.3.19 last night.  the upgrade is stalled because of a degraded etcd operator
[ocp41@dhmaapp286 ocp4upc]$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.19    True        True          10h     Working towards 4.4.3: 11% complete
[ocp41@dhmaapp286 ocp4upc]$ oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.3.19    True        False         False      121d
cloud-credential                           4.3.19    True        False         False      121d
cluster-autoscaler                         4.3.19    True        False         False      121d
console                                    4.3.19    True        False         False      11h
dns                                        4.3.19    True        False         False      23d
etcd                                       4.4.3     False       True          True       10h
image-registry                             4.3.19    True        False         False      11h
ingress                                    4.3.19    True        False         False      22h
insights                                   4.3.19    True        False         False      120d
kube-apiserver                             4.3.19    True        False         False      121d
kube-controller-manager                    4.3.19    True        False         False      121d
kube-scheduler                             4.3.19    True        False         False      121d
machine-api                                4.3.19    True        False         False      121d
machine-config                             4.3.19    True        False         False      19h
marketplace                                4.3.19    True        False         False      11h
monitoring                                 4.3.19    True        False         False      11h
network                                    4.3.19    True        False         False      121d
node-tuning                                4.3.19    True        False         False      5h32m
openshift-apiserver                        4.3.19    True        False         False      11h
openshift-controller-manager               4.3.19    True        False         False      23d
openshift-samples                          4.3.19    True        False         False      2m19s
operator-lifecycle-manager                 4.3.19    True        False         False      121d
operator-lifecycle-manager-catalog         4.3.19    True        False         False      121d
operator-lifecycle-manager-packageserver   4.3.19    True        False         False      11h
service-ca                                 4.3.19    True        False         False      121d
service-catalog-apiserver                  4.3.19    True        False         False      121d
service-catalog-controller-manager         4.3.19    True        False         False      121d
storage                                    4.3.19    True        False         False      11h

getting similar messages in the etcd operator logs

[ocp41@dhmaapp286 ocp4upc]$ oc logs etcd-operator-59cf47554b-shccm |grep dial
W0516 14:16:58.777781       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.131.27.26:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.131.27.26:2379: operation was canceled". Reconnecting...
W0516 14:20:56.916372       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.131.27.26:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.131.27.26:2379: operation was canceled". Reconnecting...
E0516 14:23:13.636242       1 etcdcli.go:324] error getting etcd member etcd-member-shmamtr003.hma.am.honda.com status: &errors.errorString{s:"failed to dial endpoint https://10.131.27.28:2379 with maintenance client: context canceled"}
W0516 14:25:13.039698       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.131.27.26:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.131.27.26:2379: operation was canceled". Reconnecting...
E0516 14:27:13.618487       1 etcdcli.go:324] error getting etcd member etcd-member-shmamtr002.hma.am.honda.com status: &errors.errorString{s:"failed to dial endpoint https://10.131.27.27:2379 with maintenance client: context canceled"}
W0516 14:30:38.613126       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.131.27.26:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.131.27.26:2379: operation was canceled". Reconnecting...
W0516 14:35:49.848968       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.131.27.26:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.131.27.26:2379: operation was canceled". Reconnecting...
W0516 14:43:13.557082       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.131.27.26:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.131.27.26:2379: operation was canceled". Reconnecting...
W0516 14:43:54.288387       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.131.27.28:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.131.27.28:2379: operation was canceled". Reconnecting...
E0516 14:45:13.597490       1 etcdcli.go:324] error getting etcd member etcd-member-shmamtr003.hma.am.honda.com status: &errors.errorString{s:"failed to dial endpoint https://10.131.27.28:2379 with maintenance client: context canceled"}
W0516 14:46:00.858231       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.131.27.26:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.131.27.26:2379: operation was canceled". Reconnecting...
W0516 14:46:13.720890       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.131.27.26:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.131.27.26:2379: operation was canceled". Reconnecting...
W0516 14:47:01.127274       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://10.131.27.28:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.131.27.28:2379: operation was canceled". Reconnecting...
E0516 14:56:13.578309       1 etcdcli.go:324] error getting etcd member etcd-member-shmamtr002.hma.am.honda.com status: &errors.errorString{s:"failed to dial endpoint https://10.131.27.27:2379 with maintenance client: context canceled"}
E0516 15:01:13.590141       1 etcdcli.go:324] error getting etcd member etcd-member-shmamtr001.hma.am.honda.com status: &errors.errorString{s:"failed to dial endpoint https://10.131.27.26:2379 with maintenance client: context canceled"}

Comment 8 Alessandro Rossi 2020-05-16 23:00:19 UTC

Same on 4.4.3 fresh install, but be aware, it's not only a false flag, etcd operator keeps restarting randomly from time to time about every 2hrs, and it could be stuck in degrading status for hours.

In addition to this, etcd and etcd-operator still keep dialing BOOTSTRAP machine, even hours/days later the setup.

Comment 10 Sam Batschelet 2020-05-19 16:30:15 UTC

We have a fix for health checks that is going through the last review stages. The plan is to backport to 4.4.z.

https://github.com/openshift/cluster-etcd-operator/pull/343

Comment 12 Sam Batschelet 2020-05-19 16:44:10 UTC

I would like to clarify the impact of this bug. Currently, we have a bug in cluster-etcd-operator that results in false negatives with our etcd health checks. etcd is still monitored by Prometheus and etcd-quorum-guard. In the event of an actual outage, you will receive Alerts against `etcdMemberDown`. So unless you are observing this Alert these events can safely be ignored.

Comment 13 Dan Mace 2020-05-20 16:28:13 UTC

*** Bug 1820734 has been marked as a duplicate of this bug. ***

Comment 16 Alessandro Rossi 2020-05-21 14:10:52 UTC

Can confirm that it is still present in 4.4.4 release.

Comment 17 Sam Batschelet 2020-05-21 18:45:42 UTC

Fix will land in 4.4.6

Comment 20 Mark McKinstry 2020-05-26 23:26:51 UTC

Is there a workaround for this until the new version is released?

Comment 21 Sam Batschelet 2020-05-27 00:06:14 UTC

> Is there a workaround for this until the new version is released?

From 4.4.4 release notes[1]

When updating an OpenShift Container Platform cluster from version 4.3 to 4.4, the etcd Operator sometimes fails to upgrade because it is in a degraded state. This is caused by an InstallerPod failure. As a workaround, you must force a new revision on etcd to overcome the InstallerPod failure, which allows the etcd Operator to recover:

Force a new revision on etcd:


$ oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
Verify the nodes are at the latest revision:


$ oc get etcd '-o=jsonpath={range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'

[]1https://docs.openshift.com/container-platform/4.4/release_notes/ocp-4-4-release-notes.html

Comment 22 Sam Batschelet 2020-05-27 00:17:35 UTC

To be clear we are talking about 2 bugs. This BZ covers a bug that reports invalid health checks. So for example etcd eventing unhealthy member when the member is actually up. If you are seeing the issues reported in https://bugzilla.redhat.com/show_bug.cgi?id=1832986#c19 you are most likely seeing 1830510[1]. That fix landed in 4.4.4.

# invalid events such as reported in the description[2]
$ oc get events -n openshift-etcd-operator
LAST SEEN   TYPE      REASON                  OBJECT                     MESSAGE
3s          Warning   UnhealthyEtcdMember     deployment/etcd-operator   unhealthy members: ip-10-0-137-159.us-west-1.compute.internal, ip-10-0-17-119.us-west-1.compute.internal

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1830510
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1832986#c0

Comment 25 kedar 2020-05-28 13:57:09 UTC

Hello Team,

Tried Updating to 4.4.6 does not seem to have resolve this issue. 

Can you confirm if the proposed fix was actually included?

At the moment, the errata document attached to 4.4.6 appears to be 404 - https://access.redhat.com/errata/RHBA-2020:2310

Comment 27 Sam Batschelet 2020-05-28 14:19:00 UTC

looks like it missed 4.4.6 so 4.4.7

https://openshift-release.svc.ci.openshift.org/releasestream/4-stable/release/4.4.6

Comment 29 ge liu 2020-06-05 07:30:50 UTC

Hello Sam, pls help to take a look comments 28

Comment 31 Sam Batschelet 2020-06-05 14:38:27 UTC

*** Bug 1844501 has been marked as a duplicate of this bug. ***

Comment 34 Christopher Wawak 2020-06-09 18:31:51 UTC

What is the probability this will land in 4.4.7? Thanks!

Comment 35 Sam Batschelet 2020-06-09 18:34:35 UTC

> What is the probability this will land in 4.4.7? Thanks!

100%

Comment 36 David Hernández Fernández 2020-06-15 14:49:50 UTC

We still see the following switchover on 4.4.8, is it expected? 
NAMESPACE                 LAST SEEN   TYPE      REASON                  OBJECT                                          MESSAGE
openshift-etcd-operator   2m         Normal    OperatorStatusChanged   deployment/etcd-operator                        Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "EtcdMemberIPMigratorDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nNodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found"
openshift-etcd-operator   2m         Normal    OperatorStatusChanged   deployment/etcd-operator                        Status for clusteroperator/etcd changed: Degraded message changed from "EtcdMemberIPMigratorDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing\nNodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found"

Comment 37 Sam Batschelet 2020-06-15 17:49:21 UTC

we could explore if the transient failure should result in degraded state but the underlying message is unrelated to this bug.

Comment 41 Andreas Karis 2020-06-24 08:27:53 UTC

Hi,

Which version number of the etcd container contains the fix? Is this already fixed in 4.4.7 or 4.4.8?

Thanks,

Andreas

Comment 42 Sam Batschelet 2020-06-24 13:36:16 UTC

4.4.7

https://bugzilla.redhat.com/show_bug.cgi?id=1832986#c35

Comment 43 Pedro Ibáñez 2020-06-26 16:30:24 UTC

Hello,

An upgrade from 4.4.6 to 4.4.7 got an etcd cluster with 4 members, the 4th member was the etcd-bootstrap.
A manual remove with the "etcdctl member remove <memberid>" command solved the issue with the etcd operator as it was in a degraded status.
Is there anything else to be done?

Thanks

Comment 44 W. Trevor King 2020-06-26 19:58:25 UTC

> An upgrade from 4.4.6 to 4.4.7 got an etcd cluster with 4 members, the 4th member was the etcd-bootstrap.

No idea why etcd-bootstrap would still be around by the time you initiate an update.  But sounds like a different bug from this one.  Can you file a new bug and provide more details (e.g. linking to a must-gather, or giving steps to reproduce)?

Comment 45 Andreas Karis 2020-06-27 07:14:36 UTC

No need to open a new BZ. Look at https://access.redhat.com/solutions/5161361 | https://bugzilla.redhat.com/show_bug.cgi?id=1832923 for further details about the bootstrap node bug.

Comment 46 Sam Batschelet 2020-06-28 15:14:35 UTC

> No need to open a new BZ. Look at https://access.redhat.com/solutions/5161361 | https://bugzilla.redhat.com/show_bug.cgi?id=1832923 for further details about the bootstrap node bug.

Well no that is not true this is 100% a bug if the report is accurate. I have never seen this in CI myself.

Comment 47 Pedro Ibáñez 2020-06-29 08:30:20 UTC

Thanks W. Trevor, Andreas and Sam,
can be that the etcd-bootstrap member was there since installation? I'm trying to get a must-gather tarball and I'll open a new BZ.

Comment 48 errata-xmlrpc 2020-07-13 17:35:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.

adeshpan
aelganzo
akaris
alchan
alerossi
alpatel
aprajapa
christoph.obexer
cwawak
dahernan
dhanacek
dsundqvi
farrotin
geliu
hector.valverde
ian.r.radtkerosen
igortiunov
ingvarr.zhmakin
jcall
jkaur
john_schiwitz
jritter
kgarriso
ksalunkh
laparici
mgahagan
michele.sandro.emma
mmckinst
morgan.peterman
nschuetz
oarribas
ocasalsa
pasik
pfruth
pibanezr
pkhaire
prsurve
rdave
rgregory
rkshirsa
sauchter
sbatsche
scott.worthington
sferguso
skolicha
susuresh
tim.sterk
travi
vz.mec
wking
zisis.lianas