1834976 – VMWare UPI - etcd-operator reporting unhealthy members: master2,master1,master3 after fresh install

Bug 1834976 - VMWare UPI - etcd-operator reporting unhealthy members: master2,master1,master3 after fresh install

Summary: VMWare UPI - etcd-operator reporting unhealthy members: master2,master1,maste...

Keywords:
Status:	CLOSED DUPLICATE of bug 1832986
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.4
Hardware:	x86_64
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Sam Batschelet
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-05-12 19:19 UTC by Morgan Peterman
Modified:	2020-05-18 19:10 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-05-13 15:32:04 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
etcd-master1 logs (128.30 KB, text/plain) 2020-05-12 19:19 UTC, Morgan Peterman	no flags	Details
etcd-master2 logs (138.78 KB, text/plain) 2020-05-12 19:19 UTC, Morgan Peterman	no flags	Details
etcd-master3 logs (174.22 KB, text/plain) 2020-05-12 19:20 UTC, Morgan Peterman	no flags	Details
View All

Description Morgan Peterman 2020-05-12 19:19:09 UTC

Created attachment 1687791 [details]
etcd-master1 logs

Created attachment 1687791 [details]
etcd-master1 logs

Created attachment 1687791 [details]
etcd-master1 logs

Created attachment 1687791 [details]
etcd-master1 logs

Description of problem:
After provisioning a new OCP 4.4.3 cluster on VMware using UPI etcd-operator is reporting all of the etcd nodes as being unhealthy.

Version-Release number of selected component (if applicable):
4.4.3

How reproducible:
Every time

Steps to Reproduce:
1. Provision new Cluster of OCP 4.4.3 using VMware UPI

Actual results:
The follow error is continuously reported:
Generated from openshift-cluster-etcd-operator-etcd-member-ip-migrator
unhealthy members: master2,master1,master3

Additional error(s): 
Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready
EtcdMembersDegraded: master2 members are unhealthy, members are unknown" to "NodeControllerDegraded: All master nodes are ready
EtcdMemberIPMigratorDegraded: rpc error: code = Canceled desc = grpc: the client connection is closing
EtcdMembersDegraded: master2 members are unhealthy, members are unknown"

Status for clusteroperator/etcd changed: Available message changed from "StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 2
EtcdMembersAvailable: master2,master1,master3 members are available, have not started, are unhealthy, are unknown" to "StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 2
EtcdMembersAvailable: master1,master3 members are available, have not started, master2 are unhealthy, are unknown"

Expected results:
Etcd members not to be reported as unhealthy

Additional info:

$oc get etcd -o=jsonpath='{range .items[0].status.conditions[?(@.type=="EtcdMembersAvailable")]}{.message}{"\n"}'
master2,master1,master3 members are available,  have not started,  are unhealthy,  are unknown

sh-4.2# etcdctl endpoint health --cluster
https://192.168.50.61:2379 is healthy: successfully committed proposal: took = 18.957755ms
https://192.168.50.63:2379 is healthy: successfully committed proposal: took = 21.740721ms
https://192.168.50.62:2379 is healthy: successfully committed proposal: took = 26.35663ms

sh-4.2# etcdctl member list -w table
+------------------+---------+---------+----------------------------+----------------------------+
|        ID        | STATUS  |  NAME   |         PEER ADDRS         |        CLIENT ADDRS        |
+------------------+---------+---------+----------------------------+----------------------------+
| 3c3af7d7f9c3b05d | started | master2 | https://192.168.50.62:2380 | https://192.168.50.62:2379 |
| 79a810c120bd61aa | started | master1 | https://192.168.50.61:2380 | https://192.168.50.61:2379 |
| be34329b46ef3c2f | started | master3 | https://192.168.50.63:2380 | https://192.168.50.63:2379 |
+------------------+---------+---------+----------------------------+----------------------------+

$ oc get pods -n openshift-etcd -o wide                                                                 
NAME           READY   STATUS    RESTARTS   AGE   IP              NODE      NOMINATED NODE   READINESS GATES
etcd-master1   3/3     Running   0          64m   192.168.50.61   master1   <none>           <none>
etcd-master2   3/3     Running   0          60m   192.168.50.62   master2   <none>           <none>
etcd-master3   3/3     Running   0          60m   192.168.50.63   master3   <none>           <none>

etcd-master1 etcd container errors:
2020-05-12 19:00:52.084812 I | embed: rejected connection from "192.168.50.62:57686" (error "EOF", ServerName "")
2020-05-12 19:01:04.824170 I | embed: rejected connection from "192.168.50.62:58252" (error "EOF", ServerName "")
2020-05-12 19:01:16.218079 I | embed: rejected connection from "192.168.50.62:58716" (error "read tcp 192.168.50.61:2379->192.168.50.62:58716: read: connection reset by peer", ServerName "")
2020-05-12 19:01:22.230850 I | embed: rejected connection from "192.168.50.62:58988" (error "EOF", ServerName "")
2020-05-12 19:02:31.635544 I | embed: rejected connection from "192.168.50.62:33666" (error "EOF", ServerName "")

etcd-master2 etcd container errors:
2020-05-12 19:09:39.752452 I | embed: rejected connection from "10.254.0.13:37826" (error "EOF", ServerName "")
2020-05-12 19:09:39.752937 I | embed: rejected connection from "10.254.0.13:37842" (error "read tcp 192.168.50.62:2379->10.254.0.13:37842: read: connection reset by peer", ServerName "")
2020-05-12 19:10:28.006962 I | embed: rejected connection from "10.254.0.13:39802" (error "EOF", ServerName "")
2020-05-12 19:10:31.023392 I | embed: rejected connection from "10.254.0.13:39924" (error "EOF", ServerName "")

etcd-master3 etcd container errors:
2020-05-12 19:05:38.544817 I | embed: rejected connection from "192.168.50.62:36284" (error "EOF", ServerName "")
2020-05-12 19:06:04.892292 I | embed: rejected connection from "192.168.50.62:37378" (error "EOF", ServerName "")
2020-05-12 19:06:53.933542 I | embed: rejected connection from "192.168.50.62:39444" (error "read tcp 192.168.50.63:2379->192.168.50.62:39444: read: connection reset by peer", ServerName "")
2020-05-12 19:06:59.982829 I | embed: rejected connection from "192.168.50.62:39710" (error "EOF", ServerName "")
2020-05-12 19:07:05.270823 I | embed: rejected connection from "192.168.50.62:39922" (error "EOF", ServerName "")

Comment 1 Morgan Peterman 2020-05-12 19:19:43 UTC

Created attachment 1687793 [details]
etcd-master2 logs

Comment 2 Morgan Peterman 2020-05-12 19:20:10 UTC

Created attachment 1687795 [details]
etcd-master3 logs

Comment 3 Morgan Peterman 2020-05-13 15:32:04 UTC


*** This bug has been marked as a duplicate of bug 1832986 ***

Note You need to log in before you can comment on or make changes to this bug.