1805487 – oc command not work any more after rebooting all nodes

Bug 1805487 - oc command not work any more after rebooting all nodes

Summary: oc command not work any more after rebooting all nodes

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Suresh Kolichala
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-02-20 21:25 UTC by Weibin Liang
Modified:	2020-05-04 11:39 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-05-04 11:38:36 UTC
Target Upstream Version:
Embargoed:
Flags:	weliang: needinfo- weliang: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift etcd pull 37	0	None	closed	[openshift-4.4] Bug 1808546: If we weren't able to get client or target member go ahead and start	2020-03-16 02:41:05 UTC
Red Hat Product Errata	RHBA-2020:0581	0	None	None	None	2020-05-04 11:39:11 UTC

Description Weibin Liang 2020-02-20 21:25:21 UTC

Description of problem:
oc get nodes not work any more after reboot all nodes

Version-Release number of selected component (if applicable):
4.4.0-0.nightly-2020-02-19-080151

How reproducible:
Testing twice and happened twice

Steps to Reproduce:
[root@dhcp-41-193 FILE]# oc get nodes
NAME                                         STATUS   ROLES    AGE   VERSION
ip-10-0-129-164.us-east-2.compute.internal   Ready    worker   25m   v1.17.1
ip-10-0-141-118.us-east-2.compute.internal   Ready    master   31m   v1.17.1
ip-10-0-145-3.us-east-2.compute.internal     Ready    worker   24m   v1.17.1
ip-10-0-147-212.us-east-2.compute.internal   Ready    master   31m   v1.17.1
ip-10-0-168-89.us-east-2.compute.internal    Ready    worker   24m   v1.17.1
ip-10-0-172-216.us-east-2.compute.internal   Ready    master   31m   v1.17.1

#### Reboot all the nodes now

[root@dhcp-41-193 FILE]# oc get nodes
The connection to the server api.weliang-ovn1202.qe.devcluster.openshift.com:6443 was refused - did you specify the right host or port?
[root@dhcp-41-193 FILE]# oc get nodes
The connection to the server api.weliang-ovn1202.qe.devcluster.openshift.com:6443 was refused - did you specify the right host or port?
[root@dhcp-41-193 FILE]# oc get nodes
Unable to connect to the server: net/http: TLS handshake timeout
[root@dhcp-41-193 FILE]# date
Thu Feb 20 16:08:20 EST 2020
[root@dhcp-41-193 FILE]# oc get nodes
The connection to the server api.weliang-ovn1202.qe.devcluster.openshift.com:6443 was refused - did you specify the right host or port?
[root@dhcp-41-193 FILE]# oc get nodes
The connection to the server api.weliang-ovn1202.qe.devcluster.openshift.com:6443 was refused - did you specify the right host or port?
[root@dhcp-41-193 FILE]# oc get nodes
The connection to the server api.weliang-ovn1202.qe.devcluster.openshift.com:6443 was refused - did you specify the right host or port?
[root@dhcp-41-193 FILE]# oc get nodes
The connection to the server api.weliang-ovn1202.qe.devcluster.openshift.com:6443 was refused - did you specify the right host or port?
[root@dhcp-41-193 FILE]# oc get nodes
The connection to the server api.weliang-ovn1202.qe.devcluster.openshift.com:6443 was refused - did you specify the right host or port?
[root@dhcp-41-193 FILE]# date
Thu Feb 20 16:12:56 EST 2020
[root@dhcp-41-193 FILE]# date
Thu Feb 20 16:15:45 EST 2020
[root@dhcp-41-193 FILE]# oc get nodes
Unable to connect to the server: net/http: TLS handshake timeout
[root@dhcp-41-193 FILE]# oc get nodes
The connection to the server api.weliang-ovn1202.qe.devcluster.openshift.com:6443 was refused - did you specify the right host or port?
[root@dhcp-41-193 FILE]# oc get clusterversions
The connection to the server api.weliang-ovn1202.qe.devcluster.openshift.com:6443 was refused - did you specify the right host or port?
[root@dhcp-41-193 FILE]# date
Thu Feb 20 16:24:47 EST 2020


Actual results:
oc command not work

Expected results:
oc should work

Additional info:

Comment 3 Juan Luis de Sousa-Valadas 2020-02-24 17:15:26 UTC

Happens regardless of the sdn plugin, and the issue looks like etcd:

# crictl ps
CONTAINER           IMAGE                                                              CREATED             STATE               NAME                                  ATTEMPT             POD ID
e9efb5c2ed817       5bcff854afb83e019bbe7a4ccf66ddc9e7f3a56cfd5ca98dad24f807a8d9cc5d   1 second ago        Running             etcd                                  16                  54341e8fdae1a
ae31dc1f52f60       dee5b59b53245bbb743f4b61ff4e4cf662e919ae15a7705ff4c54ff0d60c5282   15 minutes ago      Running             kube-apiserver-insecure-readyz        1                   a81cdd7fc52ec
bd21ec14cc8e5       b28324f4fa8d6103e8dd542ae7c61f2930be68140244f902ec15c9151463f9a7   15 minutes ago      Running             kube-controller-manager-cert-syncer   1                   861445b239a11
d12bc3a66a32e       dee5b59b53245bbb743f4b61ff4e4cf662e919ae15a7705ff4c54ff0d60c5282   15 minutes ago      Running             kube-apiserver-cert-syncer            1                   a81cdd7fc52ec
938d6d7e16508       09d121f059abf8e5e217b666fedf0aa1607966bc5878be08e87d5178202f4c71   15 minutes ago      Running             cluster-policy-controller             1                   861445b239a11
2789eed0edaef       c7ee309a23bf38345d94aa2cee0c1bb8ed91184309e26db18affdc2cf74ffcdb   15 minutes ago      Running             scheduler                             1                   e3e8a82d232f4
5705b73f70333       c7ee309a23bf38345d94aa2cee0c1bb8ed91184309e26db18affdc2cf74ffcdb   15 minutes ago      Running             kube-controller-manager               1                   861445b239a11
18c17ea0b8141       5bcff854afb83e019bbe7a4ccf66ddc9e7f3a56cfd5ca98dad24f807a8d9cc5d   15 minutes ago      Running             etcd-metrics                          1                   54341e8fdae1a

# crictl logs -f e9efb5c2ed817
{"level":"warn","ts":"2020-02-24T17:05:49.183Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-c4107ef1-452f-4ae6-8032-6953f06b1696/10.0.61.129:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.0.64.254:2379: connect: connection refused\""}
Error: context deadline exceeded

Comment 13 Weibin Liang 2020-03-11 18:51:58 UTC

Tested and verified in 4.4.0-0.nightly-2020-03-11-095741

Comment 16 errata-xmlrpc 2020-05-04 11:38:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581

Note You need to log in before you can comment on or make changes to this bug.