Description of problem: oc get nodes not work any more after reboot all nodes Version-Release number of selected component (if applicable): 4.4.0-0.nightly-2020-02-19-080151 How reproducible: Testing twice and happened twice Steps to Reproduce: [root@dhcp-41-193 FILE]# oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-129-164.us-east-2.compute.internal Ready worker 25m v1.17.1 ip-10-0-141-118.us-east-2.compute.internal Ready master 31m v1.17.1 ip-10-0-145-3.us-east-2.compute.internal Ready worker 24m v1.17.1 ip-10-0-147-212.us-east-2.compute.internal Ready master 31m v1.17.1 ip-10-0-168-89.us-east-2.compute.internal Ready worker 24m v1.17.1 ip-10-0-172-216.us-east-2.compute.internal Ready master 31m v1.17.1 #### Reboot all the nodes now [root@dhcp-41-193 FILE]# oc get nodes The connection to the server api.weliang-ovn1202.qe.devcluster.openshift.com:6443 was refused - did you specify the right host or port? [root@dhcp-41-193 FILE]# oc get nodes The connection to the server api.weliang-ovn1202.qe.devcluster.openshift.com:6443 was refused - did you specify the right host or port? [root@dhcp-41-193 FILE]# oc get nodes Unable to connect to the server: net/http: TLS handshake timeout [root@dhcp-41-193 FILE]# date Thu Feb 20 16:08:20 EST 2020 [root@dhcp-41-193 FILE]# oc get nodes The connection to the server api.weliang-ovn1202.qe.devcluster.openshift.com:6443 was refused - did you specify the right host or port? [root@dhcp-41-193 FILE]# oc get nodes The connection to the server api.weliang-ovn1202.qe.devcluster.openshift.com:6443 was refused - did you specify the right host or port? [root@dhcp-41-193 FILE]# oc get nodes The connection to the server api.weliang-ovn1202.qe.devcluster.openshift.com:6443 was refused - did you specify the right host or port? [root@dhcp-41-193 FILE]# oc get nodes The connection to the server api.weliang-ovn1202.qe.devcluster.openshift.com:6443 was refused - did you specify the right host or port? [root@dhcp-41-193 FILE]# oc get nodes The connection to the server api.weliang-ovn1202.qe.devcluster.openshift.com:6443 was refused - did you specify the right host or port? [root@dhcp-41-193 FILE]# date Thu Feb 20 16:12:56 EST 2020 [root@dhcp-41-193 FILE]# date Thu Feb 20 16:15:45 EST 2020 [root@dhcp-41-193 FILE]# oc get nodes Unable to connect to the server: net/http: TLS handshake timeout [root@dhcp-41-193 FILE]# oc get nodes The connection to the server api.weliang-ovn1202.qe.devcluster.openshift.com:6443 was refused - did you specify the right host or port? [root@dhcp-41-193 FILE]# oc get clusterversions The connection to the server api.weliang-ovn1202.qe.devcluster.openshift.com:6443 was refused - did you specify the right host or port? [root@dhcp-41-193 FILE]# date Thu Feb 20 16:24:47 EST 2020 Actual results: oc command not work Expected results: oc should work Additional info:
Happens regardless of the sdn plugin, and the issue looks like etcd: # crictl ps CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID e9efb5c2ed817 5bcff854afb83e019bbe7a4ccf66ddc9e7f3a56cfd5ca98dad24f807a8d9cc5d 1 second ago Running etcd 16 54341e8fdae1a ae31dc1f52f60 dee5b59b53245bbb743f4b61ff4e4cf662e919ae15a7705ff4c54ff0d60c5282 15 minutes ago Running kube-apiserver-insecure-readyz 1 a81cdd7fc52ec bd21ec14cc8e5 b28324f4fa8d6103e8dd542ae7c61f2930be68140244f902ec15c9151463f9a7 15 minutes ago Running kube-controller-manager-cert-syncer 1 861445b239a11 d12bc3a66a32e dee5b59b53245bbb743f4b61ff4e4cf662e919ae15a7705ff4c54ff0d60c5282 15 minutes ago Running kube-apiserver-cert-syncer 1 a81cdd7fc52ec 938d6d7e16508 09d121f059abf8e5e217b666fedf0aa1607966bc5878be08e87d5178202f4c71 15 minutes ago Running cluster-policy-controller 1 861445b239a11 2789eed0edaef c7ee309a23bf38345d94aa2cee0c1bb8ed91184309e26db18affdc2cf74ffcdb 15 minutes ago Running scheduler 1 e3e8a82d232f4 5705b73f70333 c7ee309a23bf38345d94aa2cee0c1bb8ed91184309e26db18affdc2cf74ffcdb 15 minutes ago Running kube-controller-manager 1 861445b239a11 18c17ea0b8141 5bcff854afb83e019bbe7a4ccf66ddc9e7f3a56cfd5ca98dad24f807a8d9cc5d 15 minutes ago Running etcd-metrics 1 54341e8fdae1a # crictl logs -f e9efb5c2ed817 {"level":"warn","ts":"2020-02-24T17:05:49.183Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-c4107ef1-452f-4ae6-8032-6953f06b1696/10.0.61.129:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.0.64.254:2379: connect: connection refused\""} Error: context deadline exceeded
Tested and verified in 4.4.0-0.nightly-2020-03-11-095741
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581