Description of problem: There appears to be a 15 minute delay to take a master node out of the cluster after it was abruptly shutdown (i.e. forced power down, crash, etc). Version-Release number of selected component (if applicable): OpenShift 3.4 How reproducible: Seems to be always Steps to Reproduce: 1- Shutdown Master1 2- Master2 and Master3 saw Master1 go down Mar 6 17:29:29 xxx00189 etcd: lost the TCP streaming connection with peer f1a8c1bd301a53b2 (stream MsgApp v2 reader) Mar 6 17:29:29 xxx00189 etcd: peer f1a8c1bd301a53b2 became inactive Mar 6 17:29:29 xxx00189 etcd: lost the TCP streaming connection with peer f1a8c1bd301a53b2 (stream Message reader) Mar 6 17:30:02 xxx00189 etcd: lost the TCP streaming connection with peer f1a8c1bd301a53b2 (stream MsgApp v2 writer) Mar 6 17:30:12 xxx00189 etcd: lost the TCP streaming connection with peer f1a8c1bd301a53b2 (stream Message writer) p 3- Increase the replicas from 1 to 3 for a particular pod, but nothing happened. It just sat there saying "Scaling to3" 4- Delete a pod and didn't see this pod come spin up again 5- Run 'oc get nodes' and the status for Master1 didn't change from ready. After 15 minutes (see below logs), it seems that finally the OCP cluster took out Master1. Mar 6 17:44:56 xxx00189 etcd: failed to reach the peerURL(https://10.245.160.88:2380) of member f1a8c1bd301a53b2 (Get https://10.245.160.88:2380/version: dial tcp 10.245.160.88:2380: i/o timeout) Mar 6 17:45:01 xxx00189 etcd: failed to reach the peerURL(https://10.245.160.88:2380) of member f1a8c1bd301a53b2 (Get https://10.245.160.88:2380/version: dial tcp 10.245.160.88:2380: i/o timeout) ... Mar 6 17:45:43 xxx00189 etcd: health check for peer f1a8c1bd301a53b2 could not connect: dial tcp 10.245.160.88:2380: i/o timeout After the above started to flow on Master2&3, Master1 status became NotReady, the pod that was deleted spawned again and the replicas from 1 to 3 finally took effect.
Verified with: openshift v3.4.1.44.52 kubernetes v1.4.0+776c994 etcd 3.1.0-rc.0
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1134