Description of problem: One of the etcd failed with following issues and not serving: #### attempt 9 member={name="etcd-bootstrap", peerURLs=[https://192.168.1.239:2380}, clientURLs=[https://192.168.1.239:2379] member={name="wj45uos929a-4qdsw-master-2", peerURLs=[https://192.168.1.29:2380}, clientURLs=[https://192.168.1.29:2379] member={name="wj45uos929a-4qdsw-master-1", peerURLs=[https://192.168.1.11:2380}, clientURLs=[https://192.168.1.11:2379] target=nil, err=<nil> #### sleeping... timed out Version-Release number of selected component (if applicable): 4.5.0-0.nightly-2020-09-27-230429 How reproducible: not sure, gather info will be attached Steps to Reproduce: 1. install a OCP cluster 2. Check if etcd cluster is running well 3. Actual results: 1/3 etcds is not serving Expected results: 3/3 etcds should running well Additional info:
Due to the gather info is a little big, so share with the shareddriver, http://file.apac.redhat.com/~wjiang/etcd_clock_difference.tar.gz
This is an interesting situation, When the cluster-etcd-operator attempts to create an etcd client expects to be able to connect to one of the endpoints or it will error (context deadline exceeded). So here is the client trying to be created it can't connect to https://192.168.1.121:2379, https://192.168.1.11:2379, https://192.168.1.29:2379. But when it tries to connect to https://192.168.1.239:2379 (bootstrap) but it gets a TLS auth error. ### operator unable to connect ``` https://192.168.1.121:2379 https://192.168.1.11:2379 https://192.168.1.29:2379 https://192.168.1.239:2379]: context deadline exceeded" W0929 07:50:26.160549 1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.1.121:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.168.1.121:2379: connect: connection refused". Reconnecting... W0929 07:50:26.160863 1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.1.11:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.168.1.11:2379: connect: connection refused". Reconnecting... W0929 07:50:26.164114 1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.1.29:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.168.1.29:2379: connect: connection refused". Reconnecting... W0929 07:50:26.170630 1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.1.239:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate signed by unknown authority". Reconnecting... ``` ### etcd bootstrap logging TLS error. ``` 2020-09-29 07:50:18.275973 I | embed: rejected connection from "192.168.1.121:40526" (error "remote error: tls: bad certificate", ServerName "") 2020-09-29 07:50:24.124276 I | embed: rejected connection from "192.168.1.121:40608" (error "remote error: tls: bad certificate", ServerName "") 2020-09-29 07:50:25.129634 I | embed: rejected connection from "192.168.1.121:40688" (error "remote error: tls: bad certificate", ServerName "") 2020-09-29 07:50:26.447765 I | embed: rejected connection from "192.168.1.121:40752" (error "remote error: tls: bad certificate", ServerName "") 2020-09-29 07:50:29.395488 I | embed: rejected connection from "192.168.1.121:40860" (error "remote error: tls: bad certificate", ServerName "") 2020-09-29 07:50:33.302567 I | embed: rejected connection from "192.168.1.121:40934" (error "remote error: tls: bad certificate", ServerName "") ``` The operator keeps trying but then the process dies after it loses its leader status. > W0929 07:50:37.860383 1 leaderelection.go:69] leader election lost Then when we start the operator back up because we have already scheduled the etcd static pod (which is failing) we seem to have forgotten the member still needs to be added. I see a few bugs here.
Verified it 4.6.0-0.nightly-2020-10-08-210814 with regression, have not hit it again.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196