Bug 1883772

Summary:	1/3 etcd peer connection failed with 10 tries and timeout
Product:	OpenShift Container Platform	Reporter:	weiwei jiang <wjiang>
Component:	Etcd	Assignee:	Sam Batschelet <sbatsche>
Status:	CLOSED ERRATA	QA Contact:	ge liu <geliu>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.5
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-10-27 16:46:56 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1884011

Description weiwei jiang 2020-09-30 08:16:20 UTC

Description of problem:
One of the etcd failed with following issues and not serving:
#### attempt 9
      member={name="etcd-bootstrap", peerURLs=[https://192.168.1.239:2380}, clientURLs=[https://192.168.1.239:2379]
      member={name="wj45uos929a-4qdsw-master-2", peerURLs=[https://192.168.1.29:2380}, clientURLs=[https://192.168.1.29:2379]
      member={name="wj45uos929a-4qdsw-master-1", peerURLs=[https://192.168.1.11:2380}, clientURLs=[https://192.168.1.11:2379]
      target=nil, err=<nil>
#### sleeping...
timed out

Version-Release number of selected component (if applicable):
4.5.0-0.nightly-2020-09-27-230429

How reproducible:
not sure, gather info will be attached

Steps to Reproduce:
1. install a OCP cluster
2. Check if etcd cluster is running well
3.

Actual results:
1/3 etcds is not serving

Expected results:
3/3 etcds should running well

Additional info:

Comment 1 weiwei jiang 2020-09-30 08:18:05 UTC

Due to the gather info is a little big, so share with the shareddriver, http://file.apac.redhat.com/~wjiang/etcd_clock_difference.tar.gz

Comment 3 Sam Batschelet 2020-09-30 13:52:36 UTC

This is an interesting situation, When the cluster-etcd-operator attempts to create an etcd client expects to be able to connect to one of the endpoints or it will error (context deadline exceeded).

So here is the client trying to be created it can't connect to https://192.168.1.121:2379, https://192.168.1.11:2379, https://192.168.1.29:2379. But when it tries to connect to https://192.168.1.239:2379 (bootstrap) but it gets a TLS auth error.

### operator unable to connect
```
https://192.168.1.121:2379 https://192.168.1.11:2379 https://192.168.1.29:2379 https://192.168.1.239:2379]: context deadline exceeded"
  W0929 07:50:26.160549       1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.1.121:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.168.1.121:2379: connect: connection refused". Reconnecting...
  W0929 07:50:26.160863       1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.1.11:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.168.1.11:2379: connect: connection refused". Reconnecting...
  W0929 07:50:26.164114       1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.1.29:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.168.1.29:2379: connect: connection refused". Reconnecting...
  W0929 07:50:26.170630       1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.1.239:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate signed by unknown authority". Reconnecting...
```

### etcd bootstrap logging TLS error.
```
  2020-09-29 07:50:18.275973 I | embed: rejected connection from "192.168.1.121:40526" (error "remote error: tls: bad certificate", ServerName "")
  2020-09-29 07:50:24.124276 I | embed: rejected connection from "192.168.1.121:40608" (error "remote error: tls: bad certificate", ServerName "")
  2020-09-29 07:50:25.129634 I | embed: rejected connection from "192.168.1.121:40688" (error "remote error: tls: bad certificate", ServerName "")
  2020-09-29 07:50:26.447765 I | embed: rejected connection from "192.168.1.121:40752" (error "remote error: tls: bad certificate", ServerName "")
  2020-09-29 07:50:29.395488 I | embed: rejected connection from "192.168.1.121:40860" (error "remote error: tls: bad certificate", ServerName "")
  2020-09-29 07:50:33.302567 I | embed: rejected connection from "192.168.1.121:40934" (error "remote error: tls: bad certificate", ServerName "")
```

The operator keeps trying but then the process dies after it loses its leader status.

>   W0929 07:50:37.860383       1 leaderelection.go:69] leader election lost

Then when we start the operator back up because we have already scheduled the etcd static pod (which is failing) we seem to have forgotten the member still needs to be added.

I see a few bugs here.

Comment 6 ge liu 2020-10-10 08:52:15 UTC

Verified it 4.6.0-0.nightly-2020-10-08-210814 with regression, have not hit it again.

Comment 8 errata-xmlrpc 2020-10-27 16:46:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196