1883772 – 1/3 etcd peer connection failed with 10 tries and timeout

Bug 1883772 - 1/3 etcd peer connection failed with 10 tries and timeout

Summary: 1/3 etcd peer connection failed with 10 tries and timeout

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Sam Batschelet
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1884011
TreeView+	depends on / blocked

Reported:	2020-09-30 08:16 UTC by weiwei jiang
Modified:	2020-10-27 16:47 UTC (History)
CC List:	0 users
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-27 16:46:56 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-etcd-operator pull 457	None	closed	Bug 1883772: pkg/operator/clustermembercontroller: resync every minute	2021-01-08 06:00:39 UTC
Github	openshift etcd pull 56	None	closed	Bug 1883772: discover-etcd-initial-cluster: improve error handling when we dont scale member	2021-01-08 06:00:39 UTC
Red Hat Product Errata	RHBA-2020:4196	None	None	None	2020-10-27 16:47:09 UTC

Description weiwei jiang 2020-09-30 08:16:20 UTC

Description of problem:
One of the etcd failed with following issues and not serving:
#### attempt 9
      member={name="etcd-bootstrap", peerURLs=[https://192.168.1.239:2380}, clientURLs=[https://192.168.1.239:2379]
      member={name="wj45uos929a-4qdsw-master-2", peerURLs=[https://192.168.1.29:2380}, clientURLs=[https://192.168.1.29:2379]
      member={name="wj45uos929a-4qdsw-master-1", peerURLs=[https://192.168.1.11:2380}, clientURLs=[https://192.168.1.11:2379]
      target=nil, err=<nil>
#### sleeping...
timed out

Version-Release number of selected component (if applicable):
4.5.0-0.nightly-2020-09-27-230429

How reproducible:
not sure, gather info will be attached

Steps to Reproduce:
1. install a OCP cluster
2. Check if etcd cluster is running well
3.

Actual results:
1/3 etcds is not serving

Expected results:
3/3 etcds should running well

Additional info:

Comment 1 weiwei jiang 2020-09-30 08:18:05 UTC

Due to the gather info is a little big, so share with the shareddriver, http://file.apac.redhat.com/~wjiang/etcd_clock_difference.tar.gz

Comment 3 Sam Batschelet 2020-09-30 13:52:36 UTC

This is an interesting situation, When the cluster-etcd-operator attempts to create an etcd client expects to be able to connect to one of the endpoints or it will error (context deadline exceeded).

So here is the client trying to be created it can't connect to https://192.168.1.121:2379, https://192.168.1.11:2379, https://192.168.1.29:2379. But when it tries to connect to https://192.168.1.239:2379 (bootstrap) but it gets a TLS auth error.

### operator unable to connect
```
https://192.168.1.121:2379 https://192.168.1.11:2379 https://192.168.1.29:2379 https://192.168.1.239:2379]: context deadline exceeded"
  W0929 07:50:26.160549       1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.1.121:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.168.1.121:2379: connect: connection refused". Reconnecting...
  W0929 07:50:26.160863       1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.1.11:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.168.1.11:2379: connect: connection refused". Reconnecting...
  W0929 07:50:26.164114       1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.1.29:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.168.1.29:2379: connect: connection refused". Reconnecting...
  W0929 07:50:26.170630       1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.1.239:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate signed by unknown authority". Reconnecting...
```

### etcd bootstrap logging TLS error.
```
  2020-09-29 07:50:18.275973 I | embed: rejected connection from "192.168.1.121:40526" (error "remote error: tls: bad certificate", ServerName "")
  2020-09-29 07:50:24.124276 I | embed: rejected connection from "192.168.1.121:40608" (error "remote error: tls: bad certificate", ServerName "")
  2020-09-29 07:50:25.129634 I | embed: rejected connection from "192.168.1.121:40688" (error "remote error: tls: bad certificate", ServerName "")
  2020-09-29 07:50:26.447765 I | embed: rejected connection from "192.168.1.121:40752" (error "remote error: tls: bad certificate", ServerName "")
  2020-09-29 07:50:29.395488 I | embed: rejected connection from "192.168.1.121:40860" (error "remote error: tls: bad certificate", ServerName "")
  2020-09-29 07:50:33.302567 I | embed: rejected connection from "192.168.1.121:40934" (error "remote error: tls: bad certificate", ServerName "")
```

The operator keeps trying but then the process dies after it loses its leader status.

>   W0929 07:50:37.860383       1 leaderelection.go:69] leader election lost

Then when we start the operator back up because we have already scheduled the etcd static pod (which is failing) we seem to have forgotten the member still needs to be added.

I see a few bugs here.

Comment 6 ge liu 2020-10-10 08:52:15 UTC

Verified it 4.6.0-0.nightly-2020-10-08-210814 with regression, have not hit it again.

Comment 8 errata-xmlrpc 2020-10-27 16:46:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.