Bug 1883772 - 1/3 etcd peer connection failed with 10 tries and timeout
Summary: 1/3 etcd peer connection failed with 10 tries and timeout
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.5
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.6.0
Assignee: Sam Batschelet
QA Contact: ge liu
URL:
Whiteboard:
Depends On:
Blocks: 1884011
TreeView+ depends on / blocked
 
Reported: 2020-09-30 08:16 UTC by weiwei jiang
Modified: 2020-10-27 16:47 UTC (History)
0 users

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 16:46:56 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-etcd-operator pull 457 0 None closed Bug 1883772: pkg/operator/clustermembercontroller: resync every minute 2021-01-08 06:00:39 UTC
Github openshift etcd pull 56 0 None closed Bug 1883772: discover-etcd-initial-cluster: improve error handling when we dont scale member 2021-01-08 06:00:39 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:47:09 UTC

Description weiwei jiang 2020-09-30 08:16:20 UTC
Description of problem:
One of the etcd failed with following issues and not serving:
#### attempt 9
      member={name="etcd-bootstrap", peerURLs=[https://192.168.1.239:2380}, clientURLs=[https://192.168.1.239:2379]
      member={name="wj45uos929a-4qdsw-master-2", peerURLs=[https://192.168.1.29:2380}, clientURLs=[https://192.168.1.29:2379]
      member={name="wj45uos929a-4qdsw-master-1", peerURLs=[https://192.168.1.11:2380}, clientURLs=[https://192.168.1.11:2379]
      target=nil, err=<nil>
#### sleeping...
timed out

Version-Release number of selected component (if applicable):
4.5.0-0.nightly-2020-09-27-230429

How reproducible:
not sure, gather info will be attached

Steps to Reproduce:
1. install a OCP cluster
2. Check if etcd cluster is running well
3.

Actual results:
1/3 etcds is not serving

Expected results:
3/3 etcds should running well

Additional info:

Comment 1 weiwei jiang 2020-09-30 08:18:05 UTC
Due to the gather info is a little big, so share with the shareddriver, http://file.apac.redhat.com/~wjiang/etcd_clock_difference.tar.gz

Comment 3 Sam Batschelet 2020-09-30 13:52:36 UTC
This is an interesting situation, When the cluster-etcd-operator attempts to create an etcd client expects to be able to connect to one of the endpoints or it will error (context deadline exceeded).

So here is the client trying to be created it can't connect to https://192.168.1.121:2379, https://192.168.1.11:2379, https://192.168.1.29:2379. But when it tries to connect to https://192.168.1.239:2379 (bootstrap) but it gets a TLS auth error.

### operator unable to connect
```
https://192.168.1.121:2379 https://192.168.1.11:2379 https://192.168.1.29:2379 https://192.168.1.239:2379]: context deadline exceeded"
  W0929 07:50:26.160549       1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.1.121:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.168.1.121:2379: connect: connection refused". Reconnecting...
  W0929 07:50:26.160863       1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.1.11:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.168.1.11:2379: connect: connection refused". Reconnecting...
  W0929 07:50:26.164114       1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.1.29:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 192.168.1.29:2379: connect: connection refused". Reconnecting...
  W0929 07:50:26.170630       1 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {https://192.168.1.239:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate signed by unknown authority". Reconnecting...
```

### etcd bootstrap logging TLS error.
```
  2020-09-29 07:50:18.275973 I | embed: rejected connection from "192.168.1.121:40526" (error "remote error: tls: bad certificate", ServerName "")
  2020-09-29 07:50:24.124276 I | embed: rejected connection from "192.168.1.121:40608" (error "remote error: tls: bad certificate", ServerName "")
  2020-09-29 07:50:25.129634 I | embed: rejected connection from "192.168.1.121:40688" (error "remote error: tls: bad certificate", ServerName "")
  2020-09-29 07:50:26.447765 I | embed: rejected connection from "192.168.1.121:40752" (error "remote error: tls: bad certificate", ServerName "")
  2020-09-29 07:50:29.395488 I | embed: rejected connection from "192.168.1.121:40860" (error "remote error: tls: bad certificate", ServerName "")
  2020-09-29 07:50:33.302567 I | embed: rejected connection from "192.168.1.121:40934" (error "remote error: tls: bad certificate", ServerName "")
```

The operator keeps trying but then the process dies after it loses its leader status.

>   W0929 07:50:37.860383       1 leaderelection.go:69] leader election lost

Then when we start the operator back up because we have already scheduled the etcd static pod (which is failing) we seem to have forgotten the member still needs to be added.

I see a few bugs here.

Comment 6 ge liu 2020-10-10 08:52:15 UTC
Verified it 4.6.0-0.nightly-2020-10-08-210814 with regression, have not hit it again.

Comment 8 errata-xmlrpc 2020-10-27 16:46:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.