Bug 1698456

Summary: [stability] Destroy 1 etcd member will result in ocp cluster is not accessible
Product: OpenShift Container Platform Reporter: ge liu <geliu>
Component: EtcdAssignee: Sam Batschelet <sbatsche>
Status: CLOSED CURRENTRELEASE QA Contact: ge liu <geliu>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.1.0CC: ccoleman, gblomqui, jiajliu, mifiedle, schoudha, wking, wsun
Target Milestone: ---Keywords: BetaBlocker, TestBlocker
Target Release: 4.1.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-05-02 12:18:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
etcd log none

Description ge liu 2019-04-10 12:05:01 UTC
Description of problem:

Made a typo in etcd-member.yaml to destroy 1 etcd member, after done it, all of 'oc' command fail with err:

The connection to the server api.geliu.qe.devcluster.openshift.com:6443 was refused - did you specify the right host or port?

After corrected typo in etcd-member.yaml, all of 'oc' command come back.



How reproducible:
Always

version   4.0.0-0.ci-2019-04-10-033454  

Steps to Reproduce:
1. login master node, modify etcd-member.yaml:

        --listen-peer-urls=https://0.0.0.0:2380 \
        --listen-metrics-urls                   ==>create typo
=https://0.0.0.0:9978 \

2. $ oc get pods
The connection to the server api.geliu.qe.devcluster.openshift.com:6443 was refused - did you specify the right host or port?
$ oc get project
The connection to the server api.geliu.qe.devcluster.openshift.com:6443 was refused - did you specify the right host or port?

3. correct typo in etcd-member.yaml:

 --listen-peer-urls=https://0.0.0.0:2380 \
 --listen-metrics-urls=https://0.0.0.0:9978 \

4. all of oc command is ready again

Actual results:
As title
Expected results:
The ocp cluster should work well if 1 of 3 etcd member destroyed

Comment 2 W. Trevor King 2019-04-10 12:26:33 UTC
This is similar to bug 1679215, except that is about node replacement, while this is about continuing to function in a degraded state.

Comment 3 Sam Batschelet 2019-04-10 12:32:46 UTC
etcd is still quorate with a single down node there is no reason other than a client balancer issue that API server would not choose a new etcd member and work as expected. Even if the etcd leader was killed off.


# expected workflow.

1.) etcd election timeout
2.) new leader is elected
3.) client timeout
4.) client balancer chooses a new etcd endpoint
5.) resume

Comment 4 Greg Blomquist 2019-04-10 12:57:11 UTC
Ge Liu, can you retest, confirm that you killed the leader, then confirm that a new leader is eventually selected?

Comment 5 Mike Fiedler 2019-04-11 00:04:31 UTC
Adding the TestBlocker flag.   This blocks master/etcd HA and resiliency testing for the 4.1 system test run in progress.

Comment 6 Mike Fiedler 2019-04-11 00:05:42 UTC
Disregard comment 5 - wrong bz.   Apologies

Comment 7 ge liu 2019-04-11 03:33:39 UTC
Created attachment 1554381 [details]
etcd log

Comment 9 ge liu 2019-04-11 04:33:42 UTC
Pasted the etcd log in attachment.

Comment 11 Sam Batschelet 2019-04-11 11:36:00 UTC
The bug touches gRPC-go and etcd clientv3. Have not verified but it feels like the balancer is broken per the below issues.

- https://github.com/etcd-io/etcd/issues/9949
- https://github.com/etcd-io/etcd/pull/10489
- https://github.com/kubernetes/kubernetes/issues/72102

Comment 12 Sam Batschelet 2019-04-22 14:47:12 UTC
*** Bug 1699456 has been marked as a duplicate of this bug. ***

Comment 13 Sam Batschelet 2019-04-22 14:48:38 UTC
Logging example from #1699456 which has been marked as a duplicate.

2019-04-12T18:25:18.046585457+00:00 stderr F I0412 18:25:18.046567       1 balancer_v1_wrapper.go:125] balancerWrapper: got update addr from Notify: [{etcd-0.mffiedler-132.qe.devcluster.openshift.com:2379 <nil>} {etcd-1.mffiedler-132.qe.devcluster.openshift.com:2379 <nil>} {etcd-2.mffiedler-132.qe.devcluster.openshift.com:2379 <nil>}]
2019-04-12T18:25:18.069595475+00:00 stderr F W0412 18:25:18.069030       1 clientconn.go:1304] grpc: addrConn.createTransport failed to connect to {etcd-1.mffiedler-132.qe.devcluster.openshift.com:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for localhost, etcd.kube-system.svc, etcd.kube-system.svc.cluster.local, etcd-1.mffiedler-132.qe.devcluster.openshift.com, not etcd-0.mffiedler-132.qe.devcluster.openshift.com". Reconnecting...
2019-04-12T18:25:18.069595475+00:00 stderr F W0412 18:25:18.069360       1 clientconn.go:1304] grpc: addrConn.createTransport failed to connect to {etcd-2.mffiedler-132.qe.devcluster.openshift.com:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for localhost, etcd.kube-system.svc, etcd.kube-system.svc.cluster.local, etcd-2.mffiedler-132.qe.devcluster.openshift.com, not etcd-0.mffiedler-132.qe.devcluster.openshift.com". Reconnecting...
2019-04-12T18:25:18.073724663+00:00 stderr F I0412 18:25:18.073688       1 balancer_v1_wrapper.go:125] balancerWrapper: got update addr from Notify: [{etcd-0.mffiedler-132.qe.devcluster.openshift.com:2379 <nil>}]
2019-04-12T18:25:18.073782523+00:00 stderr F W0412 18:25:18.073729       1 clientconn.go:953] Failed to dial etcd-2.mffiedler-132.qe.devcluster.openshift.com:2379: context canceled; please retry.
2019-04-12T18:25:18.073782523+00:00 stderr F W0412 18:25:18.073740       1 clientconn.go:953] Failed to dial etcd-1.mffiedler-132.qe.devcluster.openshift.com:2379: context canceled; please retry.
2019-04-12T18:25:18.082036906+00:00 stderr F I0412 18:25:18.082004       1 clientconn.go:551] parsed scheme: ""
2019-04-12T18:25:18.082117937+00:00 stderr F I0412 18:25:18.082105       1 clientconn.go:557] scheme "" not registered, fallback to default scheme

Comment 14 liujia 2019-04-23 07:52:58 UTC
Hit the same issue with #1699456 when setup upi on vmware.
time="2019-04-23T05:28:34Z" level=info msg="Waiting up to 30m0s for the Kubernetes API at https://api.jliu-demo.qe.devcluster.openshift.com:6443..."
time="2019-04-23T05:28:35Z" level=debug msg="Still waiting for the Kubernetes API: the server could not find the requested resource"
time="2019-04-23T05:29:08Z" level=debug msg="Still waiting for the Kubernetes API: the server could not find the requested resource"
time="2019-04-23T05:29:40Z" level=debug msg="Still waiting for the Kubernetes API: the server could not find the requested resource"
time="2019-04-23T05:30:13Z" level=debug msg="Still waiting for the Kubernetes API: the server could not find the requested resource"
time="2019-04-23T05:30:46Z" level=debug msg="Still waiting for the Kubernetes API: the server could not find the requested resource"
time="2019-04-23T05:31:12Z" level=debug msg="Still waiting for the Kubernetes API: Get https://api.jliu-demo.qe.devcluster.openshift.com:6443/version?timeout=32s: dial tcp 139.178.76.6:6443: connect: connection refused"
time="2019-04-23T05:31:35Z" level=info msg="API v1.13.4+81fc896 up"
time="2019-04-23T05:31:35Z" level=info msg="Waiting up to 30m0s for bootstrapping to complete..."
time="2019-04-23T06:01:35Z" level=info msg="Use the following commands to gather logs from the cluster"
time="2019-04-23T06:01:35Z" level=info msg="openshift-install gather bootstrap --help"
time="2019-04-23T06:01:35Z" level=fatal msg="failed to wait for bootstrapping to complete: timed out waiting for the condition"

Check openshift-kube-apiserver logs
I0423 06:53:53.612060       1 resolver_conn_wrapper.go:116] ccResolverWrapper: sending new addresses to cc: [{etcd-0.jliu-demo.qe.devcluster.openshift.com:2379 0  <nil>}]
I0423 06:53:53.612145       1 balancer_v1_wrapper.go:125] balancerWrapper: got update addr from Notify: [{etcd-0.jliu-demo.qe.devcluster.openshift.com:2379 <nil>} {etcd-1.jliu-demo.qe.devcluster.openshift.com:2379 <nil>} {etcd-2.jliu-demo.qe.devcluster.openshift.com:2379 <nil>}]
W0423 06:53:53.654818       1 clientconn.go:1304] grpc: addrConn.createTransport failed to connect to {etcd-1.jliu-demo.qe.devcluster.openshift.com:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for localhost, etcd.kube-system.svc, etcd.kube-system.svc.cluster.local, etcd-1.jliu-demo.qe.devcluster.openshift.com, not etcd-0.jliu-demo.qe.devcluster.openshift.com". Reconnecting...
I0423 06:53:53.660209       1 balancer_v1_wrapper.go:125] balancerWrapper: got update addr from Notify: [{etcd-0.jliu-demo.qe.devcluster.openshift.com:2379 <nil>}]
W0423 06:53:53.660243       1 clientconn.go:953] Failed to dial etcd-1.jliu-demo.qe.devcluster.openshift.com:2379: context canceled; please retry.
I0423 06:53:53.686866       1 master.go:228] Using reconciler: lease
I0423 06:53:53.687309       1 clientconn.go:551] parsed scheme: ""
I0423 06:53:53.687324       1 clientconn.go:557] scheme "" not registered, fallback to default scheme
I0423 06:53:53.687354       1 resolver_conn_wrapper.go:116] ccResolverWrapper: sending new addresses to cc: [{etcd-0.jliu-demo.qe.devcluster.openshift.com:2379 0  <nil>}]
I0423 06:53:53.687387       1 balancer_v1_wrapper.go:125] balancerWrapper: got update addr from Notify: [{etcd-0.jliu-demo.qe.devcluster.openshift.com:2379 <nil>} {etcd-1.jliu-demo.qe.devcluster.openshift.com:2379 <nil>} {etcd-2.jliu-demo.qe.devcluster.openshift.com:2379 <nil>}]
W0423 06:53:53.688678       1 clientconn.go:953] Failed to dial etcd-2.jliu-demo.qe.devcluster.openshift.com:2379: grpc: the connection is closing; please retry.
W0423 06:53:53.693322       1 clientconn.go:1304] grpc: addrConn.createTransport failed to connect to {etcd-1.jliu-demo.qe.devcluster.openshift.com:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for localhost, etcd.kube-system.svc, etcd.kube-system.svc.cluster.local, etcd-1.jliu-demo.qe.devcluster.openshift.com, not etcd-0.jliu-demo.qe.devcluster.openshift.com". Reconnecting...
W0423 06:53:53.693536       1 clientconn.go:1304] grpc: addrConn.createTransport failed to connect to {etcd-2.jliu-demo.qe.devcluster.openshift.com:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for localhost, etcd.kube-system.svc, etcd.kube-system.svc.cluster.local, etcd-2.jliu-demo.qe.devcluster.openshift.com, not etcd-0.jliu-demo.qe.devcluster.openshift.com". Reconnecting...
I0423 06:53:53.733572       1 balancer_v1_wrapper.go:125] balancerWrapper: got update addr from Notify: [{etcd-0.jliu-demo.qe.devcluster.openshift.com:2379 <nil>}]
W0423 06:53:53.733612       1 clientconn.go:953] Failed to dial etcd-1.jliu-demo.qe.devcluster.openshift.com:2379: context canceled; please retry.
W0423 06:53:53.733621       1 clientconn.go:953] Failed to dial etcd-2.jliu-demo.qe.devcluster.openshift.com:2379: context canceled; please retry.

Comment 15 liujia 2019-04-23 07:53:50 UTC
working on payload 4.1.0-0.nightly-2019-04-22-005054

Comment 20 ge liu 2019-04-30 07:39:16 UTC
Verified with 4.1.0-0.nightly-2019-04-29-235412, tried to typo in etcd member or leader node, the etcd pod will fail, but rest 2 etcd node could work well.

Comment 21 liujia 2019-04-30 09:48:54 UTC
Still hit the issue in comment14 on 4.1.0-0.nightly-2019-04-29-235412.

So will re-open the bug #1699456 for installation to track it.

Comment 22 Eric Paris 2019-05-02 13:31:17 UTC
*** Bug 1699456 has been marked as a duplicate of this bug. ***