Hide Forgot
Description of problem: Made a typo in etcd-member.yaml to destroy 1 etcd member, after done it, all of 'oc' command fail with err: The connection to the server api.geliu.qe.devcluster.openshift.com:6443 was refused - did you specify the right host or port? After corrected typo in etcd-member.yaml, all of 'oc' command come back. How reproducible: Always version 4.0.0-0.ci-2019-04-10-033454 Steps to Reproduce: 1. login master node, modify etcd-member.yaml: --listen-peer-urls=https://0.0.0.0:2380 \ --listen-metrics-urls ==>create typo =https://0.0.0.0:9978 \ 2. $ oc get pods The connection to the server api.geliu.qe.devcluster.openshift.com:6443 was refused - did you specify the right host or port? $ oc get project The connection to the server api.geliu.qe.devcluster.openshift.com:6443 was refused - did you specify the right host or port? 3. correct typo in etcd-member.yaml: --listen-peer-urls=https://0.0.0.0:2380 \ --listen-metrics-urls=https://0.0.0.0:9978 \ 4. all of oc command is ready again Actual results: As title Expected results: The ocp cluster should work well if 1 of 3 etcd member destroyed
This is similar to bug 1679215, except that is about node replacement, while this is about continuing to function in a degraded state.
etcd is still quorate with a single down node there is no reason other than a client balancer issue that API server would not choose a new etcd member and work as expected. Even if the etcd leader was killed off. # expected workflow. 1.) etcd election timeout 2.) new leader is elected 3.) client timeout 4.) client balancer chooses a new etcd endpoint 5.) resume
Ge Liu, can you retest, confirm that you killed the leader, then confirm that a new leader is eventually selected?
Adding the TestBlocker flag. This blocks master/etcd HA and resiliency testing for the 4.1 system test run in progress.
Disregard comment 5 - wrong bz. Apologies
Created attachment 1554381 [details] etcd log
Pasted the etcd log in attachment.
The bug touches gRPC-go and etcd clientv3. Have not verified but it feels like the balancer is broken per the below issues. - https://github.com/etcd-io/etcd/issues/9949 - https://github.com/etcd-io/etcd/pull/10489 - https://github.com/kubernetes/kubernetes/issues/72102
*** Bug 1699456 has been marked as a duplicate of this bug. ***
Logging example from #1699456 which has been marked as a duplicate. 2019-04-12T18:25:18.046585457+00:00 stderr F I0412 18:25:18.046567 1 balancer_v1_wrapper.go:125] balancerWrapper: got update addr from Notify: [{etcd-0.mffiedler-132.qe.devcluster.openshift.com:2379 <nil>} {etcd-1.mffiedler-132.qe.devcluster.openshift.com:2379 <nil>} {etcd-2.mffiedler-132.qe.devcluster.openshift.com:2379 <nil>}] 2019-04-12T18:25:18.069595475+00:00 stderr F W0412 18:25:18.069030 1 clientconn.go:1304] grpc: addrConn.createTransport failed to connect to {etcd-1.mffiedler-132.qe.devcluster.openshift.com:2379 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for localhost, etcd.kube-system.svc, etcd.kube-system.svc.cluster.local, etcd-1.mffiedler-132.qe.devcluster.openshift.com, not etcd-0.mffiedler-132.qe.devcluster.openshift.com". Reconnecting... 2019-04-12T18:25:18.069595475+00:00 stderr F W0412 18:25:18.069360 1 clientconn.go:1304] grpc: addrConn.createTransport failed to connect to {etcd-2.mffiedler-132.qe.devcluster.openshift.com:2379 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for localhost, etcd.kube-system.svc, etcd.kube-system.svc.cluster.local, etcd-2.mffiedler-132.qe.devcluster.openshift.com, not etcd-0.mffiedler-132.qe.devcluster.openshift.com". Reconnecting... 2019-04-12T18:25:18.073724663+00:00 stderr F I0412 18:25:18.073688 1 balancer_v1_wrapper.go:125] balancerWrapper: got update addr from Notify: [{etcd-0.mffiedler-132.qe.devcluster.openshift.com:2379 <nil>}] 2019-04-12T18:25:18.073782523+00:00 stderr F W0412 18:25:18.073729 1 clientconn.go:953] Failed to dial etcd-2.mffiedler-132.qe.devcluster.openshift.com:2379: context canceled; please retry. 2019-04-12T18:25:18.073782523+00:00 stderr F W0412 18:25:18.073740 1 clientconn.go:953] Failed to dial etcd-1.mffiedler-132.qe.devcluster.openshift.com:2379: context canceled; please retry. 2019-04-12T18:25:18.082036906+00:00 stderr F I0412 18:25:18.082004 1 clientconn.go:551] parsed scheme: "" 2019-04-12T18:25:18.082117937+00:00 stderr F I0412 18:25:18.082105 1 clientconn.go:557] scheme "" not registered, fallback to default scheme
Hit the same issue with #1699456 when setup upi on vmware. time="2019-04-23T05:28:34Z" level=info msg="Waiting up to 30m0s for the Kubernetes API at https://api.jliu-demo.qe.devcluster.openshift.com:6443..." time="2019-04-23T05:28:35Z" level=debug msg="Still waiting for the Kubernetes API: the server could not find the requested resource" time="2019-04-23T05:29:08Z" level=debug msg="Still waiting for the Kubernetes API: the server could not find the requested resource" time="2019-04-23T05:29:40Z" level=debug msg="Still waiting for the Kubernetes API: the server could not find the requested resource" time="2019-04-23T05:30:13Z" level=debug msg="Still waiting for the Kubernetes API: the server could not find the requested resource" time="2019-04-23T05:30:46Z" level=debug msg="Still waiting for the Kubernetes API: the server could not find the requested resource" time="2019-04-23T05:31:12Z" level=debug msg="Still waiting for the Kubernetes API: Get https://api.jliu-demo.qe.devcluster.openshift.com:6443/version?timeout=32s: dial tcp 139.178.76.6:6443: connect: connection refused" time="2019-04-23T05:31:35Z" level=info msg="API v1.13.4+81fc896 up" time="2019-04-23T05:31:35Z" level=info msg="Waiting up to 30m0s for bootstrapping to complete..." time="2019-04-23T06:01:35Z" level=info msg="Use the following commands to gather logs from the cluster" time="2019-04-23T06:01:35Z" level=info msg="openshift-install gather bootstrap --help" time="2019-04-23T06:01:35Z" level=fatal msg="failed to wait for bootstrapping to complete: timed out waiting for the condition" Check openshift-kube-apiserver logs I0423 06:53:53.612060 1 resolver_conn_wrapper.go:116] ccResolverWrapper: sending new addresses to cc: [{etcd-0.jliu-demo.qe.devcluster.openshift.com:2379 0 <nil>}] I0423 06:53:53.612145 1 balancer_v1_wrapper.go:125] balancerWrapper: got update addr from Notify: [{etcd-0.jliu-demo.qe.devcluster.openshift.com:2379 <nil>} {etcd-1.jliu-demo.qe.devcluster.openshift.com:2379 <nil>} {etcd-2.jliu-demo.qe.devcluster.openshift.com:2379 <nil>}] W0423 06:53:53.654818 1 clientconn.go:1304] grpc: addrConn.createTransport failed to connect to {etcd-1.jliu-demo.qe.devcluster.openshift.com:2379 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for localhost, etcd.kube-system.svc, etcd.kube-system.svc.cluster.local, etcd-1.jliu-demo.qe.devcluster.openshift.com, not etcd-0.jliu-demo.qe.devcluster.openshift.com". Reconnecting... I0423 06:53:53.660209 1 balancer_v1_wrapper.go:125] balancerWrapper: got update addr from Notify: [{etcd-0.jliu-demo.qe.devcluster.openshift.com:2379 <nil>}] W0423 06:53:53.660243 1 clientconn.go:953] Failed to dial etcd-1.jliu-demo.qe.devcluster.openshift.com:2379: context canceled; please retry. I0423 06:53:53.686866 1 master.go:228] Using reconciler: lease I0423 06:53:53.687309 1 clientconn.go:551] parsed scheme: "" I0423 06:53:53.687324 1 clientconn.go:557] scheme "" not registered, fallback to default scheme I0423 06:53:53.687354 1 resolver_conn_wrapper.go:116] ccResolverWrapper: sending new addresses to cc: [{etcd-0.jliu-demo.qe.devcluster.openshift.com:2379 0 <nil>}] I0423 06:53:53.687387 1 balancer_v1_wrapper.go:125] balancerWrapper: got update addr from Notify: [{etcd-0.jliu-demo.qe.devcluster.openshift.com:2379 <nil>} {etcd-1.jliu-demo.qe.devcluster.openshift.com:2379 <nil>} {etcd-2.jliu-demo.qe.devcluster.openshift.com:2379 <nil>}] W0423 06:53:53.688678 1 clientconn.go:953] Failed to dial etcd-2.jliu-demo.qe.devcluster.openshift.com:2379: grpc: the connection is closing; please retry. W0423 06:53:53.693322 1 clientconn.go:1304] grpc: addrConn.createTransport failed to connect to {etcd-1.jliu-demo.qe.devcluster.openshift.com:2379 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for localhost, etcd.kube-system.svc, etcd.kube-system.svc.cluster.local, etcd-1.jliu-demo.qe.devcluster.openshift.com, not etcd-0.jliu-demo.qe.devcluster.openshift.com". Reconnecting... W0423 06:53:53.693536 1 clientconn.go:1304] grpc: addrConn.createTransport failed to connect to {etcd-2.jliu-demo.qe.devcluster.openshift.com:2379 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for localhost, etcd.kube-system.svc, etcd.kube-system.svc.cluster.local, etcd-2.jliu-demo.qe.devcluster.openshift.com, not etcd-0.jliu-demo.qe.devcluster.openshift.com". Reconnecting... I0423 06:53:53.733572 1 balancer_v1_wrapper.go:125] balancerWrapper: got update addr from Notify: [{etcd-0.jliu-demo.qe.devcluster.openshift.com:2379 <nil>}] W0423 06:53:53.733612 1 clientconn.go:953] Failed to dial etcd-1.jliu-demo.qe.devcluster.openshift.com:2379: context canceled; please retry. W0423 06:53:53.733621 1 clientconn.go:953] Failed to dial etcd-2.jliu-demo.qe.devcluster.openshift.com:2379: context canceled; please retry.
working on payload 4.1.0-0.nightly-2019-04-22-005054
Verified with 4.1.0-0.nightly-2019-04-29-235412, tried to typo in etcd member or leader node, the etcd pod will fail, but rest 2 etcd node could work well.
Still hit the issue in comment14 on 4.1.0-0.nightly-2019-04-29-235412. So will re-open the bug #1699456 for installation to track it.