Bug 1698456
Summary: | [stability] Destroy 1 etcd member will result in ocp cluster is not accessible | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | ge liu <geliu> | ||||
Component: | Etcd | Assignee: | Sam Batschelet <sbatsche> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | ge liu <geliu> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 4.1.0 | CC: | ccoleman, gblomqui, jiajliu, mifiedle, schoudha, wking, wsun | ||||
Target Milestone: | --- | Keywords: | BetaBlocker, TestBlocker | ||||
Target Release: | 4.1.0 | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2019-05-02 12:18:06 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
ge liu
2019-04-10 12:05:01 UTC
This is similar to bug 1679215, except that is about node replacement, while this is about continuing to function in a degraded state. etcd is still quorate with a single down node there is no reason other than a client balancer issue that API server would not choose a new etcd member and work as expected. Even if the etcd leader was killed off. # expected workflow. 1.) etcd election timeout 2.) new leader is elected 3.) client timeout 4.) client balancer chooses a new etcd endpoint 5.) resume Ge Liu, can you retest, confirm that you killed the leader, then confirm that a new leader is eventually selected? Adding the TestBlocker flag. This blocks master/etcd HA and resiliency testing for the 4.1 system test run in progress. Created attachment 1554381 [details]
etcd log
Pasted the etcd log in attachment. The bug touches gRPC-go and etcd clientv3. Have not verified but it feels like the balancer is broken per the below issues. - https://github.com/etcd-io/etcd/issues/9949 - https://github.com/etcd-io/etcd/pull/10489 - https://github.com/kubernetes/kubernetes/issues/72102 *** Bug 1699456 has been marked as a duplicate of this bug. *** Logging example from #1699456 which has been marked as a duplicate. 2019-04-12T18:25:18.046585457+00:00 stderr F I0412 18:25:18.046567 1 balancer_v1_wrapper.go:125] balancerWrapper: got update addr from Notify: [{etcd-0.mffiedler-132.qe.devcluster.openshift.com:2379 <nil>} {etcd-1.mffiedler-132.qe.devcluster.openshift.com:2379 <nil>} {etcd-2.mffiedler-132.qe.devcluster.openshift.com:2379 <nil>}] 2019-04-12T18:25:18.069595475+00:00 stderr F W0412 18:25:18.069030 1 clientconn.go:1304] grpc: addrConn.createTransport failed to connect to {etcd-1.mffiedler-132.qe.devcluster.openshift.com:2379 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for localhost, etcd.kube-system.svc, etcd.kube-system.svc.cluster.local, etcd-1.mffiedler-132.qe.devcluster.openshift.com, not etcd-0.mffiedler-132.qe.devcluster.openshift.com". Reconnecting... 2019-04-12T18:25:18.069595475+00:00 stderr F W0412 18:25:18.069360 1 clientconn.go:1304] grpc: addrConn.createTransport failed to connect to {etcd-2.mffiedler-132.qe.devcluster.openshift.com:2379 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for localhost, etcd.kube-system.svc, etcd.kube-system.svc.cluster.local, etcd-2.mffiedler-132.qe.devcluster.openshift.com, not etcd-0.mffiedler-132.qe.devcluster.openshift.com". Reconnecting... 2019-04-12T18:25:18.073724663+00:00 stderr F I0412 18:25:18.073688 1 balancer_v1_wrapper.go:125] balancerWrapper: got update addr from Notify: [{etcd-0.mffiedler-132.qe.devcluster.openshift.com:2379 <nil>}] 2019-04-12T18:25:18.073782523+00:00 stderr F W0412 18:25:18.073729 1 clientconn.go:953] Failed to dial etcd-2.mffiedler-132.qe.devcluster.openshift.com:2379: context canceled; please retry. 2019-04-12T18:25:18.073782523+00:00 stderr F W0412 18:25:18.073740 1 clientconn.go:953] Failed to dial etcd-1.mffiedler-132.qe.devcluster.openshift.com:2379: context canceled; please retry. 2019-04-12T18:25:18.082036906+00:00 stderr F I0412 18:25:18.082004 1 clientconn.go:551] parsed scheme: "" 2019-04-12T18:25:18.082117937+00:00 stderr F I0412 18:25:18.082105 1 clientconn.go:557] scheme "" not registered, fallback to default scheme Hit the same issue with #1699456 when setup upi on vmware. time="2019-04-23T05:28:34Z" level=info msg="Waiting up to 30m0s for the Kubernetes API at https://api.jliu-demo.qe.devcluster.openshift.com:6443..." time="2019-04-23T05:28:35Z" level=debug msg="Still waiting for the Kubernetes API: the server could not find the requested resource" time="2019-04-23T05:29:08Z" level=debug msg="Still waiting for the Kubernetes API: the server could not find the requested resource" time="2019-04-23T05:29:40Z" level=debug msg="Still waiting for the Kubernetes API: the server could not find the requested resource" time="2019-04-23T05:30:13Z" level=debug msg="Still waiting for the Kubernetes API: the server could not find the requested resource" time="2019-04-23T05:30:46Z" level=debug msg="Still waiting for the Kubernetes API: the server could not find the requested resource" time="2019-04-23T05:31:12Z" level=debug msg="Still waiting for the Kubernetes API: Get https://api.jliu-demo.qe.devcluster.openshift.com:6443/version?timeout=32s: dial tcp 139.178.76.6:6443: connect: connection refused" time="2019-04-23T05:31:35Z" level=info msg="API v1.13.4+81fc896 up" time="2019-04-23T05:31:35Z" level=info msg="Waiting up to 30m0s for bootstrapping to complete..." time="2019-04-23T06:01:35Z" level=info msg="Use the following commands to gather logs from the cluster" time="2019-04-23T06:01:35Z" level=info msg="openshift-install gather bootstrap --help" time="2019-04-23T06:01:35Z" level=fatal msg="failed to wait for bootstrapping to complete: timed out waiting for the condition" Check openshift-kube-apiserver logs I0423 06:53:53.612060 1 resolver_conn_wrapper.go:116] ccResolverWrapper: sending new addresses to cc: [{etcd-0.jliu-demo.qe.devcluster.openshift.com:2379 0 <nil>}] I0423 06:53:53.612145 1 balancer_v1_wrapper.go:125] balancerWrapper: got update addr from Notify: [{etcd-0.jliu-demo.qe.devcluster.openshift.com:2379 <nil>} {etcd-1.jliu-demo.qe.devcluster.openshift.com:2379 <nil>} {etcd-2.jliu-demo.qe.devcluster.openshift.com:2379 <nil>}] W0423 06:53:53.654818 1 clientconn.go:1304] grpc: addrConn.createTransport failed to connect to {etcd-1.jliu-demo.qe.devcluster.openshift.com:2379 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for localhost, etcd.kube-system.svc, etcd.kube-system.svc.cluster.local, etcd-1.jliu-demo.qe.devcluster.openshift.com, not etcd-0.jliu-demo.qe.devcluster.openshift.com". Reconnecting... I0423 06:53:53.660209 1 balancer_v1_wrapper.go:125] balancerWrapper: got update addr from Notify: [{etcd-0.jliu-demo.qe.devcluster.openshift.com:2379 <nil>}] W0423 06:53:53.660243 1 clientconn.go:953] Failed to dial etcd-1.jliu-demo.qe.devcluster.openshift.com:2379: context canceled; please retry. I0423 06:53:53.686866 1 master.go:228] Using reconciler: lease I0423 06:53:53.687309 1 clientconn.go:551] parsed scheme: "" I0423 06:53:53.687324 1 clientconn.go:557] scheme "" not registered, fallback to default scheme I0423 06:53:53.687354 1 resolver_conn_wrapper.go:116] ccResolverWrapper: sending new addresses to cc: [{etcd-0.jliu-demo.qe.devcluster.openshift.com:2379 0 <nil>}] I0423 06:53:53.687387 1 balancer_v1_wrapper.go:125] balancerWrapper: got update addr from Notify: [{etcd-0.jliu-demo.qe.devcluster.openshift.com:2379 <nil>} {etcd-1.jliu-demo.qe.devcluster.openshift.com:2379 <nil>} {etcd-2.jliu-demo.qe.devcluster.openshift.com:2379 <nil>}] W0423 06:53:53.688678 1 clientconn.go:953] Failed to dial etcd-2.jliu-demo.qe.devcluster.openshift.com:2379: grpc: the connection is closing; please retry. W0423 06:53:53.693322 1 clientconn.go:1304] grpc: addrConn.createTransport failed to connect to {etcd-1.jliu-demo.qe.devcluster.openshift.com:2379 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for localhost, etcd.kube-system.svc, etcd.kube-system.svc.cluster.local, etcd-1.jliu-demo.qe.devcluster.openshift.com, not etcd-0.jliu-demo.qe.devcluster.openshift.com". Reconnecting... W0423 06:53:53.693536 1 clientconn.go:1304] grpc: addrConn.createTransport failed to connect to {etcd-2.jliu-demo.qe.devcluster.openshift.com:2379 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for localhost, etcd.kube-system.svc, etcd.kube-system.svc.cluster.local, etcd-2.jliu-demo.qe.devcluster.openshift.com, not etcd-0.jliu-demo.qe.devcluster.openshift.com". Reconnecting... I0423 06:53:53.733572 1 balancer_v1_wrapper.go:125] balancerWrapper: got update addr from Notify: [{etcd-0.jliu-demo.qe.devcluster.openshift.com:2379 <nil>}] W0423 06:53:53.733612 1 clientconn.go:953] Failed to dial etcd-1.jliu-demo.qe.devcluster.openshift.com:2379: context canceled; please retry. W0423 06:53:53.733621 1 clientconn.go:953] Failed to dial etcd-2.jliu-demo.qe.devcluster.openshift.com:2379: context canceled; please retry. working on payload 4.1.0-0.nightly-2019-04-22-005054 Verified with 4.1.0-0.nightly-2019-04-29-235412, tried to typo in etcd member or leader node, the etcd pod will fail, but rest 2 etcd node could work well. Still hit the issue in comment14 on 4.1.0-0.nightly-2019-04-29-235412. So will re-open the bug #1699456 for installation to track it. *** Bug 1699456 has been marked as a duplicate of this bug. *** |