Created attachment 1593769 [details] oc describe pod etcd-member-ip-10-0-137-127.us-east-2.compute.internal Description of problem: During an upgrade of 4.1.3 -> 4.1.7, the etcd operator failed to upgrade, and left etcd in a bad place. oc get pods -n openshift-etcd NAME READY STATUS RESTARTS AGE etcd-member-ip-10-0-137-127.us-east-2.compute.internal 0/2 CrashLoopBackOff 509 19h etcd-member-ip-10-0-144-157.us-east-2.compute.internal 2/2 Running 2 21h etcd-member-ip-10-0-165-149.us-east-2.compute.internal 2/2 Running 2 21h Please see attachment for oc describe pod ectd_... Version-Release number of selected component (if applicable): 4.1.3 - > 4.1.7 upgrade How reproducible: Unsure, hit it 1x while upgrading about 15 clusters Steps to Reproduce: 1. be on 4.1.3 2. run the oc adm upgrade (to 4.1.7) 3. After about an hour, noticed the cluster was in a bad state Expected results: The cluster to be upgraded with etcd in tact. Additional info:
> Jul 25 19:21:36 ip-10-0-137-127 hyperkube[849]: 2019-06-27 16:45:30.850958 E | rafthttp: failed to read 2a8a0609d54f71a1 on stream MsgApp v2 (context canceled) The underlying issue is a bug in the gRPC-go[1] version used by etcd. https://github.com/etcd-io/etcd/issues/9956
I am not sure if this fix can be backported to 3.3.x but will be part of 3.4.
mbarrett we probably should increase the priority/severity on this. During a 4.1.3 -> 4.1.7 upgrade an etcd node got into a bad state and had to be manually recovered from the remaining 2 healthy etcd nodes.
I also ran into this issue with a brand new 4.1.4 cluster upgrading to 4.1.7 and then to 4.1.8 NAME READY STATUS RESTARTS AGE etcd-member-bedrock4-master1.ccs.ornl.gov 2/2 Running 0 29m etcd-member-bedrock4-master2.ccs.ornl.gov 2/2 Running 0 170m etcd-member-bedrock4-master3.ccs.ornl.gov 1/2 CrashLoopBackOff 14 20m I agree that the priority should be bumped this is a major issue
etcd-team is working on this problem but it is not trivial. We have a new client balancer in 3.4 that we can backport to 3.3 but we will need weeks for soak testing. I am trying to get the old balancer working in latest grpc-go but with many API changes I am afraid this will not be possible with the old code. @Jason it would be helpful to understand why etcd is not starting we can not assume this is the same issue can you also provide logs?
Absolutely, here is the must-gather logs for the openshift-etcd namespace. I had to manually pull the master3 etcd logs since it looks like must-gather will skip a pod that is in CrashLoopBackoff (that seems like a bug in itself)
Created attachment 1600021 [details] must-gather logs for etcd cluster with one node down after upgrade
@Jason after reviewing the logs you provided, it appears the etcd-metrics container is crashing because of TLS issue possibly related to invalid TLS certificates. It does not look like the upstream grpc-go balancer issue. Here are the log snippets that show the problem: Errors from member3 (etcd-member container) pod logs: 2019-08-02 13:19:09.079780 I | embed: rejected connection from "160.91.205.23:38140" (error "remote error: tls: bad certificate", ServerName "bedrock4.ccs.ornl.gov") 2019-08-02 13:19:09.102355 I | embed: rejected connection from "160.91.205.23:38142" (error "remote error: tls: bad certificate", ServerName "bedrock4.ccs.ornl.gov") 2019-08-02 13:19:09.151596 I | embed: rejected connection from "160.91.205.23:38146" (error "remote error: tls: bad certificate", ServerName "bedrock4.ccs.ornl.gov") 2019-08-02 13:19:09.151670 I | embed: rejected connection from "160.91.205.23:38144" (error "remote error: tls: bad certificate", ServerName "bedrock4.ccs.ornl.gov") 2019-08-02 13:19:09.158155 I | embed: rejected connection from "160.91.205.23:38150" (error "remote error: tls: bad certificate", ServerName "bedrock4.ccs.ornl.gov") 2019-08-02 13:19:09.171770 I | embed: rejected connection from "160.91.205.21:51544" (error "remote error: tls: bad certificate", ServerName "ETCD-2.BEDROCK4.CCS.ORNL.GOV") Errors from member3 (etcd-metrics container): 2019-08-02 13:15:20.286646 I | etcdmain: ServerTLS: cert = /etc/ssl/etcd/system:etcd-metric:ETCD-2.BEDROCK4.CCS.ORNL.GOV.crt, key = /etc/ssl/etcd/system:etcd-metric:ETCD-2.BEDROCK4.CCS.ORNL.GOV.key, ca = /etc/ssl/etcd/metric-ca.crt, trusted-ca = , client-cert-auth = false, crl-file = 2019-08-02 13:15:20.287204 I | etcdmain: listening for grpc-proxy client requests on 127.0.0.1:9977 2019-08-02 13:15:20.287580 I | etcdmain: ClientTLS: cert = /etc/ssl/etcd/system:etcd-peer:ETCD-2.BEDROCK4.CCS.ORNL.GOV.crt, key = /etc/ssl/etcd/system:etcd-peer:ETCD-2.BEDROCK4.CCS.ORNL.GOV.key, ca = /etc/ssl/etcd/ca.crt, trusted-ca = , client-cert-auth = false, crl-file = WARNING: 2019/08/02 13:15:20 Failed to dial ETCD-2.BEDROCK4.CCS.ORNL.GOV:9978: connection error: desc = "transport: authentication handshake failed: x509: certificate signed by unknown authority"; please retry. WARNING: 2019/08/02 13:15:25 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp: operation was canceled"; Reconnecting to {ETCD-2.BEDROCK4.CCS.ORNL.GOV:9978 0 <nil>} WARNING: 2019/08/02 13:15:25 Failed to dial ETCD-2.BEDROCK4.CCS.ORNL.GOV:9978: grpc: the connection is closing; please retry. context deadline exceeded Status of the containers from the pod yaml: containerStatuses: - containerID: cri-o://aa66e438743902a2381686b511f1f9b6261be75d7970cf03f0300911eb15cf3d image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7174d5682d63fe637bd0fcc8ca99a047a4340a220c6eb3233418ce3ca326ef3a imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7174d5682d63fe637bd0fcc8ca99a047a4340a220c6eb3233418ce3ca326ef3a lastState: {} name: etcd-member ready: true restartCount: 0 state: running: startedAt: 2019-08-01T19:02:12Z - containerID: cri-o://4b6660a3b78ff17c2cb5a20c1cda186e199bd34d0dee8f8fd95e2ea7a691d47c image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7174d5682d63fe637bd0fcc8ca99a047a4340a220c6eb3233418ce3ca326ef3a imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7174d5682d63fe637bd0fcc8ca99a047a4340a220c6eb3233418ce3ca326ef3a lastState: terminated: containerID: cri-o://4b6660a3b78ff17c2cb5a20c1cda186e199bd34d0dee8f8fd95e2ea7a691d47c exitCode: 1 finishedAt: 2019-08-02T13:10:13Z message: "2019-08-02 13:10:08.264977 I | etcdmain: ServerTLS: cert = /etc/ssl/etcd/system:etcd-metric:ETCD-2.BEDROCK4.CCS.ORNL.GOV.crt, key = /etc/ssl/etcd/system:etcd-metric:ETCD-2.BEDROCK4.CCS.ORNL.GOV.key, ca = /etc/ssl/etcd/metric-ca.crt, trusted-ca = , client-cert-auth = false, crl-file = \n2019-08-02 13:10:08.265563 I | etcdmain: listening for grpc-proxy client requests on 127.0.0.1:9977\n2019-08-02 13:10:08.265911 I | etcdmain: ClientTLS: cert = /etc/ssl/etcd/system:etcd-peer:ETCD-2.BEDROCK4.CCS.ORNL.GOV.crt, key = /etc/ssl/etcd/system:etcd-peer:ETCD-2.BEDROCK4.CCS.ORNL.GOV.key, ca = /etc/ssl/etcd/ca.crt, trusted-ca = , client-cert-auth = false, crl-file = \nWARNING: 2019/08/02 13:10:08 Failed to dial ETCD-2.BEDROCK4.CCS.ORNL.GOV:9978: connection error: desc = \"transport: authentication handshake failed: x509: certificate signed by unknown authority\"; please retry.\ncontext deadline exceeded\n" reason: Error startedAt: 2019-08-02T13:10:08Z Please feel free to open a new bug on this and we can review the core reason for the failures.
etcd-3.3.15 landed, moving to this MODIFIED.
etcd-3.3.17 landed, and I have not trigger this issue, but I still have concern about it because I also can't reproduce it in ocp 4.3 recently, anyway, close it and reopen if it appears later.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0062