1733594 – etcd failed during upgrade and left etcd in bad state

Bug 1733594 - etcd failed during upgrade and left etcd in bad state

Summary: etcd failed during upgrade and left etcd in bad state

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.1.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.3.0
Assignee:	Sam Batschelet
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-07-26 17:04 UTC by Matt Woodson
Modified:	2020-01-23 11:05 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:	Cause: Under certain circumstances gRPC would panic and send on closed channel. The result could cause catastrophic member failure requiring disaster recover remediation. Fix: etcd v3.3.17 included a bump to gRPC-go v1.23. This included fixes resolving the potential failure.
Clone Of:
Environment:
Last Closed:	2020-01-23 11:04:29 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
oc describe pod etcd-member-ip-10-0-137-127.us-east-2.compute.internal (12.85 KB, text/plain) 2019-07-26 17:04 UTC, Matt Woodson	no flags	Details
must-gather logs for etcd cluster with one node down after upgrade (736.06 KB, application/gzip) 2019-08-02 13:26 UTC, Jason Kincl	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:0062	0	None	None	None	2020-01-23 11:05:06 UTC

Description Matt Woodson 2019-07-26 17:04:28 UTC

Created attachment 1593769 [details]
oc describe pod etcd-member-ip-10-0-137-127.us-east-2.compute.internal

Description of problem:

During an upgrade of 4.1.3 -> 4.1.7, the etcd operator failed to upgrade, and left etcd in a bad place.


 oc get pods -n openshift-etcd
NAME                                                     READY   STATUS             RESTARTS   AGE
etcd-member-ip-10-0-137-127.us-east-2.compute.internal   0/2     CrashLoopBackOff   509        19h
etcd-member-ip-10-0-144-157.us-east-2.compute.internal   2/2     Running            2          21h
etcd-member-ip-10-0-165-149.us-east-2.compute.internal   2/2     Running            2          21h


Please see attachment for oc describe pod ectd_...



Version-Release number of selected component (if applicable):
4.1.3 - > 4.1.7 upgrade


How reproducible:
Unsure, hit it 1x while upgrading about 15 clusters

Steps to Reproduce:
1. be on 4.1.3
2. run the oc adm upgrade (to 4.1.7)
3.  After about an hour, noticed the cluster was in a bad state




Expected results:
The cluster to be upgraded with etcd in tact.



Additional info:

Comment 1 Sam Batschelet 2019-07-26 19:51:26 UTC

> Jul 25 19:21:36 ip-10-0-137-127 hyperkube[849]: 2019-06-27 16:45:30.850958 E | rafthttp: failed to read 2a8a0609d54f71a1 on stream MsgApp v2 (context canceled)

The underlying issue is a bug in the gRPC-go[1] version used by etcd.

https://github.com/etcd-io/etcd/issues/9956

Comment 2 Sam Batschelet 2019-07-26 19:52:31 UTC

I am not sure if this fix can be backported to 3.3.x but will be part of 3.4.

Comment 3 Jacob Lucky 2019-07-29 15:51:00 UTC

mbarrett we probably should increase the priority/severity on this. During a 4.1.3 -> 4.1.7 upgrade an etcd node got into a bad state and had to be manually recovered from the remaining 2 healthy etcd nodes.

Comment 8 Jason Kincl 2019-08-01 19:51:29 UTC

I also ran into this issue with a brand new 4.1.4 cluster upgrading to 4.1.7 and then to 4.1.8 

NAME                                        READY   STATUS             RESTARTS   AGE
etcd-member-bedrock4-master1.ccs.ornl.gov   2/2     Running            0          29m
etcd-member-bedrock4-master2.ccs.ornl.gov   2/2     Running            0          170m
etcd-member-bedrock4-master3.ccs.ornl.gov   1/2     CrashLoopBackOff   14         20m

I agree that the priority should be bumped this is a major issue

Comment 9 Jason Kincl 2019-08-01 19:51:39 UTC

I also ran into this issue with a brand new 4.1.4 cluster upgrading to 4.1.7 and then to 4.1.8 

NAME                                        READY   STATUS             RESTARTS   AGE
etcd-member-bedrock4-master1.ccs.ornl.gov   2/2     Running            0          29m
etcd-member-bedrock4-master2.ccs.ornl.gov   2/2     Running            0          170m
etcd-member-bedrock4-master3.ccs.ornl.gov   1/2     CrashLoopBackOff   14         20m

I agree that the priority should be bumped this is a major issue

Comment 10 Sam Batschelet 2019-08-02 12:46:08 UTC

etcd-team is working on this problem but it is not trivial. We have a new client balancer in 3.4 that we can backport to 3.3 but we will need weeks for soak testing. I am trying to get the old balancer working in latest grpc-go but with many API changes I am afraid this will not be possible with the old code.

@Jason it would be helpful to understand why etcd is not starting we can not assume this is the same issue can you also provide logs?

Comment 11 Jason Kincl 2019-08-02 13:26:03 UTC

Absolutely, here is the must-gather logs for the openshift-etcd namespace. I had to manually pull the master3 etcd logs since it looks like must-gather will skip a pod that is in CrashLoopBackoff (that seems like a bug in itself)

Comment 12 Jason Kincl 2019-08-02 13:26:46 UTC

Created attachment 1600021 [details]
must-gather logs for etcd cluster with one node down after upgrade

Comment 13 Alay Patel 2019-08-04 18:31:23 UTC

@Jason after reviewing the logs you provided, it appears the etcd-metrics container is crashing because of TLS issue possibly related to invalid TLS certificates. It does not look like the upstream grpc-go balancer issue. Here are the log snippets that show the problem:

Errors from member3 (etcd-member container) pod logs:
 
2019-08-02 13:19:09.079780 I | embed: rejected connection from "160.91.205.23:38140" (error "remote error: tls: bad certificate", ServerName "bedrock4.ccs.ornl.gov")
2019-08-02 13:19:09.102355 I | embed: rejected connection from "160.91.205.23:38142" (error "remote error: tls: bad certificate", ServerName "bedrock4.ccs.ornl.gov")
2019-08-02 13:19:09.151596 I | embed: rejected connection from "160.91.205.23:38146" (error "remote error: tls: bad certificate", ServerName "bedrock4.ccs.ornl.gov")
2019-08-02 13:19:09.151670 I | embed: rejected connection from "160.91.205.23:38144" (error "remote error: tls: bad certificate", ServerName "bedrock4.ccs.ornl.gov")
2019-08-02 13:19:09.158155 I | embed: rejected connection from "160.91.205.23:38150" (error "remote error: tls: bad certificate", ServerName "bedrock4.ccs.ornl.gov")
2019-08-02 13:19:09.171770 I | embed: rejected connection from "160.91.205.21:51544" (error "remote error: tls: bad certificate", ServerName "ETCD-2.BEDROCK4.CCS.ORNL.GOV")
 
Errors from member3 (etcd-metrics container):
 
2019-08-02 13:15:20.286646 I | etcdmain: ServerTLS: cert = /etc/ssl/etcd/system:etcd-metric:ETCD-2.BEDROCK4.CCS.ORNL.GOV.crt, key = /etc/ssl/etcd/system:etcd-metric:ETCD-2.BEDROCK4.CCS.ORNL.GOV.key, ca = /etc/ssl/etcd/metric-ca.crt, trusted-ca = , client-cert-auth = false, crl-file =
2019-08-02 13:15:20.287204 I | etcdmain: listening for grpc-proxy client requests on 127.0.0.1:9977
2019-08-02 13:15:20.287580 I | etcdmain: ClientTLS: cert = /etc/ssl/etcd/system:etcd-peer:ETCD-2.BEDROCK4.CCS.ORNL.GOV.crt, key = /etc/ssl/etcd/system:etcd-peer:ETCD-2.BEDROCK4.CCS.ORNL.GOV.key, ca = /etc/ssl/etcd/ca.crt, trusted-ca = , client-cert-auth = false, crl-file =
WARNING: 2019/08/02 13:15:20 Failed to dial ETCD-2.BEDROCK4.CCS.ORNL.GOV:9978: connection error: desc = "transport: authentication handshake failed: x509: certificate signed by unknown authority"; please retry.
WARNING: 2019/08/02 13:15:25 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: Error while dialing dial tcp: operation was canceled"; Reconnecting to {ETCD-2.BEDROCK4.CCS.ORNL.GOV:9978 0  <nil>}
WARNING: 2019/08/02 13:15:25 Failed to dial ETCD-2.BEDROCK4.CCS.ORNL.GOV:9978: grpc: the connection is closing; please retry.
context deadline exceeded
 
 
Status of the containers from the pod yaml:
 
  containerStatuses:
  - containerID: cri-o://aa66e438743902a2381686b511f1f9b6261be75d7970cf03f0300911eb15cf3d
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7174d5682d63fe637bd0fcc8ca99a047a4340a220c6eb3233418ce3ca326ef3a
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7174d5682d63fe637bd0fcc8ca99a047a4340a220c6eb3233418ce3ca326ef3a
    lastState: {}
    name: etcd-member
    ready: true
    restartCount: 0
    state:
      running:
        startedAt: 2019-08-01T19:02:12Z
  - containerID: cri-o://4b6660a3b78ff17c2cb5a20c1cda186e199bd34d0dee8f8fd95e2ea7a691d47c
    image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7174d5682d63fe637bd0fcc8ca99a047a4340a220c6eb3233418ce3ca326ef3a
    imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7174d5682d63fe637bd0fcc8ca99a047a4340a220c6eb3233418ce3ca326ef3a
    lastState:
      terminated:
        containerID: cri-o://4b6660a3b78ff17c2cb5a20c1cda186e199bd34d0dee8f8fd95e2ea7a691d47c
        exitCode: 1
        finishedAt: 2019-08-02T13:10:13Z
        message: "2019-08-02 13:10:08.264977 I | etcdmain: ServerTLS: cert = /etc/ssl/etcd/system:etcd-metric:ETCD-2.BEDROCK4.CCS.ORNL.GOV.crt,
          key = /etc/ssl/etcd/system:etcd-metric:ETCD-2.BEDROCK4.CCS.ORNL.GOV.key,
          ca = /etc/ssl/etcd/metric-ca.crt, trusted-ca = , client-cert-auth = false,
          crl-file = \n2019-08-02 13:10:08.265563 I | etcdmain: listening for grpc-proxy
          client requests on 127.0.0.1:9977\n2019-08-02 13:10:08.265911 I | etcdmain:
          ClientTLS: cert = /etc/ssl/etcd/system:etcd-peer:ETCD-2.BEDROCK4.CCS.ORNL.GOV.crt,
          key = /etc/ssl/etcd/system:etcd-peer:ETCD-2.BEDROCK4.CCS.ORNL.GOV.key, ca
          = /etc/ssl/etcd/ca.crt, trusted-ca = , client-cert-auth = false, crl-file
          = \nWARNING: 2019/08/02 13:10:08 Failed to dial ETCD-2.BEDROCK4.CCS.ORNL.GOV:9978:
          connection error: desc = \"transport: authentication handshake failed: x509:
          certificate signed by unknown authority\"; please retry.\ncontext deadline
          exceeded\n"
        reason: Error
        startedAt: 2019-08-02T13:10:08Z

Please feel free to open a new bug on this and we can review the core reason for the failures.

Comment 17 Michal Fojtik 2019-11-07 09:42:36 UTC

etcd-3.3.15 landed, moving to this MODIFIED.

Comment 19 ge liu 2019-11-18 08:31:17 UTC

etcd-3.3.17 landed, and I have not trigger this issue, but I still have concern about it because I also can't reproduce it in ocp 4.3 recently, anyway, close it and reopen if it appears later.

Comment 21 errata-xmlrpc 2020-01-23 11:04:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Note You need to log in before you can comment on or make changes to this bug.