Bug 1667557 - replacement etcd member for failed etcd stuck in Init:CrashLoopBackoff with etcd-setup-environment failures [NEEDINFO]
Summary: replacement etcd member for failed etcd stuck in Init:CrashLoopBackoff with e...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.1.0
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: 4.4.0
Assignee: Sam Batschelet
QA Contact: ge liu
URL:
Whiteboard:
: 1678921 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-01-18 19:35 UTC by Mike Fiedler
Modified: 2020-05-04 11:13 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-04 11:12:48 UTC
Target Upstream Version:
mfojtik: needinfo? (sbatsche)
mfojtik: needinfo? (sbatsche)


Attachments (Terms of Use)
container log, system journal from etcd system, pod describe and events (3.42 KB, application/gzip)
2019-01-18 19:37 UTC, Mike Fiedler
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-etcd-operator pull 157 0 None closed bug 1801397: pkg/operator/guardbudget: delete MCO deployment 2021-02-05 01:35:42 UTC
Red Hat Product Errata RHBA-2020:0581 0 None None None 2020-05-04 11:13:13 UTC

Description Mike Fiedler 2019-01-18 19:35:51 UTC
Description of problem:

Not sure if etcd operator is correct component but it is the only etcd component I see.

4.0 next gen environment.  etcd image:  quay.io/coreos/etcd:v3.3.10   

After shutting down the AWS instance for a master/etcd, a new master instance is eventually created and all expected pods except etcd are started successfully.   The etcd pod is stuck in Init:CrashLoopBackoff.

Going over to the node and inspecting crictl logs for Exited containers shows multiple containers with this sequence:

I0118 19:11:56.658651       1 run.go:47] Version: 3.11.0-476-gce2eba47-dirty
I0118 19:11:56.659405       1 run.go:57] ip addr is 10.0.32.134
I0118 19:11:56.660218       1 run.go:137] checking against mffiedler30-etcd-0.qe.devcluster.openshift.com.
I0118 19:11:56.660931       1 run.go:137] checking against mffiedler30-etcd-2.qe.devcluster.openshift.com.
I0118 19:11:56.661321       1 run.go:137] checking against mffiedler30-etcd-1.qe.devcluster.openshift.com.
E0118 19:11:56.661766       1 run.go:63] error looking up self: could not find self
I0118 19:12:56.664666       1 run.go:137] checking against mffiedler30-etcd-1.qe.devcluster.openshift.com.
I0118 19:12:56.667127       1 run.go:137] checking against mffiedler30-etcd-2.qe.devcluster.openshift.com.
I0118 19:12:56.699516       1 run.go:137] checking against mffiedler30-etcd-0.qe.devcluster.openshift.com.
E0118 19:12:56.701394       1 run.go:63] error looking up self: could not find self
I0118 19:13:56.663895       1 run.go:137] checking against mffiedler30-etcd-2.qe.devcluster.openshift.com.
I0118 19:13:56.666625       1 run.go:137] checking against mffiedler30-etcd-0.qe.devcluster.openshift.com.
I0118 19:13:56.668350       1 run.go:137] checking against mffiedler30-etcd-1.qe.devcluster.openshift.com.
E0118 19:13:56.670195       1 run.go:63] error looking up self: could not find self
I0118 19:14:56.663643       1 run.go:137] checking against mffiedler30-etcd-2.qe.devcluster.openshift.com.
I0118 19:14:56.665511       1 run.go:137] checking against mffiedler30-etcd-1.qe.devcluster.openshift.com.
I0118 19:14:56.667998       1 run.go:137] checking against mffiedler30-etcd-0.qe.devcluster.openshift.com.
E0118 19:14:56.669378       1 run.go:63] error looking up self: could not find self
I0118 19:15:56.663753       1 run.go:137] checking against mffiedler30-etcd-1.qe.devcluster.openshift.com.
I0118 19:15:56.666173       1 run.go:137] checking against mffiedler30-etcd-2.qe.devcluster.openshift.com.
I0118 19:15:56.667747       1 run.go:137] checking against mffiedler30-etcd-0.qe.devcluster.openshift.com.
E0118 19:15:56.669263       1 run.go:63] error looking up self: could not find self
I0118 19:16:56.663848       1 run.go:137] checking against mffiedler30-etcd-0.qe.devcluster.openshift.com.
I0118 19:16:56.665982       1 run.go:137] checking against mffiedler30-etcd-2.qe.devcluster.openshift.com.
I0118 19:16:56.668658       1 run.go:137] checking against mffiedler30-etcd-1.qe.devcluster.openshift.com.
E0118 19:16:56.670205       1 run.go:63] error looking up self: could not find self
I0118 19:16:56.670665       1 run.go:137] checking against mffiedler30-etcd-2.qe.devcluster.openshift.com.
I0118 19:16:56.672229       1 run.go:137] checking against mffiedler30-etcd-1.qe.devcluster.openshift.com.
I0118 19:16:56.672746       1 run.go:137] checking against mffiedler30-etcd-0.qe.devcluster.openshift.com.
E0118 19:16:56.673150       1 run.go:63] error looking up self: could not find self
F0118 19:16:56.673168       1 main.go:30] Error executing etcd-setup-environment: could not find self: timed out waiting for the condition

The new member never joins the cluster


Version-Release number of selected component (if applicable):

etcd:v3.3.10
origin-v4.0-2019-01-17-162014

How reproducible: Unknown, 1 for 1 so far.   Will retry.


Steps to Reproduce:
1. Create standard 3 master/3 worker next gen cluster on AWS
2. Verify the cluster is working well - all master components and etcd running
3. From the AWS console stop 1 master instance
4. Wait for the replacement master instance to be created and report as NodeReady
5. oc get pods -n kube-system

Actual results:

NAME                                                    READY     STATUS                  RESTARTS   AGE                                                                                                                                                                                  
etcd-member-ip-10-0-15-124.us-east-2.compute.internal   1/1       Running                 3          23h                                                                                                                                                                                  
etcd-member-ip-10-0-27-53.us-east-2.compute.internal    1/1       Running                 0          1h                                                                                                                                                                                   
etcd-member-ip-10-0-32-134.us-east-2.compute.internal   0/1       Init:CrashLoopBackOff   4          29m      


Expected results:

3 running etcd pods and all etcd reporting as cluster members


Additional info:

Will attach pod and journal logs along with kube events and describe pod of crash looping pod.

Comment 1 Mike Fiedler 2019-01-18 19:37:48 UTC
Created attachment 1521669 [details]
container log, system journal from etcd system, pod describe and events

Comment 2 Mike Fiedler 2019-01-18 19:38:58 UTC
@sbatsche - not sure if this component is correct.

Comment 3 Mike Fiedler 2019-01-18 20:04:05 UTC
When I bring the stopped node back up, the replacement node/instance is terminated as expected but I'm left with a 2 member etcd cluster.

oc get pods -n kube-system
NAME                                                    READY     STATUS    RESTARTS   AGE
etcd-member-ip-10-0-15-124.us-east-2.compute.internal   1/1       Running   3          23h
etcd-member-ip-10-0-27-53.us-east-2.compute.internal    1/1       Running   0          2h


Moving this to Master for triage to make sure it does not sit in a dormant component unnoticed.

Comment 4 Sam Batschelet 2019-01-18 20:46:38 UTC
Mike thanks for the report I will take a look. Basically etcd-setup-environment is tasked with setting 2 ENV vars consumed by the container. 

ETCD_IPV4_ADDRESS and ETCD_DNS_NAME which are required data points for the Pod.

https://github.com/openshift/machine-config-operator/blob/ce2eba470b0fe85a06139d0a905b6241c5ceb409/templates/master/00-master/_base/files/etc-kubernetes-manifests-etcd-member.yaml#L87 

It seems that we have a poor assumption somewhere in the process. My guess is that because we use SRV discovery and those DNS records are static if something were to change such as name or IP, now that SRV data is stale. SRV discovery in etcd is a bootstrap mechanism not intended to be a source of truth for its members. Basically because etcd unlike say consul doesn't manage DNS. This problem is probably not trivial to resolve as we have the bootstrap mechanics working but etcd-operator which would manage this type of thing honestly does not exist yet in a form that could handle this task. Also since the container image is dependent on this workflow it needs to be rethought.

The wheels are turning but this will take some time to resolve. In the meantime I will verify my theories.

Comment 5 zhou ying 2019-01-22 02:49:09 UTC
Met the same error when stop 1 of the master instance on AWS console.

Comment 6 Sam Batschelet 2019-01-23 22:38:23 UTC
Update: I am currently working on making adjustments to the machine-config-operator logic to handle a lost master and bring the new etcd instance properly into the cluster. The issue is as described above and I have verified this. The init container is trying to bootstrap using SRV discovery and this is not possible after the cluster has been bootstrapped.

Comment 7 Michal Fojtik 2019-03-07 12:17:58 UTC
*** Bug 1678921 has been marked as a duplicate of this bug. ***

Comment 9 Mike Fiedler 2019-04-11 00:07:02 UTC
Adding TestBlocker keyword.  This blocks master/etcd HA and resiliency tests for the 4.1 system test run.

Comment 10 Greg Blomquist 2019-04-17 20:31:44 UTC
4.1 couples the control plane etcd with the masters.  Because of this, in 4.1 we can't currently auto-recover from replacing a master node.  Specifically, we can't auto-recover if we lose a master node, and get a new master node with a new IP.

I talked to Eric Paris, and verified that losing a master node and replacing it with a new master node with a new IP technically puts us into a disaster recovery scenario.  At that point, human intervention is required to make sure the new master node successfully joins the cluster.

Sam is looking at some possible mitigation paths, but none that can be delivered in 4.1.  I think we should move this to a 4.2 test scenario.

Comment 11 Eric Paris 2019-04-18 00:29:50 UTC
For 4.1 I believe that we should verify that human can recover a cluster if 1 master is removed and replaced.

For some (potentially distant) future version of 4.x I would like this to be automatic. But for 4.1 our goal should be thata human is capable of recovery.

Comment 22 ge liu 2020-03-09 08:39:09 UTC
Regarding to comment 20, close this bug. pls reopen it if any concern. @Mike Fiedler, @yinzhou

Comment 24 errata-xmlrpc 2020-05-04 11:12:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581


Note You need to log in before you can comment on or make changes to this bug.