Bug 1667557
Summary: | replacement etcd member for failed etcd stuck in Init:CrashLoopBackoff with etcd-setup-environment failures | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Mike Fiedler <mifiedle> | ||||
Component: | Etcd | Assignee: | Sam Batschelet <sbatsche> | ||||
Status: | CLOSED ERRATA | QA Contact: | ge liu <geliu> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 4.1.0 | CC: | aos-bugs, eparis, gblomqui, jokerman, mfojtik, mmccomas, rhowe, sbatsche, xxia, yinzhou | ||||
Target Milestone: | --- | Keywords: | TestBlocker | ||||
Target Release: | 4.4.0 | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2020-05-04 11:12:48 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Mike Fiedler
2019-01-18 19:35:51 UTC
Created attachment 1521669 [details]
container log, system journal from etcd system, pod describe and events
@sbatsche - not sure if this component is correct. When I bring the stopped node back up, the replacement node/instance is terminated as expected but I'm left with a 2 member etcd cluster. oc get pods -n kube-system NAME READY STATUS RESTARTS AGE etcd-member-ip-10-0-15-124.us-east-2.compute.internal 1/1 Running 3 23h etcd-member-ip-10-0-27-53.us-east-2.compute.internal 1/1 Running 0 2h Moving this to Master for triage to make sure it does not sit in a dormant component unnoticed. Mike thanks for the report I will take a look. Basically etcd-setup-environment is tasked with setting 2 ENV vars consumed by the container. ETCD_IPV4_ADDRESS and ETCD_DNS_NAME which are required data points for the Pod. https://github.com/openshift/machine-config-operator/blob/ce2eba470b0fe85a06139d0a905b6241c5ceb409/templates/master/00-master/_base/files/etc-kubernetes-manifests-etcd-member.yaml#L87 It seems that we have a poor assumption somewhere in the process. My guess is that because we use SRV discovery and those DNS records are static if something were to change such as name or IP, now that SRV data is stale. SRV discovery in etcd is a bootstrap mechanism not intended to be a source of truth for its members. Basically because etcd unlike say consul doesn't manage DNS. This problem is probably not trivial to resolve as we have the bootstrap mechanics working but etcd-operator which would manage this type of thing honestly does not exist yet in a form that could handle this task. Also since the container image is dependent on this workflow it needs to be rethought. The wheels are turning but this will take some time to resolve. In the meantime I will verify my theories. Met the same error when stop 1 of the master instance on AWS console. Update: I am currently working on making adjustments to the machine-config-operator logic to handle a lost master and bring the new etcd instance properly into the cluster. The issue is as described above and I have verified this. The init container is trying to bootstrap using SRV discovery and this is not possible after the cluster has been bootstrapped. *** Bug 1678921 has been marked as a duplicate of this bug. *** Adding TestBlocker keyword. This blocks master/etcd HA and resiliency tests for the 4.1 system test run. 4.1 couples the control plane etcd with the masters. Because of this, in 4.1 we can't currently auto-recover from replacing a master node. Specifically, we can't auto-recover if we lose a master node, and get a new master node with a new IP. I talked to Eric Paris, and verified that losing a master node and replacing it with a new master node with a new IP technically puts us into a disaster recovery scenario. At that point, human intervention is required to make sure the new master node successfully joins the cluster. Sam is looking at some possible mitigation paths, but none that can be delivered in 4.1. I think we should move this to a 4.2 test scenario. For 4.1 I believe that we should verify that human can recover a cluster if 1 master is removed and replaced. For some (potentially distant) future version of 4.x I would like this to be automatic. But for 4.1 our goal should be thata human is capable of recovery. Regarding to comment 20, close this bug. pls reopen it if any concern. @Mike Fiedler, @yinzhou Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |