Description of problem: Not sure if etcd operator is correct component but it is the only etcd component I see. 4.0 next gen environment. etcd image: quay.io/coreos/etcd:v3.3.10 After shutting down the AWS instance for a master/etcd, a new master instance is eventually created and all expected pods except etcd are started successfully. The etcd pod is stuck in Init:CrashLoopBackoff. Going over to the node and inspecting crictl logs for Exited containers shows multiple containers with this sequence: I0118 19:11:56.658651 1 run.go:47] Version: 3.11.0-476-gce2eba47-dirty I0118 19:11:56.659405 1 run.go:57] ip addr is 10.0.32.134 I0118 19:11:56.660218 1 run.go:137] checking against mffiedler30-etcd-0.qe.devcluster.openshift.com. I0118 19:11:56.660931 1 run.go:137] checking against mffiedler30-etcd-2.qe.devcluster.openshift.com. I0118 19:11:56.661321 1 run.go:137] checking against mffiedler30-etcd-1.qe.devcluster.openshift.com. E0118 19:11:56.661766 1 run.go:63] error looking up self: could not find self I0118 19:12:56.664666 1 run.go:137] checking against mffiedler30-etcd-1.qe.devcluster.openshift.com. I0118 19:12:56.667127 1 run.go:137] checking against mffiedler30-etcd-2.qe.devcluster.openshift.com. I0118 19:12:56.699516 1 run.go:137] checking against mffiedler30-etcd-0.qe.devcluster.openshift.com. E0118 19:12:56.701394 1 run.go:63] error looking up self: could not find self I0118 19:13:56.663895 1 run.go:137] checking against mffiedler30-etcd-2.qe.devcluster.openshift.com. I0118 19:13:56.666625 1 run.go:137] checking against mffiedler30-etcd-0.qe.devcluster.openshift.com. I0118 19:13:56.668350 1 run.go:137] checking against mffiedler30-etcd-1.qe.devcluster.openshift.com. E0118 19:13:56.670195 1 run.go:63] error looking up self: could not find self I0118 19:14:56.663643 1 run.go:137] checking against mffiedler30-etcd-2.qe.devcluster.openshift.com. I0118 19:14:56.665511 1 run.go:137] checking against mffiedler30-etcd-1.qe.devcluster.openshift.com. I0118 19:14:56.667998 1 run.go:137] checking against mffiedler30-etcd-0.qe.devcluster.openshift.com. E0118 19:14:56.669378 1 run.go:63] error looking up self: could not find self I0118 19:15:56.663753 1 run.go:137] checking against mffiedler30-etcd-1.qe.devcluster.openshift.com. I0118 19:15:56.666173 1 run.go:137] checking against mffiedler30-etcd-2.qe.devcluster.openshift.com. I0118 19:15:56.667747 1 run.go:137] checking against mffiedler30-etcd-0.qe.devcluster.openshift.com. E0118 19:15:56.669263 1 run.go:63] error looking up self: could not find self I0118 19:16:56.663848 1 run.go:137] checking against mffiedler30-etcd-0.qe.devcluster.openshift.com. I0118 19:16:56.665982 1 run.go:137] checking against mffiedler30-etcd-2.qe.devcluster.openshift.com. I0118 19:16:56.668658 1 run.go:137] checking against mffiedler30-etcd-1.qe.devcluster.openshift.com. E0118 19:16:56.670205 1 run.go:63] error looking up self: could not find self I0118 19:16:56.670665 1 run.go:137] checking against mffiedler30-etcd-2.qe.devcluster.openshift.com. I0118 19:16:56.672229 1 run.go:137] checking against mffiedler30-etcd-1.qe.devcluster.openshift.com. I0118 19:16:56.672746 1 run.go:137] checking against mffiedler30-etcd-0.qe.devcluster.openshift.com. E0118 19:16:56.673150 1 run.go:63] error looking up self: could not find self F0118 19:16:56.673168 1 main.go:30] Error executing etcd-setup-environment: could not find self: timed out waiting for the condition The new member never joins the cluster Version-Release number of selected component (if applicable): etcd:v3.3.10 origin-v4.0-2019-01-17-162014 How reproducible: Unknown, 1 for 1 so far. Will retry. Steps to Reproduce: 1. Create standard 3 master/3 worker next gen cluster on AWS 2. Verify the cluster is working well - all master components and etcd running 3. From the AWS console stop 1 master instance 4. Wait for the replacement master instance to be created and report as NodeReady 5. oc get pods -n kube-system Actual results: NAME READY STATUS RESTARTS AGE etcd-member-ip-10-0-15-124.us-east-2.compute.internal 1/1 Running 3 23h etcd-member-ip-10-0-27-53.us-east-2.compute.internal 1/1 Running 0 1h etcd-member-ip-10-0-32-134.us-east-2.compute.internal 0/1 Init:CrashLoopBackOff 4 29m Expected results: 3 running etcd pods and all etcd reporting as cluster members Additional info: Will attach pod and journal logs along with kube events and describe pod of crash looping pod.
Created attachment 1521669 [details] container log, system journal from etcd system, pod describe and events
@sbatsche - not sure if this component is correct.
When I bring the stopped node back up, the replacement node/instance is terminated as expected but I'm left with a 2 member etcd cluster. oc get pods -n kube-system NAME READY STATUS RESTARTS AGE etcd-member-ip-10-0-15-124.us-east-2.compute.internal 1/1 Running 3 23h etcd-member-ip-10-0-27-53.us-east-2.compute.internal 1/1 Running 0 2h Moving this to Master for triage to make sure it does not sit in a dormant component unnoticed.
Mike thanks for the report I will take a look. Basically etcd-setup-environment is tasked with setting 2 ENV vars consumed by the container. ETCD_IPV4_ADDRESS and ETCD_DNS_NAME which are required data points for the Pod. https://github.com/openshift/machine-config-operator/blob/ce2eba470b0fe85a06139d0a905b6241c5ceb409/templates/master/00-master/_base/files/etc-kubernetes-manifests-etcd-member.yaml#L87 It seems that we have a poor assumption somewhere in the process. My guess is that because we use SRV discovery and those DNS records are static if something were to change such as name or IP, now that SRV data is stale. SRV discovery in etcd is a bootstrap mechanism not intended to be a source of truth for its members. Basically because etcd unlike say consul doesn't manage DNS. This problem is probably not trivial to resolve as we have the bootstrap mechanics working but etcd-operator which would manage this type of thing honestly does not exist yet in a form that could handle this task. Also since the container image is dependent on this workflow it needs to be rethought. The wheels are turning but this will take some time to resolve. In the meantime I will verify my theories.
Met the same error when stop 1 of the master instance on AWS console.
Update: I am currently working on making adjustments to the machine-config-operator logic to handle a lost master and bring the new etcd instance properly into the cluster. The issue is as described above and I have verified this. The init container is trying to bootstrap using SRV discovery and this is not possible after the cluster has been bootstrapped.
*** Bug 1678921 has been marked as a duplicate of this bug. ***
Adding TestBlocker keyword. This blocks master/etcd HA and resiliency tests for the 4.1 system test run.
4.1 couples the control plane etcd with the masters. Because of this, in 4.1 we can't currently auto-recover from replacing a master node. Specifically, we can't auto-recover if we lose a master node, and get a new master node with a new IP. I talked to Eric Paris, and verified that losing a master node and replacing it with a new master node with a new IP technically puts us into a disaster recovery scenario. At that point, human intervention is required to make sure the new master node successfully joins the cluster. Sam is looking at some possible mitigation paths, but none that can be delivered in 4.1. I think we should move this to a 4.2 test scenario.
For 4.1 I believe that we should verify that human can recover a cluster if 1 master is removed and replaced. For some (potentially distant) future version of 4.x I would like this to be automatic. But for 4.1 our goal should be thata human is capable of recovery.
Regarding to comment 20, close this bug. pls reopen it if any concern. @Mike Fiedler, @yinzhou
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days