Description of problem: See below. Version-Release number of selected component (if applicable): 4.1.0-0.nightly-2019-05-09-204138 How reproducible: Always Steps to Reproduce: 1. Create 4.1 cluster. Then check etcd: sh-4.2# etcdctl --endpoints etcd-0.xxia-0514.qe.devcluster.openshift.com:2379,etcd-1.xxia-0514.qe.devcluster.openshift.com:2379,etcd-2.xxia-0514.qe.devcluster.openshi ft.com:2379 endpoint status --write-out table +---------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX | +---------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+ | etcd-0.xxia-0514.qe.devcluster.openshift.com:2379 | 947450319109c4c3 | 3.3.10 | 59 MB | false | 6 | 49919 | | etcd-1.xxia-0514.qe.devcluster.openshift.com:2379 | 85ee3a8c618451fd | 3.3.10 | 59 MB | true | 6 | 49919 | | etcd-2.xxia-0514.qe.devcluster.openshift.com:2379 | df24b96e3a492be3 | 3.3.10 | 58 MB | false | 6 | 49919 | +---------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+ 2. Stop etcd-0 and etcd-1 host on AWS console one by one. Now only etcd-2 is remained, which is not an etcd leader. 3. Follow the doc https://docs.google.com/document/d/1Z7xow84WdLUkgFiOaeY-QXmH1H8wnTg2vP1pQiuj22o/edit# and come to the steps "./recover.sh ...": # First ssh to xxia-0514-5lxkq-master-0 which is auto-recreated by machine api. [root@ip-10-0-132-26 ~]# ./recover.sh 10.0.131.160 Creating asset directory ./assets Downloading etcdctl binary.. etcdctl version: 3.3.10 API version: 3.3 Backing up /etc/kubernetes/manifests/etcd-member.yaml to ./assets/backup/ Backing up /etc/etcd/etcd.conf to ./assets/backup/ Trying to backup etcd client certs.. /etc/kubernetes/static-pod-resources/kube-apiserver-pod-1/secrets/etcd-client does not contain etcd client certs, trying next source .. /etc/kubernetes/static-pod-resources/kube-apiserver-pod-2/secrets/etcd-client does not contain etcd client certs, trying next source .. /etc/kubernetes/static-pod-resources/kube-apiserver-pod-3/secrets/etcd-client does not contain etcd client certs, trying next source .. /etc/kubernetes/static-pod-resources/kube-apiserver-pod-4/secrets/etcd-client does not contain etcd client certs, trying next source .. /etc/kubernetes/static-pod-resources/kube-apiserver-pod-5/secrets/etcd-client does not contain etcd client certs, trying next source .. Stopping etcd.. Waiting for etcd-member to stop Waiting for etcd-member to stop Waiting for etcd-member to stop Waiting for etcd-member to stop Waiting for etcd-member to stop Local etcd snapshot file not found, backup skipped.. Populating template.. Starting etcd client cert recovery agent.. Waiting for certs to generate.. Waiting for certs to generate.. Waiting for certs to generate.. ...repeated... Actual results: 3. It doesn't complete and "Waiting for certs to generate" messages are repeated. Then check below found pod not Running: $ oc get po -n openshift-etcd NAME READY STATUS RESTARTS AGE etcd-generate-certs-ip-10-0-132-26.ap-south-1.compute.internal 0/2 CrashLoopBackOff 18 24m etcd-member-ip-10-0-131-160.ap-south-1.compute.internal 2/2 Running 0 3h53m ... oc logs etcd-generate-certs-ip-10-0-132-26.ap-south-1.compute.internal -n openshift-etcd -c generate-certs + source /run/etcd/environment ++ ETCD_IPV4_ADDRESS=10.0.132.26 ++ ETCD_DNS_NAME=etcd-0.xxia-0514.qe.devcluster.openshift.com + '[' -e /etc/ssl/etcd/system:etcd-server:etcd-0.xxia-0514.qe.devcluster.openshift.com.crt -a -e /etc/ssl/etcd/system:etcd-server:etcd-0.xxia-0514.qe.devcluster.openshift.com.key ']' /bin/sh: line 8: ETCD_WILDCARD_DNS_NAME: unbound variable Expected results: 3. Should complete. Additional info:
can you please provide the following to debug 1.) on ip-10-0-132-26 `cat /run/etcd/environment` 2.) can I see the output of tokenize-signer.sh `cat ./assets/manifests/kube-etcd-cert-signer.yaml`
As comments 7, currently, the pr have not merged to 4.1, and this bug target release is 4.1, so re-assigned it, and we will try the pr and update later.
4.1 PR: https://github.com/openshift/machine-config-operator/pull/771
Finally verified in 4.1.0-0.nightly-2019-05-24-040103 by following latest http://file.rdu.redhat.com/~ahoffer/2019/disaster-recovery/disaster_recovery/scenario-1-infra-recovery.html with https://bugzilla.redhat.com/show_bug.cgi?id=1713219#c5 , cluster can be restored back to 3 masters and sanity test does not found cluster problem.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758