Bug 1709802

Summary: [DR] Growing etcd to include the new hosts via recover.sh doesn't complete
Product: OpenShift Container Platform Reporter: Xingxing Xia <xxia>
Component: EtcdAssignee: Sam Batschelet <sbatsche>
Status: CLOSED ERRATA QA Contact: ge liu <geliu>
Severity: high Docs Contact:
Priority: high    
Version: 4.1.0CC: bleanhar, gblomqui, vrutkovs
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:48:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Xingxing Xia 2019-05-14 10:58:26 UTC
Description of problem:
See below.

Version-Release number of selected component (if applicable):
4.1.0-0.nightly-2019-05-09-204138

How reproducible:
Always

Steps to Reproduce:
1. Create 4.1 cluster. Then check etcd:
sh-4.2# etcdctl --endpoints etcd-0.xxia-0514.qe.devcluster.openshift.com:2379,etcd-1.xxia-0514.qe.devcluster.openshift.com:2379,etcd-2.xxia-0514.qe.devcluster.openshi
ft.com:2379 endpoint status --write-out table                                                                                                                        
+---------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+                                    
|                     ENDPOINT                      |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |                                    
+---------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+                                    
| etcd-0.xxia-0514.qe.devcluster.openshift.com:2379 | 947450319109c4c3 |  3.3.10 |   59 MB |     false |         6 |      49919 |                                    
| etcd-1.xxia-0514.qe.devcluster.openshift.com:2379 | 85ee3a8c618451fd |  3.3.10 |   59 MB |      true |         6 |      49919 |                                    
| etcd-2.xxia-0514.qe.devcluster.openshift.com:2379 | df24b96e3a492be3 |  3.3.10 |   58 MB |     false |         6 |      49919 |                                    
+---------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+

2. Stop etcd-0 and etcd-1 host on AWS console one by one. Now only etcd-2 is remained, which is not an etcd leader.

3. Follow the doc https://docs.google.com/document/d/1Z7xow84WdLUkgFiOaeY-QXmH1H8wnTg2vP1pQiuj22o/edit# and come to the steps "./recover.sh ...":
# First ssh to xxia-0514-5lxkq-master-0 which is auto-recreated by machine api.
[root@ip-10-0-132-26 ~]# ./recover.sh 10.0.131.160
Creating asset directory ./assets 
Downloading etcdctl binary.. 
etcdctl version: 3.3.10 
API version: 3.3 
Backing up /etc/kubernetes/manifests/etcd-member.yaml to ./assets/backup/ 
Backing up /etc/etcd/etcd.conf to ./assets/backup/ 
Trying to backup etcd client certs.. 
/etc/kubernetes/static-pod-resources/kube-apiserver-pod-1/secrets/etcd-client does not contain etcd client certs, trying next source .. 
/etc/kubernetes/static-pod-resources/kube-apiserver-pod-2/secrets/etcd-client does not contain etcd client certs, trying next source .. 
/etc/kubernetes/static-pod-resources/kube-apiserver-pod-3/secrets/etcd-client does not contain etcd client certs, trying next source .. 
/etc/kubernetes/static-pod-resources/kube-apiserver-pod-4/secrets/etcd-client does not contain etcd client certs, trying next source .. 
/etc/kubernetes/static-pod-resources/kube-apiserver-pod-5/secrets/etcd-client does not contain etcd client certs, trying next source .. 
Stopping etcd.. 
Waiting for etcd-member to stop 
Waiting for etcd-member to stop 
Waiting for etcd-member to stop 
Waiting for etcd-member to stop 
Waiting for etcd-member to stop 
Local etcd snapshot file not found, backup skipped.. 
Populating template.. 
Starting etcd client cert recovery agent..
Waiting for certs to generate..
Waiting for certs to generate..
Waiting for certs to generate..
...repeated...

Actual results:
3. It doesn't complete and "Waiting for certs to generate" messages are repeated.
Then check below found pod not Running:
$ oc get po -n openshift-etcd
NAME                                                             READY   STATUS             RESTARTS   AGE                                                           
etcd-generate-certs-ip-10-0-132-26.ap-south-1.compute.internal   0/2     CrashLoopBackOff   18         24m                                                           
etcd-member-ip-10-0-131-160.ap-south-1.compute.internal          2/2     Running            0          3h53m
...

oc logs etcd-generate-certs-ip-10-0-132-26.ap-south-1.compute.internal -n openshift-etcd -c generate-certs
+ source /run/etcd/environment
++ ETCD_IPV4_ADDRESS=10.0.132.26
++ ETCD_DNS_NAME=etcd-0.xxia-0514.qe.devcluster.openshift.com
+ '[' -e /etc/ssl/etcd/system:etcd-server:etcd-0.xxia-0514.qe.devcluster.openshift.com.crt -a -e /etc/ssl/etcd/system:etcd-server:etcd-0.xxia-0514.qe.devcluster.openshift.com.key ']'
/bin/sh: line 8: ETCD_WILDCARD_DNS_NAME: unbound variable

Expected results:
3. Should complete.

Additional info:

Comment 2 Sam Batschelet 2019-05-14 12:47:44 UTC
can you please provide the following to debug

1.) on ip-10-0-132-26 `cat /run/etcd/environment`
2.) can I see the output of tokenize-signer.sh `cat ./assets/manifests/kube-etcd-cert-signer.yaml`

Comment 9 ge liu 2019-05-17 09:15:48 UTC
As comments 7, currently, the pr have not merged to 4.1, and this bug target release is 4.1, so re-assigned it, and we will try the pr and update later.

Comment 10 Greg Blomquist 2019-05-17 12:54:36 UTC
4.1 PR: https://github.com/openshift/machine-config-operator/pull/771

Comment 15 Xingxing Xia 2019-05-24 09:03:15 UTC
Finally verified in 4.1.0-0.nightly-2019-05-24-040103 by following latest http://file.rdu.redhat.com/~ahoffer/2019/disaster-recovery/disaster_recovery/scenario-1-infra-recovery.html with https://bugzilla.redhat.com/show_bug.cgi?id=1713219#c5 , cluster can be restored back to 3 masters and sanity test does not found cluster problem.

Comment 17 errata-xmlrpc 2019-06-04 10:48:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758