1709802 – [DR] Growing etcd to include the new hosts via recover.sh doesn't complete

Bug 1709802 - [DR] Growing etcd to include the new hosts via recover.sh doesn't complete

Summary: [DR] Growing etcd to include the new hosts via recover.sh doesn't complete

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Sam Batschelet
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-05-14 10:58 UTC by Xingxing Xia
Modified:	2019-06-04 10:49 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-06-04 10:48:48 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:0758	0	None	None	None	2019-06-04 10:49:49 UTC

Description Xingxing Xia 2019-05-14 10:58:26 UTC

Description of problem:
See below.

Version-Release number of selected component (if applicable):
4.1.0-0.nightly-2019-05-09-204138

How reproducible:
Always

Steps to Reproduce:
1. Create 4.1 cluster. Then check etcd:
sh-4.2# etcdctl --endpoints etcd-0.xxia-0514.qe.devcluster.openshift.com:2379,etcd-1.xxia-0514.qe.devcluster.openshift.com:2379,etcd-2.xxia-0514.qe.devcluster.openshi
ft.com:2379 endpoint status --write-out table                                                                                                                        
+---------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+                                    
|                     ENDPOINT                      |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |                                    
+---------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+                                    
| etcd-0.xxia-0514.qe.devcluster.openshift.com:2379 | 947450319109c4c3 |  3.3.10 |   59 MB |     false |         6 |      49919 |                                    
| etcd-1.xxia-0514.qe.devcluster.openshift.com:2379 | 85ee3a8c618451fd |  3.3.10 |   59 MB |      true |         6 |      49919 |                                    
| etcd-2.xxia-0514.qe.devcluster.openshift.com:2379 | df24b96e3a492be3 |  3.3.10 |   58 MB |     false |         6 |      49919 |                                    
+---------------------------------------------------+------------------+---------+---------+-----------+-----------+------------+

2. Stop etcd-0 and etcd-1 host on AWS console one by one. Now only etcd-2 is remained, which is not an etcd leader.

3. Follow the doc https://docs.google.com/document/d/1Z7xow84WdLUkgFiOaeY-QXmH1H8wnTg2vP1pQiuj22o/edit# and come to the steps "./recover.sh ...":
# First ssh to xxia-0514-5lxkq-master-0 which is auto-recreated by machine api.
[root@ip-10-0-132-26 ~]# ./recover.sh 10.0.131.160
Creating asset directory ./assets 
Downloading etcdctl binary.. 
etcdctl version: 3.3.10 
API version: 3.3 
Backing up /etc/kubernetes/manifests/etcd-member.yaml to ./assets/backup/ 
Backing up /etc/etcd/etcd.conf to ./assets/backup/ 
Trying to backup etcd client certs.. 
/etc/kubernetes/static-pod-resources/kube-apiserver-pod-1/secrets/etcd-client does not contain etcd client certs, trying next source .. 
/etc/kubernetes/static-pod-resources/kube-apiserver-pod-2/secrets/etcd-client does not contain etcd client certs, trying next source .. 
/etc/kubernetes/static-pod-resources/kube-apiserver-pod-3/secrets/etcd-client does not contain etcd client certs, trying next source .. 
/etc/kubernetes/static-pod-resources/kube-apiserver-pod-4/secrets/etcd-client does not contain etcd client certs, trying next source .. 
/etc/kubernetes/static-pod-resources/kube-apiserver-pod-5/secrets/etcd-client does not contain etcd client certs, trying next source .. 
Stopping etcd.. 
Waiting for etcd-member to stop 
Waiting for etcd-member to stop 
Waiting for etcd-member to stop 
Waiting for etcd-member to stop 
Waiting for etcd-member to stop 
Local etcd snapshot file not found, backup skipped.. 
Populating template.. 
Starting etcd client cert recovery agent..
Waiting for certs to generate..
Waiting for certs to generate..
Waiting for certs to generate..
...repeated...

Actual results:
3. It doesn't complete and "Waiting for certs to generate" messages are repeated.
Then check below found pod not Running:
$ oc get po -n openshift-etcd
NAME                                                             READY   STATUS             RESTARTS   AGE                                                           
etcd-generate-certs-ip-10-0-132-26.ap-south-1.compute.internal   0/2     CrashLoopBackOff   18         24m                                                           
etcd-member-ip-10-0-131-160.ap-south-1.compute.internal          2/2     Running            0          3h53m
...

oc logs etcd-generate-certs-ip-10-0-132-26.ap-south-1.compute.internal -n openshift-etcd -c generate-certs
+ source /run/etcd/environment
++ ETCD_IPV4_ADDRESS=10.0.132.26
++ ETCD_DNS_NAME=etcd-0.xxia-0514.qe.devcluster.openshift.com
+ '[' -e /etc/ssl/etcd/system:etcd-server:etcd-0.xxia-0514.qe.devcluster.openshift.com.crt -a -e /etc/ssl/etcd/system:etcd-server:etcd-0.xxia-0514.qe.devcluster.openshift.com.key ']'
/bin/sh: line 8: ETCD_WILDCARD_DNS_NAME: unbound variable

Expected results:
3. Should complete.

Additional info:

Comment 2 Sam Batschelet 2019-05-14 12:47:44 UTC

can you please provide the following to debug

1.) on ip-10-0-132-26 `cat /run/etcd/environment`
2.) can I see the output of tokenize-signer.sh `cat ./assets/manifests/kube-etcd-cert-signer.yaml`

Comment 9 ge liu 2019-05-17 09:15:48 UTC

As comments 7, currently, the pr have not merged to 4.1, and this bug target release is 4.1, so re-assigned it, and we will try the pr and update later.

Comment 10 Greg Blomquist 2019-05-17 12:54:36 UTC

4.1 PR: https://github.com/openshift/machine-config-operator/pull/771

Comment 15 Xingxing Xia 2019-05-24 09:03:15 UTC

Finally verified in 4.1.0-0.nightly-2019-05-24-040103 by following latest http://file.rdu.redhat.com/~ahoffer/2019/disaster-recovery/disaster_recovery/scenario-1-infra-recovery.html with https://bugzilla.redhat.com/show_bug.cgi?id=1713219#c5 , cluster can be restored back to 3 masters and sanity test does not found cluster problem.

Comment 17 errata-xmlrpc 2019-06-04 10:48:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758

Note You need to log in before you can comment on or make changes to this bug.