Bug 1709252

Summary:	[DR] Running restore.sh failed with 'Error: data-dir "/var/lib/etcd" exists'
Product:	OpenShift Container Platform	Reporter:	Xingxing Xia <xxia>
Component:	Etcd	Assignee:	Sam Batschelet <sbatsche>
Status:	CLOSED CURRENTRELEASE	QA Contact:	ge liu <geliu>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.1.0	CC:	aos-bugs, bleanhar, jokerman, mmccomas, sbatsche, vlaad
Target Milestone:	---
Target Release:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-05-15 18:31:58 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Xingxing Xia 2019-05-13 09:29:48 UTC

Description of problem:
Running restore.sh failed with 'Error: data-dir "/var/lib/etcd" exists' when follow the doc https://docs.google.com/document/d/1Z7xow84WdLUkgFiOaeY-QXmH1H8wnTg2vP1pQiuj22o/edit
Looks like the script `cp -rap ${ETCD_DATA_DIR} $ASSET_DIR/backup/` should use `mv`: mv -rap ${ETCD_DATA_DIR} $ASSET_DIR/backup/?

Version-Release number of selected component (if applicable):
quay.io/openshift-release-dev/ocp-release:4.1.0-rc.3

How reproducible:
Always

Steps to Reproduce:
1. Prepare a 4.1 cluster where only one master is remained alive by:
Create an IPI 4.1 cluster which has 3 masters and 2 workers.
Check cluster status is normal (e.g. `oc get no`, `oc get co`, `oc get po -A | grep -vE "(Running|Completed)"` etc.)

Then stop one master xxia-0513-destructive-6ksqs-master-0, wait a moment (new master will be recreated by machine api), the cluster still is alive.

Then stop another master xxia-0513-destructive-6ksqs-master-1. The cluster will be dead because 2 of 3 ectd pods are down (i.e. etcd quorum loss) due to https://bugzilla.redhat.com/show_bug.cgi?id=1667557

2. Create a bastion, then follow the doc https://docs.google.com/document/d/1Z7xow84WdLUkgFiOaeY-QXmH1H8wnTg2vP1pQiuj22o/edit

Actual results:
2. When coming to run the step `sh restore.sh`, it fails with below error:
[root@ip-10-0-129-162 xxia-test-master-replacement]# sh restore.sh
Creating asset directory ./assets
Downloading etcdctl binary..
etcdctl version: 3.3.10
API version: 3.3
Backing up /etc/kubernetes/manifests/etcd-member.yaml to ./assets/backup/
Backing up etcd data-dir..
Stopping etcd..
Waiting for etcd-member to stop
Waiting for etcd-member to stop
Waiting for etcd-member to stop
Waiting for etcd-member to stop
Restoring etcd member etcd-member-ip-10-0-129-162.ap-south-1.compute.internal from snapshot..
2019-05-13 08:32:04.753779 I | pkg/netutil: resolving etcd-2.xxia-0513-destructive.qe.devcluster.openshift.com:2380 to 10.0.129.162:2380
Error: data-dir "/var/lib/etcd" exists

Expected results:
The error should not happen.

Additional info:

Comment 1 Xingxing Xia 2019-05-13 10:26:13 UTC

(In reply to Xingxing Xia from comment #0)
> Looks like the script `cp -rap ${ETCD_DATA_DIR}  $ASSET_DIR/backup/` should
> use `mv`: mv -rap ${ETCD_DATA_DIR}  $ASSET_DIR/backup/?
After re-trying with `mv`, the error disappeared. So the script should be updated for user to use.

Comment 3 Xingxing Xia 2019-05-14 11:11:26 UTC

Verified the new script works without error during testing https://bugzilla.redhat.com/show_bug.cgi?id=1709802 .