Bug 1709252

Summary: [DR] Running restore.sh failed with 'Error: data-dir "/var/lib/etcd" exists'
Product: OpenShift Container Platform Reporter: Xingxing Xia <xxia>
Component: EtcdAssignee: Sam Batschelet <sbatsche>
Status: CLOSED CURRENTRELEASE QA Contact: ge liu <geliu>
Severity: high Docs Contact:
Priority: high    
Version: 4.1.0CC: aos-bugs, bleanhar, jokerman, mmccomas, sbatsche, vlaad
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-05-15 18:31:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Xingxing Xia 2019-05-13 09:29:48 UTC
Description of problem:
Running restore.sh failed with 'Error: data-dir "/var/lib/etcd" exists' when follow the doc https://docs.google.com/document/d/1Z7xow84WdLUkgFiOaeY-QXmH1H8wnTg2vP1pQiuj22o/edit
Looks like the script `cp -rap ${ETCD_DATA_DIR}  $ASSET_DIR/backup/` should use `mv`: mv -rap ${ETCD_DATA_DIR}  $ASSET_DIR/backup/?

Version-Release number of selected component (if applicable):
quay.io/openshift-release-dev/ocp-release:4.1.0-rc.3

How reproducible:
Always

Steps to Reproduce:
1. Prepare a 4.1 cluster where only one master is remained alive by:
Create an IPI 4.1 cluster which has 3 masters and 2 workers.
Check cluster status is normal (e.g. `oc get no`, `oc get co`, `oc get po -A | grep -vE "(Running|Completed)"` etc.)

Then stop one master xxia-0513-destructive-6ksqs-master-0, wait a moment (new master will be recreated by machine api), the cluster still is alive.

Then stop another master xxia-0513-destructive-6ksqs-master-1. The cluster will be dead because 2 of 3 ectd pods are down (i.e. etcd quorum loss) due to https://bugzilla.redhat.com/show_bug.cgi?id=1667557

2. Create a bastion, then follow the doc https://docs.google.com/document/d/1Z7xow84WdLUkgFiOaeY-QXmH1H8wnTg2vP1pQiuj22o/edit

Actual results:
2. When coming to run the step `sh restore.sh`, it fails with below error:
[root@ip-10-0-129-162 xxia-test-master-replacement]# sh restore.sh
Creating asset directory ./assets
Downloading etcdctl binary..
etcdctl version: 3.3.10
API version: 3.3
Backing up /etc/kubernetes/manifests/etcd-member.yaml to ./assets/backup/
Backing up etcd data-dir..
Stopping etcd..
Waiting for etcd-member to stop
Waiting for etcd-member to stop
Waiting for etcd-member to stop
Waiting for etcd-member to stop
Restoring etcd member etcd-member-ip-10-0-129-162.ap-south-1.compute.internal from snapshot..
2019-05-13 08:32:04.753779 I | pkg/netutil: resolving etcd-2.xxia-0513-destructive.qe.devcluster.openshift.com:2380 to 10.0.129.162:2380
Error: data-dir "/var/lib/etcd" exists

Expected results:
The error should not happen.

Additional info:

Comment 1 Xingxing Xia 2019-05-13 10:26:13 UTC
(In reply to Xingxing Xia from comment #0)
> Looks like the script `cp -rap ${ETCD_DATA_DIR}  $ASSET_DIR/backup/` should
> use `mv`: mv -rap ${ETCD_DATA_DIR}  $ASSET_DIR/backup/?
After re-trying with `mv`, the error disappeared. So the script should be updated for user to use.

Comment 3 Xingxing Xia 2019-05-14 11:11:26 UTC
Verified the new script works without error during testing https://bugzilla.redhat.com/show_bug.cgi?id=1709802 .