1709252 – [DR] Running restore.sh failed with 'Error: data-dir "/var/lib/etcd" exists'

Bug 1709252 - [DR] Running restore.sh failed with 'Error: data-dir "/var/lib/etcd" exists'

Summary: [DR] Running restore.sh failed with 'Error: data-dir "/var/lib/etcd" exists'

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Sam Batschelet
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-05-13 09:29 UTC by Xingxing Xia
Modified:	2019-05-15 18:31 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-05-15 18:31:58 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Xingxing Xia 2019-05-13 09:29:48 UTC

Description of problem:
Running restore.sh failed with 'Error: data-dir "/var/lib/etcd" exists' when follow the doc https://docs.google.com/document/d/1Z7xow84WdLUkgFiOaeY-QXmH1H8wnTg2vP1pQiuj22o/edit
Looks like the script `cp -rap ${ETCD_DATA_DIR} $ASSET_DIR/backup/` should use `mv`: mv -rap ${ETCD_DATA_DIR} $ASSET_DIR/backup/?

Version-Release number of selected component (if applicable):
quay.io/openshift-release-dev/ocp-release:4.1.0-rc.3

How reproducible:
Always

Steps to Reproduce:
1. Prepare a 4.1 cluster where only one master is remained alive by:
Create an IPI 4.1 cluster which has 3 masters and 2 workers.
Check cluster status is normal (e.g. `oc get no`, `oc get co`, `oc get po -A | grep -vE "(Running|Completed)"` etc.)

Then stop one master xxia-0513-destructive-6ksqs-master-0, wait a moment (new master will be recreated by machine api), the cluster still is alive.

Then stop another master xxia-0513-destructive-6ksqs-master-1. The cluster will be dead because 2 of 3 ectd pods are down (i.e. etcd quorum loss) due to https://bugzilla.redhat.com/show_bug.cgi?id=1667557

2. Create a bastion, then follow the doc https://docs.google.com/document/d/1Z7xow84WdLUkgFiOaeY-QXmH1H8wnTg2vP1pQiuj22o/edit

Actual results:
2. When coming to run the step `sh restore.sh`, it fails with below error:
[root@ip-10-0-129-162 xxia-test-master-replacement]# sh restore.sh
Creating asset directory ./assets
Downloading etcdctl binary..
etcdctl version: 3.3.10
API version: 3.3
Backing up /etc/kubernetes/manifests/etcd-member.yaml to ./assets/backup/
Backing up etcd data-dir..
Stopping etcd..
Waiting for etcd-member to stop
Waiting for etcd-member to stop
Waiting for etcd-member to stop
Waiting for etcd-member to stop
Restoring etcd member etcd-member-ip-10-0-129-162.ap-south-1.compute.internal from snapshot..
2019-05-13 08:32:04.753779 I | pkg/netutil: resolving etcd-2.xxia-0513-destructive.qe.devcluster.openshift.com:2380 to 10.0.129.162:2380
Error: data-dir "/var/lib/etcd" exists

Expected results:
The error should not happen.

Additional info:

Comment 1 Xingxing Xia 2019-05-13 10:26:13 UTC

(In reply to Xingxing Xia from comment #0)
> Looks like the script `cp -rap ${ETCD_DATA_DIR}  $ASSET_DIR/backup/` should
> use `mv`: mv -rap ${ETCD_DATA_DIR}  $ASSET_DIR/backup/?
After re-trying with `mv`, the error disappeared. So the script should be updated for user to use.

Comment 3 Xingxing Xia 2019-05-14 11:11:26 UTC

Verified the new script works without error during testing https://bugzilla.redhat.com/show_bug.cgi?id=1709802 .

Note You need to log in before you can comment on or make changes to this bug.