Bug 1712826

Summary:	[DR]'etcd-snapshot-restore.sh' don't support pre-check or re-run
Product:	OpenShift Container Platform	Reporter:	ge liu <geliu>
Component:	Etcd	Assignee:	Sam Batschelet <sbatsche>
Status:	CLOSED ERRATA	QA Contact:	ge liu <geliu>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.1.0	CC:	alpatel
Target Milestone:	---	Keywords:	Regression
Target Release:	4.2.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-10-16 06:29:13 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description ge liu 2019-05-22 10:21:51 UTC

Description of problem:

Currently, 'etcd-snapshot-restore.sh' don't support pre-check or re-run, so if command parameters have some err, etcd pods will be lost after re-run commands.

In doc section: 'Restoring back to a previous cluster state'
http://file.rdu.redhat.com/~ahoffer/2019/disaster-recovery/disaster_recovery/scenario-2-restoring-cluster-state.html

4.1.0-0.nightly-2019-05-22-050858

How reproducible:
Always


Steps to Reproduce:

1. Run restore command with wrong parameter:

$ sudo /usr/local/bin/etcd-snapshot-restore.sh /home/core/snapshot.db $INITIAL_CLUSTER
Downloading etcdctl binary..
etcdctl version: 3.3.10
API version: 3.3
etcd-member.yaml found in ./assets/backup/
Stopping all static pods..
..stopping etcd-member.yaml
..stopping kube-scheduler-pod.yaml
..stopping kube-apiserver-pod.yaml
..stopping kube-controller-manager-pod.yaml
Stopping etcd..
Waiting for etcd-member to stop
Stopping kubelet..
Stopping all containers..
dd64fdacbec58768788b7379b1e53e4ed799a73fd6e0d2420770d7021e30b7a5
................
Backing up etcd data-dir..
Removing etcd data-dir /var/lib/etcd
Snapshot file not found, restore failed: /home/core/snapshot.db.


2. Re-run command successfully after correct parameter:

$ sudo /usr/local/bin/etcd-snapshot-restore.sh /home/core/assets/backup/snapshot.db $INITIAL_CLUSTER
Downloading etcdctl binary..
etcdctl version: 3.3.10
API version: 3.3
etcd-member.yaml found in ./assets/backup/
Stopping all static pods..
Stopping etcd..
Stopping kubelet..
Stopping all containers..
etcd data-dir backup found ./assets/backup/etcd..
Removing etcd data-dir /var/lib/etcd
Restoring etcd member etcd-member-ip-10-0-145-112.us-east-2.compute.internal from snapshot..
2019-05-22 08:02:43.485146 I | pkg/netutil: resolving etcd-1.qe-geliu-0522.qe.devcluster.openshift.com:2380 to 10.0.145.112:2380
2019-05-22 08:02:43.706751 I | mvcc: restore compact to 19181
2019-05-22 08:02:43.740028 I | etcdserver/membership: added member d27de2dfc9f1364 [https://etcd-1.qe-geliu-0522.qe.devcluster.openshift.com:2380] to cluster 8b87ee45d99db215
2019-05-22 08:02:43.740079 I | etcdserver/membership: added member 452d4689e721facb [https://etcd-0.qe-geliu-0522.qe.devcluster.openshift.com:2380] to cluster 8b87ee45d99db215
2019-05-22 08:02:43.740120 I | etcdserver/membership: added member 7b59239765b160e4 [https://etcd-2.qe-geliu-0522.qe.devcluster.openshift.com:2380] to cluster 8b87ee45d99db215
Starting static pods..
Starting kubelet..
[core@ip-10-0-145-112 ~]$ 

3. Repeat correct steps on all other master hosts. 

4. Check the etcd pods after more than 30 minutes, lost etcd pods which re-run restore command, and other etcd pods which run restore command successfully by once are start up:

# oc get pods
NAME                                                     READY   STATUS    RESTARTS   AGE
etcd-member-ip-10-0-140-26.us-east-2.compute.internal    2/2     Running   0          43m
etcd-member-ip-10-0-173-234.us-east-2.compute.internal   2/2     Running   0          43m


Actual results:
As title
Expected results:
restore command has pre-check or support re-run.

Comment 2 ge liu 2019-05-23 05:26:38 UTC

Sam, 
ok, it make sense, thanks

Comment 4 ge liu 2019-08-19 09:03:56 UTC

Hello Alay,

Regarding to time between step 2 and 3, I think it should almost at same time, I opened 3 terminates and copied restoring command line on them, then click 'Enter' one by one(so the time difference should be in 1-5 seconds)


and I met new issue with 4.1 night build(4.1.0-0.nightly-2019-08-14-043700), after restoring on 3 master node, the cluster was crash(I have not tried re-run, just 1st run with default correct steps), I tried several time on different 4.1 env(with same payload), so I will open a new bug to trace this new issue(must-gather will be attatched).

and I tried this bug with 4.2, it passed with recreated steps, so I think 4.2 should have not this issue. because this bug's target release, and I verified it with 4.2.0-0.nightly-2019-08-18-222019

Comment 5 ge liu 2019-08-19 10:09:45 UTC

hi Alay, new bug for 4.1.z is here: Bug 1743190 - [DR]Cluster crashed after etcd restored back to previous cluster state (edit) 
pls take a look, thx

Comment 7 ge liu 2019-08-20 02:11:43 UTC

Alay, I also agree to close it, verfied on 4.2, thx

Comment 9 errata-xmlrpc 2019-10-16 06:29:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922