Bug 1712826 - [DR]'etcd-snapshot-restore.sh' don't support pre-check or re-run
Summary: [DR]'etcd-snapshot-restore.sh' don't support pre-check or re-run
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.1.0
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: 4.2.0
Assignee: Sam Batschelet
QA Contact: ge liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-05-22 10:21 UTC by ge liu
Modified: 2019-10-16 06:29 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-10-16 06:29:13 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 1060 0 None None None 2019-08-14 18:08:33 UTC
Red Hat Product Errata RHBA-2019:2922 0 None None None 2019-10-16 06:29:33 UTC

Description ge liu 2019-05-22 10:21:51 UTC
Description of problem:

Currently, 'etcd-snapshot-restore.sh' don't support pre-check or re-run, so if command parameters have some err, etcd pods will be lost after re-run commands.

In doc section: 'Restoring back to a previous cluster state'
http://file.rdu.redhat.com/~ahoffer/2019/disaster-recovery/disaster_recovery/scenario-2-restoring-cluster-state.html

4.1.0-0.nightly-2019-05-22-050858

How reproducible:
Always


Steps to Reproduce:

1. Run restore command with wrong parameter:

$ sudo /usr/local/bin/etcd-snapshot-restore.sh /home/core/snapshot.db $INITIAL_CLUSTER
Downloading etcdctl binary..
etcdctl version: 3.3.10
API version: 3.3
etcd-member.yaml found in ./assets/backup/
Stopping all static pods..
..stopping etcd-member.yaml
..stopping kube-scheduler-pod.yaml
..stopping kube-apiserver-pod.yaml
..stopping kube-controller-manager-pod.yaml
Stopping etcd..
Waiting for etcd-member to stop
Stopping kubelet..
Stopping all containers..
dd64fdacbec58768788b7379b1e53e4ed799a73fd6e0d2420770d7021e30b7a5
................
Backing up etcd data-dir..
Removing etcd data-dir /var/lib/etcd
Snapshot file not found, restore failed: /home/core/snapshot.db.


2. Re-run command successfully after correct parameter:

$ sudo /usr/local/bin/etcd-snapshot-restore.sh /home/core/assets/backup/snapshot.db $INITIAL_CLUSTER
Downloading etcdctl binary..
etcdctl version: 3.3.10
API version: 3.3
etcd-member.yaml found in ./assets/backup/
Stopping all static pods..
Stopping etcd..
Stopping kubelet..
Stopping all containers..
etcd data-dir backup found ./assets/backup/etcd..
Removing etcd data-dir /var/lib/etcd
Restoring etcd member etcd-member-ip-10-0-145-112.us-east-2.compute.internal from snapshot..
2019-05-22 08:02:43.485146 I | pkg/netutil: resolving etcd-1.qe-geliu-0522.qe.devcluster.openshift.com:2380 to 10.0.145.112:2380
2019-05-22 08:02:43.706751 I | mvcc: restore compact to 19181
2019-05-22 08:02:43.740028 I | etcdserver/membership: added member d27de2dfc9f1364 [https://etcd-1.qe-geliu-0522.qe.devcluster.openshift.com:2380] to cluster 8b87ee45d99db215
2019-05-22 08:02:43.740079 I | etcdserver/membership: added member 452d4689e721facb [https://etcd-0.qe-geliu-0522.qe.devcluster.openshift.com:2380] to cluster 8b87ee45d99db215
2019-05-22 08:02:43.740120 I | etcdserver/membership: added member 7b59239765b160e4 [https://etcd-2.qe-geliu-0522.qe.devcluster.openshift.com:2380] to cluster 8b87ee45d99db215
Starting static pods..
Starting kubelet..
[core@ip-10-0-145-112 ~]$ 

3. Repeat correct steps on all other master hosts. 

4. Check the etcd pods after more than 30 minutes, lost etcd pods which re-run restore command, and other etcd pods which run restore command successfully by once are start up:

# oc get pods
NAME                                                     READY   STATUS    RESTARTS   AGE
etcd-member-ip-10-0-140-26.us-east-2.compute.internal    2/2     Running   0          43m
etcd-member-ip-10-0-173-234.us-east-2.compute.internal   2/2     Running   0          43m


Actual results:
As title
Expected results:
restore command has pre-check or support re-run.

Comment 2 ge liu 2019-05-23 05:26:38 UTC
Sam, 
ok, it make sense, thanks

Comment 4 ge liu 2019-08-19 09:03:56 UTC
Hello Alay,

Regarding to time between step 2 and 3, I think it should almost at same time, I opened 3 terminates and copied restoring command line on them, then click 'Enter' one by one(so the time difference should be in 1-5 seconds)


and I met new issue with 4.1 night build(4.1.0-0.nightly-2019-08-14-043700), after restoring on 3 master node, the cluster was crash(I have not tried re-run, just 1st run with default correct steps), I tried several time on different 4.1 env(with same payload), so I will open a new bug to trace this new issue(must-gather will be attatched).

and I tried this bug with 4.2, it passed with recreated steps, so I think 4.2 should have not this issue. because this bug's target release, and I verified it with 4.2.0-0.nightly-2019-08-18-222019

Comment 5 ge liu 2019-08-19 10:09:45 UTC
hi Alay, new bug for 4.1.z is here: Bug 1743190 - [DR]Cluster crashed after etcd restored back to previous cluster state (edit) 
pls take a look, thx

Comment 7 ge liu 2019-08-20 02:11:43 UTC
Alay, I also agree to close it, verfied on 4.2, thx

Comment 9 errata-xmlrpc 2019-10-16 06:29:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.