Description of problem: Currently, 'etcd-snapshot-restore.sh' don't support pre-check or re-run, so if command parameters have some err, etcd pods will be lost after re-run commands. In doc section: 'Restoring back to a previous cluster state' http://file.rdu.redhat.com/~ahoffer/2019/disaster-recovery/disaster_recovery/scenario-2-restoring-cluster-state.html 4.1.0-0.nightly-2019-05-22-050858 How reproducible: Always Steps to Reproduce: 1. Run restore command with wrong parameter: $ sudo /usr/local/bin/etcd-snapshot-restore.sh /home/core/snapshot.db $INITIAL_CLUSTER Downloading etcdctl binary.. etcdctl version: 3.3.10 API version: 3.3 etcd-member.yaml found in ./assets/backup/ Stopping all static pods.. ..stopping etcd-member.yaml ..stopping kube-scheduler-pod.yaml ..stopping kube-apiserver-pod.yaml ..stopping kube-controller-manager-pod.yaml Stopping etcd.. Waiting for etcd-member to stop Stopping kubelet.. Stopping all containers.. dd64fdacbec58768788b7379b1e53e4ed799a73fd6e0d2420770d7021e30b7a5 ................ Backing up etcd data-dir.. Removing etcd data-dir /var/lib/etcd Snapshot file not found, restore failed: /home/core/snapshot.db. 2. Re-run command successfully after correct parameter: $ sudo /usr/local/bin/etcd-snapshot-restore.sh /home/core/assets/backup/snapshot.db $INITIAL_CLUSTER Downloading etcdctl binary.. etcdctl version: 3.3.10 API version: 3.3 etcd-member.yaml found in ./assets/backup/ Stopping all static pods.. Stopping etcd.. Stopping kubelet.. Stopping all containers.. etcd data-dir backup found ./assets/backup/etcd.. Removing etcd data-dir /var/lib/etcd Restoring etcd member etcd-member-ip-10-0-145-112.us-east-2.compute.internal from snapshot.. 2019-05-22 08:02:43.485146 I | pkg/netutil: resolving etcd-1.qe-geliu-0522.qe.devcluster.openshift.com:2380 to 10.0.145.112:2380 2019-05-22 08:02:43.706751 I | mvcc: restore compact to 19181 2019-05-22 08:02:43.740028 I | etcdserver/membership: added member d27de2dfc9f1364 [https://etcd-1.qe-geliu-0522.qe.devcluster.openshift.com:2380] to cluster 8b87ee45d99db215 2019-05-22 08:02:43.740079 I | etcdserver/membership: added member 452d4689e721facb [https://etcd-0.qe-geliu-0522.qe.devcluster.openshift.com:2380] to cluster 8b87ee45d99db215 2019-05-22 08:02:43.740120 I | etcdserver/membership: added member 7b59239765b160e4 [https://etcd-2.qe-geliu-0522.qe.devcluster.openshift.com:2380] to cluster 8b87ee45d99db215 Starting static pods.. Starting kubelet.. [core@ip-10-0-145-112 ~]$ 3. Repeat correct steps on all other master hosts. 4. Check the etcd pods after more than 30 minutes, lost etcd pods which re-run restore command, and other etcd pods which run restore command successfully by once are start up: # oc get pods NAME READY STATUS RESTARTS AGE etcd-member-ip-10-0-140-26.us-east-2.compute.internal 2/2 Running 0 43m etcd-member-ip-10-0-173-234.us-east-2.compute.internal 2/2 Running 0 43m Actual results: As title Expected results: restore command has pre-check or support re-run.
Sam, ok, it make sense, thanks
Hello Alay, Regarding to time between step 2 and 3, I think it should almost at same time, I opened 3 terminates and copied restoring command line on them, then click 'Enter' one by one(so the time difference should be in 1-5 seconds) and I met new issue with 4.1 night build(4.1.0-0.nightly-2019-08-14-043700), after restoring on 3 master node, the cluster was crash(I have not tried re-run, just 1st run with default correct steps), I tried several time on different 4.1 env(with same payload), so I will open a new bug to trace this new issue(must-gather will be attatched). and I tried this bug with 4.2, it passed with recreated steps, so I think 4.2 should have not this issue. because this bug's target release, and I verified it with 4.2.0-0.nightly-2019-08-18-222019
hi Alay, new bug for 4.1.z is here: Bug 1743190 - [DR]Cluster crashed after etcd restored back to previous cluster state (edit) pls take a look, thx
Alay, I also agree to close it, verfied on 4.2, thx
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922