Description of problem:
As title, after run etcd-snapshot-restore.sh command line, the cluster could not recovery, Cluster doesn't come up. The init will wait forever
Steps to Reproduce:
1. Run etcd backup
2. Run etcd recovery on all master node almost at same time:
sudo /usr/local/bin/etcd-snapshot-restore.sh /home/core/assets/mybackup/ $INITIAL_CLUSTER
etcdctl binary found..
etcd-member.yaml found in ./assets/backup/
Stopping all static pods..
Waiting for etcd-member to stop
Waiting for all containers to stop... (1/60)
All containers are stopped.
Backing up etcd data-dir..
Removing etcd data-dir /var/lib/etcd
Removing newer static pod resources...
Restoring etcd member etcd-member-ip-xx-0-xx-xx.us-east-2.compute.internal from snapshot..
2020-02-04 10:24:23.814873 I | pkg/netutil: resolving etcd-0.geliu0204-5.qe.devcluster.openshift.com:2380 to 10.0.133.86:2380
2020-02-04 10:24:24.888901 I | mvcc: restore compact to 36320
2020-02-04 10:24:24.931880 I | etcdserver/membership: added member 2d943f923b8a592c [https://etcd-0.geliu0204-5.qe.devcluster.openshift.com:2380] to cluster 2cdebd980fecf908
2020-02-04 10:24:24.931931 I | etcdserver/membership: added member 6bac41ec208af2e4 [https://etcd-1.geliu0204-5.qe.devcluster.openshift.com:2380] to cluster 2cdebd980fecf908
2020-02-04 10:24:24.931966 I | etcdserver/membership: added member c00499925d7cc1a6 [https://etcd-2.geliu0204-5.qe.devcluster.openshift.com:2380] to cluster 2cdebd980fecf908
Starting static pods..
3. cluster could not start
etcd recovery succeed, cluster could start after that.
Hello Suresh, as we talked in slack, file this bug to trace this issue.
Working on script overhaul to accommodate the introduction of cluster-etcd-operator. Made good progress today. Testing out various scenarios including revision changes to etcd, which makes it recovery a bit tricky.
Manually tested the changes to be working.
blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1808338
pls ignore comment 6, that new bug is blocked ocp 4.4
Fail to verify with 4.5.0-0.ci-2020-03-01-212531,after restore, the cluster crashed:
1. create etcd backup on master 1,
2. copy etcd backup to master2, 3
3. run restoration on master 1,2,3 at same time
4. cluster crashed
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.