Description of problem: As title, after run etcd-snapshot-restore.sh command line, the cluster could not recovery, Cluster doesn't come up. The init will wait forever Version: 4.4.0-0.nightly-2020-02-03-163409 How reproducible: Always Steps to Reproduce: 1. Run etcd backup 2. Run etcd recovery on all master node almost at same time: sudo /usr/local/bin/etcd-snapshot-restore.sh /home/core/assets/mybackup/ $INITIAL_CLUSTER etcdctl binary found.. etcd-member.yaml found in ./assets/backup/ Stopping all static pods.. ..stopping etcd-member.yaml ..stopping kube-scheduler-pod.yaml ..stopping kube-controller-manager-pod.yaml ..stopping kube-apiserver-pod.yaml Stopping etcd.. Waiting for etcd-member to stop Stopping kubelet.. 6462b16624c64aa4b8681cb7088a5d99e30e665f0ce9a02350c8da4d6e006e71 ................................ 2faee1f433da5b7ef17186160649ea70331cc9b91c7912cec0498e61c6a37e70 Waiting for all containers to stop... (1/60) All containers are stopped. Backing up etcd data-dir.. Removing etcd data-dir /var/lib/etcd Removing newer static pod resources... Restoring etcd member etcd-member-ip-xx-0-xx-xx.us-east-2.compute.internal from snapshot.. 2020-02-04 10:24:23.814873 I | pkg/netutil: resolving etcd-0.geliu0204-5.qe.devcluster.openshift.com:2380 to 10.0.133.86:2380 2020-02-04 10:24:24.888901 I | mvcc: restore compact to 36320 2020-02-04 10:24:24.931880 I | etcdserver/membership: added member 2d943f923b8a592c [https://etcd-0.geliu0204-5.qe.devcluster.openshift.com:2380] to cluster 2cdebd980fecf908 2020-02-04 10:24:24.931931 I | etcdserver/membership: added member 6bac41ec208af2e4 [https://etcd-1.geliu0204-5.qe.devcluster.openshift.com:2380] to cluster 2cdebd980fecf908 2020-02-04 10:24:24.931966 I | etcdserver/membership: added member c00499925d7cc1a6 [https://etcd-2.geliu0204-5.qe.devcluster.openshift.com:2380] to cluster 2cdebd980fecf908 Starting static pods.. ..starting etcd-member.yaml ..starting kube-scheduler-pod.yaml ..starting kube-controller-manager-pod.yaml ..starting kube-apiserver-pod.yaml Starting kubelet.. 3. cluster could not start Actual results: As title Expected results etcd recovery succeed, cluster could start after that.
Hello Suresh, as we talked in slack, file this bug to trace this issue.
Working on script overhaul to accommodate the introduction of cluster-etcd-operator. Made good progress today. Testing out various scenarios including revision changes to etcd, which makes it recovery a bit tricky.
Manually tested the changes to be working.
blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1808338
pls ignore comment 6, that new bug is blocked ocp 4.4
Fail to verify with 4.5.0-0.ci-2020-03-01-212531,after restore, the cluster crashed: 1. create etcd backup on master 1, 2. copy etcd backup to master2, 3 3. run restoration on master 1,2,3 at same time 4. cluster crashed
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409