+++ This bug was initially created as a clone of Bug #1797989 +++ Description of problem: As title, after run etcd-snapshot-restore.sh command line, the cluster could not recovery, Cluster doesn't come up. The init will wait forever Version: 4.4.0-0.nightly-2020-02-03-163409 How reproducible: Always Steps to Reproduce: 1. Run etcd backup 2. Run etcd recovery on all master node almost at same time: sudo /usr/local/bin/etcd-snapshot-restore.sh /home/core/assets/mybackup/ $INITIAL_CLUSTER etcdctl binary found.. etcd-member.yaml found in ./assets/backup/ Stopping all static pods.. ..stopping etcd-member.yaml ..stopping kube-scheduler-pod.yaml ..stopping kube-controller-manager-pod.yaml ..stopping kube-apiserver-pod.yaml Stopping etcd.. Waiting for etcd-member to stop Stopping kubelet.. 6462b16624c64aa4b8681cb7088a5d99e30e665f0ce9a02350c8da4d6e006e71 ................................ 2faee1f433da5b7ef17186160649ea70331cc9b91c7912cec0498e61c6a37e70 Waiting for all containers to stop... (1/60) All containers are stopped. Backing up etcd data-dir.. Removing etcd data-dir /var/lib/etcd Removing newer static pod resources... Restoring etcd member etcd-member-ip-xx-0-xx-xx.us-east-2.compute.internal from snapshot.. 2020-02-04 10:24:23.814873 I | pkg/netutil: resolving etcd-0.geliu0204-5.qe.devcluster.openshift.com:2380 to 10.0.133.86:2380 2020-02-04 10:24:24.888901 I | mvcc: restore compact to 36320 2020-02-04 10:24:24.931880 I | etcdserver/membership: added member 2d943f923b8a592c [https://etcd-0.geliu0204-5.qe.devcluster.openshift.com:2380] to cluster 2cdebd980fecf908 2020-02-04 10:24:24.931931 I | etcdserver/membership: added member 6bac41ec208af2e4 [https://etcd-1.geliu0204-5.qe.devcluster.openshift.com:2380] to cluster 2cdebd980fecf908 2020-02-04 10:24:24.931966 I | etcdserver/membership: added member c00499925d7cc1a6 [https://etcd-2.geliu0204-5.qe.devcluster.openshift.com:2380] to cluster 2cdebd980fecf908 Starting static pods.. ..starting etcd-member.yaml ..starting kube-scheduler-pod.yaml ..starting kube-controller-manager-pod.yaml ..starting kube-apiserver-pod.yaml Starting kubelet.. 3. cluster could not start Actual results: As title Expected results etcd recovery succeed, cluster could start after that. --- Additional comment from ge liu on 2020-02-04 11:29:26 UTC --- Hello Suresh, as we talked in slack, file this bug to trace this issue. --- Additional comment from Suresh Kolichala on 2020-02-20 19:08:25 UTC --- Working on script overhaul to accommodate the introduction of cluster-etcd-operator. Made good progress today. Testing out various scenarios including revision changes to etcd, which makes it recovery a bit tricky.
blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1808338
Failed to verify it with 4.4.0-0.nightly-2020-03-03-033819. after restore done, there are two etcd pods started, the last one(the one which created etcd backup on it, and copy to other one) can't startup. 1. $ sudo /usr/local/bin/etcd-snapshot-backup.sh ./assets/mybackup Creating asset directory /home/core/assets 4df5080a94f5a09ae5ba31b9942731df5cf72bf8eca5610b7904e1c54a686714 etcdctl version: 3.3.18 API version: 3.3 Trying to backup etcd client certs.. etcd client certs found in /etc/kubernetes/static-pod-resources/kube-apiserver-pod-7 backing up to /home/core/assets/backup/ Backing up /etc/kubernetes/manifests/etcd-pod.yaml to /home/core/assets/backup/ Trying to backup latest static pod resources.. Snapshot saved at ./assets/mybackup/snapshot_2020-03-03_081740.db snapshot db and kube resources are successfully saved to ./assets/mybackup! 2. Copy ./assets to other 2 master node 3. Run restore operations on each master node: sudo /usr/local/bin/etcd-snapshot-restore.sh /home/core/assets/mybackup/ $INITIAL_CLUSTER d5caeb16b61e4055dc383e17b2d44750a17da353df5661fbd439314384128124 etcdctl version: 3.3.18 API version: 3.3 etcd-pod.yaml found in /home/core/assets/backup/ Stopping all static pods.. ..stopping kube-scheduler-pod.yaml ..stopping etcd-pod.yaml ..stopping kube-controller-manager-pod.yaml ..stopping kube-apiserver-pod.yaml Stopping etcd.. Stopping kubelet.. 5d349f87a8ca7b1db4bf56b9e7fb18d1f42226993e7ef1a4593b3cbd6f86a0db .................... cc8877876cf7e1c8555e4431570b08632d6cead1e0bd67af7d6dddcffbc2b856 Waiting for all containers to stop... (1/60) All containers are stopped. Backing up etcd data-dir.. Removing etcd data-dir /var/lib/etcd Removing newer static pod resources... Removing newer etcd pod resources... Copying /home/core/assets/mybackup//snapshot_2020-03-03_081740.db to /var/lib/etcd-backup Starting static pods.. ..starting kube-scheduler-pod.yaml ..starting etcd-pod.yaml ..starting kube-controller-manager-pod.yaml ..starting kube-apiserver-pod.yaml Copying to /etc/kubernetes/manifests Starting kubelet.. 4. After restore done, check node status in/out 10 minutes, the cluster crashed:
Verified with 4.4.0-0.nightly-2020-03-06-001126, and added some comments to the doc draft: https://docs.google.com/document/d/1hIt0qUth5uTAomTzZnhqwBQ51udfHpXn9ScIZ5jb7LA/edit#
*** Bug 1807447 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581