Bug 1807959
| Summary: | [DR]etcd restore blocked cluster recovery in ocp 4.4 | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Suresh Kolichala <skolicha> |
| Component: | Etcd | Assignee: | Suresh Kolichala <skolicha> |
| Status: | CLOSED ERRATA | QA Contact: | ge liu <geliu> |
| Severity: | urgent | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.4 | CC: | amcdermo, geliu |
| Target Milestone: | --- | Keywords: | TestBlocker |
| Target Release: | 4.4.0 | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | 1797989 | Environment: | |
| Last Closed: | 2020-05-04 11:43:08 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1797989 | ||
| Bug Blocks: | |||
|
Description
Suresh Kolichala
2020-02-27 14:52:25 UTC
Failed to verify it with 4.4.0-0.nightly-2020-03-03-033819. after restore done, there are two etcd pods started, the last one(the one which created etcd backup on it, and copy to other one) can't startup. 1. $ sudo /usr/local/bin/etcd-snapshot-backup.sh ./assets/mybackup Creating asset directory /home/core/assets 4df5080a94f5a09ae5ba31b9942731df5cf72bf8eca5610b7904e1c54a686714 etcdctl version: 3.3.18 API version: 3.3 Trying to backup etcd client certs.. etcd client certs found in /etc/kubernetes/static-pod-resources/kube-apiserver-pod-7 backing up to /home/core/assets/backup/ Backing up /etc/kubernetes/manifests/etcd-pod.yaml to /home/core/assets/backup/ Trying to backup latest static pod resources.. Snapshot saved at ./assets/mybackup/snapshot_2020-03-03_081740.db snapshot db and kube resources are successfully saved to ./assets/mybackup! 2. Copy ./assets to other 2 master node 3. Run restore operations on each master node: sudo /usr/local/bin/etcd-snapshot-restore.sh /home/core/assets/mybackup/ $INITIAL_CLUSTER d5caeb16b61e4055dc383e17b2d44750a17da353df5661fbd439314384128124 etcdctl version: 3.3.18 API version: 3.3 etcd-pod.yaml found in /home/core/assets/backup/ Stopping all static pods.. ..stopping kube-scheduler-pod.yaml ..stopping etcd-pod.yaml ..stopping kube-controller-manager-pod.yaml ..stopping kube-apiserver-pod.yaml Stopping etcd.. Stopping kubelet.. 5d349f87a8ca7b1db4bf56b9e7fb18d1f42226993e7ef1a4593b3cbd6f86a0db .................... cc8877876cf7e1c8555e4431570b08632d6cead1e0bd67af7d6dddcffbc2b856 Waiting for all containers to stop... (1/60) All containers are stopped. Backing up etcd data-dir.. Removing etcd data-dir /var/lib/etcd Removing newer static pod resources... Removing newer etcd pod resources... Copying /home/core/assets/mybackup//snapshot_2020-03-03_081740.db to /var/lib/etcd-backup Starting static pods.. ..starting kube-scheduler-pod.yaml ..starting etcd-pod.yaml ..starting kube-controller-manager-pod.yaml ..starting kube-apiserver-pod.yaml Copying to /etc/kubernetes/manifests Starting kubelet.. 4. After restore done, check node status in/out 10 minutes, the cluster crashed: Verified with 4.4.0-0.nightly-2020-03-06-001126, and added some comments to the doc draft: https://docs.google.com/document/d/1hIt0qUth5uTAomTzZnhqwBQ51udfHpXn9ScIZ5jb7LA/edit# *** Bug 1807447 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581 |