Bug 1807959 - [DR]etcd restore blocked cluster recovery in ocp 4.4
Summary: [DR]etcd restore blocked cluster recovery in ocp 4.4
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.4
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
: 4.4.0
Assignee: Suresh Kolichala
QA Contact: ge liu
URL:
Whiteboard:
: 1807447 (view as bug list)
Depends On: 1797989
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-02-27 14:52 UTC by Suresh Kolichala
Modified: 2020-05-04 11:43 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1797989
Environment:
Last Closed: 2020-05-04 11:43:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 1517 0 None closed [release-4.4] Bug 1807959: Changes to fix DR scripts with new CEO environment and manifests 2020-11-04 14:15:46 UTC
Red Hat Product Errata RHBA-2020:0581 0 None None None 2020-05-04 11:43:31 UTC

Description Suresh Kolichala 2020-02-27 14:52:25 UTC
+++ This bug was initially created as a clone of Bug #1797989 +++

Description of problem:

As title, after run etcd-snapshot-restore.sh command line, the cluster could not recovery, Cluster doesn't come up. The init will wait forever

Version: 4.4.0-0.nightly-2020-02-03-163409

How reproducible:
Always

Steps to Reproduce:
1. Run etcd backup
2. Run etcd recovery on all master node almost at same time: 

sudo /usr/local/bin/etcd-snapshot-restore.sh /home/core/assets/mybackup/ $INITIAL_CLUSTER
etcdctl binary found..
etcd-member.yaml found in ./assets/backup/
Stopping all static pods..
..stopping etcd-member.yaml
..stopping kube-scheduler-pod.yaml
..stopping kube-controller-manager-pod.yaml
..stopping kube-apiserver-pod.yaml
Stopping etcd..
Waiting for etcd-member to stop
Stopping kubelet..
6462b16624c64aa4b8681cb7088a5d99e30e665f0ce9a02350c8da4d6e006e71
................................
2faee1f433da5b7ef17186160649ea70331cc9b91c7912cec0498e61c6a37e70
Waiting for all containers to stop... (1/60)
All containers are stopped.
Backing up etcd data-dir..
Removing etcd data-dir /var/lib/etcd
Removing newer static pod resources...
Restoring etcd member etcd-member-ip-xx-0-xx-xx.us-east-2.compute.internal from snapshot..
2020-02-04 10:24:23.814873 I | pkg/netutil: resolving etcd-0.geliu0204-5.qe.devcluster.openshift.com:2380 to 10.0.133.86:2380
2020-02-04 10:24:24.888901 I | mvcc: restore compact to 36320
2020-02-04 10:24:24.931880 I | etcdserver/membership: added member 2d943f923b8a592c [https://etcd-0.geliu0204-5.qe.devcluster.openshift.com:2380] to cluster 2cdebd980fecf908
2020-02-04 10:24:24.931931 I | etcdserver/membership: added member 6bac41ec208af2e4 [https://etcd-1.geliu0204-5.qe.devcluster.openshift.com:2380] to cluster 2cdebd980fecf908
2020-02-04 10:24:24.931966 I | etcdserver/membership: added member c00499925d7cc1a6 [https://etcd-2.geliu0204-5.qe.devcluster.openshift.com:2380] to cluster 2cdebd980fecf908
Starting static pods..
..starting etcd-member.yaml
..starting kube-scheduler-pod.yaml
..starting kube-controller-manager-pod.yaml
..starting kube-apiserver-pod.yaml
Starting kubelet..


3. cluster could not start

Actual results:
As title
Expected results
etcd recovery succeed, cluster could start after that.

--- Additional comment from ge liu on 2020-02-04 11:29:26 UTC ---

Hello Suresh, as we talked in slack, file this bug to trace this issue.

--- Additional comment from Suresh Kolichala on 2020-02-20 19:08:25 UTC ---

Working on script overhaul to accommodate the introduction of cluster-etcd-operator. Made good progress today. Testing out various scenarios including revision changes to etcd, which makes it recovery a bit tricky.

Comment 3 ge liu 2020-03-02 10:24:41 UTC
blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1808338

Comment 5 ge liu 2020-03-03 09:33:07 UTC
Failed to verify it with 4.4.0-0.nightly-2020-03-03-033819.
after restore done, there are two etcd pods started, the last one(the one which created etcd backup on it, and copy to other one) can't startup.

1. $ sudo /usr/local/bin/etcd-snapshot-backup.sh ./assets/mybackup
Creating asset directory /home/core/assets
4df5080a94f5a09ae5ba31b9942731df5cf72bf8eca5610b7904e1c54a686714
etcdctl version: 3.3.18
API version: 3.3
Trying to backup etcd client certs..
etcd client certs found in /etc/kubernetes/static-pod-resources/kube-apiserver-pod-7 backing up to /home/core/assets/backup/
Backing up /etc/kubernetes/manifests/etcd-pod.yaml to /home/core/assets/backup/
Trying to backup latest static pod resources..
Snapshot saved at ./assets/mybackup/snapshot_2020-03-03_081740.db
snapshot db and kube resources are successfully saved to ./assets/mybackup!

2. Copy ./assets to other 2 master node

3. Run restore operations on each master node:

 sudo /usr/local/bin/etcd-snapshot-restore.sh /home/core/assets/mybackup/ $INITIAL_CLUSTER
d5caeb16b61e4055dc383e17b2d44750a17da353df5661fbd439314384128124
etcdctl version: 3.3.18
API version: 3.3
etcd-pod.yaml found in /home/core/assets/backup/
Stopping all static pods..
..stopping kube-scheduler-pod.yaml
..stopping etcd-pod.yaml
..stopping kube-controller-manager-pod.yaml
..stopping kube-apiserver-pod.yaml
Stopping etcd..
Stopping kubelet..
5d349f87a8ca7b1db4bf56b9e7fb18d1f42226993e7ef1a4593b3cbd6f86a0db
....................
cc8877876cf7e1c8555e4431570b08632d6cead1e0bd67af7d6dddcffbc2b856
Waiting for all containers to stop... (1/60)
All containers are stopped.
Backing up etcd data-dir..
Removing etcd data-dir /var/lib/etcd
Removing newer static pod resources...
Removing newer etcd pod resources...
Copying /home/core/assets/mybackup//snapshot_2020-03-03_081740.db to /var/lib/etcd-backup
Starting static pods..
..starting kube-scheduler-pod.yaml
..starting etcd-pod.yaml
..starting kube-controller-manager-pod.yaml
..starting kube-apiserver-pod.yaml
Copying  to /etc/kubernetes/manifests
Starting kubelet..

4. After restore done, check node status in/out 10 minutes, the cluster crashed:

Comment 7 ge liu 2020-03-07 03:08:26 UTC
Verified with 4.4.0-0.nightly-2020-03-06-001126, and added some comments to the doc draft: https://docs.google.com/document/d/1hIt0qUth5uTAomTzZnhqwBQ51udfHpXn9ScIZ5jb7LA/edit#

Comment 8 Suresh Kolichala 2020-03-10 17:52:15 UTC
*** Bug 1807447 has been marked as a duplicate of this bug. ***

Comment 10 errata-xmlrpc 2020-05-04 11:43:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581


Note You need to log in before you can comment on or make changes to this bug.