Bug 1797989 - [DR]etcd restore blocked cluster recovery in ocp 4.4
Summary: [DR]etcd restore blocked cluster recovery in ocp 4.4
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.4
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
: 4.5.0
Assignee: Suresh Kolichala
QA Contact: ge liu
URL:
Whiteboard:
Depends On:
Blocks: 1807959
TreeView+ depends on / blocked
 
Reported: 2020-02-04 11:26 UTC by ge liu
Modified: 2020-07-13 17:14 UTC (History)
0 users

Fixed In Version:
Doc Type: Enhancement
Doc Text:
The docs are updated to describe the new scripts for etcd backup and restore.
Clone Of:
: 1807959 (view as bug list)
Environment:
Last Closed: 2020-07-13 17:14:06 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 1490 0 None closed Bug 1797989: Changes to fix DR scripts with new CEO environment and manifests 2020-11-12 20:21:24 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:14:28 UTC

Description ge liu 2020-02-04 11:26:49 UTC
Description of problem:

As title, after run etcd-snapshot-restore.sh command line, the cluster could not recovery, Cluster doesn't come up. The init will wait forever

Version: 4.4.0-0.nightly-2020-02-03-163409

How reproducible:
Always

Steps to Reproduce:
1. Run etcd backup
2. Run etcd recovery on all master node almost at same time: 

sudo /usr/local/bin/etcd-snapshot-restore.sh /home/core/assets/mybackup/ $INITIAL_CLUSTER
etcdctl binary found..
etcd-member.yaml found in ./assets/backup/
Stopping all static pods..
..stopping etcd-member.yaml
..stopping kube-scheduler-pod.yaml
..stopping kube-controller-manager-pod.yaml
..stopping kube-apiserver-pod.yaml
Stopping etcd..
Waiting for etcd-member to stop
Stopping kubelet..
6462b16624c64aa4b8681cb7088a5d99e30e665f0ce9a02350c8da4d6e006e71
................................
2faee1f433da5b7ef17186160649ea70331cc9b91c7912cec0498e61c6a37e70
Waiting for all containers to stop... (1/60)
All containers are stopped.
Backing up etcd data-dir..
Removing etcd data-dir /var/lib/etcd
Removing newer static pod resources...
Restoring etcd member etcd-member-ip-xx-0-xx-xx.us-east-2.compute.internal from snapshot..
2020-02-04 10:24:23.814873 I | pkg/netutil: resolving etcd-0.geliu0204-5.qe.devcluster.openshift.com:2380 to 10.0.133.86:2380
2020-02-04 10:24:24.888901 I | mvcc: restore compact to 36320
2020-02-04 10:24:24.931880 I | etcdserver/membership: added member 2d943f923b8a592c [https://etcd-0.geliu0204-5.qe.devcluster.openshift.com:2380] to cluster 2cdebd980fecf908
2020-02-04 10:24:24.931931 I | etcdserver/membership: added member 6bac41ec208af2e4 [https://etcd-1.geliu0204-5.qe.devcluster.openshift.com:2380] to cluster 2cdebd980fecf908
2020-02-04 10:24:24.931966 I | etcdserver/membership: added member c00499925d7cc1a6 [https://etcd-2.geliu0204-5.qe.devcluster.openshift.com:2380] to cluster 2cdebd980fecf908
Starting static pods..
..starting etcd-member.yaml
..starting kube-scheduler-pod.yaml
..starting kube-controller-manager-pod.yaml
..starting kube-apiserver-pod.yaml
Starting kubelet..


3. cluster could not start

Actual results:
As title
Expected results
etcd recovery succeed, cluster could start after that.

Comment 1 ge liu 2020-02-04 11:29:26 UTC
Hello Suresh, as we talked in slack, file this bug to trace this issue.

Comment 2 Suresh Kolichala 2020-02-20 19:08:25 UTC
Working on script overhaul to accommodate the introduction of cluster-etcd-operator. Made good progress today. Testing out various scenarios including revision changes to etcd, which makes it recovery a bit tricky.

Comment 3 Suresh Kolichala 2020-02-27 15:03:09 UTC
Manually tested the changes to be working.

Comment 6 ge liu 2020-02-28 09:48:12 UTC
blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1808338

Comment 7 ge liu 2020-03-02 10:23:53 UTC
pls ignore comment 6, that new bug is blocked ocp 4.4

Comment 8 ge liu 2020-03-03 09:39:04 UTC
Fail to verify with 4.5.0-0.ci-2020-03-01-212531,after restore, the cluster crashed:
1. create etcd backup on master 1,
2. copy etcd backup to master2, 3
3. run restoration on master 1,2,3 at same time
4. cluster crashed

Comment 12 errata-xmlrpc 2020-07-13 17:14:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.