1797989 – [DR]etcd restore blocked cluster recovery in ocp 4.4

Bug 1797989 - [DR]etcd restore blocked cluster recovery in ocp 4.4

Summary: [DR]etcd restore blocked cluster recovery in ocp 4.4

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Etcd
Sub Component:
Version:	4.4
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Suresh Kolichala
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1807959
TreeView+	depends on / blocked

Reported:	2020-02-04 11:26 UTC by ge liu
Modified:	2020-07-13 17:14 UTC (History)
CC List:	0 users
Fixed In Version:
Doc Type:	Enhancement
Doc Text:	The docs are updated to describe the new scripts for etcd backup and restore.
Clone Of:
Clones:	1807959 (view as bug list)
Environment:
Last Closed:	2020-07-13 17:14:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 1490	0	None	closed	Bug 1797989: Changes to fix DR scripts with new CEO environment and manifests	2020-11-12 20:21:24 UTC
Red Hat Product Errata	RHBA-2020:2409	0	None	None	None	2020-07-13 17:14:28 UTC

Description ge liu 2020-02-04 11:26:49 UTC

Description of problem:

As title, after run etcd-snapshot-restore.sh command line, the cluster could not recovery, Cluster doesn't come up. The init will wait forever

Version: 4.4.0-0.nightly-2020-02-03-163409

How reproducible:
Always

Steps to Reproduce:
1. Run etcd backup
2. Run etcd recovery on all master node almost at same time: 

sudo /usr/local/bin/etcd-snapshot-restore.sh /home/core/assets/mybackup/ $INITIAL_CLUSTER
etcdctl binary found..
etcd-member.yaml found in ./assets/backup/
Stopping all static pods..
..stopping etcd-member.yaml
..stopping kube-scheduler-pod.yaml
..stopping kube-controller-manager-pod.yaml
..stopping kube-apiserver-pod.yaml
Stopping etcd..
Waiting for etcd-member to stop
Stopping kubelet..
6462b16624c64aa4b8681cb7088a5d99e30e665f0ce9a02350c8da4d6e006e71
................................
2faee1f433da5b7ef17186160649ea70331cc9b91c7912cec0498e61c6a37e70
Waiting for all containers to stop... (1/60)
All containers are stopped.
Backing up etcd data-dir..
Removing etcd data-dir /var/lib/etcd
Removing newer static pod resources...
Restoring etcd member etcd-member-ip-xx-0-xx-xx.us-east-2.compute.internal from snapshot..
2020-02-04 10:24:23.814873 I | pkg/netutil: resolving etcd-0.geliu0204-5.qe.devcluster.openshift.com:2380 to 10.0.133.86:2380
2020-02-04 10:24:24.888901 I | mvcc: restore compact to 36320
2020-02-04 10:24:24.931880 I | etcdserver/membership: added member 2d943f923b8a592c [https://etcd-0.geliu0204-5.qe.devcluster.openshift.com:2380] to cluster 2cdebd980fecf908
2020-02-04 10:24:24.931931 I | etcdserver/membership: added member 6bac41ec208af2e4 [https://etcd-1.geliu0204-5.qe.devcluster.openshift.com:2380] to cluster 2cdebd980fecf908
2020-02-04 10:24:24.931966 I | etcdserver/membership: added member c00499925d7cc1a6 [https://etcd-2.geliu0204-5.qe.devcluster.openshift.com:2380] to cluster 2cdebd980fecf908
Starting static pods..
..starting etcd-member.yaml
..starting kube-scheduler-pod.yaml
..starting kube-controller-manager-pod.yaml
..starting kube-apiserver-pod.yaml
Starting kubelet..


3. cluster could not start

Actual results:
As title
Expected results
etcd recovery succeed, cluster could start after that.

Comment 1 ge liu 2020-02-04 11:29:26 UTC

Hello Suresh, as we talked in slack, file this bug to trace this issue.

Comment 2 Suresh Kolichala 2020-02-20 19:08:25 UTC

Working on script overhaul to accommodate the introduction of cluster-etcd-operator. Made good progress today. Testing out various scenarios including revision changes to etcd, which makes it recovery a bit tricky.

Comment 3 Suresh Kolichala 2020-02-27 15:03:09 UTC

Manually tested the changes to be working.

Comment 6 ge liu 2020-02-28 09:48:12 UTC

blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1808338

Comment 7 ge liu 2020-03-02 10:23:53 UTC

pls ignore comment 6, that new bug is blocked ocp 4.4

Comment 8 ge liu 2020-03-03 09:39:04 UTC

Fail to verify with 4.5.0-0.ci-2020-03-01-212531,after restore, the cluster crashed:
1. create etcd backup on master 1,
2. copy etcd backup to master2, 3
3. run restoration on master 1,2,3 at same time
4. cluster crashed

Comment 12 errata-xmlrpc 2020-07-13 17:14:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.