Bug 1741101 - Disaster Guide not working including restore ETCD and cert expire.
Summary: Disaster Guide not working including restore ETCD and cert expire.
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Etcd
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.5.0
Assignee: Sam Batschelet
QA Contact: ge liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-08-14 09:20 UTC by Miheer Salunke
Modified: 2020-02-21 12:01 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-02-21 12:01:21 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Miheer Salunke 2019-08-14 09:20:47 UTC
Description of problem:

Disaster Guide not working including restore ETCD and cert expire.

We followed the guide and it is fail. We do not wish the create the case during the time of evaluation. Please consider this case as important and need support team full concern.

Background:
OCP4 is installed on VMware ESXI with vcenter. We decide to propose taking snapshot of coreos, revert the snapshot and restore back the ETCD as our backup+restore approach.

We have two snapshots
06/25/2019 - snapshot is taken once OCP4 is clean install, on 4.1.0, no memory snapshot
08/09/2019 - snapshot is taken where OCP4 has been used a while, on 4.1.4 , include memory snapshot

At last, the following documentations are not working:
[DOC1] https://docs.openshift.com/container-platform/4.1/disaster_recovery/scenario-2-restoring-cluster-state.html
[DOC2]
https://docs.openshift.com/container-platform/4.1/disaster_recovery/scenario-3-expired-certs.html



Situation:

On 08/09/2019
We take the snapshot (2) and backup ETCD and then perform upgrade to 4.1.6 using OTA. Afterwards, we revert to snapshot (2). UI console and OC are able to perform. After the restore the ETCD using DOC1, UI console cannot be login sometimes and later we found out 2 of 3 api-server cannot start properly, keep restarting.

On 12/09/2019
As the environment is not stable, we decide to revert to snapshot (1) for the clean install. Once all the machine is started, we found that UI console and OC cannot be accessed. Error Message showing X509 Certification expired or invalid. We try to revert snapshot (2) and the same error message is shown. Therefore we start following DOC2 procedure based on snapshot (2). At the end, the result is fail and kubelet cannot restart. 

Issue 1 - recover-kubeconfig.sh is missing in the machine which required in DOC2, we found it on bugzilla (https://bugzilla.redhat.com/show_bug.cgi?id=1723928)
Issue 2 - Once we complete all the steps from DOC2,  result is fail and kublet cannot restart anymore. (which included running apiserver-recovery, re-generate and re-accept cert and restart kubelet.

Now the whole environment is down now.

Therefore, to summarise the case, we would like to have a solid guide for backup and restore and the guide of disaster_recovery is not working at all.


Version-Release number of selected component (if applicable):
OCP 4.1

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Miheer Salunke 2019-08-14 09:21:44 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1722807  is this related ?

Comment 11 Stephen Cuppett 2020-02-21 12:01:21 UTC
This is being tracked now in: https://issues.redhat.com/browse/OCPPLAN-2339


Note You need to log in before you can comment on or make changes to this bug.