Bug 1741101

Summary: Disaster Guide not working including restore ETCD and cert expire.
Product: OpenShift Container Platform Reporter: Miheer Salunke <misalunk>
Component: EtcdAssignee: Sam Batschelet <sbatsche>
Status: CLOSED DEFERRED QA Contact: ge liu <geliu>
Severity: high Docs Contact:
Priority: high    
Version: 4.1.0CC: ansverma, gblomqui, geliu, mfojtik, nagrawal, sbatsche, scuppett, sttts
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-02-21 12:01:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Miheer Salunke 2019-08-14 09:20:47 UTC
Description of problem:

Disaster Guide not working including restore ETCD and cert expire.

We followed the guide and it is fail. We do not wish the create the case during the time of evaluation. Please consider this case as important and need support team full concern.

Background:
OCP4 is installed on VMware ESXI with vcenter. We decide to propose taking snapshot of coreos, revert the snapshot and restore back the ETCD as our backup+restore approach.

We have two snapshots
06/25/2019 - snapshot is taken once OCP4 is clean install, on 4.1.0, no memory snapshot
08/09/2019 - snapshot is taken where OCP4 has been used a while, on 4.1.4 , include memory snapshot

At last, the following documentations are not working:
[DOC1] https://docs.openshift.com/container-platform/4.1/disaster_recovery/scenario-2-restoring-cluster-state.html
[DOC2]
https://docs.openshift.com/container-platform/4.1/disaster_recovery/scenario-3-expired-certs.html



Situation:

On 08/09/2019
We take the snapshot (2) and backup ETCD and then perform upgrade to 4.1.6 using OTA. Afterwards, we revert to snapshot (2). UI console and OC are able to perform. After the restore the ETCD using DOC1, UI console cannot be login sometimes and later we found out 2 of 3 api-server cannot start properly, keep restarting.

On 12/09/2019
As the environment is not stable, we decide to revert to snapshot (1) for the clean install. Once all the machine is started, we found that UI console and OC cannot be accessed. Error Message showing X509 Certification expired or invalid. We try to revert snapshot (2) and the same error message is shown. Therefore we start following DOC2 procedure based on snapshot (2). At the end, the result is fail and kubelet cannot restart. 

Issue 1 - recover-kubeconfig.sh is missing in the machine which required in DOC2, we found it on bugzilla (https://bugzilla.redhat.com/show_bug.cgi?id=1723928)
Issue 2 - Once we complete all the steps from DOC2,  result is fail and kublet cannot restart anymore. (which included running apiserver-recovery, re-generate and re-accept cert and restart kubelet.

Now the whole environment is down now.

Therefore, to summarise the case, we would like to have a solid guide for backup and restore and the guide of disaster_recovery is not working at all.


Version-Release number of selected component (if applicable):
OCP 4.1

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Miheer Salunke 2019-08-14 09:21:44 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1722807  is this related ?

Comment 11 Stephen Cuppett 2020-02-21 12:01:21 UTC
This is being tracked now in: https://issues.redhat.com/browse/OCPPLAN-2339