Description of problem: Disaster Guide not working including restore ETCD and cert expire. We followed the guide and it is fail. We do not wish the create the case during the time of evaluation. Please consider this case as important and need support team full concern. Background: OCP4 is installed on VMware ESXI with vcenter. We decide to propose taking snapshot of coreos, revert the snapshot and restore back the ETCD as our backup+restore approach. We have two snapshots 06/25/2019 - snapshot is taken once OCP4 is clean install, on 4.1.0, no memory snapshot 08/09/2019 - snapshot is taken where OCP4 has been used a while, on 4.1.4 , include memory snapshot At last, the following documentations are not working: [DOC1] https://docs.openshift.com/container-platform/4.1/disaster_recovery/scenario-2-restoring-cluster-state.html [DOC2] https://docs.openshift.com/container-platform/4.1/disaster_recovery/scenario-3-expired-certs.html Situation: On 08/09/2019 We take the snapshot (2) and backup ETCD and then perform upgrade to 4.1.6 using OTA. Afterwards, we revert to snapshot (2). UI console and OC are able to perform. After the restore the ETCD using DOC1, UI console cannot be login sometimes and later we found out 2 of 3 api-server cannot start properly, keep restarting. On 12/09/2019 As the environment is not stable, we decide to revert to snapshot (1) for the clean install. Once all the machine is started, we found that UI console and OC cannot be accessed. Error Message showing X509 Certification expired or invalid. We try to revert snapshot (2) and the same error message is shown. Therefore we start following DOC2 procedure based on snapshot (2). At the end, the result is fail and kubelet cannot restart. Issue 1 - recover-kubeconfig.sh is missing in the machine which required in DOC2, we found it on bugzilla (https://bugzilla.redhat.com/show_bug.cgi?id=1723928) Issue 2 - Once we complete all the steps from DOC2, result is fail and kublet cannot restart anymore. (which included running apiserver-recovery, re-generate and re-accept cert and restart kubelet. Now the whole environment is down now. Therefore, to summarise the case, we would like to have a solid guide for backup and restore and the guide of disaster_recovery is not working at all. Version-Release number of selected component (if applicable): OCP 4.1 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
https://bugzilla.redhat.com/show_bug.cgi?id=1722807 is this related ?
This is being tracked now in: https://issues.redhat.com/browse/OCPPLAN-2339