Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1973274

Summary: Update doc on how long a cluster can remain down and be gracefully restarted
Product: OpenShift Container Platform Reporter: Nitish Kaushik <nkaushik>
Component: DocumentationAssignee: Mike Pytlak <mpytlak>
Status: CLOSED CURRENTRELEASE QA Contact: Ke Wang <kewang>
Severity: high Docs Contact: Vikram Goyal <vigoyal>
Priority: high    
Version: 4.6CC: aos-bugs, jokerman, kahara, kewang, mas-hatada, maszulik, mfuruta, rgangwar, rh-container, vfarias, vgoyal, xxia
Target Milestone: ---   
Target Release: 4.6.z   
Hardware: x86_64   
OS: Other   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-08-12 19:49:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Nitish Kaushik 2021-06-17 14:13:28 UTC
Document URL: https://docs.openshift.com/container-platform/4.6/backup_and_restore/graceful-cluster-shutdown.html

Or

https://docs.openshift.com/container-platform/4.6/backup_and_restore/disaster_recovery/scenario-3-expired-certs.html

Section Number and Name: To include "Cold DR system shutdown approach for maintenance" 

Describe the issue: 
Due to maintenance work, it is required to bring down OCP cluster for a longer period or during cold DR approach, so it should be mentioned in document that how much leap it can accommodate of keeping cluster down without hampering the cluster whenever try to bring it back. There is no time limit information about shutdown in doc [0] or [1]

[0] https://docs.openshift.com/container-platform/4.6/backup_and_restore/graceful-cluster-shutdown.html


[1] https://docs.openshift.com/container-platform/4.6/backup_and_restore/disaster_recovery/scenario-3-expired-certs.html

Suggestions for improvement: 
>To include Cold DR approach
It is not recommended to keep cluster down together for more than 365 days as it may cause Cluster instabilities . Till OCP 4.6 and till 4.7.3v only concern is of kube-apiserver-to-kubelet-signer certs which gets rotated 80% of year (292nd day) and second rotation to remove old certs from configmap (on 365th day). So if cluster is brought down on day 2 or 3 and haven't brought it up till 365th day, kube-apiserver certs will expire and it will be hard to recover the cluster later 365th day so it is advised to not bring down the cluster for more than an year and even if it is required to do so, please plan cert rotation before 365th day and later bring it down again

Additional information:

Comment 2 Masaki Hatada 2021-06-18 02:27:46 UTC
Dear Red Hat,

Although Comment 0 focuses on kube-apiserver-to-kubelet-signer cert, we need to take care every certificates which won't be recovered by https://docs.openshift.com/container-platform/4.6/backup_and_restore/disaster_recovery/scenario-3-expired-certs.html .

There are many certificates in OpenShift4 according to https://docs.openshift.com/container-platform/4.6/security/certificate_types_descriptions/ .
On the other hand, scenario-3-expired-certs.html handles only control plane and node certificate.

Really is kube-apiserver-to-kubelet-signer the only cert we need to take care in Cold DR?
We believe the answer is yes, but we would like to know Red Hat's opinion.

Best Regards,
Masaki Hatada

Comment 4 Mike Pytlak 2021-07-20 18:17:07 UTC
Capturing conversation that I had with Stephanie Stout and Vikram Goyal.

Confirmed via NEC TAMS that there are two separate things here:
    • For this bug, we need to update the docs to provide the maximum time a cluster can be down for the reasons of cert expiry. We should be able to gather this after talking with Eng and QE.
    • The larger bit about hot and cold DR can stay in the RFE, which does require Tushar/Mike to weigh in and some Eng support to understand for us to take on eventually. Suggestion is that we can leave that for now. See: https://mailman-int.corp.redhat.com/archives/openshift-sme/2021-June/msg00769.html.

Comment 5 Vivek Goyal 2021-07-20 18:24:59 UTC
You probably meant vigoyal (Vikram Goyal) and not vgoyal (vgoyal)

Comment 6 Mike Pytlak 2021-07-20 19:29:56 UTC
(In reply to Vivek Goyal from comment #5)
> You probably meant vigoyal (Vikram Goyal) and not
> vgoyal (vgoyal)

Correct. Sorry about that.

Comment 19 Mike Pytlak 2021-08-02 20:58:16 UTC
Ready for QE review

Added a note to "Shutting down a cluster gracefully", stating that a cluster can remain down for up to 1 year and be expected to restart gracefully.

This can be reviewed at https://github.com/openshift/openshift-docs/pull/35099

Please also see additional request for verification in the PR.

Comment 20 Mike Pytlak 2021-08-05 14:16:50 UTC
Updating based on comments in the PR.

Comment 21 Mike Pytlak 2021-08-05 18:22:43 UTC
Updates are complete.

This can be reviewed at https://github.com/openshift/openshift-docs/pull/35099

Please also see additional request for verification in the PR.

Comment 22 Mike Pytlak 2021-08-10 13:44:40 UTC
QE approved the changes in the PR.

(CC: @xxia)