Bug 1731441

Summary:	[DOCS] Openshift on AWS: Shutdown of running cluster and start after 1-2 days not working
Product:	OpenShift Container Platform	Reporter:	szustkowski
Component:	Documentation	Assignee:	Kathryn Alexander <kalexand>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Gaoyun Pei <gpei>
Severity:	low	Docs Contact:	Vikram Goyal <vigoyal>
Priority:	low
Version:	4.1.z	CC:	aos-bugs, clasohm, jokerman, mfojtik, mjahangi, mmccomas, rhowe, sjenning, sttts, wsun
Target Milestone:	---	Keywords:	Reopened
Target Release:	4.1.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-11-01 15:01:39 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description szustkowski 2019-07-19 12:28:01 UTC

Description of problem:

A running Openshift 4.1 cluster on AWS, whose EC2 instances are stopped for saving money during a test phase does not come up again after 1-2 days. 

Version-Release number of the following components:

./openshift-install v4.1.6-201907101224-dirty
built from commit e8e6d8998bed2087244a14be16185235b43d6407
release image quay.io/openshift-release-dev/ocp-release@sha256:aa955a9ec40e55e5d9c0203a995b398e8c1031473dae24ed405efe9a95b43186


How reproducible:
Reproducible at least 2 times

Steps to Reproduce:
1. Install OpenShift 4.1 on installer-provisioned AWS, without a installer-config file
2. Stop all EC2 instances
3. Start them again immediately
4. Try to access the cluster: It works
5. Stop them again
6. Wait for 2 days
7. Start them again
8. Give them enough time to boot up: Around 30 minutes
9. Try to access the cluster: Doesn't work anymore

Actual results:
Open the Cluster Web Console in browser: Chrome shows ERR_CONNECTION_CLOSED

Expected results: The cluster is full operational, the Web frontend is accessible. 

Additional info:
Possibly related to https://github.com/openshift/installer/issues/818. Maybe it's an AWS infrastructure issue, tho. However, in this case the openshift-installer should install the cluster in such a way that it survives restarts even if it is installed by an AWS noob.

Comment 3 Seth Jennings 2019-07-29 19:18:03 UTC

This is a documented requirement of the new product architecture, that you must keep the cluster running for at least 24h so the components can rotate to their non-installation certificates
https://docs.openshift.com/container-platform/4.1/installing/installing_bare_metal/installing-bare-metal.html#installation-generate-ignition-configs_installing-bare-metal

Comment 4 Ryan Howe 2019-07-30 20:24:39 UTC

We need to have the same thing for AWS IPI install, stating that after the install the cluster must be up for 24 hrs.

Comment 5 Ryan Howe 2019-07-30 20:27:12 UTC

This warning needs to be added every type of installs: 
  
    https://docs.openshift.com/container-platform/4.1/installing/*

Comment 7 Kathryn Alexander 2019-10-17 16:02:50 UTC

The PR to add the note to all of the installation assemblies is here: https://github.com/openshift/openshift-docs/pull/17424

@Gaoyun Pei, will you PTAL?

Comment 10 Gaoyun Pei 2019-10-21 08:24:47 UTC

PR https://github.com/openshift/openshift-docs/pull/17424 lgtm.

Comment 12 Kathryn Alexander 2019-10-21 13:53:30 UTC

I've merged the change and am waiting for it to go live.

Comment 13 Kathryn Alexander 2019-11-01 15:01:39 UTC

This change is live on docs.openshift.com, eg: https://docs.openshift.com/container-platform/4.2/installing/installing_aws/installing-aws-default.html#installation-launching-installer_installing-aws-default

and on the portal, eg: https://access.redhat.com/documentation/en-us/openshift_container_platform/4.2/html/installing/installing-on-aws#installation-launching-installer_installing-aws-customizations