Bug 2058718 - Openshift installation cannot finishes due to timeout
Summary: Openshift installation cannot finishes due to timeout
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.7
Hardware: Unspecified
OS: Unspecified
low
high
Target Milestone: ---
: ---
Assignee: aos-install
QA Contact: Gaoyun Pei
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-02-25 17:15 UTC by Miguel Figueiredo Nunes
Modified: 2022-04-07 01:21 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-04-07 01:21:13 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github stolostron backlog issues 20257 0 None None None 2022-02-25 21:05:10 UTC

Description Miguel Figueiredo Nunes 2022-02-25 17:15:48 UTC
Description of the problem:
The customer is attempting to install a cluster and cannot achieve success due to installer timeout after finishing installing the control-plane

Release version:
2.5

Operator snapshot version:
N/A

OCP version:
4.7.x

Browser Info:

Steps to reproduce:
1. Attempt to create a cluster from the ACM interface
2. Wait the control-plane to finish install
3. Wait to the installation finishes

Actual results:
Cluster not installed, timeout in the installation process

Expected results:
Cluster properly installed

Additional info:
Customer has access to Internet using Proxy. The image downloading and additional components through the installation it's quite slow

Comment 2 daliu 2022-03-01 06:11:54 UTC
Key point about customer env: 
https://access.redhat.com/support/cases/#/case/03125216/discussion?commentId=a0a2K00000eYKr8QAG

[call in 2/23 from 10am~11am]

What it was discussed in the call:

- The customer it's having issues to download the OVA (Too much time to download the image through the Company's Proxy)
- Once it finishes the download and starts the installation, it also takes too much time to finishes ~4 hours
- The process doesn't concludes properly, the control-plane never stabilizes properly
- The timeout occurs in different points with different objects being downloaded from the Internet
- The customer had a negative response about network issues from the Network team in the company
- The customer wants to avoid had to create a container image mirror, if it's possible

What's the Action Plan:

- Estimate how much raw download data a new install needs just (estimation only)
- Check if the ACM possess any kind of configurable timeout 
- Open a BZ ticket to the ACM Engineering team to get the information about how much time the ACM will wait for a new cluster be running and ready and how that kind of check works

I'm open the ticket and then ask for the Engineering team celerity in provide an answer for us.


@efried Could you help to take a look?

Comment 5 Eric Fried 2022-03-03 15:36:02 UTC
From what I can see in the installer log from the linked case, the installer was invoked correctly, which generally means ACM/hive did what they were supposed to do. The next step would be asking the installer team to have a look. However, it would be good to resolve the versioning and networking questions first so as not to send them on a wild goose chase (búsqueda inútil).

Comment 15 Scott Dodson 2022-03-10 17:20:19 UTC
The installer assumes that a cluster can complete the installation within approximately 100 minutes spread across many milestones, ie: infrastructure provisioning, kube api available on bootstrap host, bootstrapping complete, installation complete, and console route availability. If an installation does not complete within this timeframe it's assumed that some critical flaw exists which would preclude successful cluster operation, which in this case seems likely to be accurate if the installation takes up to 4hrs.

As to how much data, the release image for a given architecture is approximately 9Gib and the ova is 1GiB. So assume 1GiB for the OVA import process then up to 9GiB per host. All of these assets could be retrieved and staged locally but it seems like there's no agreement to do so and thus no chance of an installation completing in the allotted time.

The only option left then is to script around the installer to wait additional time. They can loop on `openshift-install wait-for bootstrap-complete` then run `openshift-install destroy bootstrap` then loop on `opnehisft-install wait-for install-complete`.

Keep in mind, even if we simply wait longer to install this cluster is not likely to deliver quality of service expected. It will not scale well, adding nodes will retrieve up to 9GiB of data per node likely taking hours rather than minutes expected of any other OpenShift Cluster. Likewise upgrading the cluster will also consume roughly the same network bandwidth. Even simply rolling out configuration changes is likely to trigger pods to relocate between nods potentially pulling significant data. To me the only real solution for an environment with such limited connectivity is to treat this as a disconnected cluster.

Comment 16 Patrick Dillon 2022-03-11 14:57:11 UTC
Note that automatic collection of logs failed: 

time="2022-01-05T18:56:37Z" level=error msg="Attempted to gather debug logs after installation failure: failed to create SSH client: failed to use pre-existing agent, make sure the appropriate keys exist in the agent for authentication: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain"
time="2022-01-05T18:56:37Z" level=fatal msg="Bootstrap failed to complete: failed to wait for bootstrapping to complete: timed out waiting for the condition"
time="2022-01-05T18:56:38Z" level=error msg="error after waiting for command completion" error="exit status 1" installID=jvhthhqd
time="2022-01-05T18:56:38Z" level=error msg="error provisioning cluster" error="exit status 1" installID=jvhthhqd
time="2022-01-05T18:56:38Z" level=error msg="error running openshift-install, running deprovision to clean up" error="exit status 1" installID=jvhthhqd
time="2022-01-05T18:56:38Z" level=debug msg="Unable to find log storage actuator. Disabling gathering logs." installID=jvhthhqd

Do we know if something is blocking SSH connection to the bootstrap? Otherwise can you help troubleshoot with customer? I'm not sure how exactly this is handled with hive/acm.

Comment 21 daliu 2022-03-24 08:59:21 UTC
@alfrgarc 
As discussed in: https://coreos.slack.com/archives/C68TNFWA2/p1648104444911029?thread_ts=1647633318.225879&cid=C68TNFWA2

For the failed case, I think  user could repair it by the following steps:
1. Leave the backing cluster alone by setting cd.spec.preserveOnDelete: true to clusterdeployment(doc).
2. Reimport the real cluster follow cluster import doc
3. Delete/Destroy the failed managedcluster in ACM UI(it will delete the managedcluster resource and delete clusterdeployment. With the 1 step, the real cluster will not be deleted)

Comment 22 daliu 2022-03-25 01:04:54 UTC
I also add one optional steps to repair the clusterdeployment, with that customer could use Hibernate/destroy in ACM UI.

3.[Optional] adopt(https://github.com/openshift/hive/blob/master/docs/using-hive.md#cluster-adoption) the cluster again (cd’s name&namespace&spec.clusterName should be same as the managedcluster name). With this step, you could use Hibernate/destroy feature in ACM UI. It will be same as provision a cluster successfully.

https://coreos.slack.com/archives/C68TNFWA2/p1648104444911029?thread_ts=1647633318.225879&cid=C68TNFWA2

Comment 25 daliu 2022-04-07 00:20:25 UTC
@apizarro 
Great, So this issue could be closed, right?

Comment 26 Alfredo Pizarro 2022-04-07 01:10:05 UTC
Yes please. Thank you.


Note You need to log in before you can comment on or make changes to this bug.