1803805 – Timeouts are too short for openshift-baremetal-installer and not adjustable.

Bug 1803805 - Timeouts are too short for openshift-baremetal-installer and not adjustable.

Summary: Timeouts are too short for openshift-baremetal-installer and not adjustable.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.3.z
Assignee:	Stephen Benjamin
QA Contact:	Polina Rabinovich
Docs Contact:
URL:
Whiteboard:
Depends On:	1794755
Blocks:	1771572
TreeView+	depends on / blocked

Reported:	2020-02-17 13:44 UTC by Stephen Benjamin
Modified:	2020-03-26 07:27 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1794755
Environment:
Last Closed:	2020-03-10 23:53:54 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift installer pull 3116	0	None	closed	[release-4.3] Bug 1803805: cmd/openshift-install/create: wait 60 minutes for baremetal	2021-02-05 23:45:09 UTC
Red Hat Product Errata	RHBA-2020:0676	0	None	None	None	2020-03-10 23:54:11 UTC

Description Stephen Benjamin 2020-02-17 13:44:00 UTC

+++ This bug was initially created as a clone of Bug #1794755 +++

Timeouts are too short for openshift-baremetal-installer and not adjustable.

Version:
4.4.0-0.nightly-2020-01-23-054055

Running openshift-baremetal-installer on BM often times out.
We need to either have longer timeouts or to be able to adjust the timeouts per need.


One deployment I tried failed with the following output in the log:
time="2020-01-23T18:23:27-05:00" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.4.0-0.nightly-2020-01-23-054055: 99% complete"
time="2020-01-23T18:26:12-05:00" level=debug msg="Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console, ingress, machine-api, monitoring"
time="2020-01-23T18:29:57-05:00" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.4.0-0.nightly-2020-01-23-054055: 99% complete"
time="2020-01-23T18:32:27-05:00" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.4.0-0.nightly-2020-01-23-054055: 99% complete, waiting on authentication, console, ingress, monitoring"
time="2020-01-23T18:35:42-05:00" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.4.0-0.nightly-2020-01-23-054055: 99% complete"
time="2020-01-23T18:38:27-05:00" level=debug msg="Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console, ingress, monitoring"
time="2020-01-23T18:41:37-05:00" level=error msg="Cluster operator authentication Degraded is True with IngressStateEndpoints_MissingSubsets::RouteStatus_FailedHost: IngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server\nRouteStatusDegraded: route is not available at canonical host oauth-openshift.apps.qe1.kni.lab.eng.bos.redhat.com: []"
time="2020-01-23T18:41:37-05:00" level=info msg="Cluster operator authentication Progressing is Unknown with NoData: "
time="2020-01-23T18:41:37-05:00" level=info msg="Cluster operator authentication Available is Unknown with NoData: "
time="2020-01-23T18:41:37-05:00" level=info msg="Cluster operator console Progressing is True with RouteSyncProgressingFailedHost: RouteSyncProgressing: route is not available at canonical host []"
time="2020-01-23T18:41:37-05:00" level=info msg="Cluster operator console Available is Unknown with NoData: "
time="2020-01-23T18:41:37-05:00" level=error msg="Cluster operator ingress Degraded is True with IngressControllersDegraded: Some ingresscontrollers are degraded: default"
time="2020-01-23T18:41:37-05:00" level=info msg="Cluster operator ingress Progressing is True with Reconciling: Not all ingress controllers are available.\nMoving to release version \"4.4.0-0.nightly-2020-01-23-054055\".\nMoving to ingress-controller image version \"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0fe7520a269c7dcd6a1ed69670f5b1796b58117216192bd8a45470bb758e9e5b\"."
time="2020-01-23T18:41:37-05:00" level=info msg="Cluster operator ingress Available is False with IngressUnavailable: Not all ingress controllers are available."
time="2020-01-23T18:41:37-05:00" level=info msg="Cluster operator insights Disabled is False with : "
time="2020-01-23T18:41:37-05:00" level=error msg="Cluster operator monitoring Degraded is True with UpdatingAlertmanagerFailed: Failed to rollout the stack. Error: running task Updating Alertmanager failed: waiting for Alertmanager Route to become ready failed: waiting for RouteReady of alertmanager-main: no status available for alertmanager-main"
time="2020-01-23T18:41:37-05:00" level=info msg="Cluster operator monitoring Available is False with : "
time="2020-01-23T18:41:37-05:00" level=info msg="Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack."
time="2020-01-23T18:41:37-05:00" level=fatal msg="failed to initialize the cluster: Some cluster operators are still updating: authentication, console, ingress, monitoring"


When I came to check the status of the cluster - it was....actually successfully deployed.


Another deployment on the same setup failed with the following error in the log:

time="2020-01-24T01:52:12-05:00" level=debug msg="module.masters.ironic_node_v1.openshift-master-host[1]: Still creating... [26m20s elapsed]"
time="2020-01-24T01:52:22-05:00" level=debug msg="module.masters.ironic_node_v1.openshift-master-host[0]: Still creating... [26m30s elapsed]"
time="2020-01-24T01:52:22-05:00" level=debug msg="module.masters.ironic_node_v1.openshift-master-host[2]: Still creating... [26m30s elapsed]"
time="2020-01-24T01:52:22-05:00" level=debug msg="module.masters.ironic_node_v1.openshift-master-host[1]: Still creating... [26m30s elapsed]"
time="2020-01-24T01:52:26-05:00" level=error
time="2020-01-24T01:52:26-05:00" level=error msg="Error: could not contact API: timeout reached"
time="2020-01-24T01:52:26-05:00" level=error
time="2020-01-24T01:52:26-05:00" level=error msg="  on ../../tmp/openshift-install-696469293/masters/main.tf line 1, in resource \"ironic_node_v1\" \"openshift-master-host\":"
time="2020-01-24T01:52:26-05:00" level=error msg="   1: resource \"ironic_node_v1\" \"openshift-master-host\" {"
time="2020-01-24T01:52:26-05:00" level=error
time="2020-01-24T01:52:26-05:00" level=error
time="2020-01-24T01:52:26-05:00" level=error
time="2020-01-24T01:52:26-05:00" level=error msg="Error: could not contact API: timeout reached"
time="2020-01-24T01:52:26-05:00" level=error
time="2020-01-24T01:52:26-05:00" level=error msg="  on ../../tmp/openshift-install-696469293/masters/main.tf line 1, in resource \"ironic_node_v1\" \"openshift-master-host\":"
time="2020-01-24T01:52:26-05:00" level=error msg="   1: resource \"ironic_node_v1\" \"openshift-master-host\" {"
time="2020-01-24T01:52:26-05:00" level=error
time="2020-01-24T01:52:26-05:00" level=error
time="2020-01-24T01:52:26-05:00" level=error
time="2020-01-24T01:52:26-05:00" level=error msg="Error: could not contact API: timeout reached"
time="2020-01-24T01:52:26-05:00" level=error
time="2020-01-24T01:52:26-05:00" level=error msg="  on ../../tmp/openshift-install-696469293/masters/main.tf line 1, in resource \"ironic_node_v1\" \"openshift-master-host\":"
time="2020-01-24T01:52:26-05:00" level=error msg="   1: resource \"ironic_node_v1\" \"openshift-master-host\" {"
time="2020-01-24T01:52:26-05:00" level=error
time="2020-01-24T01:52:26-05:00" level=error
time="2020-01-24T01:52:26-05:00" level=fatal msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed to apply using Terraform"

This deployment has really failed. I see that the bootstrap VM is still running, but there are no errors related to starting containers. Can assume it took a long time to pull the container images.

Comment 4 errata-xmlrpc 2020-03-10 23:53:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0676

Note You need to log in before you can comment on or make changes to this bug.