1843979 – Allow to increase a timeout value to accommodate a slower baremetal nodes in deployment

Bug 1843979 - Allow to increase a timeout value to accommodate a slower baremetal nodes in deployment

Summary: Allow to increase a timeout value to accommodate a slower baremetal nodes in ...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Pierre Prinetti
QA Contact:	David Sanz
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-06-04 14:21 UTC by Chris Janiszewski
Modified:	2020-08-19 09:11 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-08-06 15:44:42 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Chris Janiszewski 2020-06-04 14:21:53 UTC

Description of problem:
When deploying OCP 4.5 on mix of VMs (masters) and BMs (workers) the deployment fails with following:

DEBUG Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-05-29-105132: 85% complete, waiting on authentication, console, csi-snapshot-controller, image-registry, ingress, kube-storage-version-migrator, monitoring
DEBUG Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-05-29-105132
DEBUG Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-05-29-105132: downloading update
DEBUG Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-05-29-105132: 0% complete
DEBUG Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-05-29-105132: 3% complete
DEBUG Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-05-29-105132: 8% complete
DEBUG Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-05-29-105132: 13% complete
DEBUG Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-05-29-105132: 14% complete
DEBUG Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-05-29-105132: 85% complete
DEBUG Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-05-29-105132: 85% complete, waiting on authentication, console, csi-snapshot-controller, image-registry, ingress, kube-storage-version-migrator, monitoring
DEBUG Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console, csi-snapshot-controller, image-registry, ingress, kube-storage-version-migrator, monitoring
DEBUG Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-05-29-105132: 86% complete
DEBUG Still waiting for the cluster to initialize: Cluster operator console is reporting a failure: RouteHealthDegraded: route not yet available, https://console-openshift-console.apps.ocpra.hextupleo.lab/health returns '503 Service Unavailable'
INFO Cluster operator authentication Progressing is True with _WellKnownNotReady: Progressing: got '404 Not Found' status while trying to GET the OAuth well-known https://10.9.67.155:6443/.well-known/oauth-authorization-server endpoint data
INFO Cluster operator authentication Available is False with :  
INFO Cluster operator insights Disabled is False with :  
INFO Cluster operator kube-apiserver Progressing is True with NodeInstaller: NodeInstallerProgressing: 2 nodes are at revision 4; 1 nodes are at revision 5
FATAL failed to initialize the cluster: Cluster operator console is reporting a failure: RouteHealthDegraded: route not yet available, https://console-openshift-console.apps.ocpra.hextupleo.lab/health returns '503 Service Unavailable'

However the process seems to finish fine after another half an our.
(shiftstack) [stack@chrisj-undercloud-osp13 ~]$ oc get nodes
NAME                       STATUS   ROLES    AGE     VERSION
ocpra-dq8hl-master-0       Ready    master   4h37m   v1.18.3+1bc7b9e
ocpra-dq8hl-master-1       Ready    master   4h37m   v1.18.3+1bc7b9e
ocpra-dq8hl-master-2       Ready    master   4h37m   v1.18.3+1bc7b9e
ocpra-dq8hl-worker-qw2hl   Ready    worker   4h8m    v1.18.3+1bc7b9e
(shiftstack) [stack@chrisj-undercloud-osp13 ~]$ openstack server list
+--------------------------------------+--------------------------+--------+-----------------------------------------------------------+-------------+-----------+
| ID                                   | Name                     | Status | Networks                                                  | Image       | Flavor    |
+--------------------------------------+--------------------------+--------+-----------------------------------------------------------+-------------+-----------+
| 364bc802-46d1-4fbb-91da-b060c745dbd5 | ocpra-dq8hl-worker-qw2hl | ACTIVE | baremetal=10.9.67.150, 10.9.65.108; StorageNFS=10.9.65.14 | rhcos45-raw | baremetal |
| 05fab68c-5744-40c9-ba33-199a8ba8abf8 | ocpra-dq8hl-master-2     | ACTIVE | baremetal=10.9.67.155; StorageNFS=10.9.65.11              | rhcos45-raw | m1.large  |
| 927aac7b-7e28-4ce3-a83e-f8a1ca36024d | ocpra-dq8hl-master-0     | ACTIVE | baremetal=10.9.67.165; StorageNFS=10.9.65.13              | rhcos45-raw | m1.large  |
| 659112d9-9fd5-423f-9c16-85d0bde91515 | ocpra-dq8hl-master-1     | ACTIVE | baremetal=10.9.67.151; StorageNFS=10.9.65.20              | rhcos45-raw | m1.large  |
+--------------------------------------+--------------------------+--------+-----------------------------------------------------------+-------------+-----------+
(shiftstack) [stack@chrisj-undercloud-osp13 ~]$ oc get clusteroperator                                                                                                                
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.5.0-0.nightly-2020-05-29-105132   True        False         False      4h4m
cloud-credential                           4.5.0-0.nightly-2020-05-29-105132   True        False         False      4h45m
cluster-autoscaler                         4.5.0-0.nightly-2020-05-29-105132   True        False         False      4h35m
config-operator                            4.5.0-0.nightly-2020-05-29-105132   True        False         False      4h36m
console                                    4.5.0-0.nightly-2020-05-29-105132   True        False         False      4h7m
csi-snapshot-controller                    4.5.0-0.nightly-2020-05-29-105132   True        False         False      9m
dns                                        4.5.0-0.nightly-2020-05-29-105132   True        False         False      4h39m
etcd                                       4.5.0-0.nightly-2020-05-29-105132   True        False         False      4h39m
image-registry                             4.5.0-0.nightly-2020-05-29-105132   True        False         False      4h10m
ingress                                    4.5.0-0.nightly-2020-05-29-105132   True        False         False      4h10m
insights                                   4.5.0-0.nightly-2020-05-29-105132   True        False         False      4h36m
kube-apiserver                             4.5.0-0.nightly-2020-05-29-105132   True        False         False      4h38m
kube-controller-manager                    4.5.0-0.nightly-2020-05-29-105132   True        False         False      4h38m
kube-scheduler                             4.5.0-0.nightly-2020-05-29-105132   True        False         False      4h37m
kube-storage-version-migrator              4.5.0-0.nightly-2020-05-29-105132   True        False         False      4h10m
machine-api                                4.5.0-0.nightly-2020-05-29-105132   True        False         False      4h33m
machine-approver                           4.5.0-0.nightly-2020-05-29-105132   True        False         False      4h36m
machine-config                             4.5.0-0.nightly-2020-05-29-105132   True        False         False      4h34m
marketplace                                4.5.0-0.nightly-2020-05-29-105132   True        False         False      4h35m
monitoring                                 4.5.0-0.nightly-2020-05-29-105132   True        False         False      10m
network                                    4.5.0-0.nightly-2020-05-29-105132   True        False         False      4h40m
node-tuning                                4.5.0-0.nightly-2020-05-29-105132   True        False         False      4h40m
openshift-apiserver                        4.5.0-0.nightly-2020-05-29-105132   True        False         False      36m
openshift-controller-manager               4.5.0-0.nightly-2020-05-29-105132   True        False         False      4h33m
openshift-samples                          4.5.0-0.nightly-2020-05-29-105132   True        False         False      4h34m
operator-lifecycle-manager                 4.5.0-0.nightly-2020-05-29-105132   True        False         False      4h39m
operator-lifecycle-manager-catalog         4.5.0-0.nightly-2020-05-29-105132   True        False         False      4h39m
operator-lifecycle-manager-packageserver   4.5.0-0.nightly-2020-05-29-105132   True        False         False      4h35m
service-ca                                 4.5.0-0.nightly-2020-05-29-105132   True        False         False      4h40m
storage                                    4.5.0-0.nightly-2020-05-29-105132   True        False         False      4h36m


Executing the following also confirms that deployment eventually should complete successfully:
(shiftstack) [stack@chrisj-undercloud-osp13 ~]$ openshift-install --dir=ocpra wait-for install-complete
INFO Waiting up to 30m0s for the cluster at https://api.ocpra.hextupleo.lab:6443 to initialize... 
INFO Waiting up to 10m0s for the openshift-console route to be created... 
INFO Install complete!                            
INFO To access the cluster as the system:admin user when using 'oc', run 'export KUBECONFIG=/home/stack/ocpra/auth/kubeconfig' 
INFO Access the OpenShift web-console here: https://console-openshift-console.apps.ocpra.hextupleo.lab 
INFO Login to the console with user: "kubeadmin", and password: "WZ8Zz-a9EB4-CCVsD-faQCX" 
INFO Time elapsed: 1s     

There should be a way to increase a timeout to accommodate different hardware that can take 30 more minutes then VM to deploy.

Version-Release number of the following components:
openshift-install 4.5.0-0.nightly-2020-05-29-105132

Comment 6 Martin André 2020-06-05 17:22:06 UTC

Just a note that BM deployment are using a 60 min timeout instead of 30 to accommodate with longer boot time for BM nodes, and are *still* hitting the timeout occasionally.
https://github.com/openshift/installer/blob/3d6f27a/cmd/openshift-install/create.go#L348

Comment 8 Aleksandar Kostadinov 2020-06-15 14:04:56 UTC

Hello, while on it, could you make installation timeouts configurable using configuration file or environment variables or flags? I see that depending on installation type (even not bare metal) and underlying infrastructure properties, the installation can timeout. This was already discussed in bug 1819746 which resulted into a documentation change.

But for QE performing many kinds of temporary installations on different infrastructure having manual steps in the process is not a viable option. On the other hand ignoring failed installer execution and checking cluster later in an automated fashion leaves gives room for false positives.

Thank you.

Comment 10 Pierre Prinetti 2020-08-06 15:44:42 UTC

Unfortunately, we are not able to provide a solution in code at this stage for the installer's `wait-for install-complete`.

We are documenting a workaround for attaching bare metal machines in this PR: https://github.com/openshift/installer/pull/3955

Day 2 operations should be covered by this patch, which increases the waiting time for CSRs to two hours: https://github.com/openshift/cluster-machine-approver/pull/37

Comment 11 Aleksandar Kostadinov 2020-08-06 18:58:22 UTC

> we are not able to provide a solution in code at this stage

Pierre, could you clarify? Does it mean we drop the feature for current version or we plan to leave as is for the foreseeable future?

In can we only drop feature for current version, how about keeping issue open and change target version?

Comment 12 Pierre Prinetti 2020-08-07 10:26:25 UTC

(In reply to Aleksandar Kostadinov from comment #11)
> > we are not able to provide a solution in code at this stage
> 
> Pierre, could you clarify? Does it mean we drop the feature for current
> version or we plan to leave as is for the foreseeable future?
> 
> In can we only drop feature for current version, how about keeping issue
> open and change target version?

The problem at hand ("the Installer timeout expires before installation is complete") has an easy workaround ("Just run `openshift-install wait-for install-complete` again").

I personally think that the best way forward would be to let the user customise the timeout duration, for example with a command-line flag.
However, since this is a change to the Installer (as opposed to a platform-specific change), it requires a degree of coordination that is hard to obtain with a low-priority bug.

We can have a discussion with the Installer team by treating the change as a feature, rather than a bug, for an upcoming release. The first step in this direction may be to open an issue, or a pull request, in github.com/openshift/enhancements.

Note You need to log in before you can comment on or make changes to this bug.