Description of problem: When deploying OCP 4.5 on mix of VMs (masters) and BMs (workers) the deployment fails with following: DEBUG Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-05-29-105132: 85% complete, waiting on authentication, console, csi-snapshot-controller, image-registry, ingress, kube-storage-version-migrator, monitoring DEBUG Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-05-29-105132 DEBUG Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-05-29-105132: downloading update DEBUG Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-05-29-105132: 0% complete DEBUG Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-05-29-105132: 3% complete DEBUG Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-05-29-105132: 8% complete DEBUG Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-05-29-105132: 13% complete DEBUG Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-05-29-105132: 14% complete DEBUG Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-05-29-105132: 85% complete DEBUG Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-05-29-105132: 85% complete, waiting on authentication, console, csi-snapshot-controller, image-registry, ingress, kube-storage-version-migrator, monitoring DEBUG Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console, csi-snapshot-controller, image-registry, ingress, kube-storage-version-migrator, monitoring DEBUG Still waiting for the cluster to initialize: Working towards 4.5.0-0.nightly-2020-05-29-105132: 86% complete DEBUG Still waiting for the cluster to initialize: Cluster operator console is reporting a failure: RouteHealthDegraded: route not yet available, https://console-openshift-console.apps.ocpra.hextupleo.lab/health returns '503 Service Unavailable' INFO Cluster operator authentication Progressing is True with _WellKnownNotReady: Progressing: got '404 Not Found' status while trying to GET the OAuth well-known https://10.9.67.155:6443/.well-known/oauth-authorization-server endpoint data INFO Cluster operator authentication Available is False with : INFO Cluster operator insights Disabled is False with : INFO Cluster operator kube-apiserver Progressing is True with NodeInstaller: NodeInstallerProgressing: 2 nodes are at revision 4; 1 nodes are at revision 5 FATAL failed to initialize the cluster: Cluster operator console is reporting a failure: RouteHealthDegraded: route not yet available, https://console-openshift-console.apps.ocpra.hextupleo.lab/health returns '503 Service Unavailable' However the process seems to finish fine after another half an our. (shiftstack) [stack@chrisj-undercloud-osp13 ~]$ oc get nodes NAME STATUS ROLES AGE VERSION ocpra-dq8hl-master-0 Ready master 4h37m v1.18.3+1bc7b9e ocpra-dq8hl-master-1 Ready master 4h37m v1.18.3+1bc7b9e ocpra-dq8hl-master-2 Ready master 4h37m v1.18.3+1bc7b9e ocpra-dq8hl-worker-qw2hl Ready worker 4h8m v1.18.3+1bc7b9e (shiftstack) [stack@chrisj-undercloud-osp13 ~]$ openstack server list +--------------------------------------+--------------------------+--------+-----------------------------------------------------------+-------------+-----------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+--------------------------+--------+-----------------------------------------------------------+-------------+-----------+ | 364bc802-46d1-4fbb-91da-b060c745dbd5 | ocpra-dq8hl-worker-qw2hl | ACTIVE | baremetal=10.9.67.150, 10.9.65.108; StorageNFS=10.9.65.14 | rhcos45-raw | baremetal | | 05fab68c-5744-40c9-ba33-199a8ba8abf8 | ocpra-dq8hl-master-2 | ACTIVE | baremetal=10.9.67.155; StorageNFS=10.9.65.11 | rhcos45-raw | m1.large | | 927aac7b-7e28-4ce3-a83e-f8a1ca36024d | ocpra-dq8hl-master-0 | ACTIVE | baremetal=10.9.67.165; StorageNFS=10.9.65.13 | rhcos45-raw | m1.large | | 659112d9-9fd5-423f-9c16-85d0bde91515 | ocpra-dq8hl-master-1 | ACTIVE | baremetal=10.9.67.151; StorageNFS=10.9.65.20 | rhcos45-raw | m1.large | +--------------------------------------+--------------------------+--------+-----------------------------------------------------------+-------------+-----------+ (shiftstack) [stack@chrisj-undercloud-osp13 ~]$ oc get clusteroperator NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.5.0-0.nightly-2020-05-29-105132 True False False 4h4m cloud-credential 4.5.0-0.nightly-2020-05-29-105132 True False False 4h45m cluster-autoscaler 4.5.0-0.nightly-2020-05-29-105132 True False False 4h35m config-operator 4.5.0-0.nightly-2020-05-29-105132 True False False 4h36m console 4.5.0-0.nightly-2020-05-29-105132 True False False 4h7m csi-snapshot-controller 4.5.0-0.nightly-2020-05-29-105132 True False False 9m dns 4.5.0-0.nightly-2020-05-29-105132 True False False 4h39m etcd 4.5.0-0.nightly-2020-05-29-105132 True False False 4h39m image-registry 4.5.0-0.nightly-2020-05-29-105132 True False False 4h10m ingress 4.5.0-0.nightly-2020-05-29-105132 True False False 4h10m insights 4.5.0-0.nightly-2020-05-29-105132 True False False 4h36m kube-apiserver 4.5.0-0.nightly-2020-05-29-105132 True False False 4h38m kube-controller-manager 4.5.0-0.nightly-2020-05-29-105132 True False False 4h38m kube-scheduler 4.5.0-0.nightly-2020-05-29-105132 True False False 4h37m kube-storage-version-migrator 4.5.0-0.nightly-2020-05-29-105132 True False False 4h10m machine-api 4.5.0-0.nightly-2020-05-29-105132 True False False 4h33m machine-approver 4.5.0-0.nightly-2020-05-29-105132 True False False 4h36m machine-config 4.5.0-0.nightly-2020-05-29-105132 True False False 4h34m marketplace 4.5.0-0.nightly-2020-05-29-105132 True False False 4h35m monitoring 4.5.0-0.nightly-2020-05-29-105132 True False False 10m network 4.5.0-0.nightly-2020-05-29-105132 True False False 4h40m node-tuning 4.5.0-0.nightly-2020-05-29-105132 True False False 4h40m openshift-apiserver 4.5.0-0.nightly-2020-05-29-105132 True False False 36m openshift-controller-manager 4.5.0-0.nightly-2020-05-29-105132 True False False 4h33m openshift-samples 4.5.0-0.nightly-2020-05-29-105132 True False False 4h34m operator-lifecycle-manager 4.5.0-0.nightly-2020-05-29-105132 True False False 4h39m operator-lifecycle-manager-catalog 4.5.0-0.nightly-2020-05-29-105132 True False False 4h39m operator-lifecycle-manager-packageserver 4.5.0-0.nightly-2020-05-29-105132 True False False 4h35m service-ca 4.5.0-0.nightly-2020-05-29-105132 True False False 4h40m storage 4.5.0-0.nightly-2020-05-29-105132 True False False 4h36m Executing the following also confirms that deployment eventually should complete successfully: (shiftstack) [stack@chrisj-undercloud-osp13 ~]$ openshift-install --dir=ocpra wait-for install-complete INFO Waiting up to 30m0s for the cluster at https://api.ocpra.hextupleo.lab:6443 to initialize... INFO Waiting up to 10m0s for the openshift-console route to be created... INFO Install complete! INFO To access the cluster as the system:admin user when using 'oc', run 'export KUBECONFIG=/home/stack/ocpra/auth/kubeconfig' INFO Access the OpenShift web-console here: https://console-openshift-console.apps.ocpra.hextupleo.lab INFO Login to the console with user: "kubeadmin", and password: "WZ8Zz-a9EB4-CCVsD-faQCX" INFO Time elapsed: 1s There should be a way to increase a timeout to accommodate different hardware that can take 30 more minutes then VM to deploy. Version-Release number of the following components: openshift-install 4.5.0-0.nightly-2020-05-29-105132
Just a note that BM deployment are using a 60 min timeout instead of 30 to accommodate with longer boot time for BM nodes, and are *still* hitting the timeout occasionally. https://github.com/openshift/installer/blob/3d6f27a/cmd/openshift-install/create.go#L348
Hello, while on it, could you make installation timeouts configurable using configuration file or environment variables or flags? I see that depending on installation type (even not bare metal) and underlying infrastructure properties, the installation can timeout. This was already discussed in bug 1819746 which resulted into a documentation change. But for QE performing many kinds of temporary installations on different infrastructure having manual steps in the process is not a viable option. On the other hand ignoring failed installer execution and checking cluster later in an automated fashion leaves gives room for false positives. Thank you.
Unfortunately, we are not able to provide a solution in code at this stage for the installer's `wait-for install-complete`. We are documenting a workaround for attaching bare metal machines in this PR: https://github.com/openshift/installer/pull/3955 Day 2 operations should be covered by this patch, which increases the waiting time for CSRs to two hours: https://github.com/openshift/cluster-machine-approver/pull/37
> we are not able to provide a solution in code at this stage Pierre, could you clarify? Does it mean we drop the feature for current version or we plan to leave as is for the foreseeable future? In can we only drop feature for current version, how about keeping issue open and change target version?
(In reply to Aleksandar Kostadinov from comment #11) > > we are not able to provide a solution in code at this stage > > Pierre, could you clarify? Does it mean we drop the feature for current > version or we plan to leave as is for the foreseeable future? > > In can we only drop feature for current version, how about keeping issue > open and change target version? The problem at hand ("the Installer timeout expires before installation is complete") has an easy workaround ("Just run `openshift-install wait-for install-complete` again"). I personally think that the best way forward would be to let the user customise the timeout duration, for example with a command-line flag. However, since this is a change to the Installer (as opposed to a platform-specific change), it requires a degree of coordination that is hard to obtain with a low-priority bug. We can have a discussion with the Installer team by treating the change as a feature, rather than a bug, for an upcoming release. The first step in this direction may be to open an issue, or a pull request, in github.com/openshift/enhancements.