Created attachment 1764127 [details] Failure in run install Description of problem: My installation failed right on the beginning (within ~20 seconds) with this error: Cluster installation failed Failed generating kubeconfig files for cluster 92d85eef-339a-4d80-9e83-361a49a3318f: command oc exited with non-zero exit code 1: error: unable to connect to image repository quay.io/openshift-release-dev/ocp-release:4.7.2-x86_64: Get "https://quay.io/v2/": context deadline exceeded (Client.Timeout exceeded while awaiting headers) . Reset the installation process to return to the configuration and try again. Some hosts may need to be re-registered by rebooting into the Discovery ISO. All hosts were in error in step 0/7. To proceed, I reset the cluster and rebooted the hosts and started again. However, this can be avoided by having the agent or installer retry the call to quay a few more times before giving up, to make such errors more rare. Version-Release number of selected component (if applicable): Release tag stable Assisted Installer UI version quay.io/ocpmetal/ocp-metal-ui:2fe99dd56daff096177e5d9a1b644c8a3ee5b039 Assisted Installer UI library version 0.0.12-wizard Assisted Installer quay.io/ocpmetal/assisted-installer:c107911c4756e4473405e893ee80f4a6b079ac4f Assisted Installer Controller quay.io/ocpmetal/assisted-installer-controller:c107911c4756e4473405e893ee80f4a6b079ac4f Assisted Installer Service quay.io/ocpmetal/assisted-service:e0df002062f80149769707e72e5952da16897aef Discovery Agent quay.io/ocpmetal/assisted-installer-agent:edbaff3f6b1343b6e51c64d461923ac592820476 How reproducible: Rarely Steps to Reproduce: 1. This is the regular AI flow Additional info: See screenshot
ercohen dont we already have retries?
Note that when cluster might fail during preparing-for-installation due to multiple reasons and there is no reason to require hosts reboot. So I think that's what we should fix
There is work in progress that should mitigate this issue (the user won't need to reset the installation & reboot all nodes). In case the assisted-installer failed for any reason during preparing-for-installation the cluster it will set the cluster status to insufficient. The cluster will recover back to ready status if all is well.
This will still fail the automation, and I think that also most users won't like to manually retry the installation even if it's simple. Would you consider adding the retry after all?
Sure, adding retries does make sense regardless of how the installation get bake on track. I'll reopen and remove the won't fix resolution.
Verified on OCP-Metal-v1.0.19.1