Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1940149

Summary: [RFE] Retry the getting of the image from quay.io
Product: OpenShift Container Platform Reporter: Udi Kalifon <ukalifon>
Component: assisted-installerAssignee: Eran Cohen <ercohen>
assisted-installer sub component: Installer QA Contact: Udi Kalifon <ukalifon>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: medium    
Priority: unspecified CC: ercohen, yobshans
Version: 4.7Keywords: Reopened
Target Milestone: ---   
Target Release: internal.milestone   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: OCP-Metal-v1.0.19.1 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-28 08:45:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Failure in run install none

Description Udi Kalifon 2021-03-17 17:16:36 UTC
Created attachment 1764127 [details]
Failure in run install

Description of problem:
My installation failed right on the beginning (within ~20 seconds) with this error:

Cluster installation failed
Failed generating kubeconfig files for cluster 92d85eef-339a-4d80-9e83-361a49a3318f: command oc exited with non-zero exit code 1: error: unable to connect to image repository quay.io/openshift-release-dev/ocp-release:4.7.2-x86_64: Get "https://quay.io/v2/": context deadline exceeded (Client.Timeout exceeded while awaiting headers) .
Reset the installation process to return to the configuration and try again. Some hosts may need to be re-registered by rebooting into the Discovery ISO.


All hosts were in error in step 0/7.

To proceed, I reset the cluster and rebooted the hosts and started again. However, this can be avoided by having the agent or installer retry the call to quay a few more times before giving up, to make such errors more rare.


Version-Release number of selected component (if applicable):
Release tag
    stable
Assisted Installer UI version
    quay.io/ocpmetal/ocp-metal-ui:2fe99dd56daff096177e5d9a1b644c8a3ee5b039
Assisted Installer UI library version
    0.0.12-wizard
Assisted Installer
    quay.io/ocpmetal/assisted-installer:c107911c4756e4473405e893ee80f4a6b079ac4f
Assisted Installer Controller
    quay.io/ocpmetal/assisted-installer-controller:c107911c4756e4473405e893ee80f4a6b079ac4f
Assisted Installer Service
    quay.io/ocpmetal/assisted-service:e0df002062f80149769707e72e5952da16897aef
Discovery Agent
    quay.io/ocpmetal/assisted-installer-agent:edbaff3f6b1343b6e51c64d461923ac592820476


How reproducible:
Rarely


Steps to Reproduce:
1. This is the regular AI flow


Additional info:
See screenshot

Comment 1 Ronnie Lazar 2021-03-17 18:51:17 UTC
ercohen dont we already have retries?

Comment 3 Eran Cohen 2021-03-18 11:25:58 UTC
Note that when cluster might fail during preparing-for-installation due to multiple reasons and there is no reason to require hosts reboot.
So I think that's what we should fix

Comment 4 Eran Cohen 2021-03-21 07:30:37 UTC
There is work in progress that should mitigate this issue (the user won't need to reset the installation & reboot all nodes).
In case the assisted-installer failed for any reason during preparing-for-installation the cluster it will set the cluster status to insufficient.
The cluster will recover back to ready status if all is well.

Comment 5 Udi Kalifon 2021-03-22 13:44:35 UTC
This will still fail the automation, and I think that also most users won't like to manually retry the installation even if it's simple. Would you consider adding the retry after all?

Comment 6 Eran Cohen 2021-03-25 07:42:34 UTC
Sure, adding retries does make sense regardless of how the installation get bake on track.
I'll reopen and remove the won't fix resolution.

Comment 7 Yuri Obshansky 2021-05-05 13:14:38 UTC
Verified on  OCP-Metal-v1.0.19.1