Bug 1940149 - [RFE] Retry the getting of the image from quay.io
Summary: [RFE] Retry the getting of the image from quay.io
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: assisted-installer
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: internal.milestone
Assignee: Eran Cohen
QA Contact: Udi Kalifon
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-17 17:16 UTC by Udi Kalifon
Modified: 2022-08-28 08:45 UTC (History)
2 users (show)

Fixed In Version: OCP-Metal-v1.0.19.1
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-28 08:45:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Failure in run install (181.27 KB, image/png)
2021-03-17 17:16 UTC, Udi Kalifon
no flags Details

Description Udi Kalifon 2021-03-17 17:16:36 UTC
Created attachment 1764127 [details]
Failure in run install

Description of problem:
My installation failed right on the beginning (within ~20 seconds) with this error:

Cluster installation failed
Failed generating kubeconfig files for cluster 92d85eef-339a-4d80-9e83-361a49a3318f: command oc exited with non-zero exit code 1: error: unable to connect to image repository quay.io/openshift-release-dev/ocp-release:4.7.2-x86_64: Get "https://quay.io/v2/": context deadline exceeded (Client.Timeout exceeded while awaiting headers) .
Reset the installation process to return to the configuration and try again. Some hosts may need to be re-registered by rebooting into the Discovery ISO.


All hosts were in error in step 0/7.

To proceed, I reset the cluster and rebooted the hosts and started again. However, this can be avoided by having the agent or installer retry the call to quay a few more times before giving up, to make such errors more rare.


Version-Release number of selected component (if applicable):
Release tag
    stable
Assisted Installer UI version
    quay.io/ocpmetal/ocp-metal-ui:2fe99dd56daff096177e5d9a1b644c8a3ee5b039
Assisted Installer UI library version
    0.0.12-wizard
Assisted Installer
    quay.io/ocpmetal/assisted-installer:c107911c4756e4473405e893ee80f4a6b079ac4f
Assisted Installer Controller
    quay.io/ocpmetal/assisted-installer-controller:c107911c4756e4473405e893ee80f4a6b079ac4f
Assisted Installer Service
    quay.io/ocpmetal/assisted-service:e0df002062f80149769707e72e5952da16897aef
Discovery Agent
    quay.io/ocpmetal/assisted-installer-agent:edbaff3f6b1343b6e51c64d461923ac592820476


How reproducible:
Rarely


Steps to Reproduce:
1. This is the regular AI flow


Additional info:
See screenshot

Comment 1 Ronnie Lazar 2021-03-17 18:51:17 UTC
ercohen dont we already have retries?

Comment 3 Eran Cohen 2021-03-18 11:25:58 UTC
Note that when cluster might fail during preparing-for-installation due to multiple reasons and there is no reason to require hosts reboot.
So I think that's what we should fix

Comment 4 Eran Cohen 2021-03-21 07:30:37 UTC
There is work in progress that should mitigate this issue (the user won't need to reset the installation & reboot all nodes).
In case the assisted-installer failed for any reason during preparing-for-installation the cluster it will set the cluster status to insufficient.
The cluster will recover back to ready status if all is well.

Comment 5 Udi Kalifon 2021-03-22 13:44:35 UTC
This will still fail the automation, and I think that also most users won't like to manually retry the installation even if it's simple. Would you consider adding the retry after all?

Comment 6 Eran Cohen 2021-03-25 07:42:34 UTC
Sure, adding retries does make sense regardless of how the installation get bake on track.
I'll reopen and remove the won't fix resolution.

Comment 7 Yuri Obshansky 2021-05-05 13:14:38 UTC
Verified on  OCP-Metal-v1.0.19.1


Note You need to log in before you can comment on or make changes to this bug.