Bug 2008583

Summary:	Agents wrong validation failure on failing to fetch image needed for installation
Product:	Red Hat Advanced Cluster Management for Kubernetes	Reporter:	Trey West <trwest>
Component:	Infrastructure Operator	Assignee:	Igal Tsoiref <itsoiref>
Status:	CLOSED ERRATA	QA Contact:
Severity:	medium	Docs Contact:	Derek <dcadzow>
Priority:	medium
Version:	rhacm-2.4	CC:	aos-bugs, asegurap, ccrum, fpercoco, juhsu, mfilanov, otuchfel, ppinjark, rlopezma, trwest, yfirst
Target Milestone:	---	Keywords:	Triaged
Target Release:	rhacm-2.5	Flags:	juhsu: rhacm-2.4.z+
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	AI-Team-Core
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-05-03 16:44:03 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Trey West 2021-09-28 15:27:15 UTC

Description of problem:

After updating assisted-service image, once installation start for multi-node static IPv4 cluster, agent validations fail with the message:

Failed to fetch container images needed for installation from quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:415e56a6d8d2b8576c59d08a7c57a7d081fa59e7dc3cd8782ed7bcd6080c7c12

Version-Release number of selected component (if applicable):
master

How reproducible:
3/3

Steps to Reproduce:
1. Update assisted-service to 8cf51023fecb0ef0d213b8d0d216176f903f0190
2. Start multi-node static IPv4 installation

Actual results:

Installation stops and agent validation fails


Expected results:

Installation completes

Additional info:

Comment 1 Omer Tuchfeld 2021-09-28 15:35:27 UTC

The issue seems to stem from stale validations getting stuck in the images_status  host column in the database:

>{
>  "quay.io/ocpmetal/assisted-installer:latest": {
>    "download_rate": 5.410359904885138,
>    "name": "quay.io/ocpmetal/assisted-installer:latest",
>    "result": "success",
>    "size_bytes": 483222869,
>    "time": 89.314366788
>  },
>  "quay.io/openshift-release-dev/ocp-release:4.8.10-x86_64": {
>    "download_rate": 10.222948248549839,
>    "name": "quay.io/openshift-release-dev/ocp-release:4.8.10-x86_64",
>    "result": "success",
>    "size_bytes": 336632138,
>    "time": 32.92906604
>  },
>  "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:415e56a6d8d2b8576c59d08a7c57a7d081fa59e7dc3cd8782ed7bcd6080c7c12": {
>    "name": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:415e56a6d8d2b8576c59d08a7c57a7d081fa59e7dc3cd8782ed7bcd6080c7c12",
>    "result": "failure"
>  },
>  "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:befaf3fd0a48a325588ef3cd50c92dde165a0f80b153007e672cc8df95d39361": {
>    "download_rate": 36.50355792926888,
>    "name": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:befaf3fd0a48a325588ef3cd50c92dde165a0f80b153007e672cc8df95d39361",
>    "result": "success",
>    "size_bytes": 442229333,
>    "time": 12.114691227
>  }
>}

Notice the 2 different quay.io/openshift-release-dev/ocp-v4.0-art-dev images in particular - the first one failed pulling for whatever reason. Then the service got upgraded, and the service was no longer asking the agent to try and pull it. So now it's "stuck" forever in the database, causing validations to fail, even though the image is no longer relevant and no longer needed for installation to proceed.

Comment 3 Flavio Percoco 2021-10-06 16:33:08 UTC

@ppinjark :wave: hey, can we look into this in more detail. Omer and Yuval did an initial investigation, we need to asses how critical the bug is and what we can do.

It would be interesting if we can test this in the upgrade CI

Comment 4 Omer Tuchfeld 2021-10-06 16:42:23 UTC

I want to note that this is not an issue with upgrades specifically, but an issue with any change of the to-be-installed OpenShift release image docker image - which might be caused by, among other things, a service upgrade (since such upgrade might change the specific images we use for each major release).

But this issue can be much more easily recreated by just changing a cluster's ClusterImageSet in kubeapi AI - if the agent happened to fail to pull the image that was in the first ClusterImageSet, then it'll forever fail validations, even if the new image gets pulled just fine.

Comment 5 Flavio Percoco 2021-10-07 12:32:46 UTC

I think we should fix this issue but it's not worth backporting it. We don't support changing the OCP version after the creation.

Let's add a proper validation and fail gracefully rather than leaving the cluster in an undesired stage.

Comment 6 Michael Filanov 2021-10-10 07:50:44 UTC

Comment 7 Michael Filanov 2021-10-10 07:51:10 UTC

@otuchfel is that similar to https://bugzilla.redhat.com/show_bug.cgi?id=2012099 ?

Comment 8 Omer Tuchfeld 2021-10-10 10:00:10 UTC

Might be, might not be. Can't know without looking at the database / having access to the agents

Comment 9 Pawan Pinjarkar 2021-10-12 14:29:19 UTC

The fix for this bug is to be handled by https://issues.redhat.com/browse/OCPBUGSM-34971 and https://github.com/openshift/assisted-service/pull/2752

Comment 10 Trey West 2022-04-13 14:55:16 UTC

Verified on 2.5.0-DOWNSTREAM-2022-04-12-04-50-40

Comment 11 rlopezma 2022-04-18 14:31:37 UTC

Issue also seen in OCP 4.9 with ACM 2.4.

Two SNO clusters are installed. One of them failed for this reason. It happens randomly, not possible to reproduce reliably.

Comment 13 Trey West 2022-04-22 19:31:55 UTC

Verified on 2.4.3-DOWNSTREAM-2022-04-13-07-05-00

Comment 20 errata-xmlrpc 2022-05-03 16:44:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Advanced Cluster Management 2.4.4 security updates and bug fixes), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1681