2008583 – Agents wrong validation failure on failing to fetch image needed for installation

Bug 2008583 - Agents wrong validation failure on failing to fetch image needed for installation

Summary: Agents wrong validation failure on failing to fetch image needed for installa...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Advanced Cluster Management for Kubernetes
Classification:	Red Hat
Component:	Infrastructure Operator
Sub Component:
Version:	rhacm-2.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	rhacm-2.5
Assignee:	Igal Tsoiref
QA Contact:
Docs Contact:	Derek
URL:
Whiteboard:	AI-Team-Core
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-09-28 15:27 UTC by Trey West
Modified:	2022-05-03 16:44 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-05-03 16:44:03 UTC
Target Upstream Version:
Embargoed:
Flags:	juhsu: rhacm-2.4.z+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	open-cluster-management backlog issues 16920	None	None	None	2021-10-07 01:37:10 UTC
Red Hat Issue Tracker	MGMTBUGSM-298	None	None	None	2022-04-13 15:05:08 UTC
Red Hat Product Errata	RHSA-2022:1681	None	None	None	2022-05-03 16:44:35 UTC

Description Trey West 2021-09-28 15:27:15 UTC

Description of problem:

After updating assisted-service image, once installation start for multi-node static IPv4 cluster, agent validations fail with the message:

Failed to fetch container images needed for installation from quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:415e56a6d8d2b8576c59d08a7c57a7d081fa59e7dc3cd8782ed7bcd6080c7c12

Version-Release number of selected component (if applicable):
master

How reproducible:
3/3

Steps to Reproduce:
1. Update assisted-service to 8cf51023fecb0ef0d213b8d0d216176f903f0190
2. Start multi-node static IPv4 installation

Actual results:

Installation stops and agent validation fails


Expected results:

Installation completes

Additional info:

Comment 1 Omer Tuchfeld 2021-09-28 15:35:27 UTC

The issue seems to stem from stale validations getting stuck in the images_status  host column in the database:

>{
>  "quay.io/ocpmetal/assisted-installer:latest": {
>    "download_rate": 5.410359904885138,
>    "name": "quay.io/ocpmetal/assisted-installer:latest",
>    "result": "success",
>    "size_bytes": 483222869,
>    "time": 89.314366788
>  },
>  "quay.io/openshift-release-dev/ocp-release:4.8.10-x86_64": {
>    "download_rate": 10.222948248549839,
>    "name": "quay.io/openshift-release-dev/ocp-release:4.8.10-x86_64",
>    "result": "success",
>    "size_bytes": 336632138,
>    "time": 32.92906604
>  },
>  "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:415e56a6d8d2b8576c59d08a7c57a7d081fa59e7dc3cd8782ed7bcd6080c7c12": {
>    "name": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:415e56a6d8d2b8576c59d08a7c57a7d081fa59e7dc3cd8782ed7bcd6080c7c12",
>    "result": "failure"
>  },
>  "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:befaf3fd0a48a325588ef3cd50c92dde165a0f80b153007e672cc8df95d39361": {
>    "download_rate": 36.50355792926888,
>    "name": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:befaf3fd0a48a325588ef3cd50c92dde165a0f80b153007e672cc8df95d39361",
>    "result": "success",
>    "size_bytes": 442229333,
>    "time": 12.114691227
>  }
>}

Notice the 2 different quay.io/openshift-release-dev/ocp-v4.0-art-dev images in particular - the first one failed pulling for whatever reason. Then the service got upgraded, and the service was no longer asking the agent to try and pull it. So now it's "stuck" forever in the database, causing validations to fail, even though the image is no longer relevant and no longer needed for installation to proceed.

Comment 3 Flavio Percoco 2021-10-06 16:33:08 UTC

@ppinjark :wave: hey, can we look into this in more detail. Omer and Yuval did an initial investigation, we need to asses how critical the bug is and what we can do.

It would be interesting if we can test this in the upgrade CI

Comment 4 Omer Tuchfeld 2021-10-06 16:42:23 UTC

I want to note that this is not an issue with upgrades specifically, but an issue with any change of the to-be-installed OpenShift release image docker image - which might be caused by, among other things, a service upgrade (since such upgrade might change the specific images we use for each major release).

But this issue can be much more easily recreated by just changing a cluster's ClusterImageSet in kubeapi AI - if the agent happened to fail to pull the image that was in the first ClusterImageSet, then it'll forever fail validations, even if the new image gets pulled just fine.

Comment 5 Flavio Percoco 2021-10-07 12:32:46 UTC

I think we should fix this issue but it's not worth backporting it. We don't support changing the OCP version after the creation.

Let's add a proper validation and fail gracefully rather than leaving the cluster in an undesired stage.

Comment 6 Michael Filanov 2021-10-10 07:50:44 UTC

Comment 7 Michael Filanov 2021-10-10 07:51:10 UTC

@otuchfel is that similar to https://bugzilla.redhat.com/show_bug.cgi?id=2012099 ?

Comment 8 Omer Tuchfeld 2021-10-10 10:00:10 UTC

Might be, might not be. Can't know without looking at the database / having access to the agents

Comment 9 Pawan Pinjarkar 2021-10-12 14:29:19 UTC

The fix for this bug is to be handled by https://issues.redhat.com/browse/OCPBUGSM-34971 and https://github.com/openshift/assisted-service/pull/2752

Comment 10 Trey West 2022-04-13 14:55:16 UTC

Verified on 2.5.0-DOWNSTREAM-2022-04-12-04-50-40

Comment 11 rlopezma 2022-04-18 14:31:37 UTC

Issue also seen in OCP 4.9 with ACM 2.4.

Two SNO clusters are installed. One of them failed for this reason. It happens randomly, not possible to reproduce reliably.

Comment 13 Trey West 2022-04-22 19:31:55 UTC

Verified on 2.4.3-DOWNSTREAM-2022-04-13-07-05-00

Comment 20 errata-xmlrpc 2022-05-03 16:44:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Advanced Cluster Management 2.4.4 security updates and bug fixes), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1681

Note You need to log in before you can comment on or make changes to this bug.