Description of problem: After updating assisted-service image, once installation start for multi-node static IPv4 cluster, agent validations fail with the message: Failed to fetch container images needed for installation from quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:415e56a6d8d2b8576c59d08a7c57a7d081fa59e7dc3cd8782ed7bcd6080c7c12 Version-Release number of selected component (if applicable): master How reproducible: 3/3 Steps to Reproduce: 1. Update assisted-service to 8cf51023fecb0ef0d213b8d0d216176f903f0190 2. Start multi-node static IPv4 installation Actual results: Installation stops and agent validation fails Expected results: Installation completes Additional info:
The issue seems to stem from stale validations getting stuck in the images_status host column in the database: >{ > "quay.io/ocpmetal/assisted-installer:latest": { > "download_rate": 5.410359904885138, > "name": "quay.io/ocpmetal/assisted-installer:latest", > "result": "success", > "size_bytes": 483222869, > "time": 89.314366788 > }, > "quay.io/openshift-release-dev/ocp-release:4.8.10-x86_64": { > "download_rate": 10.222948248549839, > "name": "quay.io/openshift-release-dev/ocp-release:4.8.10-x86_64", > "result": "success", > "size_bytes": 336632138, > "time": 32.92906604 > }, > "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:415e56a6d8d2b8576c59d08a7c57a7d081fa59e7dc3cd8782ed7bcd6080c7c12": { > "name": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:415e56a6d8d2b8576c59d08a7c57a7d081fa59e7dc3cd8782ed7bcd6080c7c12", > "result": "failure" > }, > "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:befaf3fd0a48a325588ef3cd50c92dde165a0f80b153007e672cc8df95d39361": { > "download_rate": 36.50355792926888, > "name": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:befaf3fd0a48a325588ef3cd50c92dde165a0f80b153007e672cc8df95d39361", > "result": "success", > "size_bytes": 442229333, > "time": 12.114691227 > } >} Notice the 2 different quay.io/openshift-release-dev/ocp-v4.0-art-dev images in particular - the first one failed pulling for whatever reason. Then the service got upgraded, and the service was no longer asking the agent to try and pull it. So now it's "stuck" forever in the database, causing validations to fail, even though the image is no longer relevant and no longer needed for installation to proceed.
@ppinjark :wave: hey, can we look into this in more detail. Omer and Yuval did an initial investigation, we need to asses how critical the bug is and what we can do. It would be interesting if we can test this in the upgrade CI
I want to note that this is not an issue with upgrades specifically, but an issue with any change of the to-be-installed OpenShift release image docker image - which might be caused by, among other things, a service upgrade (since such upgrade might change the specific images we use for each major release). But this issue can be much more easily recreated by just changing a cluster's ClusterImageSet in kubeapi AI - if the agent happened to fail to pull the image that was in the first ClusterImageSet, then it'll forever fail validations, even if the new image gets pulled just fine.
I think we should fix this issue but it's not worth backporting it. We don't support changing the OCP version after the creation. Let's add a proper validation and fail gracefully rather than leaving the cluster in an undesired stage.
@
@otuchfel is that similar to https://bugzilla.redhat.com/show_bug.cgi?id=2012099 ?
Might be, might not be. Can't know without looking at the database / having access to the agents
The fix for this bug is to be handled by https://issues.redhat.com/browse/OCPBUGSM-34971 and https://github.com/openshift/assisted-service/pull/2752
Verified on 2.5.0-DOWNSTREAM-2022-04-12-04-50-40
Issue also seen in OCP 4.9 with ACM 2.4. Two SNO clusters are installed. One of them failed for this reason. It happens randomly, not possible to reproduce reliably.
Verified on 2.4.3-DOWNSTREAM-2022-04-13-07-05-00
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat Advanced Cluster Management 2.4.4 security updates and bug fixes), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:1681