Bug 2008583
Summary: | Agents wrong validation failure on failing to fetch image needed for installation | ||
---|---|---|---|
Product: | Red Hat Advanced Cluster Management for Kubernetes | Reporter: | Trey West <trwest> |
Component: | Infrastructure Operator | Assignee: | Igal Tsoiref <itsoiref> |
Status: | CLOSED ERRATA | QA Contact: | |
Severity: | medium | Docs Contact: | Derek <dcadzow> |
Priority: | medium | ||
Version: | rhacm-2.4 | CC: | aos-bugs, asegurap, ccrum, fpercoco, juhsu, mfilanov, otuchfel, ppinjark, rlopezma, trwest, yfirst |
Target Milestone: | --- | Keywords: | Triaged |
Target Release: | rhacm-2.5 | Flags: | juhsu:
rhacm-2.4.z+
|
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | AI-Team-Core | ||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-05-03 16:44:03 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Trey West
2021-09-28 15:27:15 UTC
The issue seems to stem from stale validations getting stuck in the images_status host column in the database:
>{
> "quay.io/ocpmetal/assisted-installer:latest": {
> "download_rate": 5.410359904885138,
> "name": "quay.io/ocpmetal/assisted-installer:latest",
> "result": "success",
> "size_bytes": 483222869,
> "time": 89.314366788
> },
> "quay.io/openshift-release-dev/ocp-release:4.8.10-x86_64": {
> "download_rate": 10.222948248549839,
> "name": "quay.io/openshift-release-dev/ocp-release:4.8.10-x86_64",
> "result": "success",
> "size_bytes": 336632138,
> "time": 32.92906604
> },
> "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:415e56a6d8d2b8576c59d08a7c57a7d081fa59e7dc3cd8782ed7bcd6080c7c12": {
> "name": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:415e56a6d8d2b8576c59d08a7c57a7d081fa59e7dc3cd8782ed7bcd6080c7c12",
> "result": "failure"
> },
> "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:befaf3fd0a48a325588ef3cd50c92dde165a0f80b153007e672cc8df95d39361": {
> "download_rate": 36.50355792926888,
> "name": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:befaf3fd0a48a325588ef3cd50c92dde165a0f80b153007e672cc8df95d39361",
> "result": "success",
> "size_bytes": 442229333,
> "time": 12.114691227
> }
>}
Notice the 2 different quay.io/openshift-release-dev/ocp-v4.0-art-dev images in particular - the first one failed pulling for whatever reason. Then the service got upgraded, and the service was no longer asking the agent to try and pull it. So now it's "stuck" forever in the database, causing validations to fail, even though the image is no longer relevant and no longer needed for installation to proceed.
@ppinjark :wave: hey, can we look into this in more detail. Omer and Yuval did an initial investigation, we need to asses how critical the bug is and what we can do. It would be interesting if we can test this in the upgrade CI I want to note that this is not an issue with upgrades specifically, but an issue with any change of the to-be-installed OpenShift release image docker image - which might be caused by, among other things, a service upgrade (since such upgrade might change the specific images we use for each major release). But this issue can be much more easily recreated by just changing a cluster's ClusterImageSet in kubeapi AI - if the agent happened to fail to pull the image that was in the first ClusterImageSet, then it'll forever fail validations, even if the new image gets pulled just fine. I think we should fix this issue but it's not worth backporting it. We don't support changing the OCP version after the creation. Let's add a proper validation and fail gracefully rather than leaving the cluster in an undesired stage. @ @otuchfel is that similar to https://bugzilla.redhat.com/show_bug.cgi?id=2012099 ? Might be, might not be. Can't know without looking at the database / having access to the agents The fix for this bug is to be handled by https://issues.redhat.com/browse/OCPBUGSM-34971 and https://github.com/openshift/assisted-service/pull/2752 Verified on 2.5.0-DOWNSTREAM-2022-04-12-04-50-40 Issue also seen in OCP 4.9 with ACM 2.4. Two SNO clusters are installed. One of them failed for this reason. It happens randomly, not possible to reproduce reliably. Verified on 2.4.3-DOWNSTREAM-2022-04-13-07-05-00 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat Advanced Cluster Management 2.4.4 security updates and bug fixes), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:1681 |