Bug 2180146
| Summary: | upgrade cnv from 4.12.1 to v4.13.0.rhel9-1819 is stuck | ||
|---|---|---|---|
| Product: | Container Native Virtualization (CNV) | Reporter: | Debarati Basu-Nag <dbasunag> |
| Component: | Virtualization | Assignee: | Igor Bezukh <ibezukh> |
| Status: | CLOSED ERRATA | QA Contact: | Kedar Bidarkar <kbidarka> |
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 4.13.0 | CC: | acardace, dshchedr, fdeutsch, guchen, ibezukh, iholder, jlejosne, jpeimer, kbidarka, kmajcher, lpivarc, pousley, stirabos, ycui |
| Target Milestone: | --- | Keywords: | TestBlocker, UpgradeBlocker |
| Target Release: | 4.13.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | hco-bundle-v4.13.0.rhel9-2129 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-05-18 02:58:23 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
@iholder@iholder FYI Simone, could it be that HCO is not correctly using the API, I see: Registry: registry, ImagePrefix: imagePrefix, https://github.com/kubevirt/kubevirt/blob/9d100a211ca3bfe0925137b002d8b6b4590da922/pkg/virt-operator/util/config.go#L371-L372 Thus the registry and image prefix (of all images) should be separated from the image name. Thus, shouldn't HCO just set registry, imagePrefix (not nothing?)and then the image names incl shasum? I tend to think it's irrelevant: before https://github.com/kubevirt/kubevirt/pull/8673 virt-operator was composing images name like: Image: fmt.Sprintf("%s/%s%s%s", repository, imagePrefix, "virt-launcher", launcherVersion), where "virt-launcher" was hardcoded but this was preventing us from re-naming it as "virt-launcher-rhel9" and so the aforementioned PR. With: https://github.com/kubevirt/kubevirt/pull/8673/commits/0bc97bbabfd30968b9abde1670d22778c4410ee6 passing VirtLauncherShasumEnvName git deprecated and we have now the ability to directly pass - name: VIRT_LAUNCHER_IMAGE value: some.registry.com:tag5 which is now independent from the value of repository and imagePrefix. Now, due to that, we are not using anymore VIRT_LAUNCHER_SHASUM but VIRT_LAUNCHER_IMAGE so func (c *KubeVirtDeploymentConfig) UseShasums() bool { return c.VirtOperatorSha != "" && c.VirtApiSha != "" && c.VirtControllerSha != "" && c.VirtHandlerSha != "" && c.VirtLauncherSha != "" } returns false and so func (c *KubeVirtDeploymentConfig) GetOperatorVersion() string { if c.UseShasums() { return c.VirtOperatorSha } return c.KubeVirtVersion } returns a fixed tag instead of the SHA digest of the operator as expected by HCO. As a quick and dirty workaround we can eventually pass both VIRT_LAUNCHER_SHASUM and VIRT_LAUNCHER_IMAGE (and so on) just to unlock this. But we will still need a proper fix since VIRT_LAUNCHER_SHASUM and so on are deprecated. So IIC we should also check whether SHASUM is being used instead of a tag, in the VIRT_*_IMAGE definitions @stirabos thanks for the detailed troubleshooting, this looks like a real regression KubeVirt should fix. It looks like the root cause for this issue is that the observedKubevirtVersion and tagertKubevirtVersion status fields are missing in the KV CR. I would expect them to exist and have the "latest" value. Can you please attach the virt-operator pod logs? TIA Igor Also can you please provide the virt-operator deployment YAML? or a must-gather bundle if possible I verified with v4.13.0.rhel9-2051
===========
[cnv-qe-jenkins@cnv-qe-infra-01 ~]$ oc get hco kubevirt-hyperconverged -n openshift-cnv -o json | jq ".status.versions"
[
{
"name": "operator",
"version": "4.12.3"
}
]
[cnv-qe-jenkins@cnv-qe-infra-01 ~]$ oc get hco kubevirt-hyperconverged -n openshift-cnv -o json | jq ".status.conditions"
[
{
"lastTransitionTime": "2023-04-14T21:23:19Z",
"message": "Reconcile completed successfully",
"observedGeneration": 8,
"reason": "ReconcileCompleted",
"status": "True",
"type": "ReconcileComplete"
},
{
"lastTransitionTime": "2023-04-14T21:29:00Z",
"message": "Reconcile completed successfully",
"observedGeneration": 8,
"reason": "ReconcileCompleted",
"status": "True",
"type": "Available"
},
{
"lastTransitionTime": "2023-04-14T21:23:19Z",
"message": "HCO is now upgrading to version 4.13.0",
"observedGeneration": 8,
"reason": "HCOUpgrading",
"status": "True",
"type": "Progressing"
},
{
"lastTransitionTime": "2023-04-14T21:29:00Z",
"message": "Reconcile completed successfully",
"observedGeneration": 8,
"reason": "ReconcileCompleted",
"status": "False",
"type": "Degraded"
},
{
"lastTransitionTime": "2023-04-14T21:29:00Z",
"message": "Reconcile completed successfully",
"observedGeneration": 8,
"reason": "ReconcileCompleted",
"status": "True",
"type": "Upgradeable"
}
]
[cnv-qe-jenkins@cnv-qe-infra-01 ~]$
Please let me know if this was fixed in a later build.
Hi, Sorry if I wasn't clear, I would need the must-gather logs, since the provided information isn't useful for reproduction or for understanding of the problem scope. Successfully upgraded OCP:
===========================
[cnv-qe-jenkins@cnv-qe-infra-01 ~]$ oc get clusterversion -o json | jq ".items[0].status.history"
[
{
"acceptedRisks": "Target release version=\"\" image=\"quay.io/openshift-release-dev/ocp-release:4.13.0-rc.5-x86_64\" cannot be verified, but continuing anyway because the update was forced: release images that are not accessed via digest cannot be verified\nForced through blocking failures: Multiple precondition checks failed:\n* Precondition \"ClusterVersionUpgradeable\" failed because of \"MultipleReasons\": Cluster should not be upgraded between minor versions for multiple reasons: ClusterVersionOverridesSet,AdminAckRequired\n* Disabling ownership via cluster version overrides prevents upgrades. Please remove overrides before continuing.\n* Kubernetes 1.26 and therefore OpenShift 4.13 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/6958394 for details and instructions.\n* Precondition \"EtcdRecentBackup\" failed because of \"ControllerStarted\": RecentBackup: The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required\n* Precondition \"ClusterVersionRecommendedUpdate\" failed because of \"UnknownUpdate\": RetrievedUpdates=False (VersionNotFound), so the recommended status of updating from 4.12.14 to 4.13.0-rc.5 is unknown.",
"completionTime": "2023-04-25T01:05:31Z",
"image": "quay.io/openshift-release-dev/ocp-release:4.13.0-rc.5-x86_64",
"startedTime": "2023-04-24T22:48:17Z",
"state": "Completed",
"verified": false,
"version": "4.13.0-rc.5"
},
{
"completionTime": "2023-04-24T20:28:29Z",
"image": "quay.io/openshift-release-dev/ocp-release@sha256:157cc02d63bfe67988429fd803da632e495e230d811759b1aed1e6ffa7a3f31a",
"startedTime": "2023-04-24T19:03:29Z",
"state": "Completed",
"verified": false,
"version": "4.12.14"
}
]
[cnv-qe-jenkins@cnv-qe-infra-01 ~]$
=========================
[cnv-qe-jenkins@cnv-qe-infra-01 ~]$ oc get csv -n openshift-cnv
NAME DISPLAY VERSION REPLACES PHASE
jaeger-operator.v1.39.0-3 Red Hat OpenShift distributed tracing platform 1.39.0-3 jaeger-operator.v1.34.1-5 Succeeded
kiali-operator.v1.57.6 Kiali Operator 1.57.6 kiali-operator.v1.57.5 Succeeded
kubevirt-hyperconverged-operator.v4.13.0 OpenShift Virtualization 4.13.0 kubevirt-hyperconverged-operator.v4.12.3 Succeeded
openshift-pipelines-operator-rh.v1.10.0 Red Hat OpenShift Pipelines 1.10.0 Succeeded
servicemeshoperator.v2.3.2 Red Hat OpenShift Service Mesh 2.3.2-0 servicemeshoperator.v2.3.1 Succeeded
[cnv-qe-jenkins@cnv-qe-infra-01 ~]$
========================
[cnv-qe-jenkins@cnv-qe-infra-01 ~]$ oc get hco kubevirt-hyperconverged -n openshift-cnv -o json | jq ".status.versions"
[
{
"name": "operator",
"version": "4.13.0"
}
]
[cnv-qe-jenkins@cnv-qe-infra-01 ~]$
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Virtualization 4.13.0 Images security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:3205 |
This is a real bug: we broke it when we started referencing kubevirt images with the full URL to correctly handle the -rhel9 suffix. On HCO deployment we have: spec: containers: - command: - hyperconverged-cluster-operator env: ... - name: KUBEVIRT_VERSION value: sha256:71bec82a806105f964ca1b5a5827765f7d2d5e2bf5a09e4ec2bf192ee2d9b4fe and on virt-operator deployment we have: spec: ... containers: ... command: - virt-operator env: ... - name: KUBEVIRT_VERSION value: sha256:71bec82a806105f964ca1b5a5827765f7d2d5e2bf5a09e4ec2bf192ee2d9b4fe so once Kubevirt is successfully progressed to the next version, HCO expects virt-operator to report on the Kubevirt CR: status: ... operatorVersion: sha256:71bec82a806105f964ca1b5a5827765f7d2d5e2bf5a09e4ec2bf192ee2d9b4fe but now we have: status: ... operatorVersion: v0.59.0-38-g38327148b The root cause is here: func (c *KubeVirtDeploymentConfig) GetOperatorVersion() string { if c.UseShasums() { return c.VirtOperatorSha } return c.KubeVirtVersion } ... func (c *KubeVirtDeploymentConfig) UseShasums() bool { return c.VirtOperatorSha != "" && c.VirtApiSha != "" && c.VirtControllerSha != "" && c.VirtHandlerSha != "" && c.VirtLauncherSha != "" } in https://github.com/kubevirt/kubevirt/blob/9d100a211ca3bfe0925137b002d8b6b4590da922/pkg/virt-operator/util/config.go#LL476-L478C2 and now we are using: - name: VIRT_API_IMAGE value: registry.redhat.io/container-native-virtualization/virt-api-rhel9@sha256:1b6ea3379e211320d4d0dbc95f31d39fe77f50527921466b92e5b3c8bd6671b2 - name: VIRT_CONTROLLER_IMAGE value: registry.redhat.io/container-native-virtualization/virt-controller-rhel9@sha256:7da1fb9817791d97a5295bfdd7eb58cf9d5ea572cad6b42fbcae83efe5d562e6 - name: VIRT_HANDLER_IMAGE value: registry.redhat.io/container-native-virtualization/virt-handler-rhel9@sha256:e4ff8bf9d57f500cbb3c70a9035e405c51cc9c2266fda589112551725c70fbb6 - name: VIRT_LAUNCHER_IMAGE value: registry.redhat.io/container-native-virtualization/virt-launcher-rhel9@sha256:8a614bd3db0c4e487f1dd55ee42ddcedb6fbdd153d6bf77132235a090561815c - name: VIRT_EXPORTPROXY_IMAGE value: registry.redhat.io/container-native-virtualization/virt-exportproxy-rhel9@sha256:5e26e46c7d83ff78f3e22e37405d6d5102037567f2d5cb6fcf267707d6a34b57 - name: VIRT_EXPORTSERVER_IMAGE value: registry.redhat.io/container-native-virtualization/virt-exportserver-rhel9@sha256:9ad67f9f27dd483f9ed354d60be2c7f3e51b7d4f889e7f481f7c4ed5a05215b1 - name: GS_IMAGE value: registry.redhat.io/container-native-virtualization/libguestfs-tools-rhel9@sha256:fedaf6e56ece795c600384bd2262029505be3aa4b6dc0a70b51422be02d35ca2 - name: KUBEVIRT_VERSION value: sha256:71bec82a806105f964ca1b5a5827765f7d2d5e2bf5a09e4ec2bf192ee2d9b4fe instead of: - name: KUBEVIRT_VERSION value: sha256:8282de03fb1f9c5e083e1306ccf1f7dcd21c44c8d24db389288dffea8521d05c - name: VIRT_API_SHASUM value: sha256:9b2ecdf39075135b9858ff9c174b19771ef8f7f8f1d516178c124f2f2c618f88 - name: VIRT_CONTROLLER_SHASUM value: sha256:f09248b28154dcf64a38d4cdf0a14771af4e8988fc2dbfc13d2126ce93cc5f36 - name: VIRT_HANDLER_SHASUM value: sha256:1ab0f7eefed6bd5d8906f64444e5e6da59d65d5caef18b288a0f6cf790a198a0 - name: VIRT_LAUNCHER_SHASUM value: sha256:f2c199bdf254adbef295ef3c533ec1cccb4f08de1a79e246ff6c9cf5b2b3c863 - name: VIRT_EXPORTPROXY_SHASUM value: sha256:5fff19dce2924aee371fc2cfe234ebf223b5e05093d9f16a289ee9b055759226 - name: VIRT_EXPORTSERVER_SHASUM value: sha256:686d2ce29de8c85e7a2a06388fbd21a5b5fb4e2d9164472b82b4bc43d2300212 - name: GS_SHASUM value: sha256:0de4d7a9a408e9542fa851f2002baf73caa6321e1f70d5a56b4c62f624c8bbf3 as we did up to CNV 4.12. Moving this to virt component.