Bug 2180146

Summary: upgrade cnv from 4.12.1 to v4.13.0.rhel9-1819 is stuck
Product: Container Native Virtualization (CNV) Reporter: Debarati Basu-Nag <dbasunag>
Component: VirtualizationAssignee: Igor Bezukh <ibezukh>
Status: CLOSED ERRATA QA Contact: Kedar Bidarkar <kbidarka>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.13.0CC: acardace, dshchedr, fdeutsch, guchen, ibezukh, iholder, jlejosne, jpeimer, kbidarka, kmajcher, lpivarc, pousley, stirabos, ycui
Target Milestone: ---Keywords: TestBlocker, UpgradeBlocker
Target Release: 4.13.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: hco-bundle-v4.13.0.rhel9-2129 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-05-18 02:58:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 2 Simone Tiraboschi 2023-03-20 21:11:34 UTC
This is a real bug: we broke it when we started referencing kubevirt images with the full URL to correctly handle the -rhel9 suffix.

On HCO deployment we have:

    spec:
      containers:
      - command:
        - hyperconverged-cluster-operator
        env:
...
        - name: KUBEVIRT_VERSION
          value: sha256:71bec82a806105f964ca1b5a5827765f7d2d5e2bf5a09e4ec2bf192ee2d9b4fe


and on virt-operator deployment we have:

    spec:
...
      containers:
...
        command:
        - virt-operator
        env:
...
        - name: KUBEVIRT_VERSION
          value: sha256:71bec82a806105f964ca1b5a5827765f7d2d5e2bf5a09e4ec2bf192ee2d9b4fe


so once Kubevirt is successfully progressed to the next version,
HCO expects virt-operator to report on the Kubevirt CR:

status:
...
  operatorVersion: sha256:71bec82a806105f964ca1b5a5827765f7d2d5e2bf5a09e4ec2bf192ee2d9b4fe

but now we have:
status:
...
  operatorVersion: v0.59.0-38-g38327148b


The root cause is here:

func (c *KubeVirtDeploymentConfig) GetOperatorVersion() string {
	if c.UseShasums() {
		return c.VirtOperatorSha
	}
	return c.KubeVirtVersion
}

...

func (c *KubeVirtDeploymentConfig) UseShasums() bool {
	return c.VirtOperatorSha != "" && c.VirtApiSha != "" && c.VirtControllerSha != "" && c.VirtHandlerSha != "" && c.VirtLauncherSha != ""
}

in https://github.com/kubevirt/kubevirt/blob/9d100a211ca3bfe0925137b002d8b6b4590da922/pkg/virt-operator/util/config.go#LL476-L478C2


and now we are using:
        - name: VIRT_API_IMAGE
          value: registry.redhat.io/container-native-virtualization/virt-api-rhel9@sha256:1b6ea3379e211320d4d0dbc95f31d39fe77f50527921466b92e5b3c8bd6671b2
        - name: VIRT_CONTROLLER_IMAGE
          value: registry.redhat.io/container-native-virtualization/virt-controller-rhel9@sha256:7da1fb9817791d97a5295bfdd7eb58cf9d5ea572cad6b42fbcae83efe5d562e6
        - name: VIRT_HANDLER_IMAGE
          value: registry.redhat.io/container-native-virtualization/virt-handler-rhel9@sha256:e4ff8bf9d57f500cbb3c70a9035e405c51cc9c2266fda589112551725c70fbb6
        - name: VIRT_LAUNCHER_IMAGE
          value: registry.redhat.io/container-native-virtualization/virt-launcher-rhel9@sha256:8a614bd3db0c4e487f1dd55ee42ddcedb6fbdd153d6bf77132235a090561815c
        - name: VIRT_EXPORTPROXY_IMAGE
          value: registry.redhat.io/container-native-virtualization/virt-exportproxy-rhel9@sha256:5e26e46c7d83ff78f3e22e37405d6d5102037567f2d5cb6fcf267707d6a34b57
        - name: VIRT_EXPORTSERVER_IMAGE
          value: registry.redhat.io/container-native-virtualization/virt-exportserver-rhel9@sha256:9ad67f9f27dd483f9ed354d60be2c7f3e51b7d4f889e7f481f7c4ed5a05215b1
        - name: GS_IMAGE
          value: registry.redhat.io/container-native-virtualization/libguestfs-tools-rhel9@sha256:fedaf6e56ece795c600384bd2262029505be3aa4b6dc0a70b51422be02d35ca2
        - name: KUBEVIRT_VERSION
          value: sha256:71bec82a806105f964ca1b5a5827765f7d2d5e2bf5a09e4ec2bf192ee2d9b4fe


instead of:
                - name: KUBEVIRT_VERSION
                  value: sha256:8282de03fb1f9c5e083e1306ccf1f7dcd21c44c8d24db389288dffea8521d05c
                - name: VIRT_API_SHASUM
                  value: sha256:9b2ecdf39075135b9858ff9c174b19771ef8f7f8f1d516178c124f2f2c618f88
                - name: VIRT_CONTROLLER_SHASUM
                  value: sha256:f09248b28154dcf64a38d4cdf0a14771af4e8988fc2dbfc13d2126ce93cc5f36
                - name: VIRT_HANDLER_SHASUM
                  value: sha256:1ab0f7eefed6bd5d8906f64444e5e6da59d65d5caef18b288a0f6cf790a198a0
                - name: VIRT_LAUNCHER_SHASUM
                  value: sha256:f2c199bdf254adbef295ef3c533ec1cccb4f08de1a79e246ff6c9cf5b2b3c863
                - name: VIRT_EXPORTPROXY_SHASUM
                  value: sha256:5fff19dce2924aee371fc2cfe234ebf223b5e05093d9f16a289ee9b055759226
                - name: VIRT_EXPORTSERVER_SHASUM
                  value: sha256:686d2ce29de8c85e7a2a06388fbd21a5b5fb4e2d9164472b82b4bc43d2300212
                - name: GS_SHASUM
                  value: sha256:0de4d7a9a408e9542fa851f2002baf73caa6321e1f70d5a56b4c62f624c8bbf3

as we did up to CNV 4.12.

Moving this to virt component.

Comment 3 Simone Tiraboschi 2023-03-20 21:18:22 UTC
@iholder@iholder FYI

Comment 4 Fabian Deutsch 2023-03-21 08:40:25 UTC
Simone, could it be that HCO is not correctly using the API, I see:

		Registry:              registry,
		ImagePrefix:           imagePrefix,

https://github.com/kubevirt/kubevirt/blob/9d100a211ca3bfe0925137b002d8b6b4590da922/pkg/virt-operator/util/config.go#L371-L372

Thus the registry and image prefix (of all images) should be separated from the image name.
Thus, shouldn't HCO just set registry, imagePrefix (not nothing?)and then the image names incl shasum?

Comment 5 Simone Tiraboschi 2023-03-21 09:42:36 UTC
I tend to think it's irrelevant:
before https://github.com/kubevirt/kubevirt/pull/8673
virt-operator was composing images name like:
				Image: fmt.Sprintf("%s/%s%s%s", repository, imagePrefix, "virt-launcher", launcherVersion),

where "virt-launcher" was hardcoded but this was preventing us from re-naming it as "virt-launcher-rhel9" and so the aforementioned PR.

With:
https://github.com/kubevirt/kubevirt/pull/8673/commits/0bc97bbabfd30968b9abde1670d22778c4410ee6

passing VirtLauncherShasumEnvName git deprecated and we have now the ability to directly pass
- name: VIRT_LAUNCHER_IMAGE
  value: some.registry.com:tag5

which is now independent from the value of repository and imagePrefix.

Now, due to that, we are not using anymore VIRT_LAUNCHER_SHASUM but VIRT_LAUNCHER_IMAGE

so
func (c *KubeVirtDeploymentConfig) UseShasums() bool {
	return c.VirtOperatorSha != "" && c.VirtApiSha != "" && c.VirtControllerSha != "" && c.VirtHandlerSha != "" && c.VirtLauncherSha != ""
}

returns false

and so
func (c *KubeVirtDeploymentConfig) GetOperatorVersion() string {
	if c.UseShasums() {
		return c.VirtOperatorSha
	}
	return c.KubeVirtVersion
}

returns a fixed tag instead of the SHA digest of the operator as expected by HCO.

As a quick and dirty workaround we can eventually pass both VIRT_LAUNCHER_SHASUM and VIRT_LAUNCHER_IMAGE (and so on) just to unlock this.
But we will still need a proper fix since VIRT_LAUNCHER_SHASUM and so on are deprecated.

Comment 6 Igor Bezukh 2023-03-21 09:59:53 UTC
So IIC we should also check whether SHASUM is being used instead of a tag, in the VIRT_*_IMAGE definitions

Comment 7 Antonio Cardace 2023-03-21 10:16:12 UTC
@stirabos thanks for the detailed troubleshooting, this looks like a real regression KubeVirt should fix.

Comment 8 Igor Bezukh 2023-03-30 07:54:31 UTC
It looks like the root cause for this issue is that the observedKubevirtVersion and tagertKubevirtVersion status fields are missing in the KV CR. I would expect them to exist and have the "latest" value.

Can you please attach the virt-operator pod logs?

TIA
Igor

Comment 9 Igor Bezukh 2023-03-30 13:33:06 UTC
Also can you please provide the virt-operator deployment YAML? or a must-gather bundle if possible

Comment 10 Debarati Basu-Nag 2023-04-14 23:51:26 UTC
I verified with v4.13.0.rhel9-2051 
===========
[cnv-qe-jenkins@cnv-qe-infra-01 ~]$ oc get hco kubevirt-hyperconverged -n openshift-cnv -o json | jq ".status.versions"
[
  {
    "name": "operator",
    "version": "4.12.3"
  }
]
[cnv-qe-jenkins@cnv-qe-infra-01 ~]$ oc get hco kubevirt-hyperconverged -n openshift-cnv -o json | jq ".status.conditions"
[
  {
    "lastTransitionTime": "2023-04-14T21:23:19Z",
    "message": "Reconcile completed successfully",
    "observedGeneration": 8,
    "reason": "ReconcileCompleted",
    "status": "True",
    "type": "ReconcileComplete"
  },
  {
    "lastTransitionTime": "2023-04-14T21:29:00Z",
    "message": "Reconcile completed successfully",
    "observedGeneration": 8,
    "reason": "ReconcileCompleted",
    "status": "True",
    "type": "Available"
  },
  {
    "lastTransitionTime": "2023-04-14T21:23:19Z",
    "message": "HCO is now upgrading to version 4.13.0",
    "observedGeneration": 8,
    "reason": "HCOUpgrading",
    "status": "True",
    "type": "Progressing"
  },
  {
    "lastTransitionTime": "2023-04-14T21:29:00Z",
    "message": "Reconcile completed successfully",
    "observedGeneration": 8,
    "reason": "ReconcileCompleted",
    "status": "False",
    "type": "Degraded"
  },
  {
    "lastTransitionTime": "2023-04-14T21:29:00Z",
    "message": "Reconcile completed successfully",
    "observedGeneration": 8,
    "reason": "ReconcileCompleted",
    "status": "True",
    "type": "Upgradeable"
  }
]
[cnv-qe-jenkins@cnv-qe-infra-01 ~]$ 


Please let me know if this was fixed in a later build.

Comment 11 Igor Bezukh 2023-04-15 07:53:04 UTC
Hi,

Sorry if I wasn't clear, I would need the must-gather logs, since the provided information isn't useful for reproduction or for understanding of the problem scope.

Comment 14 Debarati Basu-Nag 2023-04-25 16:06:14 UTC
Successfully upgraded OCP:
===========================
[cnv-qe-jenkins@cnv-qe-infra-01 ~]$ oc get clusterversion -o json | jq ".items[0].status.history"
[
  {
    "acceptedRisks": "Target release version=\"\" image=\"quay.io/openshift-release-dev/ocp-release:4.13.0-rc.5-x86_64\" cannot be verified, but continuing anyway because the update was forced: release images that are not accessed via digest cannot be verified\nForced through blocking failures: Multiple precondition checks failed:\n* Precondition \"ClusterVersionUpgradeable\" failed because of \"MultipleReasons\": Cluster should not be upgraded between minor versions for multiple reasons: ClusterVersionOverridesSet,AdminAckRequired\n* Disabling ownership via cluster version overrides prevents upgrades. Please remove overrides before continuing.\n* Kubernetes 1.26 and therefore OpenShift 4.13 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/6958394 for details and instructions.\n* Precondition \"EtcdRecentBackup\" failed because of \"ControllerStarted\": RecentBackup: The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required\n* Precondition \"ClusterVersionRecommendedUpdate\" failed because of \"UnknownUpdate\": RetrievedUpdates=False (VersionNotFound), so the recommended status of updating from 4.12.14 to 4.13.0-rc.5 is unknown.",
    "completionTime": "2023-04-25T01:05:31Z",
    "image": "quay.io/openshift-release-dev/ocp-release:4.13.0-rc.5-x86_64",
    "startedTime": "2023-04-24T22:48:17Z",
    "state": "Completed",
    "verified": false,
    "version": "4.13.0-rc.5"
  },
  {
    "completionTime": "2023-04-24T20:28:29Z",
    "image": "quay.io/openshift-release-dev/ocp-release@sha256:157cc02d63bfe67988429fd803da632e495e230d811759b1aed1e6ffa7a3f31a",
    "startedTime": "2023-04-24T19:03:29Z",
    "state": "Completed",
    "verified": false,
    "version": "4.12.14"
  }
]
[cnv-qe-jenkins@cnv-qe-infra-01 ~]$ 
=========================
[cnv-qe-jenkins@cnv-qe-infra-01 ~]$ oc get csv -n openshift-cnv
NAME                                       DISPLAY                                          VERSION    REPLACES                                   PHASE
jaeger-operator.v1.39.0-3                  Red Hat OpenShift distributed tracing platform   1.39.0-3   jaeger-operator.v1.34.1-5                  Succeeded
kiali-operator.v1.57.6                     Kiali Operator                                   1.57.6     kiali-operator.v1.57.5                     Succeeded
kubevirt-hyperconverged-operator.v4.13.0   OpenShift Virtualization                         4.13.0     kubevirt-hyperconverged-operator.v4.12.3   Succeeded
openshift-pipelines-operator-rh.v1.10.0    Red Hat OpenShift Pipelines                      1.10.0                                                Succeeded
servicemeshoperator.v2.3.2                 Red Hat OpenShift Service Mesh                   2.3.2-0    servicemeshoperator.v2.3.1                 Succeeded
[cnv-qe-jenkins@cnv-qe-infra-01 ~]$ 
========================
[cnv-qe-jenkins@cnv-qe-infra-01 ~]$ oc get hco kubevirt-hyperconverged -n openshift-cnv -o json | jq ".status.versions"
[
  {
    "name": "operator",
    "version": "4.13.0"
  }
]
[cnv-qe-jenkins@cnv-qe-infra-01 ~]$

Comment 16 errata-xmlrpc 2023-05-18 02:58:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.13.0 Images security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:3205