Bug 2008511 - [CNV-4.8.3] VMI is in LiveMigrate loop when Upgrading Cluster from 2.6.7/4.7.32 to OCP 4.8.13
Summary: [CNV-4.8.3] VMI is in LiveMigrate loop when Upgrading Cluster from 2.6.7/4.7....
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Virtualization
Version: 2.6.7
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ---
: 4.8.3
Assignee: Jed Lejosne
QA Contact: Israel Pinto
URL:
Whiteboard:
Depends On:
Blocks: 2010742 2013494 2014509
TreeView+ depends on / blocked
 
Reported: 2021-09-28 12:18 UTC by Kedar Bidarkar
Modified: 2022-01-04 17:25 UTC (History)
7 users (show)

Fixed In Version: virt-operator-container-v4.8.3-5 hco-bundle-registry-container-v4.8.3-17
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2010742 2013494 (view as bug list)
Environment:
Last Closed: 2021-12-02 17:01:42 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Virt Logs from the cluster. (2.94 MB, application/gzip)
2021-09-28 12:18 UTC, Kedar Bidarkar
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github kubevirt kubevirt pull 6482 0 None open migration: generate empty isos on target for cloud-inits, configmaps, secrets, ... 2021-10-04 18:08:59 UTC
Github kubevirt kubevirt pull 6517 0 None open [release-0.41] migration: generate empty isos on target for cloud-inits, configmaps, secrets, ... 2021-10-04 21:30:47 UTC
Red Hat Product Errata RHSA-2021:4914 0 None None None 2021-12-02 17:01:48 UTC

Description Kedar Bidarkar 2021-09-28 12:18:33 UTC
Created attachment 1826962 [details]
Virt Logs from the cluster.

Description of problem:

Have been upgrading a Cluster from OCP/CNV-4.6/2.5.8 to OCP/CNV-4.7.32/2.6.7 to OCP-4.8.13  for another bug verification.

CNV_SOURCE=osbs   ;  2.5.8 to 2.6.7 upgrade was as follows: 2.5.8 to 2.6.4 to 2.6.5 to 2.6.6 to 2.6.7
Post 2.6.7, cluster status was GOOD.


When upgrading from  OCP/CNV 4.7.32/2.6.7   to OCP 4.8.13, that is CNV 2.6.7 running on OCP 4.8.13 , we see "VMI is in  LiveMigrate loop"

From the below pastebin, upgrade seems complete but,
Worker Nodes are not yet updated to 4.8 Images.


The Worker Nodes are stuck, because the vm2-ocs-rhel84 VM is in a constant LiveMigrate Loop.


I guess, If I stop/start or restart the VM, the worker Nodes might continue with the upgrade.
But I feel the VMI should have LiveMigrated successfully. 

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:
Upgrade stuck, due to constant VMI Live-Migrate Loop

Expected results:
Upgrade is not stuck, due to VMI Live-Migrate Loop

Additional info:
Running Must-Gather shows,

ClusterOperators:
	clusteroperator/machine-config is not available (Cluster not available for 4.8.13) because Failed to resync 4.8.13 because: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool worker is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 1)
	clusteroperator/operator-lifecycle-manager is not upgradeable because ClusterServiceVersions blocking cluster upgrade: openshift-local-storage/local-storage-operator.4.6.0-202109030220 is incompatible with OpenShift minor versions greater than 4.8,openshift-cnv/kubevirt-hyperconverged-operator.v2.6.7 is incompatible with OpenShift minor versions greater than 4.8


1) local-storage operator was not updated and it remains the same, which was installed with 4.6/2.5.8
2) OCS was not updated and it remains the same, which was installed with 4.6/2.5.8

Comment 2 Jed Lejosne 2021-09-28 13:36:31 UTC
(In reply to Kedar Bidarkar from comment #0)
> [...]
> 	clusteroperator/operator-lifecycle-manager is not upgradeable because
> ClusterServiceVersions blocking cluster upgrade:
> openshift-local-storage/local-storage-operator.4.6.0-202109030220 is
> incompatible with OpenShift minor versions greater than
> 4.8,openshift-cnv/kubevirt-hyperconverged-operator.v2.6.7 is incompatible
> with OpenShift minor versions greater than 4.8

This is odd... Technically, the OpenShift minor version was never *greater* than 4.8, it was equal at worst. Do we know why we got that error?
Also, it looks like "local-storage-operator" was never upgraded from 4.6 to 4.7, do we know why?

Unrelated, but do we know for sure that the VMI was never migrated from a 4.8 node to a 4.8 node?
There's probably no reason for such a migration to have occurred, but if it did this could be another occurrence of the persistent target issue.

The VM is a RHEL 8.4 using OCS for storage, correct? Did we get a repro (or a non-repro) with a different kind of VM?
Finally, is this expected to occur with a fresh OCP 4.7.32 that we upgrade to 4.8.13, or does it only happen when doing 4.6 -> 4.7 -> 4.8?

Thank you!

Comment 3 Kedar Bidarkar 2021-09-28 17:11:07 UTC
    >       clusteroperator/operator-lifecycle-manager is not upgradeable because
    > ClusterServiceVersions blocking cluster upgrade:
    > openshift-local-storage/local-storage-operator.4.6.0-202109030220 is
    > incompatible with OpenShift minor versions greater than
    > 4.8,openshift-cnv/kubevirt-hyperconverged-operator.v2.6.7 is incompatible
    > with OpenShift minor versions greater than 4.8

    This is odd... Technically, the OpenShift minor version was never *greater*
    than 4.8, it was equal at worst. Do we know why we got that error?
    Also, it looks like "local-storage-operator" was never upgraded from 4.6 to
    4.7, do we know why?

[kbidarka] Was not aware about upgrading "local-storage-operator".  Realized this when I saw the must-gather logs.
We usually upgrade from N-1 to N versions, but this time I was trying the N-3 to N version upgrade.

But what I see is, both containerDisk and DV based VMI's fail to migrate currently.
 


    Unrelated, but do we know for sure that the VMI was never migrated from a 4.8
    node to a 4.8 node?

[kbidarka]What I see is, during the upgrade process, the worker Node image is being upgraded and
it is during this process, it tries to LiveMigrate the VMI running on the worker Node.

As the Migration never completes successfully, it has created almost 249 targePods for each VMI.

Due to this, the Upgrade process is stuck. And Worker Nodes are unable to migrate.
 

    There's probably no reason for such a migration to have occurred, but if it did
    this could be another occurrence of the persistent target issue.

    The VM is a RHEL 8.4 using OCS for storage, correct? Did we get a repro (or a
    non-repro) with a different kind of VM?

[kbidarka] Yes, on the same cluster, we have a containerDisk image of, "kubevirt/cirros-container-disk-demo:latest"
 

    Finally, is this expected to occur with a fresh OCP 4.7.32 that we upgrade to
    4.8.13, or does it only happen when doing 4.6 -> 4.7 -> 4.8?

 
[kbidarka] I think "CNV/OCP - 2.6.7/4.7.32 to 4.8.2/4.8.13" is yet to be tested, @Debarati Basu-Nag  Please correct me if required.

Comment 4 David Vossel 2021-09-28 20:15:34 UTC
Ultimately, the migration is failing, which is why it's being tried over and over again. From the cluster scoped logs, the reason I see for the failure is

{"component":"virt-controller","kind":"","level":"error","msg":"target pod default/virt-launcher-vm2-ocs-rhel84-z9c9g shutdown during migration","name":"kubevirt-evacuation-jn5g9","namespace":"default","pos":"migration.go:296","service":"http","timestamp":"2021-09-28T11:40:09.524350Z","uid":"c7d876ac-a445-45ea-93c7-344563c6a696"}


The target pod is shutting down before the migration completes. Now if i look at the handler logs for the VMI source pod that is being migrated, I see a bunch of these errors

"{"component":"virt-handler","level":"info","msg":"re-enqueuing VirtualMachineInstance default/vm2-ocs-rhel84","pos":"vm.go:1146","reason":"Can not update a VirtualMachineInstance with unresponsive command server.","timestamp":"2021-09-28T11:37:12.297404Z"}"

The updated virt-handler can't contact the old VMI pod's command socket for some reason, which means it will never be able to kick off the migration. 

The end result is that we have the cluster scope continually trying to create migration targets for the VMI, and those target pods timeout due to the migration never kicking off.

So, the question here is, why can't the new virt-handler communicate with the old VMI pod in order to kick off the migration. Either we broke compatibility of the VMI command socket, or the virt-launcher instance itself is somehow in a bad state where it can't respond.

Can we get logs from the source virt-launcher pod please?

Comment 5 Vladik Romanovsky 2021-09-28 21:01:41 UTC
It appears that the issue is that the destination pod is generating a cloud-init disk of a different size than the source virt-launcher.

{"component":"virt-launcher","level":"error","msg":"operation failed: migration of disk vdb failed: Source and target image have different sizes","pos":"qemuMigrationNBDReportMirrorError:488","subcomponent":"libvirt","thread":"35"...

- name: cloudinitdisk
    target: vdb

That's probably has been caused by a recent change[1] that affected the iso generation process.

We could mitigate this issue by supporting iso generation without the 4k alignment for older workloads.
 


[1] https://github.com/kubevirt/kubevirt/pull/5806

Comment 6 Roman Mohr 2021-09-29 10:17:51 UTC
(In reply to Vladik Romanovsky from comment #5)
> It appears that the issue is that the destination pod is generating a
> cloud-init disk of a different size than the source virt-launcher.
> 
> {"component":"virt-launcher","level":"error","msg":"operation failed:
> migration of disk vdb failed: Source and target image have different
> sizes","pos":"qemuMigrationNBDReportMirrorError:488","subcomponent":
> "libvirt","thread":"35"...
> 
> - name: cloudinitdisk
>     target: vdb
> 
> That's probably has been caused by a recent change[1] that affected the iso
> generation process.
> 
> We could mitigate this issue by supporting iso generation without the 4k
> alignment for older workloads.
>  
> 
> 
> [1] https://github.com/kubevirt/kubevirt/pull/5806

Since the change made sure that domains can start which can't start otherwise, I wonder if
we would then not be able to migrate because the newer qemu rejects the iso due to being unaligned.

Comment 8 Simone Tiraboschi 2021-09-30 10:40:09 UTC
(In reply to Jed Lejosne from comment #2)
> (In reply to Kedar Bidarkar from comment #0)
> > [...]
> > 	clusteroperator/operator-lifecycle-manager is not upgradeable because
> > ClusterServiceVersions blocking cluster upgrade:
> > openshift-local-storage/local-storage-operator.4.6.0-202109030220 is
> > incompatible with OpenShift minor versions greater than
> > 4.8,openshift-cnv/kubevirt-hyperconverged-operator.v2.6.7 is incompatible
> > with OpenShift minor versions greater than 4.8
> 
> This is odd... Technically, the OpenShift minor version was never *greater*
> than 4.8, it was equal at worst. Do we know why we got that error?
> Also, it looks like "local-storage-operator" was never upgraded from 4.6 to
> 4.7, do we know why?

I admit that the wording of the OLM error message can appear a bit harsh but technically is accurate: due to deprecated v1beta1 APIs that got removed in K8s 1.22 and so OCP 4.9, v2.6.7 is not supposed to work on OCP 4.9 and greater which technically is "OpenShift minor versions greater than 4.8".
Tht error message is simply reporting that OCP will not upgrade to 4.9 until openshift-cnv/kubevirt-hyperconverged-operator.v2.6.7 will be replaced by a version that is supposed to be able to work there (v4.8.1 for kubevirt-hyperconverged-operator).
So this is absolutely expected and completely unrelated with the current bug.

Comment 9 Kedar Bidarkar 2021-10-20 20:08:43 UTC
[kbidarka@localhost ocp-cnv-scripts]$ oc get csv -n openshift-cnv 
NAME                                      DISPLAY                    VERSION   REPLACES                                  PHASE
kubevirt-hyperconverged-operator.v2.6.7   OpenShift Virtualization   2.6.7     kubevirt-hyperconverged-operator.v2.6.6   Replacing
kubevirt-hyperconverged-operator.v4.8.3   OpenShift Virtualization   4.8.3     kubevirt-hyperconverged-operator.v2.6.7   Installing


[kbidarka@localhost ocp-cnv-scripts]$ oc get csv -n openshift-cnv 
NAME                                      DISPLAY                    VERSION   REPLACES                                  PHASE
kubevirt-hyperconverged-operator.v4.8.3   OpenShift Virtualization   4.8.3     kubevirt-hyperconverged-operator.v2.6.7   Installing

[kbidarka@localhost ocp-cnv-scripts]$ oc get csv -n openshift-cnv 
NAME                                      DISPLAY                    VERSION   REPLACES                                  PHASE
kubevirt-hyperconverged-operator.v4.8.3   OpenShift Virtualization   4.8.3     kubevirt-hyperconverged-operator.v2.6.7   Succeeded

[kbidarka@localhost ocp-cnv-scripts]$ oc annotate --overwrite -n openshift-cnv hyperconverged kubevirt-hyperconverged kubevirt.kubevirt.io/jsonpatch='[{
>       "op": "add",
>       "path": "/spec/workloadUpdateStrategy",
>       "value": {
>             "workloadUpdateMethods": [
>                 "LiveMigrate"
>             ]
>       }
>   }]'

[kbidarka@localhost ocp-cnv-scripts]$ oc get nodes
NAME                                STATUS   ROLES    AGE   VERSION
virtkbidupg9-9k69v-master-0         Ready    master   18h   v1.21.1+a620f50
virtkbidupg9-9k69v-master-1         Ready    master   18h   v1.21.1+a620f50
virtkbidupg9-9k69v-master-2         Ready    master   18h   v1.21.1+a620f50
virtkbidupg9-9k69v-worker-0-4q5fj   Ready    worker   18h   v1.21.1+a620f50
virtkbidupg9-9k69v-worker-0-594n4   Ready    worker   18h   v1.21.1+a620f50
virtkbidupg9-9k69v-worker-0-vxldw   Ready    worker   18h   v1.21.1+a620f50

[kbidarka@localhost ocp-cnv-scripts]$ oc get vmi 
NAME             AGE   PHASE     IP             NODENAME
vm-ocs-rhel79    98m   Running   xx.yy.zz.168   virtkbidupg9-9k69v-worker-0-4q5fj
vm-ocs-rhel84    98m   Running   xx.yy.zz.169   virtkbidupg9-9k69v-worker-0-4q5fj
vm2-ocs-rhel84   98m   Running   xx.yy.zz.167   virtkbidupg9-9k69v-worker-0-4q5fj

[kbidarka@localhost ocp-cnv-scripts]$ oc get pods
NAME                                 READY   STATUS      RESTARTS   AGE
virt-launcher-vm-ocs-rhel79-5l6dm    0/1     Completed   0          69m
virt-launcher-vm-ocs-rhel79-mwr5p    1/1     Running     0          11m
virt-launcher-vm-ocs-rhel84-tz52w    1/1     Running     0          11m
virt-launcher-vm-ocs-rhel84-wsl44    0/1     Completed   0          73m
virt-launcher-vm2-ocs-rhel84-6mzbx   0/1     Completed   0          73m
virt-launcher-vm2-ocs-rhel84-xgvpb   1/1     Running     0          11m


Source VMI Target Pod: container-native-virtualization/virt-launcher/images/v2.6.7-6

Target VMI Target Pod: container-native-virtualization/virt-launcher/images/v4.8.3-6

[kbidarka@localhost ocp-cnv-scripts]$ virtctl console vm-ocs-rhel79

Successfully connected to vm-ocs-rhel79 console. The escape sequence is ^]

[cloud-user@vm-ocs-rhel79 ~]$ 
[cloud-user@vm-ocs-rhel79 ~]$ [kbidarka@localhost ocp-cnv-scripts]$ 
[kbidarka@localhost ocp-cnv-scripts]$ virtctl console vm-ocs-rhel84
Successfully connected to vm-ocs-rhel84 console. The escape sequence is ^]

[cloud-user@vm-ocs-rhel84 ~]$ [kbidarka@localhost ocp-cnv-scripts]$ 
[kbidarka@localhost ocp-cnv-scripts]$ virtctl console vm2-ocs-rhel84
Successfully connected to vm2-ocs-rhel84 console. The escape sequence is ^]

[cloud-user@vm2-ocs-rhel84 ~]$ [kbidarka@localhost ocp-cnv-scripts]$ 
[kbidarka@localhost ocp-cnv-scripts]$

Comment 11 Kedar Bidarkar 2021-10-20 20:16:53 UTC
This bug was VERIFIED on a setup, with CNV Installed 2.5.8 and upgraded to 4.8.3.

CNV upgrade path was: 2.5.8 to 2.6.4 to 2.6.5 to 2.6.6 to 2.6.7 to 4.8.3
CNV_SOURCE=production

[kbidarka@localhost ocp-cnv-scripts]$ oc get ip -A 
NAMESPACE                 NAME            CSV                                         APPROVAL    APPROVED
openshift-cnv             install-2sq8n   kubevirt-hyperconverged-operator.v2.6.5     Manual      true
openshift-cnv             install-6rs94   kubevirt-hyperconverged-operator.v2.6.4     Manual      true
openshift-cnv             install-gpvwq   kubevirt-hyperconverged-operator.v2.6.6     Manual      true
openshift-cnv             install-rqfzk   kubevirt-hyperconverged-operator.v2.6.7     Manual      true
openshift-cnv             install-rtssx   kubevirt-hyperconverged-operator.v4.8.3     Manual      true
openshift-local-storage   install-kcr5x   local-storage-operator.4.6.0-202109030220   Automatic   true
openshift-local-storage   install-t4pq2   local-storage-operator.4.7.0-202110051546   Automatic   true
openshift-local-storage   install-vl9tn   local-storage-operator.4.6.0-202109030220   Automatic   true
openshift-storage         install-44vs6   ocs-operator.v4.6.8                         Automatic   true
openshift-storage         install-4gn2j   ocs-operator.v4.7.4                         Automatic   true
[kbidarka@localhost ocp-cnv-scripts]$ oc get csv -n openshift-cnv 
NAME                                      DISPLAY                    VERSION   REPLACES                                  PHASE
kubevirt-hyperconverged-operator.v4.8.3   OpenShift Virtualization   4.8.3     kubevirt-hyperconverged-operator.v2.6.7   Succeeded

---

As part of another bug, https://bugzilla.redhat.com/show_bug.cgi?id=2014509#c3

Upgrade and LiveMigration of VMI from 2.6.7 to 4.8.3 was also Tested and worked Successfully.

Comment 15 errata-xmlrpc 2021-12-02 17:01:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.8.3 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:4914


Note You need to log in before you can comment on or make changes to this bug.