Created attachment 1826962 [details] Virt Logs from the cluster. Description of problem: Have been upgrading a Cluster from OCP/CNV-4.6/2.5.8 to OCP/CNV-4.7.32/2.6.7 to OCP-4.8.13 for another bug verification. CNV_SOURCE=osbs ; 2.5.8 to 2.6.7 upgrade was as follows: 2.5.8 to 2.6.4 to 2.6.5 to 2.6.6 to 2.6.7 Post 2.6.7, cluster status was GOOD. When upgrading from OCP/CNV 4.7.32/2.6.7 to OCP 4.8.13, that is CNV 2.6.7 running on OCP 4.8.13 , we see "VMI is in LiveMigrate loop" From the below pastebin, upgrade seems complete but, Worker Nodes are not yet updated to 4.8 Images. The Worker Nodes are stuck, because the vm2-ocs-rhel84 VM is in a constant LiveMigrate Loop. I guess, If I stop/start or restart the VM, the worker Nodes might continue with the upgrade. But I feel the VMI should have LiveMigrated successfully. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Upgrade stuck, due to constant VMI Live-Migrate Loop Expected results: Upgrade is not stuck, due to VMI Live-Migrate Loop Additional info: Running Must-Gather shows, ClusterOperators: clusteroperator/machine-config is not available (Cluster not available for 4.8.13) because Failed to resync 4.8.13 because: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool worker is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 1) clusteroperator/operator-lifecycle-manager is not upgradeable because ClusterServiceVersions blocking cluster upgrade: openshift-local-storage/local-storage-operator.4.6.0-202109030220 is incompatible with OpenShift minor versions greater than 4.8,openshift-cnv/kubevirt-hyperconverged-operator.v2.6.7 is incompatible with OpenShift minor versions greater than 4.8 1) local-storage operator was not updated and it remains the same, which was installed with 4.6/2.5.8 2) OCS was not updated and it remains the same, which was installed with 4.6/2.5.8
(In reply to Kedar Bidarkar from comment #0) > [...] > clusteroperator/operator-lifecycle-manager is not upgradeable because > ClusterServiceVersions blocking cluster upgrade: > openshift-local-storage/local-storage-operator.4.6.0-202109030220 is > incompatible with OpenShift minor versions greater than > 4.8,openshift-cnv/kubevirt-hyperconverged-operator.v2.6.7 is incompatible > with OpenShift minor versions greater than 4.8 This is odd... Technically, the OpenShift minor version was never *greater* than 4.8, it was equal at worst. Do we know why we got that error? Also, it looks like "local-storage-operator" was never upgraded from 4.6 to 4.7, do we know why? Unrelated, but do we know for sure that the VMI was never migrated from a 4.8 node to a 4.8 node? There's probably no reason for such a migration to have occurred, but if it did this could be another occurrence of the persistent target issue. The VM is a RHEL 8.4 using OCS for storage, correct? Did we get a repro (or a non-repro) with a different kind of VM? Finally, is this expected to occur with a fresh OCP 4.7.32 that we upgrade to 4.8.13, or does it only happen when doing 4.6 -> 4.7 -> 4.8? Thank you!
> clusteroperator/operator-lifecycle-manager is not upgradeable because > ClusterServiceVersions blocking cluster upgrade: > openshift-local-storage/local-storage-operator.4.6.0-202109030220 is > incompatible with OpenShift minor versions greater than > 4.8,openshift-cnv/kubevirt-hyperconverged-operator.v2.6.7 is incompatible > with OpenShift minor versions greater than 4.8 This is odd... Technically, the OpenShift minor version was never *greater* than 4.8, it was equal at worst. Do we know why we got that error? Also, it looks like "local-storage-operator" was never upgraded from 4.6 to 4.7, do we know why? [kbidarka] Was not aware about upgrading "local-storage-operator". Realized this when I saw the must-gather logs. We usually upgrade from N-1 to N versions, but this time I was trying the N-3 to N version upgrade. But what I see is, both containerDisk and DV based VMI's fail to migrate currently. Unrelated, but do we know for sure that the VMI was never migrated from a 4.8 node to a 4.8 node? [kbidarka]What I see is, during the upgrade process, the worker Node image is being upgraded and it is during this process, it tries to LiveMigrate the VMI running on the worker Node. As the Migration never completes successfully, it has created almost 249 targePods for each VMI. Due to this, the Upgrade process is stuck. And Worker Nodes are unable to migrate. There's probably no reason for such a migration to have occurred, but if it did this could be another occurrence of the persistent target issue. The VM is a RHEL 8.4 using OCS for storage, correct? Did we get a repro (or a non-repro) with a different kind of VM? [kbidarka] Yes, on the same cluster, we have a containerDisk image of, "kubevirt/cirros-container-disk-demo:latest" Finally, is this expected to occur with a fresh OCP 4.7.32 that we upgrade to 4.8.13, or does it only happen when doing 4.6 -> 4.7 -> 4.8? [kbidarka] I think "CNV/OCP - 2.6.7/4.7.32 to 4.8.2/4.8.13" is yet to be tested, @Debarati Basu-Nag Please correct me if required.
Ultimately, the migration is failing, which is why it's being tried over and over again. From the cluster scoped logs, the reason I see for the failure is {"component":"virt-controller","kind":"","level":"error","msg":"target pod default/virt-launcher-vm2-ocs-rhel84-z9c9g shutdown during migration","name":"kubevirt-evacuation-jn5g9","namespace":"default","pos":"migration.go:296","service":"http","timestamp":"2021-09-28T11:40:09.524350Z","uid":"c7d876ac-a445-45ea-93c7-344563c6a696"} The target pod is shutting down before the migration completes. Now if i look at the handler logs for the VMI source pod that is being migrated, I see a bunch of these errors "{"component":"virt-handler","level":"info","msg":"re-enqueuing VirtualMachineInstance default/vm2-ocs-rhel84","pos":"vm.go:1146","reason":"Can not update a VirtualMachineInstance with unresponsive command server.","timestamp":"2021-09-28T11:37:12.297404Z"}" The updated virt-handler can't contact the old VMI pod's command socket for some reason, which means it will never be able to kick off the migration. The end result is that we have the cluster scope continually trying to create migration targets for the VMI, and those target pods timeout due to the migration never kicking off. So, the question here is, why can't the new virt-handler communicate with the old VMI pod in order to kick off the migration. Either we broke compatibility of the VMI command socket, or the virt-launcher instance itself is somehow in a bad state where it can't respond. Can we get logs from the source virt-launcher pod please?
It appears that the issue is that the destination pod is generating a cloud-init disk of a different size than the source virt-launcher. {"component":"virt-launcher","level":"error","msg":"operation failed: migration of disk vdb failed: Source and target image have different sizes","pos":"qemuMigrationNBDReportMirrorError:488","subcomponent":"libvirt","thread":"35"... - name: cloudinitdisk target: vdb That's probably has been caused by a recent change[1] that affected the iso generation process. We could mitigate this issue by supporting iso generation without the 4k alignment for older workloads. [1] https://github.com/kubevirt/kubevirt/pull/5806
(In reply to Vladik Romanovsky from comment #5) > It appears that the issue is that the destination pod is generating a > cloud-init disk of a different size than the source virt-launcher. > > {"component":"virt-launcher","level":"error","msg":"operation failed: > migration of disk vdb failed: Source and target image have different > sizes","pos":"qemuMigrationNBDReportMirrorError:488","subcomponent": > "libvirt","thread":"35"... > > - name: cloudinitdisk > target: vdb > > That's probably has been caused by a recent change[1] that affected the iso > generation process. > > We could mitigate this issue by supporting iso generation without the 4k > alignment for older workloads. > > > > [1] https://github.com/kubevirt/kubevirt/pull/5806 Since the change made sure that domains can start which can't start otherwise, I wonder if we would then not be able to migrate because the newer qemu rejects the iso due to being unaligned.
(In reply to Jed Lejosne from comment #2) > (In reply to Kedar Bidarkar from comment #0) > > [...] > > clusteroperator/operator-lifecycle-manager is not upgradeable because > > ClusterServiceVersions blocking cluster upgrade: > > openshift-local-storage/local-storage-operator.4.6.0-202109030220 is > > incompatible with OpenShift minor versions greater than > > 4.8,openshift-cnv/kubevirt-hyperconverged-operator.v2.6.7 is incompatible > > with OpenShift minor versions greater than 4.8 > > This is odd... Technically, the OpenShift minor version was never *greater* > than 4.8, it was equal at worst. Do we know why we got that error? > Also, it looks like "local-storage-operator" was never upgraded from 4.6 to > 4.7, do we know why? I admit that the wording of the OLM error message can appear a bit harsh but technically is accurate: due to deprecated v1beta1 APIs that got removed in K8s 1.22 and so OCP 4.9, v2.6.7 is not supposed to work on OCP 4.9 and greater which technically is "OpenShift minor versions greater than 4.8". Tht error message is simply reporting that OCP will not upgrade to 4.9 until openshift-cnv/kubevirt-hyperconverged-operator.v2.6.7 will be replaced by a version that is supposed to be able to work there (v4.8.1 for kubevirt-hyperconverged-operator). So this is absolutely expected and completely unrelated with the current bug.
[kbidarka@localhost ocp-cnv-scripts]$ oc get csv -n openshift-cnv NAME DISPLAY VERSION REPLACES PHASE kubevirt-hyperconverged-operator.v2.6.7 OpenShift Virtualization 2.6.7 kubevirt-hyperconverged-operator.v2.6.6 Replacing kubevirt-hyperconverged-operator.v4.8.3 OpenShift Virtualization 4.8.3 kubevirt-hyperconverged-operator.v2.6.7 Installing [kbidarka@localhost ocp-cnv-scripts]$ oc get csv -n openshift-cnv NAME DISPLAY VERSION REPLACES PHASE kubevirt-hyperconverged-operator.v4.8.3 OpenShift Virtualization 4.8.3 kubevirt-hyperconverged-operator.v2.6.7 Installing [kbidarka@localhost ocp-cnv-scripts]$ oc get csv -n openshift-cnv NAME DISPLAY VERSION REPLACES PHASE kubevirt-hyperconverged-operator.v4.8.3 OpenShift Virtualization 4.8.3 kubevirt-hyperconverged-operator.v2.6.7 Succeeded [kbidarka@localhost ocp-cnv-scripts]$ oc annotate --overwrite -n openshift-cnv hyperconverged kubevirt-hyperconverged kubevirt.kubevirt.io/jsonpatch='[{ > "op": "add", > "path": "/spec/workloadUpdateStrategy", > "value": { > "workloadUpdateMethods": [ > "LiveMigrate" > ] > } > }]' [kbidarka@localhost ocp-cnv-scripts]$ oc get nodes NAME STATUS ROLES AGE VERSION virtkbidupg9-9k69v-master-0 Ready master 18h v1.21.1+a620f50 virtkbidupg9-9k69v-master-1 Ready master 18h v1.21.1+a620f50 virtkbidupg9-9k69v-master-2 Ready master 18h v1.21.1+a620f50 virtkbidupg9-9k69v-worker-0-4q5fj Ready worker 18h v1.21.1+a620f50 virtkbidupg9-9k69v-worker-0-594n4 Ready worker 18h v1.21.1+a620f50 virtkbidupg9-9k69v-worker-0-vxldw Ready worker 18h v1.21.1+a620f50 [kbidarka@localhost ocp-cnv-scripts]$ oc get vmi NAME AGE PHASE IP NODENAME vm-ocs-rhel79 98m Running xx.yy.zz.168 virtkbidupg9-9k69v-worker-0-4q5fj vm-ocs-rhel84 98m Running xx.yy.zz.169 virtkbidupg9-9k69v-worker-0-4q5fj vm2-ocs-rhel84 98m Running xx.yy.zz.167 virtkbidupg9-9k69v-worker-0-4q5fj [kbidarka@localhost ocp-cnv-scripts]$ oc get pods NAME READY STATUS RESTARTS AGE virt-launcher-vm-ocs-rhel79-5l6dm 0/1 Completed 0 69m virt-launcher-vm-ocs-rhel79-mwr5p 1/1 Running 0 11m virt-launcher-vm-ocs-rhel84-tz52w 1/1 Running 0 11m virt-launcher-vm-ocs-rhel84-wsl44 0/1 Completed 0 73m virt-launcher-vm2-ocs-rhel84-6mzbx 0/1 Completed 0 73m virt-launcher-vm2-ocs-rhel84-xgvpb 1/1 Running 0 11m Source VMI Target Pod: container-native-virtualization/virt-launcher/images/v2.6.7-6 Target VMI Target Pod: container-native-virtualization/virt-launcher/images/v4.8.3-6 [kbidarka@localhost ocp-cnv-scripts]$ virtctl console vm-ocs-rhel79 Successfully connected to vm-ocs-rhel79 console. The escape sequence is ^] [cloud-user@vm-ocs-rhel79 ~]$ [cloud-user@vm-ocs-rhel79 ~]$ [kbidarka@localhost ocp-cnv-scripts]$ [kbidarka@localhost ocp-cnv-scripts]$ virtctl console vm-ocs-rhel84 Successfully connected to vm-ocs-rhel84 console. The escape sequence is ^] [cloud-user@vm-ocs-rhel84 ~]$ [kbidarka@localhost ocp-cnv-scripts]$ [kbidarka@localhost ocp-cnv-scripts]$ virtctl console vm2-ocs-rhel84 Successfully connected to vm2-ocs-rhel84 console. The escape sequence is ^] [cloud-user@vm2-ocs-rhel84 ~]$ [kbidarka@localhost ocp-cnv-scripts]$ [kbidarka@localhost ocp-cnv-scripts]$
This bug was VERIFIED on a setup, with CNV Installed 2.5.8 and upgraded to 4.8.3. CNV upgrade path was: 2.5.8 to 2.6.4 to 2.6.5 to 2.6.6 to 2.6.7 to 4.8.3 CNV_SOURCE=production [kbidarka@localhost ocp-cnv-scripts]$ oc get ip -A NAMESPACE NAME CSV APPROVAL APPROVED openshift-cnv install-2sq8n kubevirt-hyperconverged-operator.v2.6.5 Manual true openshift-cnv install-6rs94 kubevirt-hyperconverged-operator.v2.6.4 Manual true openshift-cnv install-gpvwq kubevirt-hyperconverged-operator.v2.6.6 Manual true openshift-cnv install-rqfzk kubevirt-hyperconverged-operator.v2.6.7 Manual true openshift-cnv install-rtssx kubevirt-hyperconverged-operator.v4.8.3 Manual true openshift-local-storage install-kcr5x local-storage-operator.4.6.0-202109030220 Automatic true openshift-local-storage install-t4pq2 local-storage-operator.4.7.0-202110051546 Automatic true openshift-local-storage install-vl9tn local-storage-operator.4.6.0-202109030220 Automatic true openshift-storage install-44vs6 ocs-operator.v4.6.8 Automatic true openshift-storage install-4gn2j ocs-operator.v4.7.4 Automatic true [kbidarka@localhost ocp-cnv-scripts]$ oc get csv -n openshift-cnv NAME DISPLAY VERSION REPLACES PHASE kubevirt-hyperconverged-operator.v4.8.3 OpenShift Virtualization 4.8.3 kubevirt-hyperconverged-operator.v2.6.7 Succeeded --- As part of another bug, https://bugzilla.redhat.com/show_bug.cgi?id=2014509#c3 Upgrade and LiveMigration of VMI from 2.6.7 to 4.8.3 was also Tested and worked Successfully.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Virtualization 4.8.3 Images security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:4914