Description of problem: During OCP upgrade (OCP 4.4.13 - > 4.5.3) with CNV 2.3, VMI migration fails when target node is evicted as part of the upgrade; source VMI migrationState is not updated and the source node remains in Ready,SchedulingDisabled Version-Release number of selected component (if applicable): OCP 4.4.13 - > 4.5.3 with CNV 2.3 How reproducible: ? Steps to Reproduce: 1. Create a VM with NFS DV, start the VM 2. Initiate OCP upgrade Actual results: source VMI migrationState is not updated and the source node remains in Ready,SchedulingDisabled The migration never takes place, the node status is Ready,SchedulingDisabled and a new migration cannot be trigged manually ("in-flight migration detected") Expected results: The source VMI's migrationState should be updated either as successful or as failed. If failed, a new migration job should be triggered to allow node eviction. Additional info: The VMI is running host-172-16-0-39. A migration job is triggered with target node host-172-16-0-27. However, host-172-16-0-27 starts eviction and the target VMI cannot be created. ============================================= $ oc get pod -n upgrade-test-upgrade -owide|grep nfs virt-launcher-vm-for-product-upgrade-nfs-4cj5s 0/1 Completed 0 178m 10.129.2.10 host-172-16-0-27 <none> <none> virt-launcher-vm-for-product-upgrade-nfs-77t82 0/1 Error 0 143m 10.129.2.37 host-172-16-0-27 <none> <none> virt-launcher-vm-for-product-upgrade-nfs-7v58h 1/1 Running 0 176m 10.131.0.30 host-172-16-0-39 <none> <none> VMI migrationState is not updated: ============================================= migrationState: migrationUid: ac1330cc-a499-41ce-b2c4-24927ea9eab6 sourceNode: host-172-16-0-39 targetDirectMigrationNodePorts: "40099": 49153 "41555": 49152 "46761": 0 targetNode: host-172-16-0-27 targetNodeAddress: 10.129.2.6 targetNodeDomainDetected: true targetPod: virt-launcher-vm-for-product-upgrade-nfs-77t82 ============================================= Source node remains in Ready,SchedulingDisabled $ oc describe node host-172-16-0-3 machineconfiguration.openshift.io/reason: failed to drain node (5 tries): timed out waiting for the condition: error when evicting pod "virt-launcher-vm-for-product-upgrade-nfs-7v5... machineconfiguration.openshift.io/ssh: accessed machineconfiguration.openshift.io/state: Degraded All logs and relevant files attached
Created attachment 1702113 [details] logs, yaml files and additional info
Does this appear with or without the fix for bug #1856979?
without. this is 2.3 code
I took a close look at the logs. The cause appears to be a result of virt-launcher not successfully notifying virt-handler that the migration has failed. It very well may be related to bz1856979, however the exact cause of the communication error is unknown. It's unfortunate, but we really don't have any data to indicate what has occurred here. The logs are silent in regards to "why" this occurred. All we see is virt-launcher's log indicating that the migration failed, and that this has caused a domain event to occur as illustrated by these two lines. {"component":"virt-launcher","kind":"","level":"error","msg":"Live migration failed","name":"vm-for-product-upgrade-nfs","namespace":"upgrade-test-upgrade","pos":"manager.go:509","reason":"virError(Code=9, Domain=10, Message='operation failed: Lost connection to destination host')","timestamp":"2020-07-22T14:10:23.959648Z","uid":"2094e14e-399c-47da-8f0a-aa20d0a715c3"} {"component":"virt-launcher","level":"info","msg":"DomainLifecycle event 0 with reason 1 received","pos":"client.go:259","timestamp":"2020-07-22T14:10:23.967926Z"} After that we have no more information. All we know is that virt-handler never receives that domain notify event, but virt-launcher is silent as to why or how this could occurred. This lack of information is due to us silently ignoring errors on virt-launcher's side if there's a communication error with virt-handler. Basically, all we know is virt-launcher attempted to contact virt-handler to alert it of the migration failure, but that notification never got there. The client side portion of bz1856979 may help, but without knowing exactly what has occurred I can't say that with any certainty. What I do know with certainty is that our logging in this area needs to improve, which I'm addressing now upstream.
fyi, https://github.com/kubevirt/kubevirt/pull/3885 improves our logging in this area. While this does not fix anything, without these log messages it is difficult to understand what exactly has occurred here.
@vromanso - can you please advise on a workaround?
Thanks for that Vladik. I've updated the doc text written a release note with the suggested workaround and tagged you in gh for review: https://github.com/openshift/openshift-docs/pull/24710
Added my comment to the PR
@Ruth - I can't see a comment from you on the docs PR: https://github.com/openshift/openshift-docs/pull/24710 Moving back to ON_QA
@Andrew - I see my commetn there, maybe as I am not a contributor you cannot see it? Will add it here: * If container-native virtualization 2.3 is installed on your {product-title} 4.4 cluster, upgrading the cluster to version 4.5 causes a migrating virtual machine instance (VMI) to fail when the target node is evicted during the upgrade. --> "causes a migrating virtual" -> "may cause a migrating virtual" ---> "when the target node is evicted during the upgrade." should be removed This is because the virt-launcher Pod does not successfully notify the virt-handler Pod that migration has failed. The result is that the source VMI `migrationState` is not updated, and the source node remains in a `Ready,SchedulingDisabled` state. ---> "and the source node remains in a `Ready,SchedulingDisabled` state." should be removed.
(In reply to Ruth Netser from comment #15) > @Andrew - I see my commetn there, maybe as I am not a contributor you cannot > see it? > Will add it here: > > * If container-native virtualization 2.3 is installed on your > {product-title} 4.4 cluster, upgrading the cluster to version 4.5 causes a > migrating virtual machine instance (VMI) to fail when the target node is > evicted during the upgrade. > > --> "causes a migrating virtual" -> "may cause a migrating virtual" > ---> "when the target node is evicted during the upgrade." should be removed > > This is because the virt-launcher Pod does not successfully notify the > virt-handler Pod that migration has failed. > The result is that the source VMI `migrationState` is not updated, and the > source node remains in a `Ready,SchedulingDisabled` state. > ---> "and the source node remains in a `Ready,SchedulingDisabled` state." > should be removed. Managed to add my comment in the PR as well
Thanks Ruth. Updated the PR as suggested.
Reviewed, moving to verify. (note that as we could not reproduce this bug, I could not verify the w/a)