1859661 – VMI migration fails when target node is evicted as part of OCP upgrade; source VMI migrationState is not updated and the source node remains in Ready,SchedulingDisabled

Bug 1859661 - VMI migration fails when target node is evicted as part of OCP upgrade; source VMI migrationState is not updated and the source node remains in Ready,SchedulingDisabled

Summary: VMI migration fails when target node is evicted as part of OCP upgrade; sourc...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Documentation
Sub Component:
Version:	2.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	2.4.0
Assignee:	Andrew Burden
QA Contact:	Nelly Credi
Docs Contact:
URL:
Whiteboard:
Depends On:	1856979
Blocks:
TreeView+	depends on / blocked

Reported:	2020-07-22 16:37 UTC by Ruth Netser
Modified:	2020-09-01 06:28 UTC (History)
CC List:	7 users (show)
Fixed In Version:	virt-operator-container-v2.4.0-55
Doc Type:	If docs needed, set a value
Doc Text:	When upgrading an {product-title} cluster from version 4.4 to 4.5, and container-native virtualization 2.3 is installed, a migrating VMI fails when the target node is evicted due to the upgrade. This is because the virt-launcher Pod does not successfully notify the virt-handler Pod that migration has failed. The result is that the source VMI `migrationState` is not updated, and the source node remains in a `Ready,SchedulingDisabled` state.
Clone Of:
Environment:
Last Closed:	2020-09-01 06:28:32 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
logs, yaml files and additional info (24.85 KB, application/x-xz) 2020-07-22 16:45 UTC, Ruth Netser	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift openshift-docs pull 24710	0	None	closed	CNV BZ#1859661 - VMI migration rel_not	2021-02-03 08:37:08 UTC

Description Ruth Netser 2020-07-22 16:37:34 UTC

Description of problem:
During OCP upgrade (OCP 4.4.13 - > 4.5.3) with CNV 2.3, VMI migration fails when target node is evicted as part of the upgrade; source VMI migrationState is not updated and the source node remains in Ready,SchedulingDisabled


Version-Release number of selected component (if applicable):
OCP 4.4.13 - > 4.5.3 with CNV 2.3

How reproducible:
?

Steps to Reproduce:
1. Create a VM with NFS DV, start the VM
2. Initiate OCP upgrade


Actual results:
source VMI migrationState is not updated and the source node remains in Ready,SchedulingDisabled
The migration never takes place, the node status is Ready,SchedulingDisabled and a new migration cannot be trigged manually ("in-flight migration detected")


Expected results:
The source VMI's migrationState should be updated either as successful or as failed. 
If failed, a new migration job should be triggered to allow node eviction.


Additional info:
The VMI is running host-172-16-0-39. A migration job is triggered with target node host-172-16-0-27.
However, host-172-16-0-27 starts eviction and the target VMI cannot be created.


=============================================
$ oc get pod -n upgrade-test-upgrade -owide|grep nfs
virt-launcher-vm-for-product-upgrade-nfs-4cj5s                    0/1     Completed   0          178m   10.129.2.10   host-172-16-0-27   <none>           <none>
virt-launcher-vm-for-product-upgrade-nfs-77t82                    0/1     Error       0          143m   10.129.2.37   host-172-16-0-27   <none>           <none>
virt-launcher-vm-for-product-upgrade-nfs-7v58h                    1/1     Running     0          176m   10.131.0.30   host-172-16-0-39   <none>           <none>


VMI migrationState is not updated:
=============================================
  migrationState:
    migrationUid: ac1330cc-a499-41ce-b2c4-24927ea9eab6
    sourceNode: host-172-16-0-39
    targetDirectMigrationNodePorts:
      "40099": 49153
      "41555": 49152
      "46761": 0
    targetNode: host-172-16-0-27
    targetNodeAddress: 10.129.2.6
    targetNodeDomainDetected: true
    targetPod: virt-launcher-vm-for-product-upgrade-nfs-77t82


=============================================
Source node remains in Ready,SchedulingDisabled 
$ oc describe node host-172-16-0-3

                    machineconfiguration.openshift.io/reason:
                      failed to drain node (5 tries): timed out waiting for the condition: error when evicting pod "virt-launcher-vm-for-product-upgrade-nfs-7v5...
                    machineconfiguration.openshift.io/ssh: accessed
                    machineconfiguration.openshift.io/state: Degraded
   

All logs and relevant files attached

Comment 1 Ruth Netser 2020-07-22 16:45:45 UTC

Created attachment 1702113 [details]
logs, yaml files and additional info

Comment 2 Fabian Deutsch 2020-07-22 18:52:06 UTC

Does this appear with or without the fix for bug #1856979?

Comment 3 Nelly Credi 2020-07-23 10:30:05 UTC

without. this is 2.3 code

Comment 4 David Vossel 2020-07-23 15:28:02 UTC

I took a close look at the logs.

The cause appears to be a result of virt-launcher not successfully notifying virt-handler that the migration has failed. It very well may be related to bz1856979, however the exact cause of the communication error is unknown.

It's unfortunate, but we really don't have any data to indicate what has occurred here. The logs are silent in regards to "why" this occurred.

All we see is virt-launcher's log indicating that the migration failed, and that this has caused a domain event to occur as illustrated by these two lines.

{"component":"virt-launcher","kind":"","level":"error","msg":"Live migration failed","name":"vm-for-product-upgrade-nfs","namespace":"upgrade-test-upgrade","pos":"manager.go:509","reason":"virError(Code=9, Domain=10, Message='operation failed: Lost connection to destination host')","timestamp":"2020-07-22T14:10:23.959648Z","uid":"2094e14e-399c-47da-8f0a-aa20d0a715c3"}

{"component":"virt-launcher","level":"info","msg":"DomainLifecycle event 0 with reason 1 received","pos":"client.go:259","timestamp":"2020-07-22T14:10:23.967926Z"}

After that we have no more information. All we know is that virt-handler never receives that domain notify event, but virt-launcher is silent as to why or how this could occurred. This lack of information is due to us silently ignoring errors on virt-launcher's side if there's a communication error with virt-handler.

Basically, all we know is virt-launcher attempted to contact virt-handler to alert it of the migration failure, but that notification never got there. The client side portion of bz1856979 may help, but without knowing exactly what has occurred I can't say that with any certainty.

What I do know with certainty is that our logging in this area needs to improve, which I'm addressing now upstream.

Comment 5 David Vossel 2020-07-23 21:19:32 UTC

fyi, https://github.com/kubevirt/kubevirt/pull/3885 improves our logging in this area. While this does not fix anything, without these log messages it is difficult to understand what exactly has occurred here.

Comment 10 Ruth Netser 2020-08-04 07:44:26 UTC

@vromanso - can you please advise on a workaround?

Comment 12 Andrew Burden 2020-08-11 14:49:42 UTC

Thanks for that Vladik.
I've updated the doc text written a release note with the suggested workaround and tagged you in gh for review:
https://github.com/openshift/openshift-docs/pull/24710

Comment 13 Ruth Netser 2020-08-17 07:43:14 UTC

Added my comment to the PR

Comment 14 Andrew Burden 2020-08-17 11:28:39 UTC

@Ruth - I can't see a comment from you on the docs PR: https://github.com/openshift/openshift-docs/pull/24710
Moving back to ON_QA

Comment 15 Ruth Netser 2020-08-24 07:59:31 UTC

@Andrew - I see my commetn there, maybe as I am not a contributor you cannot see it?
Will add it here:

* If container-native virtualization 2.3 is installed on your {product-title} 4.4 cluster, upgrading the cluster to version 4.5 causes a migrating virtual machine instance (VMI) to fail when the target node is evicted during the upgrade.

--> "causes a migrating virtual" -> "may cause a migrating virtual"
---> "when the target node is evicted during the upgrade." should be removed

This is because the virt-launcher Pod does not successfully notify the virt-handler Pod that migration has failed.
The result is that the source VMI `migrationState` is not updated, and the source node remains in a `Ready,SchedulingDisabled` state.
---> "and the source node remains in a `Ready,SchedulingDisabled` state." should be removed.

Comment 16 Ruth Netser 2020-08-24 08:02:31 UTC

(In reply to Ruth Netser from comment #15)
> @Andrew - I see my commetn there, maybe as I am not a contributor you cannot
> see it?
> Will add it here:
> 
> * If container-native virtualization 2.3 is installed on your
> {product-title} 4.4 cluster, upgrading the cluster to version 4.5 causes a
> migrating virtual machine instance (VMI) to fail when the target node is
> evicted during the upgrade.
> 
> --> "causes a migrating virtual" -> "may cause a migrating virtual"
> ---> "when the target node is evicted during the upgrade." should be removed
> 
> This is because the virt-launcher Pod does not successfully notify the
> virt-handler Pod that migration has failed.
> The result is that the source VMI `migrationState` is not updated, and the
> source node remains in a `Ready,SchedulingDisabled` state.
> ---> "and the source node remains in a `Ready,SchedulingDisabled` state."
> should be removed.

Managed to add my comment in the PR as well

Comment 17 Andrew Burden 2020-08-24 10:48:30 UTC

Thanks Ruth. Updated the PR as suggested.

Comment 18 Ruth Netser 2020-08-26 12:39:27 UTC

Reviewed, moving to verify.
(note that as we could not reproduce this bug, I could not verify the w/a)

Note You need to log in before you can comment on or make changes to this bug.