Bug 2100054 - Windows VM with WSL2 guest fails to migrate
Summary: Windows VM with WSL2 guest fails to migrate
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Virtualization
Version: 4.11.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.11.0
Assignee: Itamar Holder
QA Contact: vsibirsk
URL:
Whiteboard:
Depends On: 2104817 2115240
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-06-22 09:57 UTC by vsibirsk
Modified: 2022-10-13 14:22 UTC (History)
12 users (show)

Fixed In Version: hco-bundle-registry-container-v4.11.0-587
Doc Type: Known Issue
Doc Text:
Cause: TSC Frequency flag is now required to be explicitly set in Libvirt as of 4.11.0. Attempting to migrate from 4.10.z or earlier , VM's will not have a TSC Frequency setting. Consequence: Windows VM with WSL2 guest (i.e Nested Virtualization CNV VM's) will fail to migrate. Workaround (if any): Currently None. Result: ETA for the fix is expected in 4.11 Z-Stream.
Clone Of:
Environment:
Last Closed: 2022-09-14 19:36:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubevirt kubevirt pull 7986 0 None open [Bug-fix]: Windows VM HyperV Reenlightenment enabled fails to migrate 2022-07-15 21:24:45 UTC
Red Hat Product Errata RHSA-2022:6526 0 None None None 2022-09-14 19:36:18 UTC

Description vsibirsk 2022-06-22 09:57:22 UTC
Description of problem:
Migration of windows VM that has wsl2 guest fails

Version-Release number of selected component (if applicable):
Openshift version: 4.11.0-fc.0
CNV version: 4.11.0
HCO image: brew.registry.redhat.io/rh-osbs/iib:251904


How reproducible:
100%

Steps to Reproduce:
1.Create windows 10 vm with 
cpu:
  features:
  - name: vmx
    policy: require
2.Start VM (should start OK, wsl2 linux guest should also start OK)
3.Migrate VM

Actual results:
Migration fails

Expected results:
Migration succeeds

Additional info:
In source virt-handler logs 2 "types" of error messages were found for failed migration:
1)
{"component":"virt-launcher","level":"error","msg":"internal error: qemu unexpectedly closed the monitor: 2022-06-21T14:03:47.472258Z qemu-kvm: Guest enabled re-enlightenment notifications, 'tsc-frequency=' has to be specified","pos":"virNetClientProgramDispatchError:172","subcomponent":"libvirt","thread":"30","timestamp":"2022-06-21T14:03:47.763000Z"}
{"component":"virt-launcher","level":"info","msg":"2022-06-21T14:03:47.472291Z qemu-kvm: error while loading state for instance 0x0 of device 'cpu'","subcomponent":"libvirt","timestamp":"2022-06-21T14:03:47.763469Z"}
{"component":"virt-launcher","kind":"","level":"error","msg":"Live migration failed.","name":"windows-wsl2-1655820075-1479514","namespace":"virt-general-test-wls2","pos":"live-migration-source.go:1052","reason":"error encountered during MigrateToURI3 libvirt api call: virError(Code=1, Domain=10, Message='internal error: qemu unexpectedly closed the monitor: 2022-06-21T14:03:47.472258Z qemu-kvm: Guest enabled re-enlightenment notifications, 'tsc-frequency=' has to be specified\n2022-06-21T14:03:47.472291Z qemu-kvm: error while loading state for instance 0x0 of device 'cpu'\n2022-06-21T14:03:47.474064Z qemu-kvm: load of migration failed: Invalid argument')","timestamp":"2022-06-21T14:03:47.830105Z","uid":"407b7386-4285-40cd-8fb3-a37253f11f7f"}
{"component":"virt-launcher","kind":"","level":"error","msg":"Failed to migrate vmi","name":"windows-wsl2-1655820075-1479514","namespace":"virt-general-test-wls2","pos":"server.go:111","reason":"migration job 7d8cf65e-0ca3-4b53-a1c5-cfb912fb0f65 already executed, finished at 2022-06-21 14:03:47.725005494 +0000 UTC, completed: true, failed: true, abortStatus: ","timestamp":"2022-06-21T14:03:47.914858Z","uid":"407b7386-4285-40cd-8fb3-a37253f11f7f"}

2)
{"component":"virt-launcher","level":"error","msg":"operation failed: migration out job: unexpectedly failed","pos":"qemuMigrationJobCheckStatus:1749","subcomponent":"libvirt","thread":"33","timestamp":"2022-06-21T19:40:49.306000Z"}
{"component":"virt-launcher","kind":"","level":"error","msg":"Failed to migrate vmi","name":"windows-wsl2-1655828106-1022391","namespace":"virt-general-test-wls2","pos":"server.go:111","reason":"migration job 89a5647e-04bb-4389-a0e2-975e3d570034 already executed, finished at 2022-06-21 19:40:49.646566613 +0000 UTC, completed: true, failed: true, abortStatus: ","timestamp":"2022-06-21T19:40:49.769054Z","uid":"099cb027-11f1-45d7-a99a-b5f51b724a33"}
{"component":"virt-launcher","level":"error","msg":"internal error: qemu unexpectedly closed the monitor: 2022-06-21T19:40:49.239598Z qemu-kvm: Guest enabled re-enlightenment notifications, 'tsc-frequency=' has to be specified","pos":"virNetClientProgramDispatchError:172","subcomponent":"libvirt","thread":"33","timestamp":"2022-06-21T19:40:50.010000Z"}
{"component":"virt-launcher","level":"info","msg":"2022-06-21T19:40:49.239642Z qemu-kvm: error while loading state for instance 0x0 of device 'cpu'","subcomponent":"libvirt","timestamp":"2022-06-21T19:40:50.011006Z"}
{"component":"virt-launcher","level":"error","msg":"End of file while reading data: Input/output error","pos":"virNetSocketReadWire:1792","subcomponent":"libvirt","thread":"29","timestamp":"2022-06-21T19:40:50.011000Z"}
{"component":"virt-launcher","level":"error","msg":"internal error: client socket is closed","pos":"virNetClientSendInternal:2159","subcomponent":"libvirt","thread":"33","timestamp":"2022-06-21T19:40:50.028000Z"}
{"component":"virt-launcher","kind":"","level":"error","msg":"Live migration failed.","name":"windows-wsl2-1655828106-1022391","namespace":"virt-general-test-wls2","pos":"live-migration-source.go:1052","reason":"error encountered during MigrateToURI3 libvirt api call: virError(Code=1, Domain=7, Message='internal error: client socket is closed')","timestamp":"2022-06-21T19:40:50.039747Z","uid":"099cb027-11f1-45d7-a99a-b5f51b724a33"}


Removing
cpu:
  features:
  - name: vmx
    policy: require
from VM spec fixes the migration issue, but wsl2 guest won't work

Comment 1 Itamar Holder 2022-06-23 09:29:07 UTC
First update:

I found this commit from QEMU: https://lists.gnu.org/archive/html/qemu-devel/2021-03/msg07116.html
There, it says: "Require 'tsc-frequency=' command line option to be specified for successful migration when re-enlightenment was enabled by the guest."

This means this behavior is intentional from QEMU's side which probably means it's a Kubevirt issue. Perhaps we don't provide that argument.

I'm still looking for a workaround / solution. Will update as soon as I have more information.

Comment 2 Itamar Holder 2022-06-23 10:10:46 UTC
Second update:

** WORKAROUND FOUND ***

*Bug root cause:*
In converter.go [1], we're adding a tsc frequency parameter if vmi.Status.TopologyHints != nil && vmi.Status.TopologyHints.TSCFrequency != nil.
In virt-controller [2] we are checking if TopologyHints need to be added to the vmi by calling TopologyHintsRequiredForVMI(). There, we are checking if "invtsc" feature exists in vmi's spec (vmi.Spec.Domain.CPU.Features) with a "required" or "force" policy. Only then we add the TopologyHints to the vmi, which will lead (in [1]) to adding this parameter and pass it on to libvirt / QEMU.

*Workaround*
Simply add the invtsc feature to the VM's spec:

cpu:
  features:
  - name: vmx
    policy: require
  - name: invtsc
    policy: require

After doing so the VM is able to migrate successfully.

[1] https://github.com/kubevirt/kubevirt/blob/v0.54.0/pkg/virt-launcher/virtwrap/converter/converter.go#L1691
[2] https://github.com/kubevirt/kubevirt/blob/v0.54.0/pkg/virt-controller/watch/vmi.go#L438

Comment 3 Fabian Deutsch 2022-06-28 07:48:26 UTC
A few qeustions

1. What changed from 4.9 and 4.10 to 4.11 that this flag is now required but not before?
2. From the HAN Atempalte we know that the tsc frequency is also sometimes required. Why is the frequency not required here?

Comment 5 Itamar Holder 2022-06-28 10:12:47 UTC
(In reply to Fabian Deutsch from comment #3)
> A few qeustions
> 
> 1. What changed from 4.9 and 4.10 to 4.11 that this flag is now required but
> not before?
> 2. From the HAN Atempalte we know that the tsc frequency is also sometimes
> required. Why is the frequency not required here?

Hey Fabian,

1. As written in comment #1 change came from a commit to QEMU done in March 2021: https://lists.gnu.org/archive/html/qemu-devel/2021-03/msg07116.html.
2. I see that this template has "invtsc" enabled without "vmx" or hyperV enabled. I'm not sure why it's needed for the template, but I think it's not related to the wsl2 case.

Comment 6 Fabian Deutsch 2022-06-28 12:11:57 UTC
Hey Itamar.

1.1 Do we know in what RHEL-AV/RHEL version this has landed?
1.2 Why do you think that the qemu patch is relevant?

Comment 7 Itamar Holder 2022-07-03 14:39:23 UTC
(In reply to Fabian Deutsch from comment #6)
> Hey Itamar.
> 
> 1.1 Do we know in what RHEL-AV/RHEL version this has landed?
> 1.2 Why do you think that the qemu patch is relevant?

1. This landed on QEMU 6.0.0. In CNV 4.9/4.10 we use QEMU 5.2.0. On CNV 4.11 we use QEMU 6.2.0.
2. Since it brakes backward compatibility in a way that forces HyperV Reenlightenment VMs to provide the --tsc-frequency parameter.

This PR's description is now updated and includes all information, I recommend looking there for more info: https://github.com/kubevirt/kubevirt/pull/7986

Comment 8 Kedar Bidarkar 2022-07-05 17:02:00 UTC
Re-setting the "Target Release" version to 4.11.1, as per the KCS article[1] update "of Nested Virt Support", which suggests so.

[1] - https://access.redhat.com/solutions/6692341

Comment 13 sgott 2022-07-12 13:25:02 UTC
As a clarifcation to Comment #12, there was some confusion in the verification procedure. While the VMs did come up according to KubeVirt, Windows did not boot. Windows cannot use the invtsc flag.

Comment 14 Itamar Holder 2022-07-21 14:21:53 UTC
Clarification & update:

The fix:
This bug is being fixed in this PR: https://github.com/kubevirt/kubevirt/pull/7986.
Essentially, we have a mechanism to find the minimum TSC frequency on the cluster and provide to QEMU. The same mechanism is now being used for Windows VMs with Re-enlightenment enabled. You can read the PR description for more info.

This fix, however, will only land only in 4.11.1.

The workaround:
Unfortunately, the workaround does not work. This is because it turns out that KVM does not support Windows + invtsc.
Unfortunately a different working workaround is not found. I will edit the KBase article to include this information.

Comment 15 Shikha Jhala 2022-07-27 20:05:39 UTC
Added known issue to the 4.11 release notes. @zpeng Please review: https://github.com/openshift/openshift-docs/pull/48328. Thank you.

Comment 16 zhe peng 2022-07-28 11:34:54 UTC
checked related contents in PR, LGTM

Comment 17 vsibirsk 2022-08-07 10:10:28 UTC
Verified on:
Openshift version: 4.11.0-rc.6
CNV version: 4.11.0
HCO image: brew.registry.redhat.io/rh-osbs/iib:286788

Windows VM with WSL2 guest starts and migrates successfully.

Comment 20 errata-xmlrpc 2022-09-14 19:36:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Virtualization 4.11.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:6526


Note You need to log in before you can comment on or make changes to this bug.