Bug 2003473 - Failed to Migrate Windows VM with CDROM (readonly)
Summary: Failed to Migrate Windows VM with CDROM (readonly)
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Virtualization
Version: 4.9.0
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 4.9.0
Assignee: Jed Lejosne
QA Contact: Israel Pinto
URL:
Whiteboard:
Depends On:
Blocks: 2007776
TreeView+ depends on / blocked
 
Reported: 2021-09-12 17:35 UTC by Israel Pinto
Modified: 2021-11-02 16:01 UTC (History)
8 users (show)

Fixed In Version: virt-operator-container-v4.9.0-48 hco-bundle-registry-container-v4.9.0-212
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2007776 (view as bug list)
Environment:
Last Closed: 2021-11-02 16:01:09 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
win10 VM booted successfully (436.69 KB, image/png)
2021-09-24 13:42 UTC, Kedar Bidarkar
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github kubevirt kubevirt pull 6424 0 None Merged migration: persist XML changes to the target 2021-11-25 20:16:13 UTC
Red Hat Product Errata RHSA-2021:4104 0 None None None 2021-11-02 16:01:30 UTC

Description Israel Pinto 2021-09-12 17:35:48 UTC
Description of problem:
Failed to migrate Windows VM with CDROM 

Version-Release number of selected component (if applicable):
OCP: 4.9.0-rc.0
Red Hat Enterprise Linux CoreOS 49.84.202109041651-0 (Ootpa)  
Kernel: 4.18.0-305.12.1.el8_4.x86_64
CNV: 4.9


Steps to Reproduce:
1. Create Windows VM with CDROM
2. Migrate VM

Actual results:
Migration failed.

Additional info:
1. Tested migration of windows VM with CDROM in CNV-4.8.2, Migration passed
2. Tested migration of RHEL VM with CDROM on CNV 4.9 - Migration passed
3. Tested with OS (windows) which doesn't boot - Same results migration failed
4. Disk size on windows:
(cdrom)CDFSD:\378.7 MiB 378.7 MiB
(root disk)NTFSC:\33.96 GiB 20.26 GiB

Comment 2 Israel Pinto 2021-09-12 17:40:11 UTC
libvirt spec for the disk:
<disk type='file' device='cdrom'>
<driver name='qemu' type='qcow2' cache='none' error_policy='stop' discard='unmap'/>
<source file='/var/run/kubevirt-ephemeral-disks/disk-data/windows-guest-tools/disk.qcow2' index='2'/>
<backingStore type='file' index='3'>
<format type='raw'/>
<source file='/var/run/kubevirt/container-disks/disk_0.img'/>
</backingStore>
<target dev='sda' bus='sata'/>
<readonly/>
<alias name='ua-windows-guest-tools'/>
<address type='drive' controller='0' bus='0' target='0' unit='0'/>
</disk>

Comment 4 Jed Lejosne 2021-09-14 13:39:24 UTC
@ipinto that line from "Additional info" confuses me:
3. Tested with OS (windows) which doesn't boot - Same results migration failed

What do you mean by "which doesn't boot"? Also, in "Same results", which results is that referring to?
Could we also get more info on the specific version of Windows you used?

Thank you!

Comment 5 Kedar Bidarkar 2021-09-14 14:28:04 UTC
@Jed, checked with Israel, below is the info.

3. Tested with OS (windows) which doesn't boot - Same results migration failed
which Windows OS was used with 4.9 here ?   This was Windows-10

So Roman, had asked us to check, if we can try to migrate a VM ( that does not boot) and still has the CD-ROM attached  --> Result: migration still failed.

Comment 6 Jed Lejosne 2021-09-14 18:20:34 UTC
Thanks Kedar.
Still unclear what a VM (that does not boot) is... Does a regular VM migrate properly?
I just tried to repro this by creating a Windows 10 VM using the wizard, and the migration request fails with:
Error migrating VirtualMachine Internal error occurred: admission webhook "migration-create-validator.kubevirt.io" denied the request: Cannot migrate VMI, Reason: DisksNotLiveMigratable, Message: cannot migrate VMI: PVC win10-afraid-hoverfly is not shared, live migration requires that all PVCs must be shared (using ReadWriteMany access mode)

Am I doing something wrong?

Comment 7 Guohua Ouyang 2021-09-15 01:37:44 UTC
(In reply to Jed Lejosne from comment #6)
> Thanks Kedar.
> Still unclear what a VM (that does not boot) is... Does a regular VM migrate
> properly?
> I just tried to repro this by creating a Windows 10 VM using the wizard, and
> the migration request fails with:
> Error migrating VirtualMachine Internal error occurred: admission webhook
> "migration-create-validator.kubevirt.io" denied the request: Cannot migrate
> VMI, Reason: DisksNotLiveMigratable, Message: cannot migrate VMI: PVC
> win10-afraid-hoverfly is not shared, live migration requires that all PVCs
> must be shared (using ReadWriteMany access mode)
> 
> Am I doing something wrong?

What storageclass do you use? VM is not migratable if the storageclass is hostpath-provisioner. Better to use 'ocs-storagecluster-ceph-rbd' if it's available.

Comment 8 Guohua Ouyang 2021-09-15 01:43:35 UTC
I could reproduce this issue on CNV 4.9 + OCP 4.9.0-rc.0 and CNV 4.8.2-17 + OCP-4.8.11.
Steps are same with bug 1966903: 
1. Create a windows VM from the UI with storageclass 'ocs-storagecluster-ceph-rbd' (the windows-guest-tool CD-ROM disk is attached by default)
2. Start the VM
3. Migrate the VM

Comment 9 Jed Lejosne 2021-09-15 20:20:31 UTC
I was able to reproduce this too, thanks to Kedar, on an OCP 4.9.0-rc.1 cluster.
The VM used ocs-storagecluster-ceph-rbd and had the windows-guest-tool iso attached.

It's important to note that the migration never actually fails (or it does after more than an hour).
Hopefully that's what others noticed as well.
I gathered a pile of logs and started reading through it.
It's hard to figure out which warnings/errors are critical and which aren't, but a recurring one caught my eye.
I kept getting this in the source virt-launcher logs:

Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainGetJobStats)

Interestingly, a quick google search yielded https://bugzilla.redhat.com/show_bug.cgi?id=1784343, which has the same error cause the same problem, but in OpenStack.
The root cause was even also related to multiple disks!

Other than that, I think these few things are worth noting:
- The (paused) target libvirt domain is marked as transient (not persistent). Maybe this is expected of unfinished migration targets, but if not then this is an issue.
- I got a lot of messages about picking a PDB at random. What interesting is that sometimes the PDB that's picked is "kubevirt-migration-pdb-kubevirt-migrate-vm-xxx" instead of "kubevirt-disruption-budget-xxx". Not sure if that's a concern but it looks suspicious.
- Cancelling the migration seems to be impossible, and so is stopping the VM. I'm actually not sure how to force-stop everything for a second attempt.
- The migration happens in Pre-Copy mode, Post-Copy doesn't seem to be enabled in CNV
- The only diff between xml dumps of the source vs the target is probably irrelevant:
  -      <target type='virtio' name='org.qemu.guest_agent.0' state='connected'/>
  +      <target type='virtio' name='org.qemu.guest_agent.0' state='disconnected'/>

Comment 12 Jed Lejosne 2021-09-16 18:39:28 UTC
The target domain being transient is probably a symptom of the issue.
On live migration, we modify the domain XML to remove all migration metadata.
However, to do that, we use https://libvirt.org/html/libvirt-libvirt-domain.html#VIR_MIGRATE_PARAM_DEST_XML
I just found out that the parameter above does not modify the persistent XML, therefore migration data is persisted which might be what breaks things.
To persist the XML changes, we also need to use https://libvirt.org/html/libvirt-libvirt-domain.html#VIR_MIGRATE_PARAM_PERSIST_XML
The previously mentioned openstack/nova project ran into this as well and fixed it there: https://opendev.org/openstack/nova/commit/1bb8ee95d4c3ddc3f607ac57526b75af1b7fbcff
I will open and link a kubevirt PR shortly to address this, and cross fingers that it also fixes this issue.

Comment 14 Kedar Bidarkar 2021-09-24 13:33:40 UTC
VERIFIED with virt-operator-container-v4.9.0-48

Created 2 Windows-10 VM's via the Templates from the WebUI and was able to successfully LiveMigrate it.

VM1 was created on node-15 and successfully LiveMigrated to node-13
VM2 was created on node-15 again and successfully LiveMigrated to node-14

Comment 15 Kedar Bidarkar 2021-09-24 13:42:42 UTC
Created attachment 1825894 [details]
win10 VM booted successfully

Comment 21 sgott 2021-09-24 15:46:50 UTC
Moving to ON_QA based on comment #14

Comment 22 Kedar Bidarkar 2021-09-24 18:17:16 UTC
VERIFIED with virt-operator-container-v4.9.0-48

Comment 26 errata-xmlrpc 2021-11-02 16:01:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.9.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:4104


Note You need to log in before you can comment on or make changes to this bug.