Description of problem: Failed to migrate Windows VM with CDROM Version-Release number of selected component (if applicable): OCP: 4.9.0-rc.0 Red Hat Enterprise Linux CoreOS 49.84.202109041651-0 (Ootpa) Kernel: 4.18.0-305.12.1.el8_4.x86_64 CNV: 4.9 Steps to Reproduce: 1. Create Windows VM with CDROM 2. Migrate VM Actual results: Migration failed. Additional info: 1. Tested migration of windows VM with CDROM in CNV-4.8.2, Migration passed 2. Tested migration of RHEL VM with CDROM on CNV 4.9 - Migration passed 3. Tested with OS (windows) which doesn't boot - Same results migration failed 4. Disk size on windows: (cdrom)CDFSD:\378.7 MiB 378.7 MiB (root disk)NTFSC:\33.96 GiB 20.26 GiB
libvirt spec for the disk: <disk type='file' device='cdrom'> <driver name='qemu' type='qcow2' cache='none' error_policy='stop' discard='unmap'/> <source file='/var/run/kubevirt-ephemeral-disks/disk-data/windows-guest-tools/disk.qcow2' index='2'/> <backingStore type='file' index='3'> <format type='raw'/> <source file='/var/run/kubevirt/container-disks/disk_0.img'/> </backingStore> <target dev='sda' bus='sata'/> <readonly/> <alias name='ua-windows-guest-tools'/> <address type='drive' controller='0' bus='0' target='0' unit='0'/> </disk>
@ipinto that line from "Additional info" confuses me: 3. Tested with OS (windows) which doesn't boot - Same results migration failed What do you mean by "which doesn't boot"? Also, in "Same results", which results is that referring to? Could we also get more info on the specific version of Windows you used? Thank you!
@Jed, checked with Israel, below is the info. 3. Tested with OS (windows) which doesn't boot - Same results migration failed which Windows OS was used with 4.9 here ? This was Windows-10 So Roman, had asked us to check, if we can try to migrate a VM ( that does not boot) and still has the CD-ROM attached --> Result: migration still failed.
Thanks Kedar. Still unclear what a VM (that does not boot) is... Does a regular VM migrate properly? I just tried to repro this by creating a Windows 10 VM using the wizard, and the migration request fails with: Error migrating VirtualMachine Internal error occurred: admission webhook "migration-create-validator.kubevirt.io" denied the request: Cannot migrate VMI, Reason: DisksNotLiveMigratable, Message: cannot migrate VMI: PVC win10-afraid-hoverfly is not shared, live migration requires that all PVCs must be shared (using ReadWriteMany access mode) Am I doing something wrong?
(In reply to Jed Lejosne from comment #6) > Thanks Kedar. > Still unclear what a VM (that does not boot) is... Does a regular VM migrate > properly? > I just tried to repro this by creating a Windows 10 VM using the wizard, and > the migration request fails with: > Error migrating VirtualMachine Internal error occurred: admission webhook > "migration-create-validator.kubevirt.io" denied the request: Cannot migrate > VMI, Reason: DisksNotLiveMigratable, Message: cannot migrate VMI: PVC > win10-afraid-hoverfly is not shared, live migration requires that all PVCs > must be shared (using ReadWriteMany access mode) > > Am I doing something wrong? What storageclass do you use? VM is not migratable if the storageclass is hostpath-provisioner. Better to use 'ocs-storagecluster-ceph-rbd' if it's available.
I could reproduce this issue on CNV 4.9 + OCP 4.9.0-rc.0 and CNV 4.8.2-17 + OCP-4.8.11. Steps are same with bug 1966903: 1. Create a windows VM from the UI with storageclass 'ocs-storagecluster-ceph-rbd' (the windows-guest-tool CD-ROM disk is attached by default) 2. Start the VM 3. Migrate the VM
I was able to reproduce this too, thanks to Kedar, on an OCP 4.9.0-rc.1 cluster. The VM used ocs-storagecluster-ceph-rbd and had the windows-guest-tool iso attached. It's important to note that the migration never actually fails (or it does after more than an hour). Hopefully that's what others noticed as well. I gathered a pile of logs and started reading through it. It's hard to figure out which warnings/errors are critical and which aren't, but a recurring one caught my eye. I kept getting this in the source virt-launcher logs: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainGetJobStats) Interestingly, a quick google search yielded https://bugzilla.redhat.com/show_bug.cgi?id=1784343, which has the same error cause the same problem, but in OpenStack. The root cause was even also related to multiple disks! Other than that, I think these few things are worth noting: - The (paused) target libvirt domain is marked as transient (not persistent). Maybe this is expected of unfinished migration targets, but if not then this is an issue. - I got a lot of messages about picking a PDB at random. What interesting is that sometimes the PDB that's picked is "kubevirt-migration-pdb-kubevirt-migrate-vm-xxx" instead of "kubevirt-disruption-budget-xxx". Not sure if that's a concern but it looks suspicious. - Cancelling the migration seems to be impossible, and so is stopping the VM. I'm actually not sure how to force-stop everything for a second attempt. - The migration happens in Pre-Copy mode, Post-Copy doesn't seem to be enabled in CNV - The only diff between xml dumps of the source vs the target is probably irrelevant: - <target type='virtio' name='org.qemu.guest_agent.0' state='connected'/> + <target type='virtio' name='org.qemu.guest_agent.0' state='disconnected'/>
The target domain being transient is probably a symptom of the issue. On live migration, we modify the domain XML to remove all migration metadata. However, to do that, we use https://libvirt.org/html/libvirt-libvirt-domain.html#VIR_MIGRATE_PARAM_DEST_XML I just found out that the parameter above does not modify the persistent XML, therefore migration data is persisted which might be what breaks things. To persist the XML changes, we also need to use https://libvirt.org/html/libvirt-libvirt-domain.html#VIR_MIGRATE_PARAM_PERSIST_XML The previously mentioned openstack/nova project ran into this as well and fixed it there: https://opendev.org/openstack/nova/commit/1bb8ee95d4c3ddc3f607ac57526b75af1b7fbcff I will open and link a kubevirt PR shortly to address this, and cross fingers that it also fixes this issue.
VERIFIED with virt-operator-container-v4.9.0-48 Created 2 Windows-10 VM's via the Templates from the WebUI and was able to successfully LiveMigrate it. VM1 was created on node-15 and successfully LiveMigrated to node-13 VM2 was created on node-15 again and successfully LiveMigrated to node-14
Created attachment 1825894 [details] win10 VM booted successfully
Moving to ON_QA based on comment #14
VERIFIED with virt-operator-container-v4.9.0-48
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Virtualization 4.9.0 Images security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:4104