Bug 2010000

Summary: Live storage migration breaks Windows (UEFI) vitrual machine
Product: [oVirt] ovirt-engine Reporter: Patrick <patrick.lomakin>
Component: BLL.StorageAssignee: Arik <ahadas>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Avihai <aefrat>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.4.8.6CC: ahadas, bugs
Target Milestone: ---Flags: ahadas: needinfo?
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-01-02 17:37:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Patrick 2021-10-02 19:48:52 UTC
Description of problem:

The problem of live disk migration between different storage domains.

Version-Release number of selected component (if applicable): 4.4.8.6

How reproducible:


Steps to Reproduce:
1. Created a storage domain (iSCSI and Gluster);
2. Created a virtual machine with Windows 10 (UEFI);
3. Placed the virtual machine on the Gluster domain;
4. Started the virtual machine and started moving the disk from the Gluster storage domain to iSCSI.

Actual results:

Аfter a reboot VM, the machine no longer started. The boot goes into recovery mode. All attempts to restore the machine to work were unsuccessful.

Expected results:

Successful virtual machine startup after live disk migration

Additional info:

I want to note that this does not happen if the machine was turned off during the disk move. In this case, it starts up without any problems. The type of domain from or to which the disk is migrated does not important. In case of live migration the Windows (UEFI) VM crashes in any case.

Comment 1 RHEL Program Management 2021-10-04 05:27:28 UTC
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.

Comment 2 Arik 2021-10-04 11:12:34 UTC
Eyal, as it happens after live storage migration, why do we suspect it's a Virt issue?

Comment 3 Eyal Shenitzky 2021-10-04 12:55:55 UTC
(In reply to Arik from comment #2)
> Eyal, as it happens after live storage migration, why do we suspect it's a
> Virt issue?

I saw Sandro put it on virt so I assumed you talked on that bug (just set the assignee to you so you will not miss it).

But if you are asking, I suspect it might related to the new snapshot flow added by virt team, and it happens only in running windows VM which is virt area.

Comment 4 Arik 2021-10-04 18:49:08 UTC
(In reply to Eyal Shenitzky from comment #3)
> (In reply to Arik from comment #2)
> > Eyal, as it happens after live storage migration, why do we suspect it's a
> > Virt issue?
> 
> I saw Sandro put it on virt so I assumed you talked on that bug (just set
> the assignee to you so you will not miss it).

I see

> 
> But if you are asking, I suspect it might related to the new snapshot flow
> added by virt team

Can you elaborate on that? I don't recall introducing a new snapshot flow

> and it happens only in running windows VM which is virt
> area.

Because it is a running Windows VM? so if incremental backup would fail for a Windows VM then the virt team would need to handle that as well? :)

OK, I'm switching it to the storage team as the live storage migration flow is the primary suspect here.
That doesn't mean it's a RHV/storage issue though, it could also be a platform issue.
And I'll try to assist

Comment 5 Arik 2021-10-04 19:23:21 UTC
Patrick, can you please provide more information:

1. What is the configuration (preferably in the form of a domain XML) of the VM? we'll be particularly interested in the disk properties (their type, whether they had snapshots - such things)

2. Does it happen only with Windows+UEFI? did you try with the same type of disks and a different operating system?

3. You wrote that the OS gets into recovery mode - we have a similar bug in which the guest is not able to boot from a copied disk (bz 1983638) but you also mention that the VM crashed. Do you mean the qemu process crashed during the live storage migration?

4. You wrote that "all attempts to restore the machine to work were unsuccessful" - so the VM is no longer able to boot from this disk at all? even after hard reset?

Comment 6 Patrick 2021-10-05 21:22:00 UTC
Arik, more information:

1. First of all, I use iscsi for storage. The manager has added a storage domain on several paths. A preallocated 100G disk with a "boot" flag has been added to the machine. Next I deployed a Windows Server machine in UEFI mode. Firmware type is "Q35 Chipset with UEFI". No snapshots were created. 

2. I also have virtual machines with CentOS that in the same disk configuration and "Q35 chipset with UEFI" migrated successfully and started successfully after reboot. I have not tried migrating machines with different chipset firmware.

3. The QEMU process was not affected. The problem is specifically with the OS. 
Here is what I tried:
I booted via live windows iso and went into the command line. After that I installed the VirtIO-SCSI driver and through the diskpart utility I mounted the system disk and the recovery disk. After mounting, I disabled the auto recovery mode of Windows and could see the error message - "The operating system couldn't be loaded because the digital signature of a file couldn't be verified. Error code: 0xc0000428". Unfortunately, the error does not specify which file failed the verification. An attempt to disable the signature check is also unsuccessful.


4. Yes, I lost some machines, but I was able to copy the files from them. The problem occurs specifically with Windows and only after a live migration. If you migrate just the VM drive and the VM is turned off, everything is successful and Windows will start without problems

Comment 7 Patrick 2021-10-06 12:38:32 UTC
Please note that the template created from a working virtual machine with Windows Server 2019 also does not work no matter if thin disk or thick disk is used. The clone of the working virtual machine also does not work. After startup, the virtual machine shows that there is no disk. The "bootable" flag is set on the virtual machine disk.

Comment 8 Arik 2021-10-06 17:38:58 UTC
Thanks for the quick follow-up.
I believe some of what you wrote in comment 7 describes different issues - and it would be hard to help with those unless you provide us with the vm configuration (for the case the disk is not recognized) and details why you say the template doesn't work.

So let's concentrate on the case we have more information on which is that the VM enters recovery state after live storage migration (for clarity, let's not use the term live migration that can refer to both live storage migration, of the disk, or live virtual machine migration).

Is it correct that you're using UEFI without secure boot in oVirt (as you previously wrote) and UEFI is set with secure boot within the VM? (you can check that in the UEFI setting - the menu you see when pressing ESC during the intial phase of the boot process). In that case, I think that a simple workaround for you would be to cancel secure boot in the UEFI settings.

But anyway, I wonder if the different signature is really detected because the active volume changes or due to a different reason.
I'd suggest checking if it also happens with SATA disks (and if not, what version of virtio drivers you use).
It would also be interesting to check if it happens when you take a working VM, shut it down, create a snapshot for it and start it on the same host.

Comment 9 Patrick 2021-11-14 19:39:04 UTC
I'm sorry for the long response. I have the protected download turned off. As I noticed the problem occurs only when migrating an enabled virtual machine between storage domains with parameters - Q35 (UEFI). If the virtual machine is shut down and just move its disk to another storage domain, it starts up without any problems.

Comment 12 Arik 2021-11-16 10:13:38 UTC
(In reply to Patrick from comment #9)
> I'm sorry for the long response. I have the protected download turned off.
> As I noticed the problem occurs only when migrating an enabled virtual
> machine between storage domains with parameters - Q35 (UEFI).

What do you mean by "protected download"?
Does "enabled virtual machine" means a virtual machine with that "protected download" option enabled?

> If the virtual
> machine is shut down and just move its disk to another storage domain, it
> starts up without any problems.

Sure, if the virtual machine is turned off and it managed to boot, moving its disks around shouldn't matter

Comment 13 Patrick 2021-11-23 06:33:08 UTC
Arik, I'm sorry, it's a translation mistake. It meant "Secure Boot".

Comment 15 Arik 2021-12-19 20:47:54 UTC
What version of virtio-drivers did you use?

Comment 16 Arik 2022-01-02 17:37:12 UTC
I wasn't able to reproduce it with Win10+UEFI
Please reopen if it still happens with relevant (engine, vdsm) logs and the version of virtio-drivers that were installed