Description of problem:
While running a live storage migration operation to move a vDisk from one Storage Domain to another, VM switched to NotResponding, SnapshotVDSCommand failed due timeout on engine, then VDS changed to not responding too.
LSM flow continued, AFAIU, this flow would be fixed in 1497355.
We now see the auto-generated snapshot on source storage and vDisks were created on destination storage domain, although VM still uses vols on source storage domain.
Version-Release number of selected component (if applicable):
Guest OS: Windows 2012 R2 x64 with latest guest agent installed
100% in customer's environment
Steps to Reproduce:
1. W2k12R2 running VM with 1 preallocated vDisk
2. Move a vDisk from one SD to another. Block based in this case
freeze() filesystem call hangs on guest and LSM operation won't finish correctly, VM will still be using source vDisks. auto-generated snapshot is created but useless in this flow.
freeze() filesystem call on guest should succeed so LSM operation can complete and vDisk moved to destination SD
This has little to do with our guest agent. In VDSM we use libvirt command virDomainFSFreeze() which in turn calls guest-fsfreeze-freeze command of QEMU Guest Agent.
From the VDSM position there's not much we can do. We cannot even timeout the request to freeze, but that wouldn't help much anyway. I'm not even sure there's much libvirt can do about it. Even if libvirt timeouts on the request there's no way of knowing in which state the VM and it's disks are (notice that the thaw request later failed). It seems libvirt itself does not know in which state the VM is (moreover, my guess is that the domain remains internaly locked) and that is why the VM is reported as unresponsive in Engine (again, this has nothing to do with guest agent).
I'm moving this to storage for review of the storage flow.
Javier, can you please:
1) describe how to reproduce this
2) Make sure at what state is the QEMU GA service on the guest *before* the request to LSM? From event logs it looks like the service manager times out waiting for QEMU GA to start, but the service finishes starting 50 seconds later.
3) share libvirt logs for the VM, we may need to open another bug on libvirt/qemu
Frankly, there's no point in the virDomainFSFreeze call in this flow. This isn't a snapshot we're ever going to use as a such, and we don't care about its consistency.
The call was removed by bug 1506697, which should make this bz a mute point.
As a workaround, user shared that he was able to perform the LSM operation by stopping ovirt-guest-agent & qemu-ga inside the guest, this way the vDisk was moved between storage domains, he also recorded a video of the operation, let me know if this is relevant so user can upload into our FTP server.
(In reply to Javier Coscia from comment #11)
> As a workaround, user shared that he was able to perform the LSM operation
> by stopping ovirt-guest-agent & qemu-ga inside the guest, this way the vDisk
> was moved between storage domains, he also recorded a video of the
> operation, let me know if this is relevant so user can upload into our FTP
Verified with the following code:
Verified with the following scenario:
1. Start the VM and connect to the console
2. Ran LSM >>>> The files system of the VM did not freeze and all operations run continued
Moving to VERIFIED
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.