Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1557448 - remoteDispatchDomainFSFreeze hangs when taking a snapshot
remoteDispatchDomainFSFreeze hangs when taking a snapshot
Status: CLOSED ERRATA
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine (Show other bugs)
4.1.9
Unspecified Unspecified
unspecified Severity high
: ovirt-4.2.0
: ---
Assigned To: Benny Zlotnik
Kevin Alon Goldblatt
:
Depends On: 1506697
Blocks:
  Show dependency treegraph
 
Reported: 2018-03-16 11:16 EDT by Javier Coscia
Modified: 2018-05-15 13:50 EDT (History)
12 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2018-05-15 13:48:31 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Storage
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3194802 None None None 2018-03-20 14:52 EDT
Red Hat Product Errata RHEA-2018:1488 None None None 2018-05-15 13:50 EDT

  None (edit)
Description Javier Coscia 2018-03-16 11:16:33 EDT
Description of problem:

While running a live storage migration operation to move a vDisk from one Storage Domain to another, VM switched to NotResponding, SnapshotVDSCommand failed due timeout on engine, then VDS changed to not responding too.

LSM flow continued, AFAIU, this flow would be fixed in 1497355. 

We now see the auto-generated snapshot on source storage and vDisks were created on destination storage domain, although VM still uses vols on source storage domain.



Version-Release number of selected component (if applicable):

rhevm-4.1.9.2-0.1.el7.noarch
rhev-guest-tools-iso-4.1-7.el7ev.noarch

Guest OS: Windows 2012 R2 x64 with latest guest agent installed 


How reproducible:
100% in customer's environment

Steps to Reproduce:
1. W2k12R2 running VM with 1 preallocated vDisk
2. Move a vDisk from one SD to another. Block based in this case



Actual results:

freeze() filesystem call hangs on guest and LSM operation won't finish correctly, VM will still be using source vDisks. auto-generated snapshot is created but useless in this flow.




Expected results:

freeze() filesystem call on guest should succeed so LSM operation can complete and vDisk moved to destination SD
Comment 5 Tomáš Golembiovský 2018-03-19 08:35:09 EDT
This has little to do with our guest agent. In VDSM we use libvirt command virDomainFSFreeze() which in turn calls guest-fsfreeze-freeze command of QEMU Guest Agent.

From the VDSM position there's not much we can do. We cannot even timeout the request to freeze, but that wouldn't help much anyway. I'm not even sure there's much libvirt can do about it. Even if libvirt timeouts on the request there's no way of knowing in which state the VM and it's disks are (notice that the thaw request later failed). It seems libvirt itself does not know in which state the VM is (moreover, my guess is that the domain remains internaly locked) and that is why the VM is reported as unresponsive in Engine (again, this has nothing to do with guest agent).

I'm moving this to storage for review of the storage flow.

Javier, can you please:

1) describe how to reproduce this

2) Make sure at what state is the QEMU GA service on the guest *before* the request to LSM? From event logs it looks like the service manager times out waiting for QEMU GA to start, but the service finishes starting 50 seconds later.

3) share libvirt logs for the VM, we may need to open another bug on libvirt/qemu
Comment 6 Allon Mureinik 2018-03-19 08:54:19 EDT
Frankly, there's no point in the virDomainFSFreeze call in this flow. This isn't a snapshot we're ever going to use as a such, and we don't care about its consistency.

The call was removed by bug 1506697, which should make this bz a mute point.
Comment 11 Javier Coscia 2018-03-20 14:46:24 EDT
As a workaround, user shared that he was able to perform the LSM operation by stopping ovirt-guest-agent & qemu-ga inside the guest, this way the vDisk was moved between storage domains, he also recorded a video of the operation, let me know if this is relevant so user can upload into our FTP server.
Comment 13 Allon Mureinik 2018-03-21 06:29:44 EDT
(In reply to Javier Coscia from comment #11)
> As a workaround, user shared that he was able to perform the LSM operation
> by stopping ovirt-guest-agent & qemu-ga inside the guest, this way the vDisk
> was moved between storage domains, he also recorded a video of the
> operation, let me know if this is relevant so user can upload into our FTP
> server.

Yes please
Comment 18 Kevin Alon Goldblatt 2018-03-27 10:29:20 EDT
Verified with the following code:
----------------------------------------
ovirt-engine-4.2.2.5-0.1.el7.noarch
vdsm-4.20.23-1.el7ev.x86_64

Verified with the following scenario:
----------------------------------------
1. Start the VM and connect to the console
2. Ran LSM >>>> The files system of the VM did not freeze and all operations run continued


Moving to VERIFIED
Comment 25 errata-xmlrpc 2018-05-15 13:48:31 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1488

Note You need to log in before you can comment on or make changes to this bug.