Bug 1557448

Summary: remoteDispatchDomainFSFreeze hangs when taking a snapshot
Product: Red Hat Enterprise Virtualization Manager Reporter: Javier Coscia <jcoscia>
Component: ovirt-engineAssignee: Benny Zlotnik <bzlotnik>
Status: CLOSED ERRATA QA Contact: Kevin Alon Goldblatt <kgoldbla>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.1.9CC: bzlotnik, daniel-oliveira, ebenahar, jcoscia, lsurette, mkenneth, rbalakri, Rhev-m-bugs, srevivo, ykaul, ylavi
Target Milestone: ovirt-4.2.0Flags: lsvaty: testing_plan_complete-
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-05-15 17:48:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1506697    
Bug Blocks:    

Description Javier Coscia 2018-03-16 15:16:33 UTC
Description of problem:

While running a live storage migration operation to move a vDisk from one Storage Domain to another, VM switched to NotResponding, SnapshotVDSCommand failed due timeout on engine, then VDS changed to not responding too.

LSM flow continued, AFAIU, this flow would be fixed in 1497355. 

We now see the auto-generated snapshot on source storage and vDisks were created on destination storage domain, although VM still uses vols on source storage domain.



Version-Release number of selected component (if applicable):

rhevm-4.1.9.2-0.1.el7.noarch
rhev-guest-tools-iso-4.1-7.el7ev.noarch

Guest OS: Windows 2012 R2 x64 with latest guest agent installed 


How reproducible:
100% in customer's environment

Steps to Reproduce:
1. W2k12R2 running VM with 1 preallocated vDisk
2. Move a vDisk from one SD to another. Block based in this case



Actual results:

freeze() filesystem call hangs on guest and LSM operation won't finish correctly, VM will still be using source vDisks. auto-generated snapshot is created but useless in this flow.




Expected results:

freeze() filesystem call on guest should succeed so LSM operation can complete and vDisk moved to destination SD

Comment 5 Tomáš Golembiovský 2018-03-19 12:35:09 UTC
This has little to do with our guest agent. In VDSM we use libvirt command virDomainFSFreeze() which in turn calls guest-fsfreeze-freeze command of QEMU Guest Agent.

From the VDSM position there's not much we can do. We cannot even timeout the request to freeze, but that wouldn't help much anyway. I'm not even sure there's much libvirt can do about it. Even if libvirt timeouts on the request there's no way of knowing in which state the VM and it's disks are (notice that the thaw request later failed). It seems libvirt itself does not know in which state the VM is (moreover, my guess is that the domain remains internaly locked) and that is why the VM is reported as unresponsive in Engine (again, this has nothing to do with guest agent).

I'm moving this to storage for review of the storage flow.

Javier, can you please:

1) describe how to reproduce this

2) Make sure at what state is the QEMU GA service on the guest *before* the request to LSM? From event logs it looks like the service manager times out waiting for QEMU GA to start, but the service finishes starting 50 seconds later.

3) share libvirt logs for the VM, we may need to open another bug on libvirt/qemu

Comment 6 Allon Mureinik 2018-03-19 12:54:19 UTC
Frankly, there's no point in the virDomainFSFreeze call in this flow. This isn't a snapshot we're ever going to use as a such, and we don't care about its consistency.

The call was removed by bug 1506697, which should make this bz a mute point.

Comment 11 Javier Coscia 2018-03-20 18:46:24 UTC
As a workaround, user shared that he was able to perform the LSM operation by stopping ovirt-guest-agent & qemu-ga inside the guest, this way the vDisk was moved between storage domains, he also recorded a video of the operation, let me know if this is relevant so user can upload into our FTP server.

Comment 13 Allon Mureinik 2018-03-21 10:29:44 UTC
(In reply to Javier Coscia from comment #11)
> As a workaround, user shared that he was able to perform the LSM operation
> by stopping ovirt-guest-agent & qemu-ga inside the guest, this way the vDisk
> was moved between storage domains, he also recorded a video of the
> operation, let me know if this is relevant so user can upload into our FTP
> server.

Yes please

Comment 18 Kevin Alon Goldblatt 2018-03-27 14:29:20 UTC
Verified with the following code:
----------------------------------------
ovirt-engine-4.2.2.5-0.1.el7.noarch
vdsm-4.20.23-1.el7ev.x86_64

Verified with the following scenario:
----------------------------------------
1. Start the VM and connect to the console
2. Ran LSM >>>> The files system of the VM did not freeze and all operations run continued


Moving to VERIFIED

Comment 25 errata-xmlrpc 2018-05-15 17:48:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1488

Comment 26 Franta Kust 2019-05-16 13:08:47 UTC
BZ<2>Jira Resync

Comment 27 Daniel Gur 2019-08-28 13:14:48 UTC
sync2jira

Comment 28 Daniel Gur 2019-08-28 13:19:50 UTC
sync2jira