1557448 – remoteDispatchDomainFSFreeze hangs when taking a snapshot

Bug 1557448 - remoteDispatchDomainFSFreeze hangs when taking a snapshot

Summary: remoteDispatchDomainFSFreeze hangs when taking a snapshot

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.1.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-4.2.0
Target Release:	---
Assignee:	Benny Zlotnik
QA Contact:	Kevin Alon Goldblatt
Docs Contact:
URL:
Whiteboard:
Depends On:	1506697
Blocks:
TreeView+	depends on / blocked

Reported:	2018-03-16 15:16 UTC by Javier Coscia
Modified:	2021-09-09 13:26 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-05-15 17:48:31 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:
Flags:	lsvaty: testing_plan_complete-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	3194802	0	None	None	None	2018-03-20 18:52:58 UTC
Red Hat Product Errata	RHEA-2018:1488	0	None	None	None	2018-05-15 17:50:38 UTC

Description Javier Coscia 2018-03-16 15:16:33 UTC

Description of problem:

While running a live storage migration operation to move a vDisk from one Storage Domain to another, VM switched to NotResponding, SnapshotVDSCommand failed due timeout on engine, then VDS changed to not responding too.

LSM flow continued, AFAIU, this flow would be fixed in 1497355. 

We now see the auto-generated snapshot on source storage and vDisks were created on destination storage domain, although VM still uses vols on source storage domain.



Version-Release number of selected component (if applicable):

rhevm-4.1.9.2-0.1.el7.noarch
rhev-guest-tools-iso-4.1-7.el7ev.noarch

Guest OS: Windows 2012 R2 x64 with latest guest agent installed 


How reproducible:
100% in customer's environment

Steps to Reproduce:
1. W2k12R2 running VM with 1 preallocated vDisk
2. Move a vDisk from one SD to another. Block based in this case



Actual results:

freeze() filesystem call hangs on guest and LSM operation won't finish correctly, VM will still be using source vDisks. auto-generated snapshot is created but useless in this flow.




Expected results:

freeze() filesystem call on guest should succeed so LSM operation can complete and vDisk moved to destination SD

Comment 5 Tomáš Golembiovský 2018-03-19 12:35:09 UTC

This has little to do with our guest agent. In VDSM we use libvirt command virDomainFSFreeze() which in turn calls guest-fsfreeze-freeze command of QEMU Guest Agent.

From the VDSM position there's not much we can do. We cannot even timeout the request to freeze, but that wouldn't help much anyway. I'm not even sure there's much libvirt can do about it. Even if libvirt timeouts on the request there's no way of knowing in which state the VM and it's disks are (notice that the thaw request later failed). It seems libvirt itself does not know in which state the VM is (moreover, my guess is that the domain remains internaly locked) and that is why the VM is reported as unresponsive in Engine (again, this has nothing to do with guest agent).

I'm moving this to storage for review of the storage flow.

Javier, can you please:

1) describe how to reproduce this

2) Make sure at what state is the QEMU GA service on the guest *before* the request to LSM? From event logs it looks like the service manager times out waiting for QEMU GA to start, but the service finishes starting 50 seconds later.

3) share libvirt logs for the VM, we may need to open another bug on libvirt/qemu

Comment 6 Allon Mureinik 2018-03-19 12:54:19 UTC

Frankly, there's no point in the virDomainFSFreeze call in this flow. This isn't a snapshot we're ever going to use as a such, and we don't care about its consistency.

The call was removed by bug 1506697, which should make this bz a mute point.

Comment 11 Javier Coscia 2018-03-20 18:46:24 UTC

As a workaround, user shared that he was able to perform the LSM operation by stopping ovirt-guest-agent & qemu-ga inside the guest, this way the vDisk was moved between storage domains, he also recorded a video of the operation, let me know if this is relevant so user can upload into our FTP server.

Comment 13 Allon Mureinik 2018-03-21 10:29:44 UTC

(In reply to Javier Coscia from comment #11)
> As a workaround, user shared that he was able to perform the LSM operation
> by stopping ovirt-guest-agent & qemu-ga inside the guest, this way the vDisk
> was moved between storage domains, he also recorded a video of the
> operation, let me know if this is relevant so user can upload into our FTP
> server.

Yes please

Comment 18 Kevin Alon Goldblatt 2018-03-27 14:29:20 UTC

Verified with the following code:
----------------------------------------
ovirt-engine-4.2.2.5-0.1.el7.noarch
vdsm-4.20.23-1.el7ev.x86_64

Verified with the following scenario:
----------------------------------------
1. Start the VM and connect to the console
2. Ran LSM >>>> The files system of the VM did not freeze and all operations run continued


Moving to VERIFIED

Comment 25 errata-xmlrpc 2018-05-15 17:48:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1488

Comment 26 Franta Kust 2019-05-16 13:08:47 UTC

BZ<2>Jira Resync

Comment 27 Daniel Gur 2019-08-28 13:14:48 UTC

sync2jira

Comment 28 Daniel Gur 2019-08-28 13:19:50 UTC

sync2jira

Note You need to log in before you can comment on or make changes to this bug.