Bug 1572801 - RHEV-guest-tools fsfreeze causes snapshot failure when Windows cannot sync filesystem
Summary: RHEV-guest-tools fsfreeze causes snapshot failure when Windows cannot sync fi...
Keywords:
Status: CLOSED DUPLICATE of bug 1613505
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-guest-agent
Version: 4.0.7
Hardware: Unspecified
OS: Windows
unspecified
high
Target Milestone: ovirt-4.3.0
: ---
Assignee: Tomáš Golembiovský
QA Contact: Petr Matyáš
URL:
Whiteboard:
Depends On: 1536954 1596457 1613505
Blocks: 1547768
TreeView+ depends on / blocked
 
Reported: 2018-04-27 22:15 UTC by amashah
Modified: 2019-08-07 11:21 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-10-31 15:07:29 UTC
oVirt Team: Virt
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3426861 0 None None None 2018-04-27 22:36:54 UTC

Description amashah 2018-04-27 22:15:18 UTC
Description of problem:
When trying to take a snapshot of a Windows VM, guest agent issues a fsfreeze. If the guest VM is unable to sync volume, VM hangs and snapshot fails.

Version-Release number of selected component (if applicable): 4.0.7


How reproducible:
It has happened on two VM's

Steps to Reproduce:
1. Take live snapshot 
2. Windows Guest timed out while waiting for release writes command.
3. 

Actual results:
Failed snapshot and/or VM hung.

Expected results:
guest agent should provide a warning and the snapshot should fail more gracefully.

Additional info:

Similar issue to BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1557448 
except in this case it is with a snapshot instead of LSM.

vdsm:

jsonrpc.Executor/1::INFO::2018-04-08 11:23:16,636::vm::3303::virt.vm::(freeze) vmId=`147d2e3d-709a-4eed-af01-9a776370aadf`::Freezing guest filesystems

jsonrpc.Executor/3::ERROR::2018-04-08 11:24:00,228::vm::2813::virt.vm::(getIoTuneResponse) vmId=`147d2e3d-709a-4eed-af01-9a776370aadf`::getVmIoTune failed
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 2798, in getIoTuneResponse
    libvirt.VIR_DOMAIN_AFFECT_LIVE)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 77, in f
    raise toe
TimeoutError: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainFSFreeze)

var/log/messages:
Apr  9 10:26:39 host-1234 journal: Cannot start job (query, none) for domain V3TCERSEI; current job is (modify, none) owned by (5290 remoteDispatchDomainFSFreeze, 0 <null>) for (83003s, 0s)
Apr  9 10:26:39 host-1234 journal: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainFSFreeze)

audit_log:

 2018-04-08 17:26:19.111+00    | 201804081122 creating new virtual machine vm-example-SNAP (vm-example)
 2018-04-08 17:26:17.821+00    | 201804081122 create snapshot successful (vm-example)
 2018-04-08 17:26:16.141+00    | Failed to complete snapshot 'oVirtBackup' creation for VM 'vm-example'.
 2018-04-08 17:26:15.85+00     | Failed to create live snapshot 'oVirtBackup' for VM 'vm-example'. VM restart is recommended. Note that using the created snapshot might cause data inconsistency.
 2018-04-08 17:26:15.81+00     | Host host.example.com is not responding. It will stay in Connecting state for a grace period of 61 seconds and after that an attempt to fence the host will be issued.
 2018-04-08 17:26:15.801+00    | VDSM host.example.com command failed: Message timeout which can be caused by communication issues
 2018-04-08 17:24:07.985+00    | VM vm-example is not responding.
 2018-04-08 17:23:00.642+00    | Snapshot 'oVirtBackup' creation for VM 'vm-example' was initiated by user@example.com-new-authz.


From Windows Event Logs:

# grep '4/8/2018 11' SystemLog.csv |grep Error
Error,4/8/2018 11:23:48 AM,volsnap,8,None,The flush and hold writes operation on volume L: timed out while waiting for a release writes command.
Error,4/8/2018 11:23:48 AM,volsnap,8,None,The flush and hold writes operation on volume \\?\Volume{1b1112cf-2a98-11e5-9c0f-806e6f6e6963} timed out while waiting for a release writes command.
Error,4/8/2018 11:23:48 AM,volsnap,8,None,The flush and hold writes operation on volume E: timed out while waiting for a release writes command.
Error,4/8/2018 11:23:48 AM,volsnap,8,None,The flush and hold writes operation on volume C: timed out while waiting for a release writes command.


Nothing reported in guest agent logs.

Comment 1 Yaniv Kaul 2018-04-29 06:58:35 UTC
This looks like 1st of all a qemu-guest-agent bug, followed by a libvirt bug, no?

Comment 2 Michal Skrivanek 2018-04-29 17:16:56 UTC
likely yes. But I want to check why we lose connectivity to vdsm - that shouldn't happen either.

Other than that, it shouldn't be that critical, assuming the guest is trying to sync data and then times out

Comment 14 Germano Veit Michel 2018-06-28 01:16:00 UTC
Same thing every day on 2 Windows 2016 VMs on the case I'm attaching. 

We enabled debug logs in the windows guest agent (latest version, 4.2_5) and it clearly shows the freeze received. But it does not seem to complete.

There doesn't seem to be any log on the windows side, so we are now capturing a coredump from the qemu guest agent (to check where its stopped) and also a qemu-kvm coredump (to check the virtio disks requests status) on that case.

But from the VDSM side, I don't think things are that good and there are improvements that could be made on vdsm too before moving the bug to qga. See this:

1. Freeze requested

2018-06-28 02:01:50,516+1000 INFO  (jsonrpc/0) [virt.vm] (vmId='85169716-2f97-417e-bd4c-1036619923a9') Freezing guest filesystems (vm:4313)

2. As fsfreeze is taking a long time and blocks the monitor, VM monitoring doesnt run:
2018-06-28 02:02:41,797+1000 WARN  (jsonrpc/7) [virt.vm] (vmId='85169716-2f97-417e-bd4c-1036619923a9') monitor became unresponsive (command timeout, age=61.9699999997) (vm:6120)

This is a known issue: https://bugzilla.redhat.com/show_bug.cgi?id=1552092

3. And we can see the snapshot waiting on fsfreeze:

2018-06-28 02:02:43,558+1000 WARN  (vdsm.Scheduler) [Executor] Worker blocked: .....
[....]
File: "/usr/lib64/python2.7/site-packages/libvirt.py", line 2842, in fsFreeze
  ret = libvirtmod.virDomainFSFreeze(self._o, mountpoints, flags) (executor:363)

5. This goes on for almost 4 hours (duration=14100), then the guest agent is restarted (due to a guest reboot) and finally VDSM continues with the snapshot:

2018-06-28 05:56:43,611+1000 WARN  (vdsm.Scheduler) [Executor] Worker blocked: .... timeout=60, duration=14100
[....]
File: "/usr/lib64/python2.7/site-packages/libvirt.py", line 2842, in fsFreeze
  ret = libvirtmod.virDomainFSFreeze(self._o, mountpoints, flags) (executor:363)

2018-06-28 05:57:21,145+1000 WARN  (jsonrpc/1) [virt.vm] (vmId='85169716-2f97-417e-bd4c-1036619923a9') monitor became unresponsive (command timeout, age=14141.32) (vm:6120)

Problems:
* What is the point of completing a snapshot 4 hours after it was requested? And it was 4 hours only because the guest rebooted and disconnected the agent. Otherwise it could be much more.
* Shouldn't vdsm have a timeout to complete the freeze and then proceed (or abandon?) the snapshot? Waiting indefinitely for the agent does not sound right.

Comment 15 Tomáš Golembiovský 2018-06-28 11:18:48 UTC
(In reply to Germano Veit Michel from comment #14)

> Problems:
> * What is the point of completing a snapshot 4 hours after it was requested?
> And it was 4 hours only because the guest rebooted and disconnected the
> agent. Otherwise it could be much more.

No point at all. IMHO the snapshot will be corrupted anyway. The guest is running but we (falsely) assume the filesystems are frozen.

> * Shouldn't vdsm have a timeout to complete the freeze and then proceed (or
> abandon?) the snapshot? Waiting indefinitely for the agent does not sound
> right.

I agree with abandoning the attempt to do the snapshot. But I wonder how do we treat the VM afterwards? The disks and the agent(s) are in undefined state. Should we just go on and pretend nothing happened? Should we suggest that VM may need reboot/intervention?

Michal, ideas?

Comment 16 Michal Skrivanek 2018-06-28 13:14:33 UTC
(In reply to Tomáš Golembiovský from comment #15)
> (In reply to Germano Veit Michel from comment #14)
> 
> > Problems:
> > * What is the point of completing a snapshot 4 hours after it was requested?
> > And it was 4 hours only because the guest rebooted and disconnected the
> > agent. Otherwise it could be much more.
> 
> No point at all. IMHO the snapshot will be corrupted anyway. The guest is
> running but we (falsely) assume the filesystems are frozen.
> 
> > * Shouldn't vdsm have a timeout to complete the freeze and then proceed (or
> > abandon?) the snapshot? Waiting indefinitely for the agent does not sound
> > right.
> 
> I agree with abandoning the attempt to do the snapshot. But I wonder how do
> we treat the VM afterwards? The disks and the agent(s) are in undefined
> state. Should we just go on and pretend nothing happened? Should we suggest
> that VM may need reboot/intervention?
> 
> Michal, ideas?

that's a consequence of the hung call. I suggest to move also this bug to RHEL libvirt (or qemu-guest-agent) to have a sensible timeout for this action, as it seems that
1) libvirt blocks other calls - bug 1552092
2) when agent doesn't respond it just never finish, then the vdsm logic to retry without agent is never triggered and the RHV snapshot operation hangs as well

Comment 17 Germano Veit Michel 2018-06-28 23:14:31 UTC
(In reply to Michal Skrivanek from comment #16)
> (In reply to Tomáš Golembiovský from comment #15)
> > (In reply to Germano Veit Michel from comment #14)
> > 
> > > Problems:
> > > * What is the point of completing a snapshot 4 hours after it was requested?
> > > And it was 4 hours only because the guest rebooted and disconnected the
> > > agent. Otherwise it could be much more.
> > 
> > No point at all. IMHO the snapshot will be corrupted anyway. The guest is
> > running but we (falsely) assume the filesystems are frozen.
> > 
> > > * Shouldn't vdsm have a timeout to complete the freeze and then proceed (or
> > > abandon?) the snapshot? Waiting indefinitely for the agent does not sound
> > > right.
> > 
> > I agree with abandoning the attempt to do the snapshot. But I wonder how do
> > we treat the VM afterwards? The disks and the agent(s) are in undefined
> > state. Should we just go on and pretend nothing happened? Should we suggest
> > that VM may need reboot/intervention?
> > 
> > Michal, ideas?
> 
> that's a consequence of the hung call. I suggest to move also this bug to
> RHEL libvirt (or qemu-guest-agent) to have a sensible timeout for this
> action, as it seems that
> 1) libvirt blocks other calls - bug 1552092
> 2) when agent doesn't respond it just never finish, then the vdsm logic to
> retry without agent is never triggered and the RHV snapshot operation hangs
> as well

So if libvirt returns with a timeout the snapshot will be aborted instead of moving forward like it does now when the agent disconnects?

But again, is the correct behavior to go ahead with the snapshot if the agent disconnects? I think this should abort the snapshot and vdsm does not seem to do this curently.

If I'm not wrong, there could be some vdsm improvements tracked by this BZ and we can open another for the guest-agent/windows problem.

Comment 18 Germano Veit Michel 2018-06-28 23:18:17 UTC
(In reply to Germano Veit Michel from comment #17)
> But again, is the correct behavior to go ahead with the snapshot if the
> agent disconnects? I think this should abort the snapshot and vdsm does not
> seem to do this curently.
> 
> If I'm not wrong, there could be some vdsm improvements tracked by this BZ
> and we can open another for the guest-agent/windows problem.

Let me make my comment clearer:

2018-06-28 02:01:50,516+1000 INFO  (jsonrpc/0) [virt.vm] (vmId='85169716-2f97-417e-bd4c-1036619923a9') Freezing guest filesystems (vm:4313)

4 hours later the qemu-agent exits inside the VM and the snapshot completes.

2018-06-28 05:57:27,819+1000 WARN  (jsonrpc/0) [virt.vm] (vmId='85169716-2f97-417e-bd4c-1036619923a9') Unable to freeze guest filesystems: Guest agent is not responding: Guest agent disappeared while executing command (vm:4318)
2018-06-28 05:57:27,973+1000 INFO  (jsonrpc/0) [virt.vm] (vmId='85169716-2f97-417e-bd4c-1036619923a9') Completed live snapshot (vm:4535)
2018-06-28 05:57:27,973+1000 INFO  (jsonrpc/0) [virt.vm] (vmId='85169716-2f97-417e-bd4c-1036619923a9') Thawing guest filesystems (vm:4336)
2018-06-28 05:57:27,974+1000 WARN  (jsonrpc/0) [virt.vm] (vmId='85169716-2f97-417e-bd4c-1036619923a9') Unable to thaw guest filesystems: Guest agent is not responding: QEMU guest agent is not connected (vm:4341)

So if the guest agent dies or something else goes wrong BEFORE a supposed new timeout is hit, vdsm will go ahead with the snapshot. This wouldn't be correct, I think.

Comment 19 Germano Veit Michel 2018-06-29 03:47:00 UTC
Here is the qemu-ga bug, for those interested.

https://bugzilla.redhat.com/show_bug.cgi?id=1596457

Comment 23 Marina Kalinin 2018-08-03 18:39:43 UTC
(In reply to Germano Veit Michel from comment #19)
> Here is the qemu-ga bug, for those interested.
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1596457

Germano, is this bug still relevant or we can close it as a duplicate of your bug?

Comment 24 Germano Veit Michel 2018-08-05 22:37:43 UTC
(In reply to Marina from comment #23)
> (In reply to Germano Veit Michel from comment #19)
> > Here is the qemu-ga bug, for those interested.
> > 
> > https://bugzilla.redhat.com/show_bug.cgi?id=1596457
> 
> Germano, is this bug still relevant or we can close it as a duplicate of
> your bug?

Hi Marina

I'm not sure. From what I can see the RHV side could have some improvements as well to be more resilient in case of guest issues during fsfreeze. See comments #15-17.

Comment 25 Michal Skrivanek 2018-08-06 13:09:50 UTC
Germano, the behavior of taking snapshot when the agent is not available (potentially corrupted) is the original behavior we had in RHV since...well..since live snapshot has been introduced. I'm not sure we should change the original behavior, even if you think we should, it's really a question to storage (Tal?)

If we keep the existing behavior then it should all be fine once the guest agent is working correctly. We can consider to switch monitoring to bug 1552092, but that's a risky change and I'd prefer to track that separately and not for a zstream.

Comment 26 Tal Nisan 2018-08-06 15:03:38 UTC
I agree, taking a snapshot without a working guest agent can indeed lead to inconsistencies

Comment 27 Germano Veit Michel 2018-08-06 22:05:03 UTC
Hi Michal and Tal.

I trust your assessment. If you say its a risky change then its probably not worth it.

So we just keep this bug to track consuming the fix of BZ1596457 into the RHV Guest tool ISO?

Comment 28 Marina Kalinin 2018-08-07 16:58:35 UTC
To summarize, this bug suggests 3 possible resolution paths:
(1) Respin new RHV Guest Tools ISO to consume the fix in bz#1596457. 
(2) Change current behavior in RHV live snapshots and cancel the snapshot if the guest is not responding. 
(3) send --nowait to libvirt when we collect qmeu statsvia qemu monitor, to consume the fix from bz#1552092.

How realistic are those suggestions?
(3) Sounds to risky to consume in z-stream and needs to be evaluated more in depth. I suggest opening RFE to RHV to consume that fix. 
(2) Changing existing behavior seems like a lot of work at this point. I think, we may get away with writing it down as a known issue. I suggest we continue tracking it in this BZ.
(1) Sounds the easiest to do at this moment and the fastest.
I will raise the severity of that bz, ask for 7.5.z fix + open a bug to RHV Guest Tools to consume it in the ISO for the windows guests.

Michal / Germano, FYI and ack.

Comment 29 Michal Skrivanek 2018-08-07 17:07:43 UTC
+1 on all above

Comment 30 Marina Kalinin 2018-08-07 17:17:37 UTC
Bz for RHV Guest tools:
https://bugzilla.redhat.com/show_bug.cgi?id=1613505

Comment 31 Marina Kalinin 2018-08-07 17:34:55 UTC
This is the RFE for #3 to consider:
https://bugzilla.redhat.com/show_bug.cgi?id=1613514

Comment 34 Michal Skrivanek 2018-10-31 15:07:29 UTC

*** This bug has been marked as a duplicate of bug 1613505 ***


Note You need to log in before you can comment on or make changes to this bug.