Bug 1932284 - Engine handled FS freeze is not fast enough for Windows systems
Summary: Engine handled FS freeze is not fast enough for Windows systems
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.4.3
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ovirt-4.4.6
: 4.4.6
Assignee: Liran Rotenberg
QA Contact: Qin Yuan
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-02-24 12:46 UTC by Roman Hodain
Modified: 2024-06-14 00:28 UTC (History)
5 users (show)

Fixed In Version: ovirt-engine-4.4.6.3
Doc Type: Bug Fix
Doc Text:
Previously, the engine-config value LiveSnapshotPerformFreezeInEngine was set by default to false and was supposed to be uses in cluster compatibility levels below 4.4. The value was set to general version. With this release, each cluster level has it's own value, defaulting to false for 4.4 and above. This will reduce unnecessary overhead in removing time outs of the file system freeze command.
Clone Of:
Environment:
Last Closed: 2021-06-01 13:22:12 UTC
oVirt Team: Virt
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2021:2179 0 None None None 2021-06-01 13:23:04 UTC
oVirt gerrit 113802 0 master MERGED core: deprecate FreezeInEngine 2021-03-09 11:12:10 UTC

Description Roman Hodain 2021-02-24 12:46:28 UTC
Description of problem:
When LiveSnapshotPerformFreezeInEngine is set to true the snapshot operation is not fast enough and the FS is unfrozen before we finished it.

Version-Release number of selected component (if applicable):
4.4.3

How reproducible:
Time to time. Depending on the environment load

Steps to Reproduce:
1. Generate a lot of snapshot creation for multiple Windows systems


Actual results:
2021-02-24 09:03:07,961+01 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ThawVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-64) [84755dbc-a211-408b-8d58-9d1368d0c76d] Failed in 'ThawVDS' method
2021-02-24 09:03:07,963+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-64) [84755dbc-a211-408b-8d58-9d1368d0c76d] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host01.example.com command ThawVDS failed: internal error: unable to execute QEMU agent command 'guest-fsfreeze-thaw': couldn't hold writes: fsfreeze is limited up to
 10 seconds: 


Expected results:
The FS thaw should be finished before the timeout of 10s. Or we need to at least make sure that the FS is frozen when we take the snapshot. If that is in place this should not be an ERROR, but rather a warning.

Additional info:

2021-02-24 09:02:44,098+01 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FreezeVDSCommand] (default task-1506) [84755dbc-a211-408b-8d58-9d1368d0c76d] START, FreezeVDSCommand(HostName = host01.example.com, VdsAndVmIDVDSParametersBase:{hostId='34c607b5-954e-4620-8022-f9176a99257a', vmId='c5854e55-fce4-496a-89dd-e28903b604a1'}), log id: 3e0162e9

...

2021-02-24 09:02:44,847+01 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FreezeVDSCommand] (default task-1506) [84755dbc-a211-408b-8d58-9d1368d0c76d] FINISH, FreezeVDSCommand, return: , log id: 3e0162e9
2021-02-24 09:02:44,849+01 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-1506) [84755dbc-a211-408b-8d58-9d1368d0c76d] EVENT_ID: FREEZE_VM_SUCCESS(10,767), Guest filesystems on VM VM02 have been frozen successfully.

...

2021-02-24 09:03:06,737+01 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ThawVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-64) [84755dbc-a211-408b-8d58-9d1368d0c76d] START, ThawVDSCommand(HostName = host01.example.com, VdsAndVmIDVDSParametersBase:{hostId='34c607b5-954e-4620-8022-f9176a99257a', vmId='c5854e55-fce4-496a-89dd-e28903b604a1'}), log id: 1b7217f8
2021-02-24 09:03:07,961+01 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ThawVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-64) [84755dbc-a211-408b-8d58-9d1368d0c76d] Failed in 'ThawVDS' method
2021-02-24 09:03:07,963+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-64) [84755dbc-a211-408b-8d58-9d1368d0c76d] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host01.example.com command ThawVDS failed: internal error: unable to execute QEMU agent command 'guest-fsfreeze-thaw': couldn't hold writes: fsfreeze is limited up to 10 seconds:

Comment 2 Nir Soffer 2021-02-26 13:22:51 UTC
We have other flows using freeze before taking a snapshot, and doing
something else which is guaranteed to completed in 10 seconds.

The flow was added to support Cinder based Ceph storage where the 
snapshot is created on the storage side. I think we use the same
flow for managed block storage (cinderlib based), which works in
the same way.

I think snapshots in openstack work in the same way, so they have
the same issue with snapshot taken more than 10 seconds after the
freeze.

Where is the 10 seconds limit in fsfreeze coming from?

    internal error: unable to execute QEMU agent command 'guest-fsfreeze-thaw':
    couldn't hold writes: fsfreeze is limited up to 10 seconds

Is it configurable?

Comment 3 Roman Hodain 2021-02-26 14:05:52 UTC
This is unfortunately not configurable. It comes directly from the VSS subsystem. MS does not allow the FS to be frozen longer than these 10s.

Comment 4 Nir Soffer 2021-02-26 15:13:28 UTC
(In reply to Roman Hodain from comment #3)
> This is unfortunately not configurable. It comes directly from the VSS
> subsystem. MS does not allow the FS to be frozen longer than these 10s.

So we need to change all the using these flows to handle the case when 
fsreeze timed out after 10 seconds. One way to handle this is to pause
the vm right after freeze completed - this will prevent the guest from 
undoing the freeze too early so we can ensure consistent snapshots.

Another way is to treat this as best effort. If the guest undo the freeze
behind our back, you will not have consistent snapshot.

Comment 5 Arik 2021-03-01 12:36:52 UTC
Let's not call the FS-freeze from the engine to mitigate this issue in case the freeze operation times out

Comment 8 Qin Yuan 2021-04-19 07:11:36 UTC
Verified with:
ovirt-engine-4.4.6.3-0.8.el8ev.noarch

Steps:
1. Check default LiveSnapshotPerformFreezeInEngine values on a fresh installed ovirt-engine-4.4.6.3:

2. Check LiveSnapshotPerformFreezeInEngine values after engine upgrade to ovirt-engine-4.4.6.3:
1) set LiveSnapshotPerformFreezeInEngine to true on ovirt-engine-4.4.5.11, restart engine
   # engine-config -s LiveSnapshotPerformFreezeInEngine=true
   # systemctl restart ovirt-engine
   # engine-config -g LiveSnapshotPerformFreezeInEngine
   LiveSnapshotPerformFreezeInEngine: true version: general
2) upgrade ovirt-engine-4.4.5.11 to ovirt-engine-4.4.6.3, check LiveSnapshotPerformFreezeInEngine

3. Check setting LiveSnapshotPerformFreezeInEngine values on ovirt-engine-4.4.6.3:
1) set LiveSnapshotPerformFreezeInEngine of cluster compatibility level 4.2 to false
   # engine-config -s LiveSnapshotPerformFreezeInEngine=false --cver=4.2
2) set LiveSnapshotPerformFreezeInEngine of cluster compatibility level 4.4 to true
   # engine-config -s LiveSnapshotPerformFreezeInEngine=true --cver=4.4
3) restart engine, check LiveSnapshotPerformFreezeInEngine
 
4. Check if LiveSnapshotPerformFreezeInEngine configurations work as expected when creating live snapshot without memory on ovirt-engine-4.4.6.3:
1) check VM with cluster compatibility version 4.3:
   - make sure LiveSnapshotPerformFreezeInEngine of cluster compatibility level 4.3 is ture
   - create a cluster with compatibility version 4.3, add a 4.3 host
   - create and run a VM named testvm_43
   - create live snapshot without memory on VM testvm_43
   - check if there is freezing guest filesystem process
   - check if snapshot is created successfully

2) check VM with cluster compatibility version 4.6:
   - make sure LiveSnapshotPerformFreezeInEngine of cluster compatibility level 4.6 is false
   - create a cluster with compatibility version 4.6, add a 4.4.6 host
   - create and run a Windows VM named testwinvm_46
   - create live snapshot without memory on Windows VM testwinvm_46
   - check if there is no freezing guest filesystem process
   - check if snapshot is created successfully

Results:
1. In a fresh installed ovirt-engine-4.4.6.3, each cluster compatibility level has its own LiveSnapshotPerformFreezeInEngine value, and all default to false.
# engine-config -g LiveSnapshotPerformFreezeInEngine
LiveSnapshotPerformFreezeInEngine: false version: 4.2
LiveSnapshotPerformFreezeInEngine: false version: 4.3
LiveSnapshotPerformFreezeInEngine: false version: 4.4
LiveSnapshotPerformFreezeInEngine: false version: 4.5
LiveSnapshotPerformFreezeInEngine: false version: 4.6

2. If LiveSnapshotPerformFreezeInEngine is true in old engine, after upgrade to ovirt-engine-4.4.6.3, it remains true for cluster compatibility level < 4.4, changes to false for cluster compatibility level >= 4.4.
# engine-config -g LiveSnapshotPerformFreezeInEngine
LiveSnapshotPerformFreezeInEngine: true version: 4.2
LiveSnapshotPerformFreezeInEngine: true version: 4.3
LiveSnapshotPerformFreezeInEngine: false version: 4.4
LiveSnapshotPerformFreezeInEngine: false version: 4.5
LiveSnapshotPerformFreezeInEngine: false version: 4.6

3. In ovirt-engine-4.4.6.3, LiveSnapshotPerformFreezeInEngine can be set for each cluster compatibility level individually by using --cver option.
# engine-config -s LiveSnapshotPerformFreezeInEngine=false --cver=4.2
# engine-config -s LiveSnapshotPerformFreezeInEngine=true --cver=4.4
# engine-config -g LiveSnapshotPerformFreezeInEngine
LiveSnapshotPerformFreezeInEngine: false version: 4.2
LiveSnapshotPerformFreezeInEngine: true version: 4.3
LiveSnapshotPerformFreezeInEngine: true version: 4.4
LiveSnapshotPerformFreezeInEngine: false version: 4.5
LiveSnapshotPerformFreezeInEngine: false version: 4.6

4. If LiveSnapshotPerformFreezeInEngine of the VM's cluster compatibility level is true, there is freezing guest filesystem process when taking live snapshot without memory on the VM: 

engine.log:
2021-04-19 09:30:23,284+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-397) [44fe97a1-bc29-4f60-81e7-a37c2edfb503] EVENT_ID: FREEZE_VM_INITIATED(10,766), Freeze of guest filesystems on VM testvm_43 was initiated.
2021-04-19 09:30:23,326+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-397) [44fe97a1-bc29-4f60-81e7-a37c2edfb503] EVENT_ID: FREEZE_VM_SUCCESS(10,767), Guest filesystems on VM testvm_43 have been frozen successfully.
2021-04-19 09:30:23,702+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-397) [44fe97a1-bc29-4f60-81e7-a37c2edfb503] EVENT_ID: USER_CREATE_SNAPSHOT(45), Snapshot 'snap_43' creation for VM 'testvm_43' was initiated by admin@internal-authz.
2021-04-19 09:30:30,043+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ThawVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-13) [44fe97a1-bc29-4f60-81e7-a37c2edfb503] START, ThawVDSCommand(HostName = host_43, VdsAndVmIDVDSParametersBase:{hostId='feeac0e6-d722-4606-9ac9-5e560c835442', vmId='4d3dd86b-6979-40ae-b5f9-6bb39ea7a2c7'}), log id: 4942efce
2021-04-19 09:30:30,057+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ThawVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-13) [44fe97a1-bc29-4f60-81e7-a37c2edfb503] FINISH, ThawVDSCommand, return: , log id: 4942efce
2021-04-19 09:30:32,403+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-86) [] EVENT_ID: USER_CREATE_SNAPSHOT_FINISHED_SUCCESS(68), Snapshot 'snap_43' creation for VM 'testvm_43' has been completed.

5. If LiveSnapshotPerformFreezeInEngine of the VM's cluster compatibility level is false, there is no freezing guest filesystem process when taking live snapshot without memory on the VM:

engine.log:
2021-04-19 09:36:58,298+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-551) [b5f803ed-338c-4e86-b4d1-f7a5db66fccd] EVENT_ID: USER_CREATE_SNAPSHOT(45), Snapshot 'snapwin_46' creation for VM 'testwinvm_46' was initiated by admin@internal-authz.
2021-04-19 09:37:41,983+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-73) [] EVENT_ID: USER_CREATE_SNAPSHOT_FINISHED_SUCCESS(68), Snapshot 'snapwin_46' creation for VM 'testwinvm_46' has been completed.

6. According to 5, it took more than 40s to finish creating live snapshot on Windows VM, there is no freezing guest filesystem process, and no ThawVDS error.

Comment 12 errata-xmlrpc 2021-06-01 13:22:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: RHV Manager security update (ovirt-engine) [ovirt-4.4.6]), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2179

Comment 13 Red Hat Bugzilla 2023-09-15 01:02:02 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.