Description of problem: When LiveSnapshotPerformFreezeInEngine is set to true the snapshot operation is not fast enough and the FS is unfrozen before we finished it. Version-Release number of selected component (if applicable): 4.4.3 How reproducible: Time to time. Depending on the environment load Steps to Reproduce: 1. Generate a lot of snapshot creation for multiple Windows systems Actual results: 2021-02-24 09:03:07,961+01 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ThawVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-64) [84755dbc-a211-408b-8d58-9d1368d0c76d] Failed in 'ThawVDS' method 2021-02-24 09:03:07,963+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-64) [84755dbc-a211-408b-8d58-9d1368d0c76d] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host01.example.com command ThawVDS failed: internal error: unable to execute QEMU agent command 'guest-fsfreeze-thaw': couldn't hold writes: fsfreeze is limited up to 10 seconds: Expected results: The FS thaw should be finished before the timeout of 10s. Or we need to at least make sure that the FS is frozen when we take the snapshot. If that is in place this should not be an ERROR, but rather a warning. Additional info: 2021-02-24 09:02:44,098+01 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FreezeVDSCommand] (default task-1506) [84755dbc-a211-408b-8d58-9d1368d0c76d] START, FreezeVDSCommand(HostName = host01.example.com, VdsAndVmIDVDSParametersBase:{hostId='34c607b5-954e-4620-8022-f9176a99257a', vmId='c5854e55-fce4-496a-89dd-e28903b604a1'}), log id: 3e0162e9 ... 2021-02-24 09:02:44,847+01 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FreezeVDSCommand] (default task-1506) [84755dbc-a211-408b-8d58-9d1368d0c76d] FINISH, FreezeVDSCommand, return: , log id: 3e0162e9 2021-02-24 09:02:44,849+01 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-1506) [84755dbc-a211-408b-8d58-9d1368d0c76d] EVENT_ID: FREEZE_VM_SUCCESS(10,767), Guest filesystems on VM VM02 have been frozen successfully. ... 2021-02-24 09:03:06,737+01 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ThawVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-64) [84755dbc-a211-408b-8d58-9d1368d0c76d] START, ThawVDSCommand(HostName = host01.example.com, VdsAndVmIDVDSParametersBase:{hostId='34c607b5-954e-4620-8022-f9176a99257a', vmId='c5854e55-fce4-496a-89dd-e28903b604a1'}), log id: 1b7217f8 2021-02-24 09:03:07,961+01 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.ThawVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-64) [84755dbc-a211-408b-8d58-9d1368d0c76d] Failed in 'ThawVDS' method 2021-02-24 09:03:07,963+01 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-64) [84755dbc-a211-408b-8d58-9d1368d0c76d] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host01.example.com command ThawVDS failed: internal error: unable to execute QEMU agent command 'guest-fsfreeze-thaw': couldn't hold writes: fsfreeze is limited up to 10 seconds:
We have other flows using freeze before taking a snapshot, and doing something else which is guaranteed to completed in 10 seconds. The flow was added to support Cinder based Ceph storage where the snapshot is created on the storage side. I think we use the same flow for managed block storage (cinderlib based), which works in the same way. I think snapshots in openstack work in the same way, so they have the same issue with snapshot taken more than 10 seconds after the freeze. Where is the 10 seconds limit in fsfreeze coming from? internal error: unable to execute QEMU agent command 'guest-fsfreeze-thaw': couldn't hold writes: fsfreeze is limited up to 10 seconds Is it configurable?
This is unfortunately not configurable. It comes directly from the VSS subsystem. MS does not allow the FS to be frozen longer than these 10s.
(In reply to Roman Hodain from comment #3) > This is unfortunately not configurable. It comes directly from the VSS > subsystem. MS does not allow the FS to be frozen longer than these 10s. So we need to change all the using these flows to handle the case when fsreeze timed out after 10 seconds. One way to handle this is to pause the vm right after freeze completed - this will prevent the guest from undoing the freeze too early so we can ensure consistent snapshots. Another way is to treat this as best effort. If the guest undo the freeze behind our back, you will not have consistent snapshot.
Let's not call the FS-freeze from the engine to mitigate this issue in case the freeze operation times out
Verified with: ovirt-engine-4.4.6.3-0.8.el8ev.noarch Steps: 1. Check default LiveSnapshotPerformFreezeInEngine values on a fresh installed ovirt-engine-4.4.6.3: 2. Check LiveSnapshotPerformFreezeInEngine values after engine upgrade to ovirt-engine-4.4.6.3: 1) set LiveSnapshotPerformFreezeInEngine to true on ovirt-engine-4.4.5.11, restart engine # engine-config -s LiveSnapshotPerformFreezeInEngine=true # systemctl restart ovirt-engine # engine-config -g LiveSnapshotPerformFreezeInEngine LiveSnapshotPerformFreezeInEngine: true version: general 2) upgrade ovirt-engine-4.4.5.11 to ovirt-engine-4.4.6.3, check LiveSnapshotPerformFreezeInEngine 3. Check setting LiveSnapshotPerformFreezeInEngine values on ovirt-engine-4.4.6.3: 1) set LiveSnapshotPerformFreezeInEngine of cluster compatibility level 4.2 to false # engine-config -s LiveSnapshotPerformFreezeInEngine=false --cver=4.2 2) set LiveSnapshotPerformFreezeInEngine of cluster compatibility level 4.4 to true # engine-config -s LiveSnapshotPerformFreezeInEngine=true --cver=4.4 3) restart engine, check LiveSnapshotPerformFreezeInEngine 4. Check if LiveSnapshotPerformFreezeInEngine configurations work as expected when creating live snapshot without memory on ovirt-engine-4.4.6.3: 1) check VM with cluster compatibility version 4.3: - make sure LiveSnapshotPerformFreezeInEngine of cluster compatibility level 4.3 is ture - create a cluster with compatibility version 4.3, add a 4.3 host - create and run a VM named testvm_43 - create live snapshot without memory on VM testvm_43 - check if there is freezing guest filesystem process - check if snapshot is created successfully 2) check VM with cluster compatibility version 4.6: - make sure LiveSnapshotPerformFreezeInEngine of cluster compatibility level 4.6 is false - create a cluster with compatibility version 4.6, add a 4.4.6 host - create and run a Windows VM named testwinvm_46 - create live snapshot without memory on Windows VM testwinvm_46 - check if there is no freezing guest filesystem process - check if snapshot is created successfully Results: 1. In a fresh installed ovirt-engine-4.4.6.3, each cluster compatibility level has its own LiveSnapshotPerformFreezeInEngine value, and all default to false. # engine-config -g LiveSnapshotPerformFreezeInEngine LiveSnapshotPerformFreezeInEngine: false version: 4.2 LiveSnapshotPerformFreezeInEngine: false version: 4.3 LiveSnapshotPerformFreezeInEngine: false version: 4.4 LiveSnapshotPerformFreezeInEngine: false version: 4.5 LiveSnapshotPerformFreezeInEngine: false version: 4.6 2. If LiveSnapshotPerformFreezeInEngine is true in old engine, after upgrade to ovirt-engine-4.4.6.3, it remains true for cluster compatibility level < 4.4, changes to false for cluster compatibility level >= 4.4. # engine-config -g LiveSnapshotPerformFreezeInEngine LiveSnapshotPerformFreezeInEngine: true version: 4.2 LiveSnapshotPerformFreezeInEngine: true version: 4.3 LiveSnapshotPerformFreezeInEngine: false version: 4.4 LiveSnapshotPerformFreezeInEngine: false version: 4.5 LiveSnapshotPerformFreezeInEngine: false version: 4.6 3. In ovirt-engine-4.4.6.3, LiveSnapshotPerformFreezeInEngine can be set for each cluster compatibility level individually by using --cver option. # engine-config -s LiveSnapshotPerformFreezeInEngine=false --cver=4.2 # engine-config -s LiveSnapshotPerformFreezeInEngine=true --cver=4.4 # engine-config -g LiveSnapshotPerformFreezeInEngine LiveSnapshotPerformFreezeInEngine: false version: 4.2 LiveSnapshotPerformFreezeInEngine: true version: 4.3 LiveSnapshotPerformFreezeInEngine: true version: 4.4 LiveSnapshotPerformFreezeInEngine: false version: 4.5 LiveSnapshotPerformFreezeInEngine: false version: 4.6 4. If LiveSnapshotPerformFreezeInEngine of the VM's cluster compatibility level is true, there is freezing guest filesystem process when taking live snapshot without memory on the VM: engine.log: 2021-04-19 09:30:23,284+03 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-397) [44fe97a1-bc29-4f60-81e7-a37c2edfb503] EVENT_ID: FREEZE_VM_INITIATED(10,766), Freeze of guest filesystems on VM testvm_43 was initiated. 2021-04-19 09:30:23,326+03 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-397) [44fe97a1-bc29-4f60-81e7-a37c2edfb503] EVENT_ID: FREEZE_VM_SUCCESS(10,767), Guest filesystems on VM testvm_43 have been frozen successfully. 2021-04-19 09:30:23,702+03 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-397) [44fe97a1-bc29-4f60-81e7-a37c2edfb503] EVENT_ID: USER_CREATE_SNAPSHOT(45), Snapshot 'snap_43' creation for VM 'testvm_43' was initiated by admin@internal-authz. 2021-04-19 09:30:30,043+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ThawVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-13) [44fe97a1-bc29-4f60-81e7-a37c2edfb503] START, ThawVDSCommand(HostName = host_43, VdsAndVmIDVDSParametersBase:{hostId='feeac0e6-d722-4606-9ac9-5e560c835442', vmId='4d3dd86b-6979-40ae-b5f9-6bb39ea7a2c7'}), log id: 4942efce 2021-04-19 09:30:30,057+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.ThawVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-13) [44fe97a1-bc29-4f60-81e7-a37c2edfb503] FINISH, ThawVDSCommand, return: , log id: 4942efce 2021-04-19 09:30:32,403+03 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-86) [] EVENT_ID: USER_CREATE_SNAPSHOT_FINISHED_SUCCESS(68), Snapshot 'snap_43' creation for VM 'testvm_43' has been completed. 5. If LiveSnapshotPerformFreezeInEngine of the VM's cluster compatibility level is false, there is no freezing guest filesystem process when taking live snapshot without memory on the VM: engine.log: 2021-04-19 09:36:58,298+03 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-551) [b5f803ed-338c-4e86-b4d1-f7a5db66fccd] EVENT_ID: USER_CREATE_SNAPSHOT(45), Snapshot 'snapwin_46' creation for VM 'testwinvm_46' was initiated by admin@internal-authz. 2021-04-19 09:37:41,983+03 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-73) [] EVENT_ID: USER_CREATE_SNAPSHOT_FINISHED_SUCCESS(68), Snapshot 'snapwin_46' creation for VM 'testwinvm_46' has been completed. 6. According to 5, it took more than 40s to finish creating live snapshot on Windows VM, there is no freezing guest filesystem process, and no ThawVDS error.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: RHV Manager security update (ovirt-engine) [ovirt-4.4.6]), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2179
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days