When the live snapshot job is aborted it can be due: 1. libvirt error that returns to VDSM. 2. The job hit the timeout (currently default to 30 minutes), then calling to abort the job in libvirt. Both triggers at the end this part: https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/jobs/snapshot.py#L464 Which it true only to the second flow. We need to improve the logging. We can still detect where the abortion came from from other logging and timing but we should make it easier. Reference bug: https://bugzilla.redhat.com/show_bug.cgi?id=1916122#c12
Verified on RHV 4.4.6-6. All looks good to me. Env: - Engine instance with RHV 4.4.6-6 (ovirt-engine-4.4.6.5-0.17.el8ev) and RHEL 8.4 installed. - 3 hosts with RHV 4.4.6-6 and RHEL 8.4, vdsm-4.40.60.5-1.el8ev I verified both triggers: (I) The job hit the timeout, then calling to abort the job in libvirt. Steps: 1. Run "engine-config -s LiveSnapshotTimeoutInMinutes=1" on engine machine. 2. Run "systemctl restart ovirt-engine" on engine machine. 3. Create a 4.6 data center and a 4.6 cluster. 4. Install a host and create a new NFS storage domain. 5. Create a VM from a RHEL 8 template with disk, 63000 MB Memory Size, 63000 MB Maximum memory and 63000 MB Physical Memory Guaranteed. 6. Run the VM. 7. Install "stress" tool in the VM. 8. Run "stress --vm 1 --vm-bytes 60000M --vm-keep -t 60m" in the VM. 9. Create snapshot with memory and wait until the snapshot fail. Results (As Expected): 1. The "LiveSnapshotTimeoutInMinutes" property is set as 1 min. 2. The engine was restarted. 3. The 4.6 data center and the 4.6 cluster were created. 4. The host was installed and the NFS storage domain was created. 5. The VM was created. 6. The VM is running. 7. The package is installed. 8. The VM's memory is stressed. 9. The snapshot was failed. Logs: "engine.log": 2021-04-29 17:31:06,223+03 WARN [org.ovirt.engine.core.bll.storage.disk.AddDiskCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [] VM is null - no unlocking 2021-04-29 17:31:06,264+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [] EVENT_ID: USER_ADD_DISK_FINISH ED_FAILURE(2,022), Add-Disk operation failed to complete. 2021-04-29 17:31:06,277+03 WARN [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [] Trying to release exclusive lock which does not exist, lock key: '517548c6-75a8-4833-9a5a-112f3eccd09fVM' 2021-04-29 17:31:06,278+03 INFO [org.ovirt.engine.core.bll.snapshots.CreateSnapshotForVmCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [] Lock freed to object 'EngineLock:{ex clusiveLocks='[517548c6-75a8-4833-9a5a-112f3eccd09f=VM]', sharedLocks=''}' 2021-04-29 17:31:06,326+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [] EVENT_ID: USER_CREATE_SNAPSHOT _FINISHED_FAILURE(69), Failed to complete snapshot '1' creation for VM 'VDSM'. 2021-04-29 17:31:06,327+03 WARN [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [] Command 'CreateSnapshotForVm' id: '0 3380316-97c6-4663-88cf-91ed91f4d009' end method execution failed, as the command isn't marked for endAction() retries silently ignoring "vdsm.log": 2021-04-29 17:31:00,842+0300 INFO (virt/50a60924) [virt.vm] (vmId='517548c6-75a8-4833-9a5a-112f3eccd09f') Snapshot timeout reached, operation aborted (snapshot:472) 2021-04-29 17:31:00,842+0300 ERROR (virt/50a60924) [virt.vm] (vmId='517548c6-75a8-4833-9a5a-112f3eccd09f') Unable to take snapshot (snapshot:478) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/virt/jobs/snapshot.py", line 462, in snapshot self._vm.run_dom_snapshot(snapxml, snap_flags) File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 5856, in run_dom_snapshot self._dom.snapshotCreateXML(snapxml, snap_flags) File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 101, in f ret = attr(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/common/function.py", line 94, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python3.6/site-packages/libvirt.py", line 3049, in snapshotCreateXML raise libvirtError('virDomainSnapshotCreateXML() failed') libvirt.libvirtError: operation aborted: snapshot job: canceled by client (II) libvirt error that returns to VDSM. Steps: 1. Run "engine-config -s LiveSnapshotTimeoutInMinutes=30" on engine machine. 2. Run "systemctl restart ovirt-engine" on engine machine. 3. Create a 4.6 data center and a 4.6 cluster. 4. Install a host and create a new NFS storage domain. 5. Create a VM from a RHEL 8 template with disk, 63000 MB Memory Size, 63000 MB Maximum memory and 63000 MB Physical Memory Guaranteed. 6. Run the VM. 7. Install "stress" tool in the VM. 8. Run "stress --vm 1 --vm-bytes 60000M --vm-keep -t 60m" in the VM. 9. Create snapshot with memory. 10. Run "virsh -r domjobinfo VM_NAME" in the host the VM is running on until there is a snapshot job on the VM. (Instead of the VM_NAME, put the vm name you chose) 11. Run "virsh domjobabort VM_NAME" in the host the VM is running on. (Instead of the VM_NAME, put the vm name you chose) 12. Wait until the snapshot fail. Results (As Expected): 1. The "LiveSnapshotTimeoutInMinutes" property is set as 30 min. 2. The engine was restarted. 3. The 4.6 data center and the 4.6 cluster were created. 4. The host was installed and the NFS storage domain was created. 5. The VM was created. 6. The VM is running. 7. The package is installed. 8. The VM's memory is stressed. 9. The snapshot in progress. 10. The domain got the snapshot job on. 11. The job was aborted. 12. The snapshot was failed. Logs: "engine.log": 021-04-29 18:09:44,132+03 ERROR [org.ovirt.engine.core.bll.snapshots.CreateLiveSnapshotForVmCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-10) [c7194afe-5a9d-4dbd-a65e-5c907f009ca7] Ending command 'org.ovirt.engine.core.bll.snapshots.CreateLiveSnapshotForVmCommand' with failure. 2021-04-29 18:09:44,188+03 INFO [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-10) [c7194afe-5a9d-4dbd-a65e-5c907f009ca7] Command 'CreateSnapshotForVm' id: '1def18d8-8780-4a7d-a782-771549da8320' child commands '[bcdb68f8-1fd1-447b-9d2b-5128cf6ddabd, 980bc287-933a-4aea-ad58-deca1653acd6, 521a088f-b884-4ccf-bbca-41b8a04e1785, e0ad8e1c-4ff3-40b6-8858-0a16ace1a6ed]' executions were completed, status 'FAILED' 2021-04-29 18:09:45,245+03 ERROR [org.ovirt.engine.core.bll.snapshots.CreateSnapshotForVmCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-99) [c7194afe-5a9d-4dbd-a65e-5c907f009ca7] Ending command 'org.ovirt.engine.core.bll.snapshots.CreateSnapshotForVmCommand' with failure. "vdsm.log": 2021-04-29 18:09:13,527+0300 WARN (virt/cb961973) [virt.vm] (vmId='517548c6-75a8-4833-9a5a-112f3eccd09f') Snapshot operation aborted by libvirt: operation aborted: snapshot job: canceled by client (snapshot:47 7) 2021-04-29 18:09:13,527+0300 ERROR (virt/cb961973) [virt.vm] (vmId='517548c6-75a8-4833-9a5a-112f3eccd09f') Unable to take snapshot (snapshot:478) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/virt/jobs/snapshot.py", line 462, in snapshot self._vm.run_dom_snapshot(snapxml, snap_flags) File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 5856, in run_dom_snapshot self._dom.snapshotCreateXML(snapxml, snap_flags) File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 101, in f ret = attr(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/common/function.py", line 94, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python3.6/site-packages/libvirt.py", line 3049, in snapshotCreateXML raise libvirtError('virDomainSnapshotCreateXML() failed') libvirt.libvirtError: operation aborted: snapshot job: canceled by client
This bugzilla is included in oVirt 4.4.6 release, published on May 4th 2021. Since the problem described in this bug report should be resolved in oVirt 4.4.6 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.