Bug 1933669 - Improve live snapshot logs in case of aborting
Summary: Improve live snapshot logs in case of aborting
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: Core
Version: 4.40.13
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ovirt-4.4.6
: 4.40.60.5
Assignee: Liran Rotenberg
QA Contact: Tamir
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-01 11:34 UTC by Liran Rotenberg
Modified: 2021-11-04 19:28 UTC (History)
4 users (show)

Fixed In Version: vdsm-4.40.60.5
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-05-05 05:35:53 UTC
oVirt Team: Virt
Embargoed:
pm-rhel: ovirt-4.4+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 114331 0 master MERGED snapshot: fix abort log 2021-04-21 07:59:24 UTC

Description Liran Rotenberg 2021-03-01 11:34:16 UTC
When the live snapshot job is aborted it can be due:
1. libvirt error that returns to VDSM.
2. The job hit the timeout (currently default to 30 minutes), then calling to abort the job in libvirt.

Both triggers at the end this part:
https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/jobs/snapshot.py#L464

Which it true only to the second flow. We need to improve the logging.

We can still detect where the abortion came from from other logging and timing but we should make it easier.

Reference bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1916122#c12

Comment 1 Tamir 2021-04-29 16:12:02 UTC
Verified on RHV 4.4.6-6. All looks good to me.

Env:
  - Engine instance with RHV 4.4.6-6 (ovirt-engine-4.4.6.5-0.17.el8ev) and RHEL 8.4 installed.
  - 3 hosts with RHV 4.4.6-6 and RHEL 8.4, vdsm-4.40.60.5-1.el8ev

I verified both triggers:

(I) The job hit the timeout, then calling to abort the job in libvirt.

Steps:
1. Run "engine-config -s LiveSnapshotTimeoutInMinutes=1" on engine machine.
2. Run "systemctl restart ovirt-engine" on engine machine.
3. Create a 4.6 data center and a 4.6 cluster.
4. Install a host and create a new NFS storage domain.
5. Create a VM from a RHEL 8 template with disk, 63000 MB Memory Size, 63000 MB Maximum memory and 63000 MB Physical Memory Guaranteed.
6. Run the VM.
7. Install "stress" tool in the VM.
8. Run "stress --vm 1 --vm-bytes 60000M --vm-keep -t 60m" in the VM.
9. Create snapshot with memory and wait until the snapshot fail.

Results (As Expected):

1. The "LiveSnapshotTimeoutInMinutes" property is set as 1 min.
2. The engine was restarted.
3. The 4.6 data center and the 4.6 cluster were created.
4. The host was installed and the NFS storage domain was created.
5. The VM was created.
6. The VM is running.
7. The package is installed.
8. The VM's memory is stressed.
9. The snapshot was failed. 

Logs:

"engine.log":

2021-04-29 17:31:06,223+03 WARN  [org.ovirt.engine.core.bll.storage.disk.AddDiskCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [] VM is null - no unlocking
2021-04-29 17:31:06,264+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [] EVENT_ID: USER_ADD_DISK_FINISH
ED_FAILURE(2,022), Add-Disk operation failed to complete.
2021-04-29 17:31:06,277+03 WARN  [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [] Trying to release exclusive lock which does not 
exist, lock key: '517548c6-75a8-4833-9a5a-112f3eccd09fVM'
2021-04-29 17:31:06,278+03 INFO  [org.ovirt.engine.core.bll.snapshots.CreateSnapshotForVmCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [] Lock freed to object 'EngineLock:{ex
clusiveLocks='[517548c6-75a8-4833-9a5a-112f3eccd09f=VM]', sharedLocks=''}'
2021-04-29 17:31:06,326+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [] EVENT_ID: USER_CREATE_SNAPSHOT
_FINISHED_FAILURE(69), Failed to complete snapshot '1' creation for VM 'VDSM'.
2021-04-29 17:31:06,327+03 WARN  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-82) [] Command 'CreateSnapshotForVm' id: '0
3380316-97c6-4663-88cf-91ed91f4d009' end method execution failed, as the command isn't marked for endAction() retries silently ignoring



"vdsm.log":

2021-04-29 17:31:00,842+0300 INFO  (virt/50a60924) [virt.vm] (vmId='517548c6-75a8-4833-9a5a-112f3eccd09f') Snapshot timeout reached, operation aborted (snapshot:472)
2021-04-29 17:31:00,842+0300 ERROR (virt/50a60924) [virt.vm] (vmId='517548c6-75a8-4833-9a5a-112f3eccd09f') Unable to take snapshot (snapshot:478)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/virt/jobs/snapshot.py", line 462, in snapshot
    self._vm.run_dom_snapshot(snapxml, snap_flags)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 5856, in run_dom_snapshot
    self._dom.snapshotCreateXML(snapxml, snap_flags)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 101, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/common/function.py", line 94, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python3.6/site-packages/libvirt.py", line 3049, in snapshotCreateXML
    raise libvirtError('virDomainSnapshotCreateXML() failed')
libvirt.libvirtError: operation aborted: snapshot job: canceled by client



(II)  libvirt error that returns to VDSM.

Steps:
1. Run "engine-config -s LiveSnapshotTimeoutInMinutes=30" on engine machine.
2. Run "systemctl restart ovirt-engine" on engine machine.
3. Create a 4.6 data center and a 4.6 cluster.
4. Install a host and create a new NFS storage domain.
5. Create a VM from a RHEL 8 template with disk, 63000 MB Memory Size, 63000 MB Maximum memory and 63000 MB Physical Memory Guaranteed.
6. Run the VM.
7. Install "stress" tool in the VM.
8. Run "stress --vm 1 --vm-bytes 60000M --vm-keep -t 60m" in the VM.
9. Create snapshot with memory.
10. Run "virsh -r domjobinfo VM_NAME" in the host the VM is running on until there is a snapshot job on the VM. (Instead of the VM_NAME, put the vm name you chose)
11. Run "virsh domjobabort VM_NAME" in the host the VM is running on. (Instead of the VM_NAME, put the vm name you chose)
12. Wait until the snapshot fail.

Results (As Expected):

1. The "LiveSnapshotTimeoutInMinutes" property is set as 30 min.
2. The engine was restarted.
3. The 4.6 data center and the 4.6 cluster were created.
4. The host was installed and the NFS storage domain was created.
5. The VM was created.
6. The VM is running.
7. The package is installed.
8. The VM's memory is stressed.
9. The snapshot in progress.
10. The domain got the snapshot job on.
11. The job was aborted.
12. The snapshot was failed. 

Logs:

"engine.log":

021-04-29 18:09:44,132+03 ERROR [org.ovirt.engine.core.bll.snapshots.CreateLiveSnapshotForVmCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-10) [c7194afe-5a9d-4dbd-a65e-5c907f009ca7] Ending command 'org.ovirt.engine.core.bll.snapshots.CreateLiveSnapshotForVmCommand' with failure.
2021-04-29 18:09:44,188+03 INFO  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-10) [c7194afe-5a9d-4dbd-a65e-5c907f009ca7] Command 'CreateSnapshotForVm' id: '1def18d8-8780-4a7d-a782-771549da8320' child commands '[bcdb68f8-1fd1-447b-9d2b-5128cf6ddabd, 980bc287-933a-4aea-ad58-deca1653acd6, 521a088f-b884-4ccf-bbca-41b8a04e1785, e0ad8e1c-4ff3-40b6-8858-0a16ace1a6ed]' executions were completed, status 'FAILED'
2021-04-29 18:09:45,245+03 ERROR [org.ovirt.engine.core.bll.snapshots.CreateSnapshotForVmCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-99) [c7194afe-5a9d-4dbd-a65e-5c907f009ca7] Ending command 'org.ovirt.engine.core.bll.snapshots.CreateSnapshotForVmCommand' with failure.


"vdsm.log":

2021-04-29 18:09:13,527+0300 WARN  (virt/cb961973) [virt.vm] (vmId='517548c6-75a8-4833-9a5a-112f3eccd09f') Snapshot operation aborted by libvirt: operation aborted: snapshot job: canceled by client (snapshot:47
7)
2021-04-29 18:09:13,527+0300 ERROR (virt/cb961973) [virt.vm] (vmId='517548c6-75a8-4833-9a5a-112f3eccd09f') Unable to take snapshot (snapshot:478)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/virt/jobs/snapshot.py", line 462, in snapshot
    self._vm.run_dom_snapshot(snapxml, snap_flags)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 5856, in run_dom_snapshot
    self._dom.snapshotCreateXML(snapxml, snap_flags)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 101, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/common/function.py", line 94, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python3.6/site-packages/libvirt.py", line 3049, in snapshotCreateXML
    raise libvirtError('virDomainSnapshotCreateXML() failed')
libvirt.libvirtError: operation aborted: snapshot job: canceled by client

Comment 3 Sandro Bonazzola 2021-05-05 05:35:53 UTC
This bugzilla is included in oVirt 4.4.6 release, published on May 4th 2021.

Since the problem described in this bug report should be resolved in oVirt 4.4.6 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.