Description of problem: When I create a second/third snapshot on a random VM, the snapshot creation fails, and in specific case the host on which the VM run wasn't responding (Because High CPU usage on the specific Host). Version-Release number of selected component (if applicable): Engine have RHEV 4.4.0-17 Hosts have Rhel 8.2 OS and RHEV 4.4.0-17 How reproducible: 100% - Tested on 2 different environments. Steps to Reproduce: 1. Create a new VM 2. Create a snapshot for the VM 3. Create another snapshot for the VM Actual results: The first snapshot was created and the second snapshot creation process fails in the 2nd step, and on specific environment the host on which the VM run on was not responding due high CPU usage. Expected results: The 2 snapshots were created successfully. Additional info: logs from 2 environments attached. I created 2 directories in the zip with identifying environments names. "Compute_6" - this folder contains engine.log, and vdsm.log from functioning host (When the snapshot process run). "Compute_2" - this folder contains engine.log, and vdsm.log from the host with the high CPU usage (When the snapshot process run).
please attach the logs
I did not simulate this. Could use more information on the "specific environment", Host OS, repo versions, etc. It could also be an issue with lack of resources or environment.
Created attachment 1658118 [details] added logs
Created attachment 1658526 [details] Logs are empty The log directories seem to be empty as per the screen shot, but I will try to install rhel 8.2 which I need for another issue anyway and try this again. Meanwhile please see if you can provide the logs.
I tried to reproduce the bug again, and it didn't reproduce. I am closing this issue.
Created attachment 1662140 [details] scenario1
Created attachment 1662141 [details] scenario2 There is a bug I reproduce it in the following cases: Scenario1: 1. create VM on the base of 8.1 template (official qe-infra template from glance - attached in template.xml). 2. Run the VM , Click create snapshot, name the snapshot- got event "Snapshot 'snapshot1' creation for VM 'test_snapshot' has been completed.". 3. Trying to login the VM (used concole) causes error 2020-02-10 14:39:35,713+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-22) [] EVENT_ID: VM_DOWN_ERROR(119), VM test_snapshot is down with error. Exit message: Lost connection with qemu process. logs attached as logs_snapshot_scenario1.tar.gz Scenario 2: 1. create VM on the base of 8.1 template 2. Run the VM, login the VM using console, create there a file. 3. Create snapshot - causes 2020-02-10 15:04:11,038+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-33) [] EVENT_ID: VM_DOWN_ERROR(119), VM golden_env_mixed_virtio_5 is down with error. Exit message: Lost connection with qemu process. logs attached as logs_snapshot_scenario2.tar.gz
the reproduce is on http://bob-dr.lab.eng.brq.redhat.com/builds/4.4/rhv-4.4.0-18
Created attachment 1662229 [details] libvirt logs when simulating failure with debugging This issue is reproducible when following the following steps after creating a VM (8.1 template not necessary): 2. Run the VM, login the VM using console, create there a file. 3. Create snapshot - causes According to the engine log, the failure is due to libvirt failing on the usb controller: 2020-02-10 19:47:08,802+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.VmDevicesMonitoring] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-79) [] VM 'd2ddc775-9a4e-4897-b816-e87634fb6c68' managed non pluggable device was removed unexpectedly from libvirt: 'VmDevice:{id='VmDeviceId:{deviceId='0b60b210-88b0-4fe4-970e-d295b9a73278', vmId='d2ddc775-9a4e-4897-b816-e87634fb6c68'}', device='usb', type='CONTROLLER', specParams='[index=0, model=piix3-uhci]', address='', managed='true', plugged='false', readOnly='false', deviceAlias='', customProperties='[]', snapshotId='null', logicalName='null', hostDevice='null'}' The vdsm log reports an unknown task due to storage: 2020-02-10 19:47:12,067+0200 ERROR (jsonrpc/5) [storage.TaskManager.Task] (Task='9b2f7c26-2148-405e-886e-ae181cd0008e') Unexpected error (task:874) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 881, in _run return fn(*args, **kargs) File "<decorator-gen-111>", line 2, in revertTask File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 50, in method ret = func(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 2311, in revertTask return self.taskMng.revertTask(taskID=taskID) File "/usr/lib/python3.6/site-packages/vdsm/storage/taskManager.py", line 159, in revertTask t = self._getTask(taskID) File "/usr/lib/python3.6/site-packages/vdsm/storage/taskManager.py", line 86, in _getTask raise se.UnknownTask(taskID) vdsm.storage.exception.UnknownTask: Task id unknown: ('34317ef1-b467-49a0-9ca9-1b2a74a1ee02',) I am providing the libvirt log accordingly.
(In reply to Steven Rosenberg from comment #14) > Created attachment 1662229 [details] > libvirt logs when simulating failure with debugging > > This issue is reproducible when following the following steps after creating > a VM (8.1 template not necessary): > > 2. Run the VM, login the VM using console, create there a file. > 3. Create snapshot - causes > > > According to the engine log, the failure is due to libvirt failing on the > usb controller: > > 2020-02-10 19:47:08,802+02 ERROR > [org.ovirt.engine.core.vdsbroker.monitoring.VmDevicesMonitoring] > (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-79) [] > VM 'd2ddc775-9a4e-4897-b816-e87634fb6c68' managed non pluggable device was > removed unexpectedly from libvirt: > 'VmDevice:{id='VmDeviceId:{deviceId='0b60b210-88b0-4fe4-970e-d295b9a73278', > vmId='d2ddc775-9a4e-4897-b816-e87634fb6c68'}', device='usb', > type='CONTROLLER', specParams='[index=0, model=piix3-uhci]', address='', > managed='true', plugged='false', readOnly='false', deviceAlias='', > customProperties='[]', snapshotId='null', logicalName='null', > hostDevice='null'}' > > > The vdsm log reports an unknown task due to storage: > > > 2020-02-10 19:47:12,067+0200 ERROR (jsonrpc/5) [storage.TaskManager.Task] > (Task='9b2f7c26-2148-405e-886e-ae181cd0008e') Unexpected error (task:874) > Traceback (most recent call last): > File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 881, in > _run > return fn(*args, **kargs) > File "<decorator-gen-111>", line 2, in revertTask > File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 50, in > method > ret = func(*args, **kwargs) > File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 2311, in > revertTask > return self.taskMng.revertTask(taskID=taskID) > File "/usr/lib/python3.6/site-packages/vdsm/storage/taskManager.py", line > 159, in revertTask > t = self._getTask(taskID) > File "/usr/lib/python3.6/site-packages/vdsm/storage/taskManager.py", line > 86, in _getTask > raise se.UnknownTask(taskID) > vdsm.storage.exception.UnknownTask: Task id unknown: > ('34317ef1-b467-49a0-9ca9-1b2a74a1ee02',) > > I am providing the libvirt log accordingly. Please review the libvirt log with debugging information to see if we can narrow down the issue as per the engine error. Thank you
I'm not familiar with snapshots, Peter?
2020-02-10 18:09:29.708+0000: 3530: info : qemuMonitorSend:1072 : QEMU_MONITOR_SEND_MSG: mon=0x7f25040e6880 msg={"execute":"transaction","arguments":{"actions":[{"type":"blockdev-snapshot-sync","data":{"device":"drive-ua-647ef928-caee-4ae7-9d17-6f5d44f57c74","snapshot-file":"/rhev/data-center/mnt/vserver-spider.eng.lab.tlv.redhat.com:_pub_srosenbe_engine__44/b05445db-5f5b-4af1-b142-84fc00cecbd2/images/647ef928-caee-4ae7-9d17-6f5d44f57c74/0d591eb9-41ba-4a41-9d60-abff3f83df00","format":"qcow2","mode":"existing"}}]},"id":"libvirt-50"} fd=-1 2020-02-10 18:09:29.708+0000: 3526: info : virObjectRef:401 : OBJECT_REF: obj=0x7f25040e6880 2020-02-10 18:09:29.708+0000: 3526: info : qemuMonitorIOWrite:546 : QEMU_MONITOR_IO_WRITE: mon=0x7f25040e6880 buf={"execute":"transaction","arguments":{"actions":[{"type":"blockdev-snapshot-sync","data":{"device":"drive-ua-647ef928-caee-4ae7-9d17-6f5d44f57c74","snapshot-file":"/rhev/data-center/mnt/vserver-spider.eng.lab.tlv.redhat.com:_pub_srosenbe_engine__44/b05445db-5f5b-4af1-b142-84fc00cecbd2/images/647ef928-caee-4ae7-9d17-6f5d44f57c74/0d591eb9-41ba-4a41-9d60-abff3f83df00","format":"qcow2","mode":"existing"}}]},"id":"libvirt-50"} len=427 ret=427 errno=0 2020-02-10 18:09:29.708+0000: 3526: info : virObjectUnref:349 : OBJECT_UNREF: obj=0x7f25040e6880 2020-02-10 18:09:30.131+0000: 3526: info : virObjectRef:401 : OBJECT_REF: obj=0x7f25040e6880 2020-02-10 18:09:30.132+0000: 3526: debug : qemuMonitorJSONIOProcessLine:222 : Line [{"return": {}, "id": "libvirt-50"}] 2020-02-10 18:09:30.132+0000: 3526: info : qemuMonitorJSONIOProcessLine:242 : QEMU_MONITOR_RECV_REPLY: mon=0x7f25040e6880 reply={"return": {}, "id": "libvirt-50"} The snapshot was successful here ^^ 2020-02-10 18:09:30.132+0000: 3526: info : virObjectUnref:349 : OBJECT_UNREF: obj=0x7f25040e6880 2020-02-10 18:09:30.132+0000: 3530: info : virObjectUnref:349 : OBJECT_UNREF: obj=0x7f25040e6880 2020-02-10 18:09:30.132+0000: 3530: debug : qemuDomainObjExitMonitorInternal:8243 : Exited monitor (mon=0x7f25040e6880 vm=0x7f25081a5050 name=VMLinux) 2020-02-10 18:09:30.132+0000: 3530: debug : qemuDomainObjEndJob:8104 : Stopping job: async nested (async=snapshot vm=0x7f25081a5050 name=VMLinux) 2020-02-10 18:09:30.132+0000: 3530: debug : virStorageFileBackendFileDeinit:55 : deinitializing FS storage file 0x7f25500438b0 (file:/rhev/data-center/mnt/vserver-spider.eng.lab.tlv.redhat.com:_pub_srosenbe_engine__44/b05445db-5f5b-4af1-b142-84fc00cecbd2/images/647ef928-caee-4ae7-9d17-6f5d44f57c74/0d591eb9-41ba-4a41-9d60-abff3f83df00) 2020-02-10 18:09:30.132+0000: 3530: debug : virFileClose:114 : Closed fd 38 2020-02-10 18:09:30.169+0000: 3530: error : virProcessRunInFork:1170 : internal error: child reported (status=125): 2020-02-10 18:09:30.169+0000: 3530: debug : virFileClose:114 : Closed fd 37 2020-02-10 18:09:30.170+0000: 3530: debug : virFileClose:114 : Closed fd 38 2020-02-10 18:09:30.209+0000: 3530: error : virProcessRunInFork:1170 : internal error: child reported (status=125): internal error: child reported (status=125): 2020-02-10 18:09:30.209+0000: 3530: debug : virFileClose:114 : Closed fd 37 2020-02-10 18:09:30.209+0000: 3530: warning : qemuDomainSnapshotUpdateDiskSources:15405 : Unable to move disk metadata on vm VMLinux But this failed. I don't see any other failure regarding the VM dying or anything in the logs so the only suspicious thing is the seclabel remembering failure.
(In reply to Peter Krempa from comment #17) > I think those are harmless, in fact a symptom (e.g. the file that we are trying to modify XATTRs on does not exist). The qemuDomainSnapshotUpdateDiskSources() calls qemuSecurityMoveImageMetadata() but doesn't return an error in case of failure but prints a warning (as can be seen in the log).
I traced this issue to the following engine failures: 2020-02-12 17:50:53,896+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-3) [9852fb15-cc2c-4bd9-9ea1-5703668f5f47] Failed in 'SnapshotVDS' method 2020-02-12 17:50:53,903+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-3) [9852fb15-cc2c-4bd9-9ea1-5703668f5f47] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM Host1 command SnapshotVDS failed: Snapshot failed The vdsm fails here: 2020-02-12 17:50:53,893+0200 ERROR (jsonrpc/4) [virt.vm] (vmId='59b1108c-fe3b-46ea-889b-37e25ae655cc') The base volume doesn't exist: {'device': 'disk', 'domainID': '5c9478e8-01b3-4ae6-a932-4f318eba013a', 'imageID': '85995710-9995-456d-98c8-ac54bbb3f478', 'volumeID': '2da0a69e-08f7-4e70-b1ac-de9de296446d'} (vm:3998) The issue seems to be that the engine and vdsm are out of sync. Could you take the recent 4.4 build which should contain both AFAIK in order to synchronize them?
Created attachment 1662768 [details] logs for re-test on 4.4.0-20 re-tested on http://bob.eng.lab.tlv.redhat.com/builds/4.4/rhv-4.4.0-20/rhv-release-engine-latest-4.4.noarch.rpm http://bob.eng.lab.tlv.redhat.com/builds/4.4/rhv-4.4.0-20/rhv-release-host-latest-4.4.noarch.rpm Behavior is changed: the create snapshot action fails with error and following exceptions upon trying vds refreshing engine.log fragment for scenario1: 2020-02-12 21:59:05,880+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-80) [] EVENT_ID: USER_CREATE_SNAPSHOT_FINISHED_FAILURE(69), Failed to complete snapshot 'snapshot_1' creation for VM 'test_snapshot'. engine.log fragment for scenario2: 2020-02-12 22:04:14,009+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-80) [] EVENT_ID: USER_CREATE_SNAPSHOT_FINISHED_FAILURE(69), Failed to complete snapshot 'snapshot_2' creation for VM 'test_snapshot_2'. there are java.lang.NullPointerExceptions while trying to refresh vds
exceptions in upon vds refreshing are not related to this issue. it is another bug https://bugzilla.redhat.com/show_bug.cgi?id=1802469
(In reply to Polina from comment #20) > Created attachment 1662768 [details] > logs for re-test on 4.4.0-20 > > re-tested on > http://bob.eng.lab.tlv.redhat.com/builds/4.4/rhv-4.4.0-20/rhv-release-engine- > latest-4.4.noarch.rpm > http://bob.eng.lab.tlv.redhat.com/builds/4.4/rhv-4.4.0-20/rhv-release-host- > latest-4.4.noarch.rpm > > Behavior is changed: the create snapshot action fails with error and > following exceptions upon trying vds refreshing > > engine.log fragment for scenario1: > 2020-02-12 21:59:05,880+02 ERROR > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-80) [] > EVENT_ID: USER_CREATE_SNAPSHOT_FINISHED_FAILURE(69), Failed to complete > snapshot 'snapshot_1' creation for VM 'test_snapshot'. > > engine.log fragment for scenario2: > 2020-02-12 22:04:14,009+02 ERROR > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-80) [] > EVENT_ID: USER_CREATE_SNAPSHOT_FINISHED_FAILURE(69), Failed to complete > snapshot 'snapshot_2' creation for VM 'test_snapshot_2'. > > there are java.lang.NullPointerExceptions while trying to refresh vds I reviewed the logs. It appears either the VDSM and libvirt logs are not complete or the time is very different between your host and engine. This is the last line on the vdsm log (which does not match 21:59:05 nor 22:04:14): 2020-02-12 15:10:29,333-0500 INFO (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call Host.getAllVmStats succeeded in 0.00 seconds (__init__:312) Could you please ensure the time stamps are in sync and provide complete logs?
Note that libvirtd logs are always in UTC regardless of the host settings.
Created attachment 1662958 [details] logs for synchronized engine and vdsm I synchronized time on engine and the host and provide the new logs for the reproduced failure engine timestamp - 2020-02-13 15:45:24,687+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-19) [] EVENT_ID: VM_DOWN_ERROR(119), VM test_snapshot is down with error. Exit message: Lost connection with qemu process.
(In reply to Polina from comment #24) > Created attachment 1662958 [details] > logs for synchronized engine and vdsm > > I synchronized time on engine and the host and provide the new logs for the > reproduced failure > engine timestamp - > 2020-02-13 15:45:24,687+02 ERROR > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > (ForkJoinPool-1-worker-19) [] EVENT_ID: VM_DOWN_ERROR(119), VM test_snapshot > is down with error. Exit message: Lost connection with qemu process. Do you have the qemu (VM) logs?
Created attachment 1663294 [details] logs attached new set of logs containing the qemu for VM name 'test_snapshot3'
I checked the Host Polina is using as per her log: host_mixed_1, and the vdsm on this host still has the following code in vm.py: try: vmDrive = self.findDriveByUUIDs(baseDrv) except LookupError: # The volume we want to snapshot doesn't exist self.log.error("The base volume doesn't exist: %s", baseDrv) return response.error('snapshotErr') This error no longer exists on the current repository: https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/vm.py Please advise what package Polina needs in order to synchronize the vdsm and engine versions for master (4.4).
This should be retested in the next release when the rmps are updated accordingly.
verified on http://bob-dr.lab.eng.brq.redhat.com/builds/4.4/rhv-4.4.0-23 according to https://bugzilla.redhat.com/show_bug.cgi?id=1797316#c11 and https://bugzilla.redhat.com/show_bug.cgi?id=1797316#c31
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: RHV Manager (ovirt-engine) 4.4 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:3247