Bug 1797316 - Snapshot creation from VM fails on second snapshot and afterwords
Summary: Snapshot creation from VM fails on second snapshot and afterwords
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.4.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ovirt-4.4.0
: ---
Assignee: Steven Rosenberg
QA Contact: Polina
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-02-02 14:20 UTC by Tamir
Modified: 2020-08-04 13:21 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Rebase: Bug Fixes and Enhancements
Doc Text:
Upgrade package(s) to version: rhv-4.4.0-23 Highlights and important bug fixes: Enhancements to VM snapshots caused a regression due to inconsistencies between the VDSM and RHV Manager versions. This upgrade fixes the issue by synchronizing the RHV Manager version to match the VDSM version.
Clone Of:
Environment:
Last Closed: 2020-08-04 13:21:33 UTC
oVirt Team: Virt
Target Upstream Version:


Attachments (Terms of Use)
added logs (330 bytes, application/zip)
2020-02-06 08:21 UTC, Tamir
no flags Details
Logs are empty (26.94 KB, image/png)
2020-02-06 16:09 UTC, Steven Rosenberg
no flags Details
scenario1 (2.02 MB, application/gzip)
2020-02-10 13:27 UTC, Polina
no flags Details
scenario2 (902.48 KB, application/gzip)
2020-02-10 13:28 UTC, Polina
no flags Details
libvirt logs when simulating failure with debugging (4.02 MB, text/plain)
2020-02-10 18:32 UTC, Steven Rosenberg
no flags Details
logs for re-test on 4.4.0-20 (2.95 MB, application/gzip)
2020-02-12 20:16 UTC, Polina
no flags Details
logs for synchronized engine and vdsm (872.97 KB, application/gzip)
2020-02-13 13:55 UTC, Polina
no flags Details
logs (1.79 MB, application/gzip)
2020-02-16 08:58 UTC, Polina
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2020:3247 0 None None None 2020-08-04 13:21:52 UTC

Description Tamir 2020-02-02 14:20:08 UTC
Description of problem:
When I create a second/third snapshot on a random VM, the snapshot creation fails, and in specific case the host on which the VM run wasn't responding (Because High CPU usage on the specific Host). 

Version-Release number of selected component (if applicable):
Engine have RHEV 4.4.0-17
Hosts have Rhel 8.2 OS and RHEV 4.4.0-17

How reproducible:
100% - Tested on 2 different environments. 

Steps to Reproduce:
1. Create a new VM
2. Create a snapshot for the VM
3. Create another snapshot for the VM

Actual results:
The first snapshot was created and the second snapshot creation process fails in the 2nd step, and on specific environment the host on which the VM run on was not responding due high CPU usage.

Expected results:
The 2 snapshots were created successfully.

Additional info:
logs from 2 environments attached.
I created 2 directories in the zip with identifying environments names.

"Compute_6" - this folder contains engine.log, and vdsm.log from functioning host (When the snapshot process run). 
"Compute_2" - this folder contains engine.log, and vdsm.log from the host with the high CPU usage (When the snapshot process run).

Comment 1 Michal Skrivanek 2020-02-03 08:12:45 UTC
please attach the logs

Comment 2 Steven Rosenberg 2020-02-05 16:22:59 UTC
I did not simulate this. Could use more information on the "specific environment", Host OS, repo versions, etc. It could also be an issue with lack of resources or environment.

Comment 3 Tamir 2020-02-06 08:21:13 UTC
Created attachment 1658118 [details]
added logs

Comment 6 Steven Rosenberg 2020-02-06 16:09:21 UTC
Created attachment 1658526 [details]
Logs are empty

The log directories seem to be empty as per the screen shot, but I will try to install rhel 8.2 which I need for another issue anyway and try this again. Meanwhile please see if you can provide the logs.

Comment 9 Tamir 2020-02-09 08:51:46 UTC
I tried to reproduce the bug again, and it didn't reproduce. I am closing this issue.

Comment 10 Polina 2020-02-10 13:27:09 UTC
Created attachment 1662140 [details]
scenario1

Comment 11 Polina 2020-02-10 13:28:26 UTC
Created attachment 1662141 [details]
scenario2

There is a bug I reproduce it in the following cases:

Scenario1: 

1. create VM on the base of 8.1 template (official qe-infra template from glance - attached in template.xml).
2. Run the VM , Click create snapshot, name the snapshot- got event "Snapshot 'snapshot1' creation for VM 'test_snapshot' has been completed.". 
3. Trying to login the VM (used concole) causes error

2020-02-10 14:39:35,713+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-22) [] EVENT_ID: VM_DOWN_ERROR(119), VM test_snapshot is down with error. Exit message: Lost connection with qemu process.

logs attached as logs_snapshot_scenario1.tar.gz


Scenario 2:
1. create VM on the base of 8.1 template
2. Run the VM, login the VM using console, create there a file.
3. Create snapshot - causes 

2020-02-10 15:04:11,038+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-33) [] EVENT_ID: VM_DOWN_ERROR(119), VM golden_env_mixed_virtio_5 is down with error. Exit message: Lost connection with qemu process.

logs attached as logs_snapshot_scenario2.tar.gz

Comment 12 Polina 2020-02-10 13:30:30 UTC
the reproduce is on  http://bob-dr.lab.eng.brq.redhat.com/builds/4.4/rhv-4.4.0-18

Comment 14 Steven Rosenberg 2020-02-10 18:32:44 UTC
Created attachment 1662229 [details]
libvirt logs when simulating failure with debugging

This issue is reproducible when following the following steps after creating a VM (8.1 template not necessary):

2. Run the VM, login the VM using console, create there a file.
3. Create snapshot - causes 


According to the engine log, the failure is due to libvirt failing on the usb controller:

2020-02-10 19:47:08,802+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.VmDevicesMonitoring] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-79) [] VM 'd2ddc775-9a4e-4897-b816-e87634fb6c68' managed non pluggable device was removed unexpectedly from libvirt: 'VmDevice:{id='VmDeviceId:{deviceId='0b60b210-88b0-4fe4-970e-d295b9a73278', vmId='d2ddc775-9a4e-4897-b816-e87634fb6c68'}', device='usb', type='CONTROLLER', specParams='[index=0, model=piix3-uhci]', address='', managed='true', plugged='false', readOnly='false', deviceAlias='', customProperties='[]', snapshotId='null', logicalName='null', hostDevice='null'}'


The vdsm log reports an unknown task due to storage:


2020-02-10 19:47:12,067+0200 ERROR (jsonrpc/5) [storage.TaskManager.Task] (Task='9b2f7c26-2148-405e-886e-ae181cd0008e') Unexpected error (task:874)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 881, in _run
    return fn(*args, **kargs)
  File "<decorator-gen-111>", line 2, in revertTask
  File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 50, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 2311, in revertTask
    return self.taskMng.revertTask(taskID=taskID)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/taskManager.py", line 159, in revertTask
    t = self._getTask(taskID)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/taskManager.py", line 86, in _getTask
    raise se.UnknownTask(taskID)
vdsm.storage.exception.UnknownTask: Task id unknown: ('34317ef1-b467-49a0-9ca9-1b2a74a1ee02',)

I am providing the libvirt log accordingly.

Comment 15 Steven Rosenberg 2020-02-10 18:34:05 UTC
(In reply to Steven Rosenberg from comment #14)
> Created attachment 1662229 [details]
> libvirt logs when simulating failure with debugging
> 
> This issue is reproducible when following the following steps after creating
> a VM (8.1 template not necessary):
> 
> 2. Run the VM, login the VM using console, create there a file.
> 3. Create snapshot - causes 
> 
> 
> According to the engine log, the failure is due to libvirt failing on the
> usb controller:
> 
> 2020-02-10 19:47:08,802+02 ERROR
> [org.ovirt.engine.core.vdsbroker.monitoring.VmDevicesMonitoring]
> (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-79) []
> VM 'd2ddc775-9a4e-4897-b816-e87634fb6c68' managed non pluggable device was
> removed unexpectedly from libvirt:
> 'VmDevice:{id='VmDeviceId:{deviceId='0b60b210-88b0-4fe4-970e-d295b9a73278',
> vmId='d2ddc775-9a4e-4897-b816-e87634fb6c68'}', device='usb',
> type='CONTROLLER', specParams='[index=0, model=piix3-uhci]', address='',
> managed='true', plugged='false', readOnly='false', deviceAlias='',
> customProperties='[]', snapshotId='null', logicalName='null',
> hostDevice='null'}'
> 
> 
> The vdsm log reports an unknown task due to storage:
> 
> 
> 2020-02-10 19:47:12,067+0200 ERROR (jsonrpc/5) [storage.TaskManager.Task]
> (Task='9b2f7c26-2148-405e-886e-ae181cd0008e') Unexpected error (task:874)
> Traceback (most recent call last):
>   File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 881, in
> _run
>     return fn(*args, **kargs)
>   File "<decorator-gen-111>", line 2, in revertTask
>   File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 50, in
> method
>     ret = func(*args, **kwargs)
>   File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 2311, in
> revertTask
>     return self.taskMng.revertTask(taskID=taskID)
>   File "/usr/lib/python3.6/site-packages/vdsm/storage/taskManager.py", line
> 159, in revertTask
>     t = self._getTask(taskID)
>   File "/usr/lib/python3.6/site-packages/vdsm/storage/taskManager.py", line
> 86, in _getTask
>     raise se.UnknownTask(taskID)
> vdsm.storage.exception.UnknownTask: Task id unknown:
> ('34317ef1-b467-49a0-9ca9-1b2a74a1ee02',)
> 
> I am providing the libvirt log accordingly.

Please review the libvirt log with debugging information to see if we can narrow down the issue as per the engine error.

Thank you

Comment 16 Michal Privoznik 2020-02-11 08:16:38 UTC
I'm not familiar with snapshots, Peter?

Comment 17 Peter Krempa 2020-02-11 08:32:12 UTC
2020-02-10 18:09:29.708+0000: 3530: info : qemuMonitorSend:1072 : QEMU_MONITOR_SEND_MSG: mon=0x7f25040e6880 msg={"execute":"transaction","arguments":{"actions":[{"type":"blockdev-snapshot-sync","data":{"device":"drive-ua-647ef928-caee-4ae7-9d17-6f5d44f57c74","snapshot-file":"/rhev/data-center/mnt/vserver-spider.eng.lab.tlv.redhat.com:_pub_srosenbe_engine__44/b05445db-5f5b-4af1-b142-84fc00cecbd2/images/647ef928-caee-4ae7-9d17-6f5d44f57c74/0d591eb9-41ba-4a41-9d60-abff3f83df00","format":"qcow2","mode":"existing"}}]},"id":"libvirt-50"}
 fd=-1
2020-02-10 18:09:29.708+0000: 3526: info : virObjectRef:401 : OBJECT_REF: obj=0x7f25040e6880
2020-02-10 18:09:29.708+0000: 3526: info : qemuMonitorIOWrite:546 : QEMU_MONITOR_IO_WRITE: mon=0x7f25040e6880 buf={"execute":"transaction","arguments":{"actions":[{"type":"blockdev-snapshot-sync","data":{"device":"drive-ua-647ef928-caee-4ae7-9d17-6f5d44f57c74","snapshot-file":"/rhev/data-center/mnt/vserver-spider.eng.lab.tlv.redhat.com:_pub_srosenbe_engine__44/b05445db-5f5b-4af1-b142-84fc00cecbd2/images/647ef928-caee-4ae7-9d17-6f5d44f57c74/0d591eb9-41ba-4a41-9d60-abff3f83df00","format":"qcow2","mode":"existing"}}]},"id":"libvirt-50"}
 len=427 ret=427 errno=0
2020-02-10 18:09:29.708+0000: 3526: info : virObjectUnref:349 : OBJECT_UNREF: obj=0x7f25040e6880
2020-02-10 18:09:30.131+0000: 3526: info : virObjectRef:401 : OBJECT_REF: obj=0x7f25040e6880
2020-02-10 18:09:30.132+0000: 3526: debug : qemuMonitorJSONIOProcessLine:222 : Line [{"return": {}, "id": "libvirt-50"}]
2020-02-10 18:09:30.132+0000: 3526: info : qemuMonitorJSONIOProcessLine:242 : QEMU_MONITOR_RECV_REPLY: mon=0x7f25040e6880 reply={"return": {}, "id": "libvirt-50"}

The snapshot was successful here ^^

2020-02-10 18:09:30.132+0000: 3526: info : virObjectUnref:349 : OBJECT_UNREF: obj=0x7f25040e6880
2020-02-10 18:09:30.132+0000: 3530: info : virObjectUnref:349 : OBJECT_UNREF: obj=0x7f25040e6880
2020-02-10 18:09:30.132+0000: 3530: debug : qemuDomainObjExitMonitorInternal:8243 : Exited monitor (mon=0x7f25040e6880 vm=0x7f25081a5050 name=VMLinux)
2020-02-10 18:09:30.132+0000: 3530: debug : qemuDomainObjEndJob:8104 : Stopping job: async nested (async=snapshot vm=0x7f25081a5050 name=VMLinux)
2020-02-10 18:09:30.132+0000: 3530: debug : virStorageFileBackendFileDeinit:55 : deinitializing FS storage file 0x7f25500438b0 (file:/rhev/data-center/mnt/vserver-spider.eng.lab.tlv.redhat.com:_pub_srosenbe_engine__44/b05445db-5f5b-4af1-b142-84fc00cecbd2/images/647ef928-caee-4ae7-9d17-6f5d44f57c74/0d591eb9-41ba-4a41-9d60-abff3f83df00)
2020-02-10 18:09:30.132+0000: 3530: debug : virFileClose:114 : Closed fd 38
2020-02-10 18:09:30.169+0000: 3530: error : virProcessRunInFork:1170 : internal error: child reported (status=125): 
2020-02-10 18:09:30.169+0000: 3530: debug : virFileClose:114 : Closed fd 37
2020-02-10 18:09:30.170+0000: 3530: debug : virFileClose:114 : Closed fd 38
2020-02-10 18:09:30.209+0000: 3530: error : virProcessRunInFork:1170 : internal error: child reported (status=125): internal error: child reported (status=125): 
2020-02-10 18:09:30.209+0000: 3530: debug : virFileClose:114 : Closed fd 37
2020-02-10 18:09:30.209+0000: 3530: warning : qemuDomainSnapshotUpdateDiskSources:15405 : Unable to move disk metadata on vm VMLinux

But this failed. I don't see any other failure regarding the VM dying or anything in the logs so the only suspicious thing is the seclabel remembering failure.

Comment 18 Michal Privoznik 2020-02-11 12:54:11 UTC
(In reply to Peter Krempa from comment #17)
>

I think those are harmless, in fact a symptom (e.g. the file that we are trying to modify XATTRs on does not exist). The qemuDomainSnapshotUpdateDiskSources() calls qemuSecurityMoveImageMetadata() but doesn't return an error in case of failure but prints a warning (as can be seen in the log).

Comment 19 Steven Rosenberg 2020-02-12 16:54:26 UTC
I traced this issue to the following engine failures:

2020-02-12 17:50:53,896+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-3) [9852fb15-cc2c-4bd9-9ea1-5703668f5f47] Failed in 'SnapshotVDS' method

2020-02-12 17:50:53,903+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-3) [9852fb15-cc2c-4bd9-9ea1-5703668f5f47] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM Host1 command SnapshotVDS failed: Snapshot failed


The vdsm fails here:

2020-02-12 17:50:53,893+0200 ERROR (jsonrpc/4) [virt.vm] (vmId='59b1108c-fe3b-46ea-889b-37e25ae655cc') The base volume doesn't exist: {'device': 'disk', 'domainID': '5c9478e8-01b3-4ae6-a932-4f318eba013a', 'imageID': '85995710-9995-456d-98c8-ac54bbb3f478', 'volumeID': '2da0a69e-08f7-4e70-b1ac-de9de296446d'} (vm:3998)

The issue seems to be that the engine and vdsm are out of sync.

Could you take the recent 4.4 build which should contain both AFAIK in order to synchronize them?

Comment 20 Polina 2020-02-12 20:16:22 UTC
Created attachment 1662768 [details]
logs for re-test on 4.4.0-20

re-tested on 
http://bob.eng.lab.tlv.redhat.com/builds/4.4/rhv-4.4.0-20/rhv-release-engine-latest-4.4.noarch.rpm
http://bob.eng.lab.tlv.redhat.com/builds/4.4/rhv-4.4.0-20/rhv-release-host-latest-4.4.noarch.rpm

Behavior is changed: the create snapshot action fails with error and following exceptions upon trying vds refreshing

engine.log fragment for scenario1:
2020-02-12 21:59:05,880+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-80) [] EVENT_ID: USER_CREATE_SNAPSHOT_FINISHED_FAILURE(69), Failed to complete snapshot 'snapshot_1' creation for VM 'test_snapshot'.

engine.log fragment for scenario2:
2020-02-12 22:04:14,009+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-80) [] EVENT_ID: USER_CREATE_SNAPSHOT_FINISHED_FAILURE(69), Failed to complete snapshot 'snapshot_2' creation for VM 'test_snapshot_2'.

there are  java.lang.NullPointerExceptions while trying to refresh vds

Comment 21 Polina 2020-02-13 09:14:46 UTC
exceptions in upon vds refreshing are not related to this issue. it is another bug https://bugzilla.redhat.com/show_bug.cgi?id=1802469

Comment 22 Steven Rosenberg 2020-02-13 10:32:47 UTC
(In reply to Polina from comment #20)
> Created attachment 1662768 [details]
> logs for re-test on 4.4.0-20
> 
> re-tested on 
> http://bob.eng.lab.tlv.redhat.com/builds/4.4/rhv-4.4.0-20/rhv-release-engine-
> latest-4.4.noarch.rpm
> http://bob.eng.lab.tlv.redhat.com/builds/4.4/rhv-4.4.0-20/rhv-release-host-
> latest-4.4.noarch.rpm
> 
> Behavior is changed: the create snapshot action fails with error and
> following exceptions upon trying vds refreshing
> 
> engine.log fragment for scenario1:
> 2020-02-12 21:59:05,880+02 ERROR
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-80) []
> EVENT_ID: USER_CREATE_SNAPSHOT_FINISHED_FAILURE(69), Failed to complete
> snapshot 'snapshot_1' creation for VM 'test_snapshot'.
> 
> engine.log fragment for scenario2:
> 2020-02-12 22:04:14,009+02 ERROR
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-80) []
> EVENT_ID: USER_CREATE_SNAPSHOT_FINISHED_FAILURE(69), Failed to complete
> snapshot 'snapshot_2' creation for VM 'test_snapshot_2'.
> 
> there are  java.lang.NullPointerExceptions while trying to refresh vds

I reviewed the logs. It appears either the VDSM and libvirt logs are not complete or the time is very different between your host and engine.

This is the last line on the vdsm log (which does not match 21:59:05 nor 22:04:14):

2020-02-12 15:10:29,333-0500 INFO  (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call Host.getAllVmStats succeeded in 0.00 seconds (__init__:312)

Could you please ensure the time stamps are in sync and provide complete logs?

Comment 23 Peter Krempa 2020-02-13 10:51:47 UTC
Note that libvirtd logs are always in UTC regardless of the host settings.

Comment 24 Polina 2020-02-13 13:55:00 UTC
Created attachment 1662958 [details]
logs for synchronized engine and vdsm

I synchronized time on engine and the host and provide the new logs for the reproduced failure 
engine timestamp - 
2020-02-13 15:45:24,687+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-19) [] EVENT_ID: VM_DOWN_ERROR(119), VM test_snapshot is down with error. Exit message: Lost connection with qemu process.

Comment 25 Steven Rosenberg 2020-02-13 14:48:48 UTC
(In reply to Polina from comment #24)
> Created attachment 1662958 [details]
> logs for synchronized engine and vdsm
> 
> I synchronized time on engine and the host and provide the new logs for the
> reproduced failure 
> engine timestamp - 
> 2020-02-13 15:45:24,687+02 ERROR
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (ForkJoinPool-1-worker-19) [] EVENT_ID: VM_DOWN_ERROR(119), VM test_snapshot
> is down with error. Exit message: Lost connection with qemu process.

Do you have the qemu (VM) logs?

Comment 26 Polina 2020-02-16 08:58:27 UTC
Created attachment 1663294 [details]
logs


attached new set of logs containing the qemu for VM name 'test_snapshot3'

Comment 27 Steven Rosenberg 2020-02-16 13:47:59 UTC
I checked the Host Polina is using as per her log: host_mixed_1, and the vdsm on this host still has the following code in vm.py:


            try:
                vmDrive = self.findDriveByUUIDs(baseDrv)
            except LookupError:
                # The volume we want to snapshot doesn't exist
                self.log.error("The base volume doesn't exist: %s", baseDrv)
                return response.error('snapshotErr')

This error no longer exists on the current repository:

https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/vm.py


Please advise what package Polina needs in order to synchronize the vdsm and engine versions for master (4.4).

Comment 29 Steven Rosenberg 2020-02-20 10:14:34 UTC
This should be retested in the next release when the rmps are updated accordingly.

Comment 41 errata-xmlrpc 2020-08-04 13:21:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: RHV Manager (ovirt-engine) 4.4 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:3247


Note You need to log in before you can comment on or make changes to this bug.