Bug 1957992 - Engine does not monitor async snapshot job on the host if the cluster is at 4.3 level, proceeds immediately
Summary: Engine does not monitor async snapshot job on the host if the cluster is at 4...
Keywords:
Status: CLOSED DUPLICATE of bug 1950209
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.4.6
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: ---
Assignee: Nobody
QA Contact: meital avital
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-06 23:57 UTC by Germano Veit Michel
Modified: 2021-05-19 17:32 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-05-07 03:44:11 UTC
oVirt Team: Virt
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 6023851 0 None None None 2021-05-07 00:59:16 UTC

Description Germano Veit Michel 2021-05-06 23:57:22 UTC
Description of problem:

The 4.4 engine seems to be doing a "sync" snapshot logic if the VM is on a 4.3 CL, while VDSM returns the snapshot command immediately (async).
This triggers a race condition, as the snapshot job in the host runs in parallel with engine steps that should be done after snapshot finishes on the host.

1. Create new volume for snapshot

2021-05-07 09:35:08,983+10 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateVolumeVDSCommand] (EE-ManagedThreadFactory-engine-Thread-26281) [3ada30af-720f-43ae-899b-9ab393f57352] START, CreateVolumeVDSCommand( CreateVolumeVDSCommandParameters:{storagePoolId='d90838f8-a485-47e2-af01-df1a5edcf469', ignoreFailoverLimit='false', storageDomainId='7eb7cfce-ae07-4f41-91c4-b53e81973c90', imageGroupId='2b429890-a743-42ac-927a-5e0673d14904', imageSizeInBytes='1073741824', volumeFormat='COW', newImageId='08c33239-554a-4d40-93aa-1dc4880c3e52', imageType='Sparse', newImageDescription='', imageInitialSizeInBytes='0', imageId='40f69732-4ddb-4f25-b6ca-c4b0d0605c71', sourceImageGroupId='2b429890-a743-42ac-927a-5e0673d14904', shouldAddBitmaps='false'}), log id: 2b8f3f62

2. SPM Finishes creating volume

2021-05-07 09:35:12,615+10 INFO  [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-96) [] SPMAsyncTask::PollTask: Polling task '03cc1d19-d5dc-4323-81b4-ae1fcd7237d9' (Parent Command 'CreateSnapshot', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') returned status 'finished', result 'success'.

3. Engine sends snapshot command to host running the VM

2021-05-07 09:35:17,594+10 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-40) [3ada30af-720f-43ae-899b-9ab393f57352] START, SnapshotVDSCommand(HostName = host4.kvm.local, SnapshotVDSCommandParameters:{hostId='4fa0aefb-83f4-4718-aad4-20d5c518896f', vmId='f090cb15-0933-4f8a-8de7-a2fa231d33d3'}), log id: 512442c3

4. VDSM returns immediately and schedules a job for the snapshot

2021-05-07 09:35:17,598+10 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-40) [3ada30af-720f-43ae-899b-9ab393f57352] FINISH, SnapshotVDSCommand, return: c66e32e6-806a-424b-8eef-4def305136a5, log id: 512442c3

5. Engine does *not* do GetHostJobsVDSCommand, and proceeds immediately to getting the info on the image, 3s later it already did Image and Volume info commands, and is doing Qemu:

2021-05-07 09:35:20,953+10 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetQemuImageInfoVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-63) [3ada30af-720f-43ae-899b-9ab393f57352] START, GetQemuImageInfoVDSCommand(HostName = host4.kvm.local, GetVolumeInfoVDSCommandParameters:{hostId='4fa0aefb-83f4-4718-aad4-20d5c518896f', storagePoolId='d90838f8-a485-47e2-af01-df1a5edcf469', storageDomainId='7eb7cfce-ae07-4f41-91c4-b53e81973c90', imageGroupId='2b429890-a743-42ac-927a-5e0673d14904', imageId='08c33239-554a-4d40-93aa-1dc4880c3e52'}), log id: 6db63d67

6. The snapshot job on the host is still running, and has not activated the new volume LV yet.

2021-05-07 09:35:21,070+10 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-63) [3ada30af-720f-43ae-899b-9ab393f57352] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host4.kvm.local command GetQemuImageInfoVDS failed: Command ['/usr/bin/qemu-img', 'info', '--output', 'json', '-f', 'qcow2', '-U', '/rhev/data-center/mnt/blockSD/7eb7cfce-ae07-4f41-91c4-b53e81973c90/images/2b429890-a743-42ac-927a-5e0673d14904/08c33239-554a-4d40-93aa-1dc4880c3e52'] failed with rc=1 out=b'' err=b"qemu-img: Could not open '/rhev/data-center/mnt/blockSD/7eb7cfce-ae07-4f41-91c4-b53e81973c90/images/2b429890-a743-42ac-927a-5e0673d14904/08c33239-554a-4d40-93aa-1dc4880c3e52': Could not open '/rhev/data-center/mnt/blockSD/7eb7cfce-ae07-4f41-91c4-b53e81973c90/images/2b429890-a743-42ac-927a-5e0673d14904/08c33239-554a-4d40-93aa-1dc4880c3e52': No such file or directory\n"

Version-Release number of selected component (if applicable):
- ovirt-engine-4.4.6.6-0.10.el8ev.noarch (labs)
- vdsm-4.40.60.3-1.el8ev.x86_6 (labs)
- customer on 4.4.5

How reproducible:
Always

Steps to Reproduce:
* Block storage
* To make the bug always hit, add a delay on the snapshot job before activating the LV to expose the race:
   /usr/lib/python3.6/site-packages/vdsm/virt/jobs/snapshot.py
   374	        prepared_drives = {}
   375	
   376	        time.sleep(10)   <-------
   377	        for vm_dev_name, vm_device in new_drives.items():
   378	            # Adding the device before requesting to prepare it as we want
   379	            # to be sure to teardown it down even when prepareVolumePath
   380	            # failed for some unknown issue that left the volume active.
   381	            prepared_drives[vm_dev_name] = vm_device
* Without this, add load and/or lots of storage for slow LVM commands to slow down the snapshot job on the host.

Actual results:
* Logs filled with errors on GetQemuImageInfoVDSCommand

Expected results:
* No errors, engine waits for snapshot to be done before proceeding.

Additional info:
* Seems harmless but I don't know what else could happen, this is a race condition that could perhaps lead to other issues (i.e. what is the snapshot job fails, or dumpxml is done too early?)

Comment 1 Germano Veit Michel 2021-05-06 23:59:04 UTC
To make it clearer:
- 4.4 engine
- CL at 4.3 (or at 4.4 and VM not rebooted)

Comment 4 Germano Veit Michel 2021-05-07 03:42:35 UTC
Is it the same as BZ1950209?

Comment 5 Germano Veit Michel 2021-05-07 03:44:11 UTC
Looking at the patches, its the same. Glad its already on target to 4.4.6.

*** This bug has been marked as a duplicate of bug 1950209 ***


Note You need to log in before you can comment on or make changes to this bug.