Bug 1957992

Summary: Engine does not monitor async snapshot job on the host if the cluster is at 4.3 level, proceeds immediately
Product: Red Hat Enterprise Virtualization Manager Reporter: Germano Veit Michel <gveitmic>
Component: ovirt-engineAssignee: Nobody <nobody>
Status: CLOSED DUPLICATE QA Contact: meital avital <mavital>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.4.6CC: ahadas, mzamazal
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-07 03:44:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Germano Veit Michel 2021-05-06 23:57:22 UTC
Description of problem:

The 4.4 engine seems to be doing a "sync" snapshot logic if the VM is on a 4.3 CL, while VDSM returns the snapshot command immediately (async).
This triggers a race condition, as the snapshot job in the host runs in parallel with engine steps that should be done after snapshot finishes on the host.

1. Create new volume for snapshot

2021-05-07 09:35:08,983+10 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.CreateVolumeVDSCommand] (EE-ManagedThreadFactory-engine-Thread-26281) [3ada30af-720f-43ae-899b-9ab393f57352] START, CreateVolumeVDSCommand( CreateVolumeVDSCommandParameters:{storagePoolId='d90838f8-a485-47e2-af01-df1a5edcf469', ignoreFailoverLimit='false', storageDomainId='7eb7cfce-ae07-4f41-91c4-b53e81973c90', imageGroupId='2b429890-a743-42ac-927a-5e0673d14904', imageSizeInBytes='1073741824', volumeFormat='COW', newImageId='08c33239-554a-4d40-93aa-1dc4880c3e52', imageType='Sparse', newImageDescription='', imageInitialSizeInBytes='0', imageId='40f69732-4ddb-4f25-b6ca-c4b0d0605c71', sourceImageGroupId='2b429890-a743-42ac-927a-5e0673d14904', shouldAddBitmaps='false'}), log id: 2b8f3f62

2. SPM Finishes creating volume

2021-05-07 09:35:12,615+10 INFO  [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-96) [] SPMAsyncTask::PollTask: Polling task '03cc1d19-d5dc-4323-81b4-ae1fcd7237d9' (Parent Command 'CreateSnapshot', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') returned status 'finished', result 'success'.

3. Engine sends snapshot command to host running the VM

2021-05-07 09:35:17,594+10 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-40) [3ada30af-720f-43ae-899b-9ab393f57352] START, SnapshotVDSCommand(HostName = host4.kvm.local, SnapshotVDSCommandParameters:{hostId='4fa0aefb-83f4-4718-aad4-20d5c518896f', vmId='f090cb15-0933-4f8a-8de7-a2fa231d33d3'}), log id: 512442c3

4. VDSM returns immediately and schedules a job for the snapshot

2021-05-07 09:35:17,598+10 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-40) [3ada30af-720f-43ae-899b-9ab393f57352] FINISH, SnapshotVDSCommand, return: c66e32e6-806a-424b-8eef-4def305136a5, log id: 512442c3

5. Engine does *not* do GetHostJobsVDSCommand, and proceeds immediately to getting the info on the image, 3s later it already did Image and Volume info commands, and is doing Qemu:

2021-05-07 09:35:20,953+10 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetQemuImageInfoVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-63) [3ada30af-720f-43ae-899b-9ab393f57352] START, GetQemuImageInfoVDSCommand(HostName = host4.kvm.local, GetVolumeInfoVDSCommandParameters:{hostId='4fa0aefb-83f4-4718-aad4-20d5c518896f', storagePoolId='d90838f8-a485-47e2-af01-df1a5edcf469', storageDomainId='7eb7cfce-ae07-4f41-91c4-b53e81973c90', imageGroupId='2b429890-a743-42ac-927a-5e0673d14904', imageId='08c33239-554a-4d40-93aa-1dc4880c3e52'}), log id: 6db63d67

6. The snapshot job on the host is still running, and has not activated the new volume LV yet.

2021-05-07 09:35:21,070+10 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-63) [3ada30af-720f-43ae-899b-9ab393f57352] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host4.kvm.local command GetQemuImageInfoVDS failed: Command ['/usr/bin/qemu-img', 'info', '--output', 'json', '-f', 'qcow2', '-U', '/rhev/data-center/mnt/blockSD/7eb7cfce-ae07-4f41-91c4-b53e81973c90/images/2b429890-a743-42ac-927a-5e0673d14904/08c33239-554a-4d40-93aa-1dc4880c3e52'] failed with rc=1 out=b'' err=b"qemu-img: Could not open '/rhev/data-center/mnt/blockSD/7eb7cfce-ae07-4f41-91c4-b53e81973c90/images/2b429890-a743-42ac-927a-5e0673d14904/08c33239-554a-4d40-93aa-1dc4880c3e52': Could not open '/rhev/data-center/mnt/blockSD/7eb7cfce-ae07-4f41-91c4-b53e81973c90/images/2b429890-a743-42ac-927a-5e0673d14904/08c33239-554a-4d40-93aa-1dc4880c3e52': No such file or directory\n"

Version-Release number of selected component (if applicable):
- ovirt-engine-4.4.6.6-0.10.el8ev.noarch (labs)
- vdsm-4.40.60.3-1.el8ev.x86_6 (labs)
- customer on 4.4.5

How reproducible:
Always

Steps to Reproduce:
* Block storage
* To make the bug always hit, add a delay on the snapshot job before activating the LV to expose the race:
   /usr/lib/python3.6/site-packages/vdsm/virt/jobs/snapshot.py
   374	        prepared_drives = {}
   375	
   376	        time.sleep(10)   <-------
   377	        for vm_dev_name, vm_device in new_drives.items():
   378	            # Adding the device before requesting to prepare it as we want
   379	            # to be sure to teardown it down even when prepareVolumePath
   380	            # failed for some unknown issue that left the volume active.
   381	            prepared_drives[vm_dev_name] = vm_device
* Without this, add load and/or lots of storage for slow LVM commands to slow down the snapshot job on the host.

Actual results:
* Logs filled with errors on GetQemuImageInfoVDSCommand

Expected results:
* No errors, engine waits for snapshot to be done before proceeding.

Additional info:
* Seems harmless but I don't know what else could happen, this is a race condition that could perhaps lead to other issues (i.e. what is the snapshot job fails, or dumpxml is done too early?)

Comment 1 Germano Veit Michel 2021-05-06 23:59:04 UTC
To make it clearer:
- 4.4 engine
- CL at 4.3 (or at 4.4 and VM not rebooted)

Comment 4 Germano Veit Michel 2021-05-07 03:42:35 UTC
Is it the same as BZ1950209?

Comment 5 Germano Veit Michel 2021-05-07 03:44:11 UTC
Looking at the patches, its the same. Glad its already on target to 4.4.6.

*** This bug has been marked as a duplicate of bug 1950209 ***