Description of problem: The 4.4 engine seems to be doing a "sync" snapshot logic if the VM is on a 4.3 CL, while VDSM returns the snapshot command immediately (async). This triggers a race condition, as the snapshot job in the host runs in parallel with engine steps that should be done after snapshot finishes on the host. 1. Create new volume for snapshot 2021-05-07 09:35:08,983+10 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.CreateVolumeVDSCommand] (EE-ManagedThreadFactory-engine-Thread-26281) [3ada30af-720f-43ae-899b-9ab393f57352] START, CreateVolumeVDSCommand( CreateVolumeVDSCommandParameters:{storagePoolId='d90838f8-a485-47e2-af01-df1a5edcf469', ignoreFailoverLimit='false', storageDomainId='7eb7cfce-ae07-4f41-91c4-b53e81973c90', imageGroupId='2b429890-a743-42ac-927a-5e0673d14904', imageSizeInBytes='1073741824', volumeFormat='COW', newImageId='08c33239-554a-4d40-93aa-1dc4880c3e52', imageType='Sparse', newImageDescription='', imageInitialSizeInBytes='0', imageId='40f69732-4ddb-4f25-b6ca-c4b0d0605c71', sourceImageGroupId='2b429890-a743-42ac-927a-5e0673d14904', shouldAddBitmaps='false'}), log id: 2b8f3f62 2. SPM Finishes creating volume 2021-05-07 09:35:12,615+10 INFO [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-96) [] SPMAsyncTask::PollTask: Polling task '03cc1d19-d5dc-4323-81b4-ae1fcd7237d9' (Parent Command 'CreateSnapshot', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') returned status 'finished', result 'success'. 3. Engine sends snapshot command to host running the VM 2021-05-07 09:35:17,594+10 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-40) [3ada30af-720f-43ae-899b-9ab393f57352] START, SnapshotVDSCommand(HostName = host4.kvm.local, SnapshotVDSCommandParameters:{hostId='4fa0aefb-83f4-4718-aad4-20d5c518896f', vmId='f090cb15-0933-4f8a-8de7-a2fa231d33d3'}), log id: 512442c3 4. VDSM returns immediately and schedules a job for the snapshot 2021-05-07 09:35:17,598+10 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-40) [3ada30af-720f-43ae-899b-9ab393f57352] FINISH, SnapshotVDSCommand, return: c66e32e6-806a-424b-8eef-4def305136a5, log id: 512442c3 5. Engine does *not* do GetHostJobsVDSCommand, and proceeds immediately to getting the info on the image, 3s later it already did Image and Volume info commands, and is doing Qemu: 2021-05-07 09:35:20,953+10 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetQemuImageInfoVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-63) [3ada30af-720f-43ae-899b-9ab393f57352] START, GetQemuImageInfoVDSCommand(HostName = host4.kvm.local, GetVolumeInfoVDSCommandParameters:{hostId='4fa0aefb-83f4-4718-aad4-20d5c518896f', storagePoolId='d90838f8-a485-47e2-af01-df1a5edcf469', storageDomainId='7eb7cfce-ae07-4f41-91c4-b53e81973c90', imageGroupId='2b429890-a743-42ac-927a-5e0673d14904', imageId='08c33239-554a-4d40-93aa-1dc4880c3e52'}), log id: 6db63d67 6. The snapshot job on the host is still running, and has not activated the new volume LV yet. 2021-05-07 09:35:21,070+10 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-63) [3ada30af-720f-43ae-899b-9ab393f57352] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host4.kvm.local command GetQemuImageInfoVDS failed: Command ['/usr/bin/qemu-img', 'info', '--output', 'json', '-f', 'qcow2', '-U', '/rhev/data-center/mnt/blockSD/7eb7cfce-ae07-4f41-91c4-b53e81973c90/images/2b429890-a743-42ac-927a-5e0673d14904/08c33239-554a-4d40-93aa-1dc4880c3e52'] failed with rc=1 out=b'' err=b"qemu-img: Could not open '/rhev/data-center/mnt/blockSD/7eb7cfce-ae07-4f41-91c4-b53e81973c90/images/2b429890-a743-42ac-927a-5e0673d14904/08c33239-554a-4d40-93aa-1dc4880c3e52': Could not open '/rhev/data-center/mnt/blockSD/7eb7cfce-ae07-4f41-91c4-b53e81973c90/images/2b429890-a743-42ac-927a-5e0673d14904/08c33239-554a-4d40-93aa-1dc4880c3e52': No such file or directory\n" Version-Release number of selected component (if applicable): - ovirt-engine-4.4.6.6-0.10.el8ev.noarch (labs) - vdsm-4.40.60.3-1.el8ev.x86_6 (labs) - customer on 4.4.5 How reproducible: Always Steps to Reproduce: * Block storage * To make the bug always hit, add a delay on the snapshot job before activating the LV to expose the race: /usr/lib/python3.6/site-packages/vdsm/virt/jobs/snapshot.py 374 prepared_drives = {} 375 376 time.sleep(10) <------- 377 for vm_dev_name, vm_device in new_drives.items(): 378 # Adding the device before requesting to prepare it as we want 379 # to be sure to teardown it down even when prepareVolumePath 380 # failed for some unknown issue that left the volume active. 381 prepared_drives[vm_dev_name] = vm_device * Without this, add load and/or lots of storage for slow LVM commands to slow down the snapshot job on the host. Actual results: * Logs filled with errors on GetQemuImageInfoVDSCommand Expected results: * No errors, engine waits for snapshot to be done before proceeding. Additional info: * Seems harmless but I don't know what else could happen, this is a race condition that could perhaps lead to other issues (i.e. what is the snapshot job fails, or dumpxml is done too early?)
To make it clearer: - 4.4 engine - CL at 4.3 (or at 4.4 and VM not rebooted)
Is it the same as BZ1950209?
Looking at the patches, its the same. Glad its already on target to 4.4.6. *** This bug has been marked as a duplicate of bug 1950209 ***