(In reply to Ilan Zuckerman from comment #0) ... > The failure is due to: > ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is > "[Cannot transfer Virtual Disk: Disk is locked. Please try again later.]". > HTTP response code is 409. This means the engine did not release the memory lock yet. > Both, checksum checks, and the act of downloading the disks is done via the > SDK example scripts: > checksum_disk.py > download_disk.py Both scripts are waiting until the disk is OK, but unfortunately engine changes the disk to OK, few seconds before releasing the memory lock. ... > From investigating both of the failures, I can see it occurs only on > GlusterFS type storage. May be because glusterfs is slower, it may cause additional delays on engine side that cause the lock to be released later. ... > This issue is not marked as a regression because those are new automated > tests. They were recently introduced (i think in rhv-4.4.5-1) and they are > using different kind of approach for taking the checksums and downloading > the disks (using directly SDK example scripts). I think this is a duplicate of the same bug opened by virt-v2v. This is a known issue. > The flow of the failed TC is as following: > > 1. clone vm from a pre-defined template latest-rhel-guest-image-8.3 and > start it > 2. Create 4 disks permutations with filesystem on VM: > virtio_scsiraw, virtiocow, virtio_scsicow, virtioraw > 3. Stop the VM > 4. Check disks checksums as in this example taken from the failed TC [1] > 6. WAIT FOR DISKS STATUS TO BE 'OK' This is not needed, checksum_disk and download_disk already do this, and should wait until the disk is free to use for the next operation. > 7. CHECK ENGINE LOG TO SEE THAT ALL OF THE DISKS COMPLETED THEIR CHECKSUMS > BY SEARCHING THE FOLLOWING STRING: "Image Download with disk {disk} > succeeded" ,AND ONLY THEN CONTINUE This may not be enough. I think the only way that will work today is to monitor the events for the transfer job. This should be fixed in the sdk scripts in helpers/imagetransfer.py Both finallize/cancel must wait until the job has finished/failed before completing. ... > * Per each storage type, steps 1 to 9 run once (with proxy download), AND > steps 4 to 9 once more (this time with transfer) Does it fail only when using the proxy? Maybe engine is installed on a very slow host and the download on the engine machine cause engine to respond slower, and unlock the disk later? ... > Shortly after the failure to download the disk [2], We see a null pointer > exception in the engine log [3]. ... > [3]: > 2021-01-30 18:56:51,142+02 ERROR > [org.ovirt.engine.core.bll.tasks.CommandCallbacksPoller] > (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-92) > [a8eeea5b-2886-4224-9998-01b704d7a7aa] Exception: > java.lang.NullPointerException > at > deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.storage.disk.image. > TransferDiskImageCommand.getImageAlias(TransferDiskImageCommand.java:317) > at > deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.storage.disk.image. > TransferDiskImageCommand.getTransferDescription(TransferDiskImageCommand. > java:1379) > at > deployment.engine.ear.bll.jar//org.ovirt.engine.core.bll.storage.disk.image. > TransferDiskImageCommand.handleFinishedSuccess(TransferDiskImageCommand.java: > 893) This error looks promising - the error is raised here: protected String getImageAlias() { return getParameters().getAddDiskParameters() != null ? getParameters().getAddDiskParameters().getDiskInfo().getDiskAlias() : getDiskImage().getDiskAlias(); } protected DiskImage getDiskImage() { if (!Guid.isNullOrEmpty(getParameters().getImageId())) { setImageId(getParameters().getImageId()); return super.getDiskImage(); } DiskImage diskImage = super.getDiskImage(); if (diskImage == null) { diskImage = (DiskImage) diskDao.get(getParameters().getImageGroupID()); } return diskImage; } Looks like getDiskImage() or getParameters().getAddDiskParameters().getDiskInfo() return null, which is probably a bug. Once we fix this issue and log clear error message it will be more clear why this happened. A second bug is likely that failure to get transfer description result in failure to cleanup the job and release the lock properly. Changing priority to high since we cannot have such issues in normal flows. This will also affect backup applications.
*** This bug has been marked as a duplicate of bug 1849861 ***
This blocks v2v bug 1976020 In this comment we can see that image transfer switched state to FINISHED 9 seconds before the lock was released. https://bugzilla.redhat.com/show_bug.cgi?id=1976020#c7
*** Bug 2001894 has been marked as a duplicate of this bug. ***
Hi, I tried to reproduce bug 1994001 on RHEV 4.4.9-3 and encountered a similar error to what was seen in this bug. During my reproduction scenario, I ran QE's automation test case "TestUploadImages" which uses upload_disk.py and checksum_disk.py SDK's. After the upload, the flow does a checksum of the disk. In my tests, I got the following errors: 2021-10-05 15:37:13,716 - ThreadPoolExecutor-2_1 - VDS - ERROR - [10.46.12.145] Failed to run command ['python3', '/usr/share/doc/python3-ovirt-engine-sdk4/examples/checksum_disk.py', '-c', 'engine', 'bc443017-6542-46b7-8e3c-61c277f68f44'] ERR: Traceback (most recent call last): File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/checksum_disk.py", line 50, in <module> connection, disk, types.ImageTransferDirection.DOWNLOAD) File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/helpers/imagetransfer.py", line 200, in create_transfer transfer = transfers_service.add(transfer) File "/usr/lib64/python3.6/site-packages/ovirtsdk4/services.py", line 14153, in add return self._internal_add(image_transfer, headers, query, wait) File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 232, in _internal_add return future.wait() if wait else future File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 55, in wait return self._code(response) File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 229, in callback self._check_fault(response) File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 132, in _check_fault self._raise_error(response, body) File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 118, in _raise_error raise error ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is "[Cannot transfer Virtual Disk: Disk is locked. Please try again later.]". HTTP response code is 409. OUT: What is weird is that the test fails on types.ImageTransferDirection.DOWNLOAD although the test doesn't perform a download operation at all. Attaching relevant logs. Pavel can you please advise on this?
(In reply to Amit Sharir from comment #13) > Hi, > > I tried to reproduce bug 1994001 on RHEV 4.4.9-3 and encountered a similar > error to what was seen in this bug. > During my reproduction scenario, I ran QE's automation test case > "TestUploadImages" which uses upload_disk.py and checksum_disk.py SDK's. > After the upload, the flow does a checksum of the disk. > In my tests, I got the following errors: > > 2021-10-05 15:37:13,716 - ThreadPoolExecutor-2_1 - VDS - ERROR - > [10.46.12.145] Failed to run command ['python3', > '/usr/share/doc/python3-ovirt-engine-sdk4/examples/checksum_disk.py', '-c', > 'engine', 'bc443017-6542-46b7-8e3c-61c277f68f44'] ERR: Traceback (most > recent call last): > File "/usr/share/doc/python3-ovirt-engine-sdk4/examples/checksum_disk.py", > line 50, in <module> > connection, disk, types.ImageTransferDirection.DOWNLOAD) > File > "/usr/share/doc/python3-ovirt-engine-sdk4/examples/helpers/imagetransfer.py", > line 200, in create_transfer > transfer = transfers_service.add(transfer) > File "/usr/lib64/python3.6/site-packages/ovirtsdk4/services.py", line > 14153, in add > return self._internal_add(image_transfer, headers, query, wait) > File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 232, > in _internal_add > return future.wait() if wait else future > File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 55, > in wait > return self._code(response) > File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 229, > in callback > self._check_fault(response) > File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 132, > in _check_fault > self._raise_error(response, body) > File "/usr/lib64/python3.6/site-packages/ovirtsdk4/service.py", line 118, > in _raise_error > raise error > ovirtsdk4.Error: Fault reason is "Operation Failed". Fault detail is > "[Cannot transfer Virtual Disk: Disk is locked. Please try again later.]". > HTTP response code is 409. > OUT: > > > > > What is weird is that the test fails on > types.ImageTransferDirection.DOWNLOAD although the test doesn't perform a > download operation at all. > Attaching relevant logs. > > Pavel can you please advise on this? The error that you encountered happens for the same error as this bug (1923178). After image upload, "imagetransfer.finalize_transfer()" checks for "ImageTransferPhase.FINISHED_SUCCESS/FINISHED_FAILURE" and finishes the moment one of the final statuses appears, allowing next command to start running. But the final status appears too soon, before all the disks locks were released. Thus the following image upload / download operation might fail. Regarding your checksum issue and your surprise to see the seemingly not related "ImageTransferDirection.DOWNLOAD" there - the checksum is actually a 2-steps operation: (1) download image & (2) calculate checksum of the image downloaded at step #1. https://github.com/oVirt/python-ovirt-engine-sdk4/blob/main/examples/checksum_disk.py#L50 So actually it's upload+download scenario that fails due to "locks not yet released although 1st command status already updated as finished" issue.
As this bug was marked as a duplicate of bug 2001894 we/QE need to make sure we also cover the dup bug flow. @Amit please add this bug 2001894 flow to the verification effort: From the dup bug description: " We have a script that does the following flow: 1. Create snapshot 2. Download VM at the time of that snapshot 3. Delete snapshot sometimes a failure when removing the snapshot "
Version: ovirt-engine-4.4.9.2-0.6.el8ev.noarch vdsm-4.40.90.2-1.el8ev.x86_64 Verification flow: I Used multiple automation/manual tests (with different storage types) that reproduced this issue in the past - without any success in reproduction. Verification Conclusions: The expected output matched the actual output. All the flows I mentioned were completed with no errors. Bug verified.
This bugzilla is included in oVirt 4.4.9 release, published on October 20th 2021. Since the problem described in this bug report should be resolved in oVirt 4.4.9 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.