Created attachment 1814458 [details] engine.log Description of problem: We are seeing disk removing failures every once in a while after upload operation. We are seeing this issue in our automation runs as part of test case "TestUploadImages" - At the end of this test case, we delete the uploaded disks, and every once in a while this deletion fails. The issues that we are seeing in the logs are problems with removing images volumes, and errors with removing disks. The main flow of this TC is - uploading disk matrix with proxy_url\transfer_url. - The upload is done on SPM host using the SDK example script - MD5 checksums are being compared before and after upload with a help of SDK example script - Uploads of disk images are done one after the other(raw, qcow2_v3, qcow2_v3, ISO). - Verify data correctness with md5sum of original and uploaded images Engine log errors: 2021-08-04 14:14:00,606+03 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-9) [] BaseAsyncTask::logEndTaskFailure: Task 'fb73c87a-8af0-4615-a34e-02ac277cdf98' (Parent Command 'RemoveDisk', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') ended with failure: -- Result: 'cleanSuccess' -- Message: 'VDSGenericException: VDSErrorException: Failed in vdscommand to HSMGetAllTasksStatusesVDS, error = value=Could not remove all image's volumes: ('2f7e055e-4e74-4ea2-8a7a-df7cf5bdf89b [Errno 39] Directory not empty',) abortedcode=256', -- Exception: 'VDSGenericException: VDSErrorException: Failed in vdscommand to HSMGetAllTasksStatusesVDS, error = value=Could not remove all image's volumes: ('2f7e055e-4e74-4ea2-8a7a-df7cf5bdf89b [Errno 39] Directory not empty',) abortedcode=256', 2021-08-04 14:14:00,623+03 ERROR [org.ovirt.engine.core.bll.storage.disk.RemoveDiskCommand] (EE-ManagedThreadFactory-engine-Thread-63816) [disks_delete_969f8082-fba2-4b92] Ending command 'org.ovirt.engine.core.bll.storage.disk.RemoveDiskCommand' with failure VDSM log errors: 2021-08-04 14:13:51,333+0300 ERROR (tasks/9) [storage.StorageDomainManifest] removed image dir: /rhev/data-center/mnt/mantis-nfs-lif2.lab.eng.tlv2.redhat.com:_nas01_ge__storage3__nfs__1/c8c7cbec-d445-4320-b622-334e3494ecc7/images/_remove_me_2f7e055e-4e74-4ea2-8a7a-df7cf5bdf89b can't be removed (fileSD:258) 2021-08-04 14:13:51,333+0300 ERROR (tasks/9) [storage.TaskManager.Task] (Task='fb73c87a-8af0-4615-a34e-02ac277cdf98') Unexpected error (task:877) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/fileSD.py", line 251, in purgeImage self.oop.os.rmdir(toDelDir) File "/usr/lib/python3.6/site-packages/vdsm/storage/outOfProcess.py", line 238, in rmdir self._iop.rmdir(path) File "/usr/lib/python3.6/site-packages/ioprocess/__init__.py", line 550, in rmdir return self._sendCommand("rmdir", {"path": path}, self.timeout) File "/usr/lib/python3.6/site-packages/ioprocess/__init__.py", line 479, in _sendCommand raise OSError(errcode, errstr) OSError: [Errno 39] Directory not empty During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 884, in _run return fn(*args, **kargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 350, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/lib/python3.6/site-packages/vdsm/storage/securable.py", line 79, in wrapper return method(self, *args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 1988, in purgeImage domain.purgeImage(sdUUID, imgUUID, volsByImg, discard) File "/usr/lib/python3.6/site-packages/vdsm/storage/sd.py", line 1054, in purgeImage self._manifest.purgeImage(sdUUID, imgUUID, volsImgs, discard) File "/usr/lib/python3.6/site-packages/vdsm/storage/fileSD.py", line 259, in purgeImage raise se.ImageDeleteError("%s %s" % (imgUUID, str(e))) vdsm.storage.exception.ImageDeleteError: Could not remove all image's volumes: ('2f7e055e-4e74-4ea2-8a7a-df7cf5bdf89b [Errno 39] Directory not empty',) Version-Release number of selected component (if applicable): vdsm-4.40.80.3-1.el8ev.x86_64 ovirt-engine-4.4.8.2-0.11.el8ev.noarch How reproducible: A flaky test case that fails in ~25% of our automation tests. This bug isn't reproducible manually. I succeeded only once using automation tests. Steps to Reproduce: 1. Upload disk via proxy_url/transfer_url and trying to delete it afterward. Actual results: Manually the procedure mentioned above isn't reproducible, in our automation tests the issue reproduces 25% of the time. Expected results: The deletion of uploaded disks using proxy_url/transfer_url should always work. Additional info: Engine log and VDSM log from automation reproduction are attached (These logs are from my local environment). If required I can attach additional logs from our automation runs.
Created attachment 1814462 [details] vdsm2.log
Created attachment 1814463 [details] vdsm3.log
Nir, can you please have a look?
(In reply to Amit Sharir from comment #0) > Steps to Reproduce: > 1. Upload disk via proxy_url/transfer_url and trying to delete it afterward. This is not detailed enough. You need to provide here the exact command used, for example if you are running upload_disk.py, or the exact API calls sent if you are using the API. How do you wait for completion of the upload before you delete the disk? What may happen, is that you wait until the disk status is "OK" and try to delete the disk before imageio closes the file descriptor, or before qemu-nbd (if you are using format="raw") closed the file descriptor. In this case vdsm may fail to purge delete the open image and than deleting the directory will fail with: OSError: [Errno 39] Directory not empty Did you look in vdsm log before this error? Do we see previous errors before this? I will look at this only after you provide detailed reproduction instructions. The best way to reprouce this is to create a script that I can run on my setup and run it in a loop (e.g. 100 times).
Hi Nir. Sorry I missed the "need info" here. I am working on reproducing this issue and will update ASAP all the relevant information and logs - (it is only reproducible via Jenkins runs).
Updating that I am having some issues in reproducing this bug on RHEV 4.4.9-3. I am getting failures due to different errors. I added my relevant findings in bug 1923178 in the following comment: https://bugzilla.redhat.com/show_bug.cgi?id=1923178#c13. For now, I will continue to try to reproduce this on older RHEV versions.
Following our meeting today - I will continue to work on this bug after https://bugzilla.redhat.com/show_bug.cgi?id=1923178 is solved.
(In reply to Amit Sharir from comment #11) > Following our meeting today - I will continue to work on this bug after > https://bugzilla.redhat.com/show_bug.cgi?id=1923178 is solved. The bug is fixed and closed, can you try to reproduce it now?
Ever since the fix of bug 1923178, I wasn't able to reproduce this issue. I want to try to reproduce it in a few more flows - if I won't succeed I will close this bug.
(In reply to Amit Sharir from comment #13) > Ever since the fix of bug 1923178, I wasn't able to reproduce this issue. > I want to try to reproduce it in a few more flows - if I won't succeed I > will close this bug. Couldn't reproduce this issue (on version - ovirt-engine-4.4.9.2-0.6.el8ev.noarch / vdsm-4.40.90.2-1.el8ev.x86_64). Thus, I am closing this bug for now. I will reopen it if I encounter it in the future.