1994001 – Removing disks fails every once in a while after upload operation (using proxy_url / transfer_url)

Bug 1994001 - Removing disks fails every once in a while after upload operation (using proxy_url / transfer_url)

Summary: Removing disks fails every once in a while after upload operation (using prox...

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	vdsm
Classification:	oVirt
Component:	Core
Sub Component:
Version:	4.40.80.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-4.5.0
Target Release:	---
Assignee:	Eyal Shenitzky
QA Contact:	Amit Sharir
Docs Contact:
URL:
Whiteboard:
Depends On:	1923178 2014017
Blocks:
TreeView+	depends on / blocked

Reported:	2021-08-16 13:56 UTC by Amit Sharir
Modified:	2022-04-01 16:09 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2021-10-21 12:52:02 UTC
oVirt Team:	Storage
Embargoed:
Dependent Products:
Flags:	pm-rhel: ovirt-4.5? asharir: testing_plan_complete+

Attachments	(Terms of Use)
engine.log (2.35 MB, text/plain) 2021-08-16 13:56 UTC, Amit Sharir	no flags	Details
vdsm2.log (5.04 MB, text/plain) 2021-08-16 13:59 UTC, Amit Sharir	no flags	Details
vdsm3.log (11.95 MB, text/plain) 2021-08-16 14:00 UTC, Amit Sharir	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHV-43038	0	None	None	None	2021-08-16 13:57:51 UTC

Description Amit Sharir 2021-08-16 13:56:26 UTC

Created attachment 1814458 [details]
engine.log

Description of problem:

We are seeing disk removing failures every once in a while after upload operation. We are seeing this issue in our automation runs as part of test case  "TestUploadImages" - At the end of this test case, we delete the uploaded disks, and every once in a while this deletion fails. 

The issues that we are seeing in the logs are problems with removing images volumes, and errors with removing disks.

The main flow of this TC is - uploading disk matrix with proxy_url\transfer_url.

    - The upload is done on SPM host using the SDK example script
    - MD5 checksums are being compared before and after upload with a help of 
      SDK example script
    - Uploads of disk images are done one after the other(raw, qcow2_v3, 
      qcow2_v3, ISO).
    - Verify data correctness with md5sum of original and uploaded images


Engine log errors:

2021-08-04 14:14:00,606+03 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-9) [] BaseAsyncTask::logEndTaskFailure: Task 'fb73c87a-8af0-4615-a34e-02ac277cdf98' (Parent Command 'RemoveDisk', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') ended with failure:
-- Result: 'cleanSuccess'
-- Message: 'VDSGenericException: VDSErrorException: Failed in vdscommand to HSMGetAllTasksStatusesVDS, error = value=Could not remove all image's volumes: ('2f7e055e-4e74-4ea2-8a7a-df7cf5bdf89b [Errno 39] Directory not empty',) abortedcode=256',
-- Exception: 'VDSGenericException: VDSErrorException: Failed in vdscommand to HSMGetAllTasksStatusesVDS, error = value=Could not remove all image's volumes: ('2f7e055e-4e74-4ea2-8a7a-df7cf5bdf89b [Errno 39] Directory not empty',) abortedcode=256',


2021-08-04 14:14:00,623+03 ERROR [org.ovirt.engine.core.bll.storage.disk.RemoveDiskCommand] (EE-ManagedThreadFactory-engine-Thread-63816) [disks_delete_969f8082-fba2-4b92] Ending command 'org.ovirt.engine.core.bll.storage.disk.RemoveDiskCommand' with failure    


VDSM log errors:

2021-08-04 14:13:51,333+0300 ERROR (tasks/9) [storage.StorageDomainManifest] removed image dir: /rhev/data-center/mnt/mantis-nfs-lif2.lab.eng.tlv2.redhat.com:_nas01_ge__storage3__nfs__1/c8c7cbec-d445-4320-b622-334e3494ecc7/images/_remove_me_2f7e055e-4e74-4ea2-8a7a-df7cf5bdf89b can't be removed (fileSD:258)
2021-08-04 14:13:51,333+0300 ERROR (tasks/9) [storage.TaskManager.Task] (Task='fb73c87a-8af0-4615-a34e-02ac277cdf98') Unexpected error (task:877)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/fileSD.py", line 251, in purgeImage
    self.oop.os.rmdir(toDelDir)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/outOfProcess.py", line 238, in rmdir
    self._iop.rmdir(path)
  File "/usr/lib/python3.6/site-packages/ioprocess/__init__.py", line 550, in rmdir
    return self._sendCommand("rmdir", {"path": path}, self.timeout)
  File "/usr/lib/python3.6/site-packages/ioprocess/__init__.py", line 479, in _sendCommand
    raise OSError(errcode, errstr)
OSError: [Errno 39] Directory not empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 884, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 350, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/securable.py", line 79, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 1988, in purgeImage
    domain.purgeImage(sdUUID, imgUUID, volsByImg, discard)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/sd.py", line 1054, in purgeImage
    self._manifest.purgeImage(sdUUID, imgUUID, volsImgs, discard)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/fileSD.py", line 259, in purgeImage
    raise se.ImageDeleteError("%s %s" % (imgUUID, str(e)))
vdsm.storage.exception.ImageDeleteError: Could not remove all image's volumes: ('2f7e055e-4e74-4ea2-8a7a-df7cf5bdf89b [Errno 39] Directory not empty',)

Version-Release number of selected component (if applicable):
vdsm-4.40.80.3-1.el8ev.x86_64
ovirt-engine-4.4.8.2-0.11.el8ev.noarch


How reproducible:
A flaky test case that fails in ~25% of our automation tests.
This bug isn't reproducible manually. I succeeded only once using automation tests.

Steps to Reproduce:
1. Upload disk via proxy_url/transfer_url and trying to delete it afterward. 

Actual results:
Manually the procedure mentioned above isn't reproducible, in our automation tests the issue reproduces 25% of the time.

Expected results:
The deletion of uploaded disks using proxy_url/transfer_url should always work.

Additional info:
Engine log and VDSM log from automation reproduction are attached (These logs are from my local environment).
If required I can attach additional logs from our automation runs.

Comment 2 Amit Sharir 2021-08-16 13:59:06 UTC

Created attachment 1814462 [details]
vdsm2.log

Comment 3 Amit Sharir 2021-08-16 14:00:26 UTC

Created attachment 1814463 [details]
vdsm3.log

Comment 7 Eyal Shenitzky 2021-08-23 11:37:32 UTC

Nir, can you please have a look?

Comment 8 Nir Soffer 2021-09-01 14:11:58 UTC

(In reply to Amit Sharir from comment #0)

> Steps to Reproduce:
> 1. Upload disk via proxy_url/transfer_url and trying to delete it afterward.

This is not detailed enough. You need to provide here the exact command used, for
example if you are running upload_disk.py, or the exact API calls sent if you 
are using the API.

How do you wait for completion of the upload before you delete the disk?

What may happen, is that you wait until the disk status is "OK" and try to 
delete the disk before imageio closes the file descriptor, or before qemu-nbd
(if you are using format="raw") closed the file descriptor. In this case vdsm
may fail to purge delete the open image and than deleting the directory will 
fail with:

    OSError: [Errno 39] Directory not empty

Did you look in vdsm log before this error? Do we see previous errors before this?

I will look at this only after you provide detailed reproduction instructions.

The best way to reprouce this is to create a script that I can run on my setup
and run it in a loop (e.g. 100 times).

Comment 9 Amit Sharir 2021-10-04 13:48:36 UTC

Hi Nir. 

Sorry I missed the "need info" here.
I am working on reproducing this issue and will update ASAP all the relevant information and logs - (it is only reproducible via Jenkins runs).

Comment 10 Amit Sharir 2021-10-05 14:25:35 UTC

Updating that I am having some issues in reproducing this bug on RHEV 4.4.9-3. I am getting failures due to different errors. 
I added my relevant findings in bug 1923178 in the following comment: https://bugzilla.redhat.com/show_bug.cgi?id=1923178#c13.

For now, I will continue to try to reproduce this on older RHEV versions.

Comment 11 Amit Sharir 2021-10-06 12:07:48 UTC

Following our meeting today - I will continue to work on this bug after https://bugzilla.redhat.com/show_bug.cgi?id=1923178 is solved.

Comment 12 Eyal Shenitzky 2021-10-18 07:55:36 UTC

(In reply to Amit Sharir from comment #11)
> Following our meeting today - I will continue to work on this bug after
> https://bugzilla.redhat.com/show_bug.cgi?id=1923178 is solved.

The bug is fixed and closed, can you try to reproduce it now?

Comment 13 Amit Sharir 2021-10-18 08:41:53 UTC

Ever since the fix of bug 1923178, I wasn't able to reproduce this issue. 
I want to try to reproduce it in a few more flows - if I won't succeed I will close this bug.

Comment 15 Amit Sharir 2021-10-21 12:52:02 UTC

(In reply to Amit Sharir from comment #13)
> Ever since the fix of bug 1923178, I wasn't able to reproduce this issue. 
> I want to try to reproduce it in a few more flows - if I won't succeed I
> will close this bug.

Couldn't reproduce this issue (on version - ovirt-engine-4.4.9.2-0.6.el8ev.noarch / vdsm-4.40.90.2-1.el8ev.x86_64).

Thus, I am closing this bug for now.
I will reopen it if I encounter it in the future.

Note You need to log in before you can comment on or make changes to this bug.