Bug 1729359

Summary: Failed image upload leaves disk in locked state, requiring manual intervention to cleanup.
Product: Red Hat Enterprise Virtualization Manager Reporter: Germano Veit Michel <gveitmic>
Component: ovirt-engineAssignee: Ahmad Khiet <akhiet>
Status: CLOSED ERRATA QA Contact: Ilan Zuckerman <izuckerm>
Severity: low Docs Contact:
Priority: unspecified    
Version: 4.3.4CC: aefrat, akhiet, eshenitz, frolland, izuckerm, mkalinin, sbruce, sfishbai, tnisan, ybenshim
Target Milestone: ovirt-4.4.5Keywords: TestOnly, ZStream
Target Release: 4.4.5   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: ovirt-engine-4.4.5.6 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-04-14 11:39:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
upload log none

Description Germano Veit Michel 2019-07-12 04:42:51 UTC
Description of problem:

Disk uploads can fail due to uploading malformed disks. The problem is these failed uploads leave the image locked instead of cleaning it up. The customer cannot cleanup by himself and requires GSS to manually intervene. Would be nice to gracefully handle these problems.

Version-Release number of selected component (if applicable):
rhvm-4.3.5.1-0.1.el7.noarch (reproduced)
rhvm-4.3.3.7-0.1.el7.noarch (customer)

How reproducible:
100%

Steps to Reproduce:
1. Create qcow2 with disk size larger than virtual size
   $ qemu-img create -f qcow2 test.qcow2 10G
   $ truncate -s 20G test.qcow2 
2. Upload via Admin Portal

Actual results:
Disk locked on failure

Expected results:
Disk unlocked on failure, user able to cleanup by himself.

Additional info:

1. For the case described in "Steps to Reproduce", createVolume fails with:

2019-07-12 14:08:36,943+1000 ERROR (tasks/1) [storage.Volume] Unexpected error (volume:1267)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/volume.py", line 1224, in create
    initialSize=initialSize)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/blockVolume.py", line 498, in _create
    size, initialSize)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/blockVolume.py", line 543, in calculate_volume_alloc_size
    preallocate, capacity, initial_size)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/blockVolume.py", line 344, in calculate_volume_alloc_size
    initial_size)

2. Engine sees the failure to create the volume, but does not end the failure gracefully.

2019-07-12 14:08:51,627+10 WARN  [org.ovirt.engine.core.bll.storage.disk.RemoveDiskCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-1) [2b1cdfc] Validation of action 'RemoveDisk' failed for user admin@internal-authz. Reasons: VAR__ACTION__REMOVE,VAR__TYPE__DISK,ACTION_TYPE_FAILED_VM_IMAGE_DOES_NOT_EXIST

2019-07-12 14:08:51,629+10 INFO  [org.ovirt.engine.core.bll.storage.disk.RemoveDiskCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-1) [2b1cdfc] Lock freed to object 'EngineLock:{exclusiveLocks='[]', sharedLocks='[]'}'

2019-07-12 14:08:51,629+10 INFO  [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-1) [2b1cdfc] Lock freed to object 'EngineLock:{exclusiveLocks='[]', sharedLocks='[]'}'

2019-07-12 14:08:51,674+10 ERROR [org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-1) [2b1cdfc] Error during log command: org.ovirt.engine.core.bll.storage.disk.image.TransferDiskImageCommand. Exception null

Comment 2 Daniel Gur 2019-08-28 13:11:31 UTC
sync2jira

Comment 3 Daniel Gur 2019-08-28 13:15:43 UTC
sync2jira

Comment 4 Fedor Gavrilov 2019-10-02 12:03:46 UTC
Created attachment 1621820 [details]
upload log

Comment 11 Avihai 2019-10-28 08:52:33 UTC
Moving back to 'Assigned' as this bug reproduce, see last comment by Yosi.

Comment 15 Ahmad Khiet 2020-12-15 13:43:30 UTC
Hi 

After trying to reproduce the bug in several ways, one way is using the steps provided in the bug description, I see that there is no trace for the result from the engine side.
the upload works as expected. 
I tried to make a failed upload but, the ways I tried, the engine cleaned them up.

I remember that I faced the manual cleanup of uploaded images but at that time I was working on another bug and imageio was in the upgrading process.

please verify that it still affects the users and if there are other reproducing steps.

Comment 16 Avihai 2020-12-16 12:15:14 UTC
(In reply to Ahmad Khiet from comment #15)
> Hi 
> 
> After trying to reproduce the bug in several ways, one way is using the
> steps provided in the bug description, I see that there is no trace for the
> result from the engine side.
> the upload works as expected. 
> I tried to make a failed upload but, the ways I tried, the engine cleaned
> them up.
> 
> I remember that I faced the manual cleanup of uploaded images but at that
> time I was working on another bug and imageio was in the upgrading process.
> please verify that it still affects the users and if there are other
> reproducing steps.

Ilan can you see if this reproduce(use Yossi's Comment 10 as he managed to reproduce it a while back 


This bug was opened by Germano not QE so QE do not know if this still affects users/customers.
This is a Q for GSS focal Germano, putting NeedInfo on him.

Comment 17 Ilan Zuckerman 2020-12-16 15:01:55 UTC
I was able to reproduce this issue on:

ovirt-engine-4.4.4.4-0.9.el8ev.noarch
rhv-release-4.4.4-5-001.noarch

Steps as described in comment #10 with same kind of errors in engine and vdsm logs.
Attaching the logs.

Comment 22 Ahmad Khiet 2020-12-29 10:11:59 UTC
After testing on master I can't reproduce the bug, maybe there is another missing info on why it reproduces with Ilan.
I tried with different image sizes but not 20G. 

following Ilan logs, I see that there is an error in AddDiskCommandCallback which is affected by vdsm error raised at:
https://github.com/oVirt/vdsm/blob/master/lib/vdsm/storage/blockVolume.py#L320
with the following error  log.error("The requested initial %s is bigger "
                          "than the max size %s", initial_size, max_size)

which is reflected back to the engine.

there is another similar bug which solved by Benny
https://bugzilla.redhat.com/show_bug.cgi?id=1758048



Ilan, did you upload to file storage domain or block?

Comment 23 Ilan Zuckerman 2020-12-29 11:41:23 UTC
I uploaded the disk to a block storage domain.

Comment 24 Ahmad Khiet 2020-12-29 14:34:21 UTC
now it reproduced on iscsi 
there are two separate issues here:
one is solved by benny https://gerrit.ovirt.org/c/ovirt-engine/+/111196
is about the exceptions in the logs. 

the 2nd one about the image size which is the subject of this bug.

Comment 27 Ilan Zuckerman 2021-03-02 12:30:51 UTC
Trying to verify on rhv-4.4.5-7 according steps from comment #10 

After a failed attempt to upload the disk using UI, it remains in the 'disks' section with 'illegal' status.
Is that an expected result?

From engine.log:

2021-03-02 14:22:05,174+02 ERROR [org.ovirt.engine.core.bll.tasks.SPMAsyncTask] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-3) [] BaseAsyncTask::logEndTaskFailure: Task 'ebc26725-f650-4b43-96df-416384c89fa5' (Parent Command 'AddImageFromScratch', Parameters Type 'org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters') ended with failure:
-- Result: 'cleanSuccess'
-- Message: 'VDSGenericException: VDSErrorException: Failed to HSMGetAllTasksStatusesVDS, error = value=Error creating a new volume: ("Volume creation 336e6d35-49aa-476d-ac3b-c9bd26194a9e failed: Invalid parameter: 'initial_size=21474836480'",) abortedcode=205, code = 205',
-- Exception: 'VDSGenericException: VDSErrorException: Failed to HSMGetAllTasksStatusesVDS, error = value=Error creating a new volume: ("Volume creation 336e6d35-49aa-476d-ac3b-c9bd26194a9e failed: Invalid parameter: 'initial_size=21474836480'",) abortedcode=205, code = 205'
.
.
.
2021-03-02 14:22:09,861+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-86) [] EVENT_ID: USER_ADD_DISK_FINISHED_FAILURE(2,022), Add-Disk operation failed to complete.
2021-03-02 14:22:09,896+02 INFO  [org.ovirt.engine.core.bll.storage.disk.image.ImageTransferUpdater] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-86) [f6aaeaff-e4b4-4d3d-bac6-e2c4bffed7b1] Updating image transfer 6a3b8980-c0d5-4a19-b09c-96e3904b397f phase to Finished Failure



Now, attempting to remove the disk from the UI:

2021-03-02 14:28:58,365+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-37485) [8475c7d8-4584-4862-9e90-2322b8473bcd] EVENT_ID: IRS_BROKER_COMMAND_FAILURE(10,803), VDSM command DeleteImageGroupVDS failed: Image does not exist in domain: 'image=df7c7581-bbcb-4eac-b3c4-489bac0387fa, domain=9db95765-0fb7-485e-91f2-381354a66d13'
.
.
.
2021-03-02 14:28:58,639+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-37485) [8475c7d8-4584-4862-9e90-2322b8473bcd] EVENT_ID: USER_FINISHED_REMOVE_DISK(2,014), Disk test was successfully removed from domain iscsi_0 (User admin@internal-authz).



Result: The disk removed from the UI

Comment 28 Ahmad Khiet 2021-03-02 12:54:02 UTC
Hi, 

that is the expected behavior, otherwise, the image will still 'locked' and can't be removed.

Comment 29 Ilan Zuckerman 2021-03-02 13:04:58 UTC
(In reply to Ahmad Khiet from comment #28)
> Hi, 
> 
> that is the expected behavior, otherwise, the image will still 'locked' and
> can't be removed.

Good, switching to 'verified' according comment #27

Comment 34 errata-xmlrpc 2021-04-14 11:39:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: RHV Manager (ovirt-engine) 4.4.z [ovirt-4.4.5] security, bug fix, enhancement), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:1169