1097648 – deleteImage command after LSM fails with Resource timeout code 851 cause image status 'Illegal'

Bug 1097648 - deleteImage command after LSM fails with Resource timeout code 851 cause image status 'Illegal'

Summary: deleteImage command after LSM fails with Resource timeout code 851 cause imag...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.4.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	3.5.0
Assignee:	Daniel Erez
QA Contact:	Raz Tamir
Docs Contact:
URL:
Whiteboard:	storage
Depends On:
Blocks:	rhev3.5beta 1156165
TreeView+	depends on / blocked

Reported:	2014-05-14 09:01 UTC by Raz Tamir
Modified:	2016-02-10 19:11 UTC (History)
CC List:	11 users (show)
Fixed In Version:	ovirt-3.5.0-beta1.1
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-02-16 13:41:32 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
vdsm and engine logs (1.17 MB, application/x-gzip) 2014-05-14 09:01 UTC, Raz Tamir	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	28264	0	master	MERGED	hsm: deleteImage - lock image by imgUUID and sdUUID	2020-12-18 18:48:21 UTC
oVirt gerrit	29656	0	ovirt-3.5	MERGED	hsm: deleteImage - lock image by imgUUID and sdUUID	2020-12-18 18:48:21 UTC

Description Raz Tamir 2014-05-14 09:01:26 UTC

Created attachment 895411 [details]
vdsm and engine logs

Description of problem:
Removing vm disk operation that fails with ResourceTimeout will cause the disk status to be illegal.

Thread-31::DEBUG::2014-05-13 17:07:02,341::resourceManager::819::ResourceManager.Owner::(acquire) 1367603e-562a-4b4e-99ae-39dc1b4367bd: request for 'Storage.3012f98b-b2d5-48c8-b93e-e547e70dab64' ti
med out after '120.000000' seconds
Thread-31::ERROR::2014-05-13 17:07:02,344::task::866::TaskManager.Task::(_setError) Task=`1367603e-562a-4b4e-99ae-39dc1b4367bd`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 873, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 1506, in deleteImage
    vars.task.getExclusiveLock(STORAGE, imgUUID)
  File "/usr/share/vdsm/storage/task.py", line 1329, in getExclusiveLock
    timeout)
  File "/usr/share/vdsm/storage/resourceManager.py", line 820, in acquire
    raise se.ResourceTimeout()
ResourceTimeout: Resource timeout: ()

The engine report that the image removed successfully but with storage failure:

e0da05060, Call Stack: null, Custom Event ID: -1, Message: Disk virtio_scsi_raw_False_disk was successfully removed from domain iscsi_1 with storage failure (User admin).


Version-Release number of selected component (if applicable):
vdsm-4.14.7-1.el6ev.x86_64
rhevm-3.4.0-0.18.el6ev.noarch

How reproducible:
100%

Steps to Reproduce:
setup with 2 SD (iscsi), vm with 5 disks (2 preallocated, 2 sparse, 1 boot)
1. Live migrate all disks to second SD
2. After all disks moved, power off the vm
3. deactivate all disks (except boot) and remove

Actual results:
Disk (one of them) status: illegal

Expected results:
should be deleted or show proper message 

Additional info:

Comment 1 Allon Mureinik 2014-05-14 10:37:49 UTC

The delete operation on the storage failed. What else would you expect us to do?

Comment 2 Raz Tamir 2014-05-14 13:27:11 UTC

When performing live storage migration of disk from sd1 to sd2, the operation finishes with deleteImage on image from sd1 (source domain). For this operation, vdsm acquires an exclusive lock for the volume uuid. 
Trying to delete the disk right while the async deleteImage operation from the end of the LSM is still running will cause vdsm try to acquire the lock for the same resource (volume uuid, but this time it is for the target domain, sd2).
But in fact, that resource is already locked for deleteImage from sd1 (and both resources have the same ID, so use the same lock). After 2 minutes, vdsm throws a ResourceTimout exception and returns code 851 to engine which in turn change the disk status to illegal, even though the volume was not deleted.

Comment 3 Daniel Erez 2014-05-14 13:38:14 UTC

It seems like a locking issue in vdsm side - moving component.
@Fede - do we lock the disk only by volume id (rather than volume and domain)?

Comment 4 Federico Simoncelli 2014-05-14 16:06:40 UTC

(In reply to Daniel Erez from comment #3)
> It seems like a locking issue in vdsm side - moving component.
> @Fede - do we lock the disk only by volume id (rather than volume and
> domain)?

More or less. We lock only by imgUUID:

 @public
 def deleteImage(self, sdUUID, spUUID, imgUUID, postZero=False,
                 force=False):
      ...
      vars.task.getExclusiveLock(STORAGE, imgUUID)
      ...

Comment 5 Raz Tamir 2014-07-27 10:52:37 UTC

Verified - 
ovirt-engine-3.5.0-0.0.master.20140715172116.git4687dc1.el6.noarch
vdsm-4.16.0-27.git00146ed.el6.x86_64

Note You need to log in before you can comment on or make changes to this bug.