Bug 1097648

Summary: deleteImage command after LSM fails with Resource timeout code 851 cause image status 'Illegal'
Product: Red Hat Enterprise Virtualization Manager Reporter: Raz Tamir <ratamir>
Component: vdsmAssignee: Daniel Erez <derez>
Status: CLOSED CURRENTRELEASE QA Contact: Raz Tamir <ratamir>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.4.0CC: amureini, bazulay, fsimonce, gklein, iheim, lpeer, ratamir, rbalakri, Rhev-m-bugs, scohen, yeylon
Target Milestone: ---   
Target Release: 3.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: storage
Fixed In Version: ovirt-3.5.0-beta1.1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-02-16 13:41:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1142923, 1156165    
Attachments:
Description Flags
vdsm and engine logs none

Description Raz Tamir 2014-05-14 09:01:26 UTC
Created attachment 895411 [details]
vdsm and engine logs

Description of problem:
Removing vm disk operation that fails with ResourceTimeout will cause the disk status to be illegal.

Thread-31::DEBUG::2014-05-13 17:07:02,341::resourceManager::819::ResourceManager.Owner::(acquire) 1367603e-562a-4b4e-99ae-39dc1b4367bd: request for 'Storage.3012f98b-b2d5-48c8-b93e-e547e70dab64' ti
med out after '120.000000' seconds
Thread-31::ERROR::2014-05-13 17:07:02,344::task::866::TaskManager.Task::(_setError) Task=`1367603e-562a-4b4e-99ae-39dc1b4367bd`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 873, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 1506, in deleteImage
    vars.task.getExclusiveLock(STORAGE, imgUUID)
  File "/usr/share/vdsm/storage/task.py", line 1329, in getExclusiveLock
    timeout)
  File "/usr/share/vdsm/storage/resourceManager.py", line 820, in acquire
    raise se.ResourceTimeout()
ResourceTimeout: Resource timeout: ()

The engine report that the image removed successfully but with storage failure:

e0da05060, Call Stack: null, Custom Event ID: -1, Message: Disk virtio_scsi_raw_False_disk was successfully removed from domain iscsi_1 with storage failure (User admin).


Version-Release number of selected component (if applicable):
vdsm-4.14.7-1.el6ev.x86_64
rhevm-3.4.0-0.18.el6ev.noarch

How reproducible:
100%

Steps to Reproduce:
setup with 2 SD (iscsi), vm with 5 disks (2 preallocated, 2 sparse, 1 boot)
1. Live migrate all disks to second SD
2. After all disks moved, power off the vm
3. deactivate all disks (except boot) and remove

Actual results:
Disk (one of them) status: illegal

Expected results:
should be deleted or show proper message 

Additional info:

Comment 1 Allon Mureinik 2014-05-14 10:37:49 UTC
The delete operation on the storage failed. What else would you expect us to do?

Comment 2 Raz Tamir 2014-05-14 13:27:11 UTC
When performing live storage migration of disk from sd1 to sd2, the operation finishes with deleteImage on image from sd1 (source domain). For this operation, vdsm acquires an exclusive lock for the volume uuid. 
Trying to delete the disk right while the async deleteImage operation from the end of the LSM is still running will cause vdsm try to acquire the lock for the same resource (volume uuid, but this time it is for the target domain, sd2).
But in fact, that resource is already locked for deleteImage from sd1 (and both resources have the same ID, so use the same lock). After 2 minutes, vdsm throws a ResourceTimout exception and returns code 851 to engine which in turn change the disk status to illegal, even though the volume was not deleted.

Comment 3 Daniel Erez 2014-05-14 13:38:14 UTC
It seems like a locking issue in vdsm side - moving component.
@Fede - do we lock the disk only by volume id (rather than volume and domain)?

Comment 4 Federico Simoncelli 2014-05-14 16:06:40 UTC
(In reply to Daniel Erez from comment #3)
> It seems like a locking issue in vdsm side - moving component.
> @Fede - do we lock the disk only by volume id (rather than volume and
> domain)?

More or less. We lock only by imgUUID:

 @public
 def deleteImage(self, sdUUID, spUUID, imgUUID, postZero=False,
                 force=False):
      ...
      vars.task.getExclusiveLock(STORAGE, imgUUID)
      ...

Comment 5 Raz Tamir 2014-07-27 10:52:37 UTC
Verified - 
ovirt-engine-3.5.0-0.0.master.20140715172116.git4687dc1.el6.noarch
vdsm-4.16.0-27.git00146ed.el6.x86_64