Created attachment 895411 [details] vdsm and engine logs Description of problem: Removing vm disk operation that fails with ResourceTimeout will cause the disk status to be illegal. Thread-31::DEBUG::2014-05-13 17:07:02,341::resourceManager::819::ResourceManager.Owner::(acquire) 1367603e-562a-4b4e-99ae-39dc1b4367bd: request for 'Storage.3012f98b-b2d5-48c8-b93e-e547e70dab64' ti med out after '120.000000' seconds Thread-31::ERROR::2014-05-13 17:07:02,344::task::866::TaskManager.Task::(_setError) Task=`1367603e-562a-4b4e-99ae-39dc1b4367bd`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 873, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 45, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 1506, in deleteImage vars.task.getExclusiveLock(STORAGE, imgUUID) File "/usr/share/vdsm/storage/task.py", line 1329, in getExclusiveLock timeout) File "/usr/share/vdsm/storage/resourceManager.py", line 820, in acquire raise se.ResourceTimeout() ResourceTimeout: Resource timeout: () The engine report that the image removed successfully but with storage failure: e0da05060, Call Stack: null, Custom Event ID: -1, Message: Disk virtio_scsi_raw_False_disk was successfully removed from domain iscsi_1 with storage failure (User admin). Version-Release number of selected component (if applicable): vdsm-4.14.7-1.el6ev.x86_64 rhevm-3.4.0-0.18.el6ev.noarch How reproducible: 100% Steps to Reproduce: setup with 2 SD (iscsi), vm with 5 disks (2 preallocated, 2 sparse, 1 boot) 1. Live migrate all disks to second SD 2. After all disks moved, power off the vm 3. deactivate all disks (except boot) and remove Actual results: Disk (one of them) status: illegal Expected results: should be deleted or show proper message Additional info:
The delete operation on the storage failed. What else would you expect us to do?
When performing live storage migration of disk from sd1 to sd2, the operation finishes with deleteImage on image from sd1 (source domain). For this operation, vdsm acquires an exclusive lock for the volume uuid. Trying to delete the disk right while the async deleteImage operation from the end of the LSM is still running will cause vdsm try to acquire the lock for the same resource (volume uuid, but this time it is for the target domain, sd2). But in fact, that resource is already locked for deleteImage from sd1 (and both resources have the same ID, so use the same lock). After 2 minutes, vdsm throws a ResourceTimout exception and returns code 851 to engine which in turn change the disk status to illegal, even though the volume was not deleted.
It seems like a locking issue in vdsm side - moving component. @Fede - do we lock the disk only by volume id (rather than volume and domain)?
(In reply to Daniel Erez from comment #3) > It seems like a locking issue in vdsm side - moving component. > @Fede - do we lock the disk only by volume id (rather than volume and > domain)? More or less. We lock only by imgUUID: @public def deleteImage(self, sdUUID, spUUID, imgUUID, postZero=False, force=False): ... vars.task.getExclusiveLock(STORAGE, imgUUID) ...
Verified - ovirt-engine-3.5.0-0.0.master.20140715172116.git4687dc1.el6.noarch vdsm-4.16.0-27.git00146ed.el6.x86_64