Bug 876663

Summary: [Storage] Cannot Extend storage domain while moving an image
Product: Red Hat Enterprise Virtualization Manager Reporter: Leonid Natapov <lnatapov>
Component: vdsmAssignee: Federico Simoncelli <fsimonce>
Status: CLOSED CURRENTRELEASE QA Contact: Dafna Ron <dron>
Severity: high Docs Contact:
Priority: high    
Version: unspecifiedCC: abaron, bazulay, hateya, iheim, knesenko, lpeer, sgrinber, ykaul
Target Milestone: ---   
Target Release: 3.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: storage
Fixed In Version: vdsm-4.10.2-10.0.el6ev Doc Type: Bug Fix
Doc Text:
Previously it was impossible to extend a block storage domain (adding more space) while there was an ongoing operation, as for example the transfer (move) of an image from a storage domain to another. Now it's possible to extend the storage domain even when there are long tasks running.
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 917401    
Attachments:
Description Flags
vdsm log none

Description Leonid Natapov 2012-11-14 16:51:22 UTC
Created attachment 645019 [details]
vdsm log

[Storage] Extend storage domain fails on timeout if there are another storage tasks run in parallel. This might be critical when we do live storage migration

Scenario:
Live migration of disks from one storage domain to another
Extend storage domain while moving disks between domains.
When we start moving disk ,the operation goes to the queue and fails in 2 minutes. So,in case of live disk storage migration both tasks are failed.
Extend storage domain and move disk. 

full vdsm log attached.

Thread-15779::DEBUG::2012-11-14 17:56:37,839::resourceManager::705::ResourceManager.Owner::(acquire) 2bf82660-6468-4195-b712-33dccf9ae9a1: request
for 'Storage.6093c92d-8817-4199-a2dc-9c04b07344b3' timed out after '120.000000' seconds
Thread-15779::ERROR::2012-11-14 17:56:37,840::task::853::TaskManager.Task::(_setError) Task=`2bf82660-6468-4195-b712-33dccf9ae9a1`::Unexpected erro
r
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 861, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 38, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 631, in extendStorageDomain
    vars.task.getExclusiveLock(STORAGE, sdUUID)
  File "/usr/share/vdsm/storage/task.py", line 1301, in getExclusiveLock
    self.resOwner.acquire(namespace, resName, resourceManager.LockType.exclusive, timeout)
  File "/usr/share/vdsm/storage/resourceManager.py", line 706, in acquire
    raise se.ResourceTimeout()
ResourceTimeout: Resource timeout: ()

Comment 7 RHEL Program Management 2012-12-14 07:52:18 UTC
This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 8 Ayal Baron 2012-12-26 10:18:28 UTC
Looking at this again it looks to me like it has nothing to do with the tasks, just a locking issue.
extendStorageDomain takes an exclusive lock on the domain and live storage migration (syncImage) takes a shared lock.  Can probably just change the lock to shared (I can't think of a reason for it to be exclusive).
Fede?

Comment 9 Federico Simoncelli 2012-12-28 10:59:54 UTC
(In reply to comment #8)
> Looking at this again it looks to me like it has nothing to do with the
> tasks, just a locking issue.
> extendStorageDomain takes an exclusive lock on the domain and live storage
> migration (syncImage) takes a shared lock.  Can probably just change the
> lock to shared (I can't think of a reason for it to be exclusive).
> Fede?

You are correct, in this specific case the call was moveImage (probably during the copy of a template), but yes it's a locking issue.

I think we can use a shared lock for extendStorageDomain as you suggested.

Thread-15692::INFO::2012-11-14 17:53:04,265::logUtils::37::dispatcher::(wrapper) Run and protect: moveImage(spUUID='e59ad86c-2e41-11e2-98f9-df8494d70a03', srcDomUUID='328fe2ec-4849-4e64-92d8-333f9ff0b09b', dstDomUUID='6093c92d-8817-4199-a2dc-9c04b07344b3', imgUUID='1630a362-5a55-47fb-86c1-a2af432d437a', vmUUID='', op=2, postZero='true', force='false')

[...]

Thread-15692::DEBUG::2012-11-14 17:53:04,793::resourceManager::175::ResourceManager.Request::(__init__) ResName=`Storage.6093c92d-8817-4199-a2dc-9c04b07344b3`ReqID=`e0d1d872-2eb7-4d60-931c-6f1b29dd318c`::Request was made in '/usr/share/vdsm/storage/resourceManager.py' line '485' at 'registerResource'
Thread-15692::DEBUG::2012-11-14 17:53:04,793::resourceManager::486::ResourceManager::(registerResource) Trying to register resource 'Storage.6093c92d-8817-4199-a2dc-9c04b07344b3' for lock type 'shared'
Thread-15692::DEBUG::2012-11-14 17:53:04,794::resourceManager::528::ResourceManager::(registerResource) Resource 'Storage.6093c92d-8817-4199-a2dc-9c04b07344b3' is free. Now locking as 'shared' (1 active user)

[...]

Thread-15779::DEBUG::2012-11-14 17:56:37,839::resourceManager::705::ResourceManager.Owner::(acquire) 2bf82660-6468-4195-b712-33dccf9ae9a1: request for 'Storage.6093c92d-8817-4199-a2dc-9c04b07344b3' timed out after '120.000000' seconds

Comment 10 Federico Simoncelli 2013-01-02 15:37:21 UTC
commit 0b56650267094de24d32d209ca3b5bd02d19685a
Author: Federico Simoncelli <fsimonce>
Date:   Fri Dec 28 10:12:26 2012 -0500

    domain: use shared lock for extendStorageDomain
    
    The extendStorageDomain command shouldn't hold an exclusive lock on
    the storage domain since it can be executed also during other long
    tasks (e.g.: moveImage).

http://gerrit.ovirt.org/#/c/10451/

Comment 13 Dafna Ron 2013-03-13 13:45:17 UTC
verified on sf10 with vdsm-4.10.2-11.0.el6ev.x86_64 libvirt-0.10.2-18.el6_4.eblake.2.x86_64

Comment 14 Itamar Heim 2013-06-11 09:26:22 UTC
3.2 has been released

Comment 15 Itamar Heim 2013-06-11 09:30:38 UTC
3.2 has been released

Comment 16 Itamar Heim 2013-06-11 09:46:19 UTC
3.2 has been released