Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1100527

Summary:	ResourceManager Queue is not being cleared properly causing other tasks timeout trying to get the locked resource
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	Marina Kalinin <mkalinin>
Component:	vdsm	Assignee:	Federico Simoncelli <fsimonce>
Status:	CLOSED WORKSFORME	QA Contact:	Aharon Canan <acanan>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	3.3.0	CC:	amureini, bazulay, dfediuck, fsimonce, gklein, iheim, lpeer, mkalinin, scohen, tdosek, yeylon
Target Milestone:	---
Target Release:	3.4.1
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	storage
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2014-06-08 22:10:32 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Storage	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Marina Kalinin 2014-05-23 03:04:53 UTC

The queue of the tasks waiting for the storage resource is not cleaned properly (I see it mainly after the resource was exclusively locked requested by _shutDownUpgrade in sp.py).

So that even when the task released the resource, the queue does not get cleared and other tasks are entering the q and time out and never run.

Additional question is what causes this q to empty itself at some point - 2hours after the initial lock was granted.


Version-Release number of selected component (if applicable):
rhevm-3.3.1-0.48
vdsm-4.13.2-0.11

How reproducible:
On customer site, every 40 minutes or so.
Customer has about 40 FC storage domains.
Once ISO domain was put in maintenance, problem occurrence decreased, but still continues happening.
================================================================
Here are the logs from vdsm on spm:

We are looking into the resource: 7f7534f7-e5d4-44b3-b34a-4d0743734e7d

[mku: start here, requesting lock for upgrade_7f7534f7-e5d4-44b3-b34a-4d0743734e7d (why upgrade, btw?)]
vdsm.log.2.xz:Thread-8687::DEBUG::2014-05-22 14:51:35,616::resourceManager::197::ResourceManager.Request::(__init__) ResName=`Storage.upgrade_7f7534f7-e5d4-44b3-b34a-4d0743734e7d`ReqID=`a6455513-d6ef-445c-b118-2008fea52646`::Request was made in '/usr/share/vdsm/storage/sp.py' line '345' at '_shutDownUpgrade'
[mku: requesting exclusive lock]
vdsm.log.2.xz:Thread-8687::DEBUG::2014-05-22 14:51:35,721::resourceManager::541::ResourceManager::(registerResource) Trying to register resource 'Storage.upgrade_7f7534f7-e5d4-44b3-b34a-4d0743734e7d' for lock type 'exclusive'
vdsm.log.2.xz:Thread-8687::DEBUG::2014-05-22 14:51:36,001::resourceManager::600::ResourceManager::(registerResource) Resource 'Storage.upgrade_7f7534f7-e5d4-44b3-b34a-4d0743734e7d' is free. Now locking as 'exclusive' (1 active user)
[mku: now exclusive lock is granted]
vdsm.log.2.xz:Thread-8687::DEBUG::2014-05-22 14:51:36,097::resourceManager::237::ResourceManager.Request::(grant) ResName=`Storage.upgrade_7f7534f7-e5d4-44b3-b34a-4d0743734e7d`ReqID=`a6455513-d6ef-445c-b118-2008fea52646`::Granted request
[mku: here we are trying to release the resource]
vdsm.log.2.xz:Thread-8687::DEBUG::2014-05-22 14:51:36,366::resourceManager::615::ResourceManager::(releaseResource) Trying to release resource 'Storage.upgrade_7f7534f7-e5d4-44b3-b34a-4d0743734e7d'
vdsm.log.2.xz:Thread-8687::DEBUG::2014-05-22 14:51:36,497::resourceManager::634::ResourceManager::(releaseResource) Released resource 'Storage.upgrade_7f7534f7-e5d4-44b3-b34a-4d0743734e7d' (0 active users)
vdsm.log.2.xz:Thread-8687::DEBUG::2014-05-22 14:51:36,524::resourceManager::640::ResourceManager::(releaseResource) Resource 'Storage.upgrade_7f7534f7-e5d4-44b3-b34a-4d0743734e7d' is free, finding out if anyone is waiting for it.
[mku: here apparently we have released the resource]
vdsm.log.2.xz:Thread-8687::DEBUG::2014-05-22 14:51:36,536::resourceManager::648::ResourceManager::(releaseResource) No one is waiting for resource 'Storage.upgrade_7f7534f7-e5d4-44b3-b34a-4d0743734e7d', Clearing records.

[mku: here is next thread trying to register to the resource]
vdsm.log.2.xz:Thread-8722::DEBUG::2014-05-22 14:54:36,584::resourceManager::197::ResourceManager.Request::(__init__) ResName=`Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d`ReqID=`d3359f6c-f07a-4c68-ac82-4fe1c5171236`::Request was made in '/usr/share/vdsm/storage/hsm.py' line '1216' at 'deactivateStorageDomain'
vdsm.log.2.xz:Thread-812::DEBUG::2014-05-22 14:54:36,609::resourceManager::939::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {}
vdsm.log.2.xz:Thread-8722::DEBUG::2014-05-22 14:54:36,757::resourceManager::541::ResourceManager::(registerResource) Trying to register resource 'Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d' for lock type 'exclusive'
vdsm.log.2.xz:Thread-812::DEBUG::2014-05-22 14:54:36,830::resourceManager::976::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
[mku: here we are entering the queue for the first time and fail. now only 1 reported in the q]
vdsm.log.2.xz:Thread-8722::DEBUG::2014-05-22 14:54:36,971::resourceManager::576::ResourceManager::(registerResource) Resource 'Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d' is currently locked, Entering queue (1 in queue)
vdsm.log.2.xz:Thread-8722::DEBUG::2014-05-22 14:56:37,161::resourceManager::209::ResourceManager.Request::(cancel) ResName=`Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d`ReqID=`d3359f6c-f07a-4c68-ac82-4fe1c5171236`::Canceled request
vdsm.log.2.xz:Thread-8722::DEBUG::2014-05-22 14:56:37,559::resourceManager::818::ResourceManager.Owner::(acquire) 78619516-83df-42d8-8f2e-7d3505f200e6: request for 'Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d' timed out after '120.000000' seconds
[mku: continue requesting the lock for another task, and enter the queue again - now there are 2]
vdsm.log.2.xz:Thread-8744::DEBUG::2014-05-22 14:56:43,561::resourceManager::197::ResourceManager.Request::(__init__) ResName=`Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d`ReqID=`433ee983-dce1-485c-956f-be28daecdf53`::Request was made in '/usr/share/vdsm/storage/hsm.py' line '2562' at 'getStoragePoolInfo'
vdsm.log.2.xz:Thread-8744::DEBUG::2014-05-22 14:56:43,577::resourceManager::541::ResourceManager::(registerResource) Trying to register resource 'Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d' for lock type 'shared'
vdsm.log.2.xz:Thread-8744::DEBUG::2014-05-22 14:56:43,618::resourceManager::576::ResourceManager::(registerResource) Resource 'Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d' is currently locked, Entering queue (2 in queue)
[mku: it continues even further:]
vdsm.log.1.xz:Thread-9115::DEBUG::2014-05-22 15:30:04,722::resourceManager::576::ResourceManager::(registerResource) Resource 'Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d' is currently locked, Entering queue (16 in queue)

[mku: now, 2 hours later - the Owner released all the requests and now al the tasks are popped up of the queue]
Thread-8687::DEBUG::2014-05-22 15:31:33,831::resourceManager::939::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {'Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d': < ResourceRef 'Stora
ge.7f7534f7-e5d4-44b3-b34a-4d0743734e7d', isValid: 'True' obj: 'None'>}
Thread-8687::DEBUG::2014-05-22 15:31:34,125::resourceManager::976::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-8687::DEBUG::2014-05-22 15:31:34,397::resourceManager::615::ResourceManager::(releaseResource) Trying to release resource 'Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d'
Thread-8687::DEBUG::2014-05-22 15:31:34,746::resourceManager::634::ResourceManager::(releaseResource) Released resource 'Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d' (0 active users)
Thread-8687::DEBUG::2014-05-22 15:31:34,953::resourceManager::640::ResourceManager::(releaseResource) Resource 'Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d' is free, finding out if anyone is waiting for it.
Thread-413::DEBUG::2014-05-22 15:31:34,994::resourceManager::939::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {}
Thread-8687::DEBUG::2014-05-22 15:31:35,063::resourceManager::653::ResourceManager::(releaseResource) Resource 'Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d' has 16 requests in queue. Handling top request.
Thread-413::DEBUG::2014-05-22 15:31:35,108::resourceManager::976::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-8687::DEBUG::2014-05-22 15:31:35,203::resourceManager::662::ResourceManager::(releaseResource) Request 'Request for Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d - exclusive: canceled' was canceled, Ignori
ng it.
Thread-8687::DEBUG::2014-05-22 15:31:35,353::resourceManager::653::ResourceManager::(releaseResource) Resource 'Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d' has 15 requests in queue. Handling top request.
Thread-8687::DEBUG::2014-05-22 15:31:35,516::resourceManager::662::ResourceManager::(releaseResource) Request 'Request for Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d - shared: canceled' was canceled, Ignoring 
it.
Thread-8687::DEBUG::2014-05-22 15:31:35,668::resourceManager::653::ResourceManager::(releaseResource) Resource 'Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d' has 14 requests in queue. Handling top request.
Thread-8687::DEBUG::2014-05-22 15:31:35,811::resourceManager::662::ResourceManager::(releaseResource) Request 'Request for Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d - shared: canceled' was canceled, Ignoring 
it.
Thread-8687::DEBUG::2014-05-22 15:31:35,971::resourceManager::653::ResourceManager::(releaseResource) Resource 'Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d' has 13 requests in queue. Handling top request.
Thread-8687::DEBUG::2014-05-22 15:31:36,148::resourceManager::662::ResourceManager::(releaseResource) Request 'Request for Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d - shared: canceled' was canceled, Ignoring 
it.
Thread-8687::DEBUG::2014-05-22 15:31:36,361::resourceManager::653::ResourceManager::(releaseResource) Resource 'Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d' has 12 requests in queue. Handling top request.
Thread-8687::DEBUG::2014-05-22 15:31:36,506::resourceManager::662::ResourceManager::(releaseResource) Request 'Request for Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d - shared: canceled' was canceled, Ignoring 
it.
Thread-8687::DEBUG::2014-05-22 15:31:36,671::resourceManager::653::ResourceManager::(releaseResource) Resource 'Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d' has 11 requests in queue. Handling top request.
Thread-8687::DEBUG::2014-05-22 15:31:36,862::resourceManager::662::ResourceManager::(releaseResource) Request 'Request for Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d - shared: canceled' was canceled, Ignoring it.
Thread-8687::DEBUG::2014-05-22 15:31:37,095::resourceManager::653::ResourceManager::(releaseResource) Resource 'Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d' has 10 requests in queue. Handling top request.
Thread-8687::DEBUG::2014-05-22 15:31:37,257::resourceManager::662::ResourceManager::(releaseResource) Request 'Request for Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d - shared: canceled' was canceled, Ignoring it.
Thread-9132::DEBUG::2014-05-22 15:31:37,378::resourceManager::939::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {}
Thread-8687::DEBUG::2014-05-22 15:31:37,483::resourceManager::653::ResourceManager::(releaseResource) Resource 'Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d' has 9 requests in queue. Handling top request.
Thread-9132::DEBUG::2014-05-22 15:31:37,541::resourceManager::976::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-8687::DEBUG::2014-05-22 15:31:37,637::resourceManager::662::ResourceManager::(releaseResource) Request 'Request for Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d - shared: canceled' was canceled, Ignoring it.
Thread-8687::DEBUG::2014-05-22 15:31:37,801::resourceManager::653::ResourceManager::(releaseResource) Resource 'Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d' has 8 requests in queue. Handling top request.
Thread-8687::DEBUG::2014-05-22 15:31:37,840::resourceManager::662::ResourceManager::(releaseResource) Request 'Request for Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d - shared: canceled' was canceled, Ignoring it.
Thread-8687::DEBUG::2014-05-22 15:31:37,866::resourceManager::653::ResourceManager::(releaseResource) Resource 'Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d' has 7 requests in queue. Handling top request.
Thread-8687::DEBUG::2014-05-22 15:31:37,876::resourceManager::662::ResourceManager::(releaseResource) Request 'Request for Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d - shared: canceled' was canceled, Ignoring it.
Thread-8687::DEBUG::2014-05-22 15:31:37,961::resourceManager::653::ResourceManager::(releaseResource) Resource 'Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d' has 6 requests in queue. Handling top request.
Thread-8687::DEBUG::2014-05-22 15:31:38,001::resourceManager::662::ResourceManager::(releaseResource) Request 'Request for Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d - shared: canceled' was canceled, Ignoring it.
Thread-8687::DEBUG::2014-05-22 15:31:38,093::resourceManager::653::ResourceManager::(releaseResource) Resource 'Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d' has 5 requests in queue. Handling top request.
Thread-8687::DEBUG::2014-05-22 15:31:38,250::resourceManager::662::ResourceManager::(releaseResource) Request 'Request for Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d - shared: canceled' was canceled, Ignoring it.
Thread-8687::DEBUG::2014-05-22 15:31:38,452::resourceManager::653::ResourceManager::(releaseResource) Resource 'Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d' has 4 requests in queue. Handling top request.
Thread-8687::DEBUG::2014-05-22 15:31:38,614::resourceManager::662::ResourceManager::(releaseResource) Request 'Request for Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d - shared: canceled' was canceled, Ignoring it.
Thread-8687::DEBUG::2014-05-22 15:31:38,880::resourceManager::653::ResourceManager::(releaseResource) Resource 'Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d' has 3 requests in queue. Handling top request.
Thread-8687::DEBUG::2014-05-22 15:31:39,055::resourceManager::662::ResourceManager::(releaseResource) Request 'Request for Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d - shared: canceled' was canceled, Ignoring it.
Thread-8687::DEBUG::2014-05-22 15:31:39,228::resourceManager::653::ResourceManager::(releaseResource) Resource 'Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d' has 2 requests in queue. Handling top request.
Thread-8687::DEBUG::2014-05-22 15:31:39,395::resourceManager::662::ResourceManager::(releaseResource) Request 'Request for Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d - shared: canceled' was canceled, Ignoring it.
Thread-812::DEBUG::2014-05-22 15:31:39,458::resourceManager::939::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {}
Thread-8687::DEBUG::2014-05-22 15:31:39,627::resourceManager::653::ResourceManager::(releaseResource) Resource 'Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d' has 1 requests in queue. Handling top request.
Thread-812::DEBUG::2014-05-22 15:31:39,693::resourceManager::976::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-8687::DEBUG::2014-05-22 15:31:39,862::resourceManager::237::ResourceManager.Request::(grant) ResName=`Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d`ReqID=`f77a73c1-ceb4-4057-a1e6-df6795b3ea6a`::Granted request
Thread-8687::DEBUG::2014-05-22 15:31:40,110::resourceManager::685::ResourceManager::(releaseResource) Request 'Request for Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d - shared: granted' was granted
Thread-8687::DEBUG::2014-05-22 15:31:40,281::resourceManager::693::ResourceManager::(releaseResource) This is a shared lock. Granting all shared requests

[mku:] now start again - bit different  originator of the lock request, so that the queue is cleared properly.
Thread-9147::DEBUG::2014-05-22 15:32:43,284::resourceManager::197::ResourceManager.Request::(__init__) ResName=`Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d`ReqID=`5360b79e-345e-4a8a-9ef2-a0980f58054d`::Request was made in '/usr/share/vdsm/storage/hsm.py' line '587' at 'spmStart'
Thread-9147::DEBUG::2014-05-22 15:32:43,512::resourceManager::541::ResourceManager::(registerResource) Trying to register resource 'Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d' for lock type 'exclusive'
Thread-9147::DEBUG::2014-05-22 15:32:43,763::resourceManager::600::ResourceManager::(registerResource) Resource 'Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d' is free. Now locking as 'exclusive' (1 active user)
Thread-9147::DEBUG::2014-05-22 15:32:43,918::resourceManager::237::ResourceManager.Request::(grant) ResName=`Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d`ReqID=`5360b79e-345e-4a8a-9ef2-a0980f58054d`::Granted request
83660ad7-b4fa-46f9-b77e-2763ffed8f6f::DEBUG::2014-05-22 15:33:43,329::resourceManager::197::ResourceManager.Request::(__init__) ResName=`Storage.upgrade_7f7534f7-e5d4-44b3-b34a-4d0743734e7d`ReqID=`d8a384d2-02eb-4333-b470-0026a2a7584c`::Request was made in '/usr/share/vdsm/storage/sp.py' line '432' at '_upgradePool'
83660ad7-b4fa-46f9-b77e-2763ffed8f6f::DEBUG::2014-05-22 15:33:43,472::resourceManager::541::ResourceManager::(registerResource) Trying to register resource 'Storage.upgrade_7f7534f7-e5d4-44b3-b34a-4d0743734e7d' for lock type 'exclusive'
83660ad7-b4fa-46f9-b77e-2763ffed8f6f::DEBUG::2014-05-22 15:33:43,643::resourceManager::600::ResourceManager::(registerResource) Resource 'Storage.upgrade_7f7534f7-e5d4-44b3-b34a-4d0743734e7d' is free. Now locking as 'exclusive' (1 active user)
83660ad7-b4fa-46f9-b77e-2763ffed8f6f::DEBUG::2014-05-22 15:33:43,845::resourceManager::237::ResourceManager.Request::(grant) ResName=`Storage.upgrade_7f7534f7-e5d4-44b3-b34a-4d0743734e7d`ReqID=`d8a384d2-02eb-4333-b470-0026a2a7584c`::Granted request

83660ad7-b4fa-46f9-b77e-2763ffed8f6f::DEBUG::2014-05-22 15:34:19,426::resourceManager::615::ResourceManager::(releaseResource) Trying to release resource 'Storage.upgrade_7f7534f7-e5d4-44b3-b34a-4d0743734e7d'

83660ad7-b4fa-46f9-b77e-2763ffed8f6f::DEBUG::2014-05-22 15:34:27,394::resourceManager::634::ResourceManager::(releaseResource) Released resource 'Storage.upgrade_7f7534f7-e5d4-44b3-b34a-4d0743734e7d' (0 active users)
83660ad7-b4fa-46f9-b77e-2763ffed8f6f::DEBUG::2014-05-22 15:34:27,553::resourceManager::640::ResourceManager::(releaseResource) Resource 'Storage.upgrade_7f7534f7-e5d4-44b3-b34a-4d0743734e7d' is free, finding out if anyone is waiting for it.
83660ad7-b4fa-46f9-b77e-2763ffed8f6f::DEBUG::2014-05-22 15:34:27,762::resourceManager::648::ResourceManager::(releaseResource) No one is waiting for resource 'Storage.upgrade_7f7534f7-e5d4-44b3-b34a-4d0743734e7d', Clearing records.

83660ad7-b4fa-46f9-b77e-2763ffed8f6f::DEBUG::2014-05-22 15:35:52,979::resourceManager::939::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {'Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e
7d': < ResourceRef 'Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d', isValid: 'True' obj: 'None'>}
83660ad7-b4fa-46f9-b77e-2763ffed8f6f::DEBUG::2014-05-22 15:35:53,197::resourceManager::976::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-9283::DEBUG::2014-05-22 15:35:53,549::resourceManager::939::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {}
83660ad7-b4fa-46f9-b77e-2763ffed8f6f::DEBUG::2014-05-22 15:35:53,546::resourceManager::615::ResourceManager::(releaseResource) Trying to release resource 'Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d'
Thread-9283::DEBUG::2014-05-22 15:35:53,668::resourceManager::976::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
83660ad7-b4fa-46f9-b77e-2763ffed8f6f::DEBUG::2014-05-22 15:35:53,723::resourceManager::634::ResourceManager::(releaseResource) Released resource 'Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d' (0 active users)
83660ad7-b4fa-46f9-b77e-2763ffed8f6f::DEBUG::2014-05-22 15:35:54,058::resourceManager::640::ResourceManager::(releaseResource) Resource 'Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d' is free, finding out if anyo
ne is waiting for it.
83660ad7-b4fa-46f9-b77e-2763ffed8f6f::DEBUG::2014-05-22 15:35:54,327::resourceManager::648::ResourceManager::(releaseResource) No one is waiting for resource 'Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d', Clear
ing records.

Thread-9287::DEBUG::2014-05-22 15:36:00,937::resourceManager::976::ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
Thread-9289::DEBUG::2014-05-22 15:36:03,314::resourceManager::197::ResourceManager.Request::(__init__) ResName=`Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d`ReqID=`31ae0a8d-7e1d-4f34-9433-d5fd772eb642`::Request was made in '/usr/share/vdsm/storage/hsm.py' line '2562' at 'getStoragePoolInfo'
Thread-9289::DEBUG::2014-05-22 15:36:03,476::resourceManager::541::ResourceManager::(registerResource) Trying to register resource 'Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d' for lock type 'shared'
Thread-9289::DEBUG::2014-05-22 15:36:03,787::resourceManager::600::ResourceManager::(registerResource) Resource 'Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d' is free. Now locking as 'shared' (1 active user)
Thread-9289::DEBUG::2014-05-22 15:36:03,982::resourceManager::237::ResourceManager.Request::(grant) ResName=`Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d`ReqID=`31ae0a8d-7e1d-4f34-9433-d5fd772eb642`::Granted request

Comment 4 Federico Simoncelli 2014-05-23 10:53:08 UTC

The system seems under heavy load and operations are taking longer than expected:

- 38 block storage domains attached
- 15 VMs running on the SPM (at least 7 with host cpu usage over 50%)
- vdsm under heavy load (2724% cpu usage?)
- load average: 5.91, 5.54, 5.56

USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
qemu      42778 13.4  1.2 18078544 12831924 ?   Sl   03:11 114:20 /usr/libexec/qemu-kvm
qemu      61104 23.4  0.6 18077664 6696008 ?    Sl   03:14 198:47 /usr/libexec/qemu-kvm
qemu     149188 24.8  0.9 18040884 9870000 ?    Sl   02:58 215:10 /usr/libexec/qemu-kvm
qemu     161767 26.1  0.8 18037604 8630748 ?    Sl   03:02 224:36 /usr/libexec/qemu-kvm
qemu     135355 64.0  1.5 18046408 16905068 ?   Sl   02:55 556:10 /usr/libexec/qemu-kvm
qemu     129328 70.0  1.5 18076460 16881296 ?   Sl   02:52 609:44 /usr/libexec/qemu-kvm
qemu      82353 86.0  0.1 18076496 2077036 ?    Sl   03:20 724:54 /usr/libexec/qemu-kvm
qemu      30986  129  1.5 18039844 16795672 ?   Sl   03:08 1110:15 /usr/libexec/qemu-kvm
qemu      17820  132  1.5 18038676 16786008 ?   Sl   03:05 1136:22 /usr/libexec/qemu-kvm
qemu      54152  156  1.5 18048112 16812268 ?   Sl   03:13 1327:17 /usr/libexec/qemu-kvm
qemu     119067  216  0.2 9304372 3094524 ?     Sl   02:49 1891:31 /usr/libexec/qemu-kvm
vdsm      98759 2724  0.0 9797276 143440 ?      S<l  02:37 24128:44 /usr/bin/python /usr/share/vdsm/vdsm


That said, it seems that updating the VMs ovf it's taking too long (3 minutes and 41 seconds):

Thread-8651::INFO::2014-05-22 14:48:39,717::logUtils::44::dispatcher::(wrapper) Run and protect: updateVM(spUUID='...
...
Thread-8651::ERROR::2014-05-22 14:52:20,187::SecureXMLRPCServer::168::root::(handle_error) client ('10.113.136.147', 52763)
Traceback (most recent call last):
  File "/usr/lib64/python2.6/SocketServer.py", line 570, in process_request_thread
  File "/usr/lib64/python2.6/SocketServer.py", line 332, in finish_request
  File "/usr/lib64/python2.6/SocketServer.py", line 627, in __init__
  File "/usr/lib64/python2.6/BaseHTTPServer.py", line 329, in handle
  File "/usr/lib64/python2.6/BaseHTTPServer.py", line 323, in handle_one_request
  File "/usr/lib64/python2.6/SimpleXMLRPCServer.py", line 490, in do_POST
  File "/usr/lib64/python2.6/BaseHTTPServer.py", line 384, in send_response
  File "/usr/lib64/python2.6/BaseHTTPServer.py", line 390, in send_header
  File "/usr/lib64/python2.6/socket.py", line 324, in write
  File "/usr/lib64/python2.6/socket.py", line 303, in flush
  File "/usr/lib64/python2.6/site-packages/M2Crypto/SSL/Connection.py", line 222, in write
  File "/usr/lib64/python2.6/site-packages/M2Crypto/SSL/Connection.py", line 205, in _write_bio
SSLError: (32, 'Broken pipe')

Engine side:

2014-05-22 16:48:18,497 INFO  [org.ovirt.engine.core.bll.OvfDataUpdater] (DefaultQuartzScheduler_Worker-7) Attempting to update VMs/Templates Ovf.
2014-05-22 16:48:18,498 INFO  [org.ovirt.engine.core.bll.OvfDataUpdater] (DefaultQuartzScheduler_Worker-7) Attempting to update VM OVFs in Data Center VCP
...
2014-05-22 16:51:28,124 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-7) IrsBroker::Failed::UpdateVMVDS due to: TimeoutException:

I'd like to point out that the entire size of the XML-RPC updateVM request is ~450Kb (75 VMs)!

On the vdsm side the majority of time is spent trying to acquire and release VMs ovf locks (useless?):

Thread-8651::DEBUG::2014-05-22 14:48:39,765::resourceManager::197::ResourceManager.Request::(__init__) ResName=`Storage.7f7534f7-e5d4-44b3-b34a-4d0743734e7d`ReqID=`8fa1e150-cd58-450f-93df-fabac7852e22`::Request was made in '/usr/share/vdsm/storage/hsm.py' line '1330' at 'updateVM'
...
Thread-8651::DEBUG::2014-05-22 14:50:58,902::task::974::TaskManager.Task::(_decref) Task=`249892bc-d334-4d48-9c22-b73c6c3f296f`::ref 1 aborting False


Thread-8651::DEBUG::2014-05-22 14:51:14,212::resourceManager::939::ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {'...
...
Thread-8651::DEBUG::2014-05-22 14:52:19,346::resourceManager::648::ResourceManager::(releaseResource) No one is waiting for resource 'Storage.a843d0b8-38cf-4056-b08d-9e1d31dd7e07_80767ba5-7c87-45dd-9a32-7330ed889af5', Clearing records.

For a total of 3 minutes and 24 seconds (just to acquire/release 76 locks).
If we hadn't spent this time the operation would have just lasted ~17 seconds.

The updateVM failure triggers an spmStop command:

2014-05-22 16:51:28,124 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-7) IrsBroker::Failed::UpdateVMVDS due to: TimeoutException:
2014-05-22 16:51:30,611 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (DefaultQuartzScheduler_Worker-7) START, SpmStopVDSCommand(HostName = vcpfscrx900h02.vas.vodacom.corp, HostId = febf7f89-55f7-46e6-b65e-d4ee23f96f72, storagePoolId = 7f7534f7-e5d4-44b3-b34a-4d0743734e7d), log id: 5e0b63f

that on the vdsm side is taking a really long time (40 minutes and 7 seconds) and timeouts as well:

Thread-8687::INFO::2014-05-22 14:51:33,565::logUtils::44::dispatcher::(wrapper) Run and protect: spmStop(spUUID='7f7534f7-e5d4-44b3-b34a-4d0743734e7d', options=None)
...
Thread-8687::ERROR::2014-05-22 15:31:40,828::SecureXMLRPCServer::168::root::(handle_error) client ('10.113.136.147', 34969)
Traceback (most recent call last):
  File "/usr/lib64/python2.6/SocketServer.py", line 570, in process_request_thread
  File "/usr/lib64/python2.6/SocketServer.py", line 332, in finish_request
  File "/usr/lib64/python2.6/SocketServer.py", line 627, in __init__
  File "/usr/lib64/python2.6/BaseHTTPServer.py", line 329, in handle
  File "/usr/lib64/python2.6/BaseHTTPServer.py", line 323, in handle_one_request
  File "/usr/lib64/python2.6/SimpleXMLRPCServer.py", line 490, in do_POST
  File "/usr/lib64/python2.6/BaseHTTPServer.py", line 384, in send_response
  File "/usr/lib64/python2.6/BaseHTTPServer.py", line 390, in send_header
  File "/usr/lib64/python2.6/socket.py", line 324, in write
  File "/usr/lib64/python2.6/socket.py", line 303, in flush
  File "/usr/lib64/python2.6/site-packages/M2Crypto/SSL/Connection.py", line 222, in write
  File "/usr/lib64/python2.6/site-packages/M2Crypto/SSL/Connection.py", line 205, in _write_bio
SSLError: bad write retry

The majority of the time (39 minutes and 22 seconds) is spent in:

Thread-8687::DEBUG::2014-05-22 14:51:54,540::sp::381::Storage.StoragePool::(cleanupMasterMount) unmounting /rhev/data-center/mnt/blockSD/80767ba5-7c87-45dd-9a32-7330ed889af5/master
Thread-8687::DEBUG::2014-05-22 15:31:16,887::mount::226::Storage.Misc.excCmd::(_runcmd) '/usr/bin/sudo -n /bin/umount /rhev/data-center/mnt/blockSD/80767ba5-7c87-45dd-9a32-7330ed889af5/master' (cwd None)

If we hadn't spent this time in cleanupMasterMount then stopSpm would have taken only 45 seconds.

There's no logging in this area but looking at the code I believe the entire time is spent in the first lines of blockSD.BlockStorageDomain.doUnmountMaster():

1211     def doUnmountMaster(cls, masterdir):
...
1217         cls.__handleStuckUmount(masterdir)
1218         try:
1219             masterMount = mount.getMountFromTarget(masterdir)
1220         except OSError as ex:
1221             if ex.errno == errno.ENOENT:
1222                 return
1224             raise
1225         if masterMount.isMounted():
1226             # Try umount, take 1
1227             try:
1228                 masterMount.umount()

probably in __handleStuckUmount or in getMountFromTarget/isMounted.

Bottom line:

- we can try to make updateVM faster (probably getting rids of the locks if they're useless)
- we can try to investigate what is taking so long in doUnmountMaster

Anyway in general this seems related to the high cpu usage on the host.
Probably Nir's work on bug 1095907 (also related to the mount module) should improve the situation here.

Meanwhile I suggest to lower the amount of VMs that are currently running on the SPM and eventually tweak the OvfItemsCountPerUpdate config (maybe set it to 10-20) to lower the amount of ovf sent.

Comment 5 Allon Mureinik 2014-05-23 16:37:16 UTC

(In reply to Federico Simoncelli from comment #4)
> Bottom line:
> 
> - we can try to make updateVM faster (probably getting rids of the locks if
> they're useless)
Looking forward, in 3.4.0, on 3.4.0 DCs, we don't use it all, but stream OVFs to a special disk per domain. So this should be mitigated considerably (see bug 1032686).

I'd suggest the following course of action:

1. Tweak the current system's config, as Federico suggested:
> Meanwhile I suggest to lower the amount of VMs that are currently running on
> the SPM and eventually tweak the OvfItemsCountPerUpdate config (maybe set it
> to 10-20) to lower the amount of ovf sent.

2. Roll out 3.3.3, which includes quite a bit of work to reduce CPU consumption:
> Anyway in general this seems related to the high cpu usage on the host.
> Probably Nir's work on bug 1095907 (also related to the mount module) should
> improve the situation here.

If this isn't enough, we should re-asses, but generally, I don't want to put too much work investigating a piece of code we're deprecating anyway.

Marina - we're basically waiting for your input on whether the above does the trick for the customer or not.

Comment 6 Marina Kalinin 2014-05-23 16:43:22 UTC

Allon, thank you.
This is exactly the action plan we have decided on.

Trying to implement Fede's suggestions with the customer now.

Comment 15 Marina Kalinin 2014-05-27 23:25:33 UTC

Doron / Allon,

Assuming the customer has 80 cores, what value would you recommend to set to SpmVCpuConsumption?
The customer set it to 9 and didn't notice any significant effect.
What would be a rule of thumb for this parameter? 50%?
Or it is insignificant now, and we better see how the customer environment behaves after upgrade to 3.3.3?

Thanks,
Marina.

Comment 16 Doron Fediuck 2014-05-28 07:25:26 UTC

(In reply to Marina from comment #15)
> Doron / Allon,
> 
> Assuming the customer has 80 cores, what value would you recommend to set to
> SpmVCpuConsumption?
> The customer set it to 9 and didn't notice any significant effect.
> What would be a rule of thumb for this parameter? 50%?
> Or it is insignificant now, and we better see how the customer environment
> behaves after upgrade to 3.3.3?
> 
> Thanks,
> Marina.

Marina,
SpmVCpuConsumption is not directly related to this issue, so I suggest we
discuss it over email.

To answer your questions, SpmVCpuConsumption is one factor among many others
which are considered while calculating loads for load balancing and new VM
placement. Additionally, we should consider VcpuConsumptionPercentage which
is being used with SpmVCpuConsumption.

Each flow uses SpmVCpuConsumption differently, but generally speaking the
basic behavior did not change.

For new VM placement, SpmVCpuConsumption is being used for optimization
only as follows:

//vcpu = VcpuConsumptionPercentage
//spmCpu = SpmVCpuConsumption
host-score = (hostCpu / vcpu) + (pendingVcpus + vm.getNumOfCpus() + spmCpu) / hostCores;

For load balancing, SpmVCpuConsumption is being used for load calculations
as follows:

host-load = SpmVCpuConsumption * VcpuConsumptionPercentage / vds.getCpuCores()
+ vds.getUsageCpuPercent()


So... as you can see this is changing from one host to another and from
one load pattern to another.
If you're looking to improve things using load balancing and wish to have
as little VMs as possible on the SPM host, you can try and increase it
and then give the system time to balance based on the new settings (please
verify the value is being refreshed and restart is not needed). A positive
indication will be when the logs indicate that load balancing is doing migration
from the spm host to the other hosts.

Comment 17 Allon Mureinik 2014-05-29 08:37:39 UTC

Marina,

After implementing Doron's suggestions and upgrading to 3.3.3, was the issue resolved?
Are there any more action items on dev's side?

Comment 18 Tomas Dosek 2014-05-29 08:42:47 UTC

They did not upgrade to 3.3.3 so far (maintenance window is scheduled for the end of this week).

However implementing all the config changes and repairing the RCs from sanlock side did make the environment stable.

I suggest we keep this bug opened until they confirm the env. stood stable after the upgrade and close it as NOTABUG afterwards.

Does that make sense to you Allon?

Comment 19 Allon Mureinik 2014-05-29 09:01:30 UTC

(In reply to Tomas Dosek from comment #18)
> They did not upgrade to 3.3.3 so far (maintenance window is scheduled for
> the end of this week).
> 
> However implementing all the config changes and repairing the RCs from
> sanlock side did make the environment stable.
> 
> I suggest we keep this bug opened until they confirm the env. stood stable
> after the upgrade and close it as NOTABUG afterwards.
> 
> Does that make sense to you Allon?
Yup, sounds great to me.
Please report the status after the upgrade.

Thanks!

Comment 20 Marina Kalinin 2014-05-29 14:25:50 UTC

Sure!

Comment 21 Allon Mureinik 2014-06-08 22:10:32 UTC

(In reply to Allon Mureinik from comment #19)
> (In reply to Tomas Dosek from comment #18)
> > They did not upgrade to 3.3.3 so far (maintenance window is scheduled for
> > the end of this week).
> > 
> > However implementing all the config changes and repairing the RCs from
> > sanlock side did make the environment stable.
> > 
> > I suggest we keep this bug opened until they confirm the env. stood stable
> > after the upgrade and close it as NOTABUG afterwards.
> > 
> > Does that make sense to you Allon?
> Yup, sounds great to me.
> Please report the status after the upgrade.

Discussed with Marina offline.
The customer env. seems to be stable, and it will take some time to orchestrate an upgrade. 
I'm closing this BZ as WORKFORME (since, will... it does :-)). Feel free to reopen it if the issue resurfaces.