876561 – [Storage] [Infra] Exception in vdsm while trying to refresh (read) storage domain (task fails on timeout - keyError)

Bug 876561 - [Storage] [Infra] Exception in vdsm while trying to refresh (read) storage domain (task fails on timeout - keyError)

Summary: [Storage] [Infra] Exception in vdsm while trying to refresh (read) storage do...

Keywords:
Status:	CLOSED DUPLICATE of bug 908068
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	3.2.0
Assignee:	Liron Aravot
QA Contact:	Haim
Docs Contact:
URL:
Whiteboard:	storage
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-11-14 13:17 UTC by Leonid Natapov
Modified:	2016-02-10 16:48 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-03-17 20:04:18 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
vdsm log (690.75 KB, application/octet-stream) 2012-11-14 13:17 UTC, Leonid Natapov	no flags	Details
View All

Description Leonid Natapov 2012-11-14 13:17:57 UTC

Created attachment 644828 [details]
vdsm log

[Storage][Infra] Although the extend succeeded there is an exception in vdsm ( collecting storage domain monitoring information fails) while trying to extend storage domain along with other storage operation like moving disk from one storage domain to another. Full vdsm log attached.
Scenario:

1.Setup with 2 storage ISCSI domains.
2.Moving disk from one storage domain to another.
3.When the move process has began go to Storage tab of UI and edit one of the storage domains.
4.Try to extend storage domain.

Thread-460::ERROR::2012-11-14 13:08:44,558::domainMonitor::208::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain 328fe2e
c-4849-4e64-92d8-333f9ff0b09b monitoring information
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/domainMonitor.py", line 186, in _monitorDomain
    self.domain.selftest()
  File "/usr/share/vdsm/storage/sdc.py", line 49, in __getattr__
    return getattr(self.getRealDomain(), attrName)
  File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 119, in _realProduce
    self.refreshStorage()
  File "/usr/share/vdsm/storage/misc.py", line 1063, in helper
    return sm(*args, **kwargs)
  File "/usr/share/vdsm/storage/misc.py", line 1048, in __call__
    self.__lastResult = self.__func(*args, **kwargs)
  File "/usr/share/vdsm/storage/sdc.py", line 82, in refreshStorage
    multipath.rescan()
  File "/usr/share/vdsm/storage/multipath.py", line 70, in rescan
    supervdsm.getProxy().forceIScsiScan()
  File "/usr/share/vdsm/supervdsm.py", line 69, in __call__
    return callMethod()
  File "/usr/share/vdsm/supervdsm.py", line 60, in <lambda>
    **kwargs)
  File "<string>", line 2, in forceIScsiScan
  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 740, in _callmethod
    raise convert_to_error(kind, result)
RemoteError:
---------------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 216, in serve_client
    obj, exposed, gettypeid = id_to_obj[ident]
KeyError: '1879e10'
---------------------------------------------------------------------------
Thread-460::DEBUG::2012-11-14 13:08:44,581::domainMonitor::216::Storage.DomainMonitorThread::(_monitorDomain) Domain 328fe2ec-4849-4e64-92d8-333f9f
f0b09b changed its status to Invalid

Comment 2 Yaniv Bronhaim 2012-11-15 19:18:23 UTC

This part of log is not related to this bug.
The bug appears when extend vg operation gets timeout. This happens because the operation waits too long for storage resource that another operation wasn't finished to use before the extend operation, using the same resource, was received.

Please add the right log.

Comment 3 Haim 2012-11-18 08:23:38 UTC

(In reply to comment #2)
> This part of log is not related to this bug.
> The bug appears when extend vg operation gets timeout. This happens because
> the operation waits too long for storage resource that another operation
> wasn't finished to use before the extend operation, using the same resource,
> was received.
> 
> Please add the right log.

changed the title - please fix the keyError in task handling.

Comment 4 RHEL Program Management 2012-12-14 08:50:50 UTC

This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 8 Yaniv Bronhaim 2013-03-03 10:33:08 UTC

I don't recall moving the bug to modified, probably by mistake.. sorry about that.
I'm not sure about why this exception appears, the refreshing failed once and 5 minute after it worked, supervdsm functions in the next and previous call, and no restart of vdsm takes place in this part of log.
Is this reproducible? What do you edit in the Storage tab?

Comment 10 Ayal Baron 2013-03-07 19:44:07 UTC

Yeela, please take a look at this to see if it's been resolved already.

Comment 11 Yeela Kaplan 2013-03-10 16:34:39 UTC

The failure happens in extendStorageDomain. When trying to extend the SD using a lun that is already used.

This is an engine issue, vdsm returns the lun as used in getDeviceList:

Thread-7481::INFO::2012-11-14 14:57:04,066::logUtils::39::dispatcher::(wrapper) Run and protect: getDeviceList, Return response: {'devList': [
{'status': 'used', 'fwrev': '0001', 'vgUUID': 'A3R7gy-7QPB-9UKl-Rx1M-kVqt-B4OH-AfItXd', 'pathlist': [{'connection': '10.35.64.10', 'iqn': 'iqn.1994-05.com.redhat:violet-rhel_612144559425363', 'portal': '1', 'port': '3260', 'initiatorname': 'default'}], 'logicalblocksize': '512', 'devtype': 'iSCSI', 'physicalblocksize': '512', 'serial': 'SIET_VIRTUAL-DISK', 'GUID': '1a95f4b3f-ba92-4250', 'productID': 'VIRTUAL-DISK', 'vendorID': 'IET', 'capacity': '10737418240', 'pathstatus': [{'physdev': 'sdg', 'state': 'active', 'type': 'iSCSI', 'lun': '1'}], 'pvUUID': 'fmS9Mj-bv7d-745o-PJd8-cn7m-CvDQ-DrjbSY'}
]}

and afterwards vdsm gets a request to extendSD using the used lun:

Thread-7512::INFO::2012-11-14 14:57:33,488::logUtils::37::dispatcher::(wrapper) Run and protect: extendStorageDomain(sdUUID='328fe2ec-4849-4e64-92d8-333f9ff0b09b', spUUID='e59ad86c-2e41-11e2-98f9-df8494d70a03', devlist=['1a95f4b3f-ba92-4250'], options=None)

and fails:
Thread-7512::DEBUG::2012-11-14 14:58:05,567::__init__::1164::Storage.Misc.excCmd::(_log) '/usr/bin/sudo -n /sbin/lvm pvcreate --config " devices { preferred_names = [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 dis
able_after_error_count=3 filter = [ \\"a%1a95f4b3f-ba92-4250|3514f0c561000000b|3514f0c561000000c|3514f0c561000000d|3514f0c561000000e|36000402002ec07a76533518c00000000%\\", \\"r%.*%\\" ] }  global {  locking_type=1  prioritise_write_locks
=1  wait_for_locks=1 }  backup {  retain_min = 50  retain_days = 0 } " --metadatasize 128m --metadatacopies 2 --metadataignore y /dev/mapper/1a95f4b3f-ba92-4250' (cwd None)
Thread-7512::DEBUG::2012-11-14 14:58:05,638::__init__::1164::Storage.Misc.excCmd::(_log) FAILED: <err> = '  Can\'t initialize physical volume "/dev/mapper/1a95f4b3f-ba92-4250" of volume group "d4cb8e1a-85e2-4475-a081-c26c188e7b1c" withou
t -ff\n'; <rc> = 5

Regarding the keyError issue , it is a duplicate of BZ#908068, which is already fixed and verified.

Comment 12 Ayal Baron 2013-03-10 16:56:05 UTC

(In reply to comment #11)
> The failure happens in extendStorageDomain. When trying to extend the SD
> using a lun that is already used.
> 
> This is an engine issue, vdsm returns the lun as used in getDeviceList:
> 
> Thread-7481::INFO::2012-11-14
> 14:57:04,066::logUtils::39::dispatcher::(wrapper) Run and protect:
> getDeviceList, Return response: {'devList': [
> {'status': 'used', 'fwrev': '0001', 'vgUUID':
> 'A3R7gy-7QPB-9UKl-Rx1M-kVqt-B4OH-AfItXd', 'pathlist': [{'connection':
> '10.35.64.10', 'iqn': 'iqn.1994-05.com.redhat:violet-rhel_612144559425363',
> 'portal': '1', 'port': '3260', 'initiatorname': 'default'}],
> 'logicalblocksize': '512', 'devtype': 'iSCSI', 'physicalblocksize': '512',
> 'serial': 'SIET_VIRTUAL-DISK', 'GUID': '1a95f4b3f-ba92-4250', 'productID':
> 'VIRTUAL-DISK', 'vendorID': 'IET', 'capacity': '10737418240', 'pathstatus':
> [{'physdev': 'sdg', 'state': 'active', 'type': 'iSCSI', 'lun': '1'}],
> 'pvUUID': 'fmS9Mj-bv7d-745o-PJd8-cn7m-CvDQ-DrjbSY'}
> ]}
> 
> and afterwards vdsm gets a request to extendSD using the used lun:
> 
> Thread-7512::INFO::2012-11-14
> 14:57:33,488::logUtils::37::dispatcher::(wrapper) Run and protect:
> extendStorageDomain(sdUUID='328fe2ec-4849-4e64-92d8-333f9ff0b09b',
> spUUID='e59ad86c-2e41-11e2-98f9-df8494d70a03',
> devlist=['1a95f4b3f-ba92-4250'], options=None)
> 
> and fails:
> Thread-7512::DEBUG::2012-11-14
> 14:58:05,567::__init__::1164::Storage.Misc.excCmd::(_log) '/usr/bin/sudo -n
> /sbin/lvm pvcreate --config " devices { preferred_names =
> [\\"^/dev/mapper/\\"] ignore_suspended_devices=1 write_cache_state=0 dis
> able_after_error_count=3 filter = [
> \\"a%1a95f4b3f-ba92-
> 4250|3514f0c561000000b|3514f0c561000000c|3514f0c561000000d|3514f0c561000000e|
> 36000402002ec07a76533518c00000000%\\", \\"r%.*%\\" ] }  global { 
> locking_type=1  prioritise_write_locks
> =1  wait_for_locks=1 }  backup {  retain_min = 50  retain_days = 0 } "
> --metadatasize 128m --metadatacopies 2 --metadataignore y
> /dev/mapper/1a95f4b3f-ba92-4250' (cwd None)
> Thread-7512::DEBUG::2012-11-14
> 14:58:05,638::__init__::1164::Storage.Misc.excCmd::(_log) FAILED: <err> = ' 
> Can\'t initialize physical volume "/dev/mapper/1a95f4b3f-ba92-4250" of
> volume group "d4cb8e1a-85e2-4475-a081-c26c188e7b1c" withou
> t -ff\n'; <rc> = 5

Liron, I doubt this issue still exists but please verify.

Comment 13 Liron Aravot 2013-03-11 15:13:51 UTC

Ayal, i'll try to sum up the possible issues IIUC (vdsm log), also adding needinfo? on Lesik to clarify what's the issue we refer to it as it's not clear from the bug and from the comments (atleast to me) what he meants to, as i don't see timeout error in the vdsm log.

1. When running moveImage, a shared lock is acquired on vdsm on the domains,  while the extend attempt to acquire exclusive lock meaning that the extend operation on the domain will wait for that lock -  the operation can fail on ResourceTimeout if the copy takes long time (possibly - it's not in the present log so i can't verify that).

Thread-8048::INFO::2012-11-14 15:09:06,315::logUtils::37::dispatcher::(wrapper) Run and protect: moveImage(spUUID='e59ad86c-2e41-11e2-98f9-df8494d70a03', srcDomUUID='6093c92
d-8817-4199-a2dc-9c04b07344b3', dstDomUUID='328fe2ec-4849-4e64-92d8-333f9ff0b09b', imgUUID='3510cfb6-c455-4a40-b90a-137e68cf7e74', vmUUID='', op=2, postZero='false', force='
false')
orce='
false')
Thread-8048::DEBUG::2012-11-14 15:09:06,825::resourceManager::486::ResourceManager::(registerResource) Trying to register resource 'Storage.328fe2ec-4849-4e64-92d8-333f9ff0b09b' for lock type 'shared'
Thread-8048::DEBUG::2012-11-14 15:09:06,825::resourceManager::528::ResourceManager::(registerResource) Resource 'Storage.328fe2ec-4849-4e64-92d8-333f9ff0b09b' is free. Now locking as 'shared' (1 active user)
Thread-8048::DEBUG::2012-11-14 15:09:06,826::resourceManager::212::ResourceManager.Request::(grant) ResName=`Storage.328fe2ec-4849-4e64-92d8-333f9ff0b09b`ReqID=`51954bef-d40a-4e00-bb4d-34f5a9535724`::Granted request



if we try to extend meanwhile we should wait:

Run and protect: extendStorageDomain(sdUUID='328fe2ec-4849-4e64-92d8-333f9ff0b09b', spUUID='e
59ad86c-2e41-11e2-98f9-df8494d70a03', devlist=['1paikov7'], options=None)
Thread-8082::DEBUG::2012-11-14 15:09:46,438::resourceManager::175::ResourceManager.Request::(__init__) ResName=`Storage.328fe2ec-4849-4e64-92d8-333f9ff0b09b`ReqID=`07c4d17a-
bb72-4e47-b968-92ff2f185c38`::Request was made in '/usr/share/vdsm/storage/resourceManager.py' line '485' at 'registerResource'
Thread-8082::DEBUG::2012-11-14 15:09:46,438::resourceManager::486::ResourceManager::(registerResource) Trying to register resource 'Storage.328fe2ec-4849-4e64-92d8-333f9ff0b
09b' for lock type 'exclusive'
Thread-8082::DEBUG::2012-11-14 15:09:46,439::resourceManager::510::ResourceManager::(registerResource) Resource 'Storage.328fe2ec-4849-4e64-92d8-333f9ff0b09b' is currently l
ocked, Entering queue (1 in queue)


2. Other issue is that the extend fails (Yeela wrote it)- we allow to extend with used luns, the operation can fail on initialize physical volume - which can happen always and is present in the log - doesn't seem related to the copy.


3. the log which is copy pasted:
Thread-460::ERROR::2012-11-14 14:59:02,069::domainMonitor::208::Storage.DomainMonitorThread::(_monitorDomain) Error while collecting domain 328fe2ec-4849-4e64-92d8-333f9ff0b
09b monitoring information
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/domainMonitor.py", line 186, in _monitorDomain
    self.domain.selftest()
  File "/usr/share/vdsm/storage/sdc.py", line 49, in __getattr__
    return getattr(self.getRealDomain(), attrName)
  File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 119, in _realProduce
    self.refreshStorage()
  File "/usr/share/vdsm/storage/misc.py", line 1063, in helper
    return sm(*args, **kwargs)
  File "/usr/share/vdsm/storage/misc.py", line 1048, in __call__
    self.__lastResult = self.__func(*args, **kwargs)
  File "/usr/share/vdsm/storage/sdc.py", line 82, in refreshStorage
    multipath.rescan()
  File "/usr/share/vdsm/storage/multipath.py", line 70, in rescan
    supervdsm.getProxy().forceIScsiScan()
  File "/usr/share/vdsm/supervdsm.py", line 69, in __call__
    return callMethod()
  File "/usr/share/vdsm/supervdsm.py", line 60, in <lambda>
    **kwargs)
  File "<string>", line 2, in forceIScsiScan
  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 740, in _callmethod
    raise convert_to_error(kind, result)
RemoteError: 
---------------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 216, in serve_client
    obj, exposed, gettypeid = id_to_obj[ident]
KeyError: '1879e10'
---------------------------------------------------------------------------
Thread-460::DEBUG::2012-11-14 14:59:02,069::domainMonitor::216::Storage.DomainMonitorThread::(_monitorDomain) Domain 328fe2ec-4849-4e64-92d8-333f9ff0b09b changed its status 
to Invalid


the monitoring on the domain failed, but before the operation failed before - so i don't see how that's related or relevant to the operation:
DEBUG::2012-11-14 14:58:05,650::resourceManager::538::ResourceManager::(releaseResource) Trying to release resource 'Storage.328fe2ec-4849-4e64-92d8-333f9ff0b09b'
Thread-7512::DEBUG::2012-11-14 14:58:05,650::resourceManager::553::ResourceManager::(releaseResource) Released resource 'Storage.328fe2ec-4849-4e64-92d8-333f9ff0b09b' (0 active users)
Thread-7512::DEBUG::2012-11-14 14:58:05,650::resourceManager::558::ResourceManager::(releaseResource) Resource 'Storage.328fe2ec-4849-4e64-92d8-333f9ff0b09b' is free, finding out if anyone is waiting for it.
Thread-7512::DEBUG::2012-11-14 14:58:05,651::resourceManager::565::ResourceManager::(releaseResource) No one is waiting for resource 'Storage.328fe2ec-4849-4e64-92d8-333f9ff0b09b', Clearing records.
Thread-7512::ERROR::2012-11-14 14:58:05,651::dispatcher::66::Storage.Dispatcher.Protect::(run) {'status': {'message': 'Failed to initialize physical device: ("[\'/dev/mapper/1a95f4b3f-ba92-4250\']",)', 'code': 601}}

Comment 14 Liron Aravot 2013-03-11 15:38:57 UTC

bad wording in the last comment, in the previous comment i summed up the issues following the different logs attached in the comments,
now- regards the key error, as Yeela mentioned it should have been solved in the link she provided.

Regards the issue in Yeela log - it can always happen when we fail to extend, doesn't seems related here.

the possible timeout error - caused by the taken locks on the doamin.

Lesik/Haim - is this bug (as the title reflects - refers to the keyError) still relevent?

Comment 15 Liron Aravot 2013-03-11 17:00:07 UTC

"Regards the issue in Yeela log - it can always happen when we fail to extend, doesn't seems related here."  -  i meant that the failure can happen regardless of waiting to acquire the lock, therefore not seem related to the timeout error.

Comment 16 Liron Aravot 2013-03-14 17:31:13 UTC

Haim, Lesik - do we have any issue left here? (Haim, you wrote to fix the task error handling, Yeela gave link to a bug in which it should have been solved).

Comment 17 Haim 2013-03-17 20:04:18 UTC


*** This bug has been marked as a duplicate of bug 908068 ***

Note You need to log in before you can comment on or make changes to this bug.